THE SQL Server Blog Spot on the Web
Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Adam Machanic

Adam Machanic, Boston-based independent database consultant, writer, and speaker, shares his experiences with programming, performance tuning, and optimizing SQL Server 2000, 2005, and 2008, in conjunction with related technologies such as .NET.

Tokenize UDF

Originally posted here.

 


Yes, another string splitting UDF from a guy who's obvioiusly become obsessed with TSQL string splitting. This time we delve into a mysterious world that I call, "Tokenization."

So what is Tokenization? It's a word I made up for this problem.

But what is it, really? It's splitting up a string based on a delimiter -- in this case, a comma -- but being wary of substring delimiters. In this case, that's a pair of apostrophes, because that's what TSQL uses for strings.

I think this is best illustrated with an example string:

 

DECLARE @Tokens VARCHAR(50)

SET @Tokens = 'a, ''b'', ''''c'', ''d'', ''e'''', f, ''1,2,3,4'''

The basic split string function that you can find will produce the following output:

 

SELECT * 
FROM dbo.SplitString(@Tokens, ',')

OutParam
-------------
a
'b'
''c'
'd'
'e''
f
'1
2
3
4'

Well, that's wrong. Because what I want to do is maintain the substrings (or, "tokens," as I like to call them -- thus, Tokenization!)

The output I desire is:

 

Token
--------
a
'b'
''c', 'd', 'e''
f
'1,2,3,4'

Notice that substrings -- delimited with apostrophes -- should be maintained.

And here's how I've solved this problem...

 

CREATE FUNCTION dbo.Tokenize
(
@Input NVARCHAR(2000)
)
RETURNS @Tokens TABLE
(
TokenNum INT IDENTITY(1,1),
Token NVARCHAR(2000)
)
AS
BEGIN
DECLARE @i INT SET @i = 0
DECLARE @StartChar INT SET @StartChar = 1
DECLARE @Quote INT SET @Quote = 0

DECLARE @Chars TABLE
(
CharNum INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
TheChar CHAR(1),
TheCount INT,
StartChar INT
)

SET @Input = ' , ' + @Input + ' , '

INSERT @Chars (TheChar)
SELECT SUBSTRING(@Input, n.Number, 1)
FROM Numbers n
WHERE n.Number > 0
AND n.Number <= LEN(@Input)
ORDER BY n.Number

UPDATE Chars SET
@i = Chars.TheCount =
CASE
WHEN Chars1.TheChar = ','
AND @Quote % 2 = 0 THEN 0
ELSE @i + 1
END,
@Quote =
CASE
WHEN Chars1.TheChar = '''' THEN @Quote + 1
WHEN @i = 0 THEN 0
ELSE @Quote
END,
@StartChar = Chars.StartChar =
CASE
WHEN @i = 1 THEN Chars1.CharNum - 1
WHEN @i = 0 THEN @StartChar + 1
ELSE @StartChar
END
FROM @Chars Chars
JOIN @Chars Chars1 ON Chars1.CharNum = Chars.CharNum + 1

INSERT @Tokens(Token)
SELECT
RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum - StartChar + 1)))
FROM (
SELECT StartChar, CharNum
FROM @Chars
WHERE TheCount = 0

UNION ALL

SELECT
MAX
(
CASE TheCount
WHEN 0 THEN CharNum
ELSE 0
END
) + 1,
MAX(CharNum)
FROM @Chars
) x
WHERE RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum - StartChar + 1))) NOT IN ('', ',')
ORDER BY x.StartChar
RETURN
END

A word of warning: This UDF uses the undocumented -- and unsupported -- "aggregate update" functionality. I've tested thoroughly in this case and believe it works perfectly (and it sure is handy!), but I would advise you to not use it in your own projects without extensive testing! MS doesn't support this one, so handle with care.

And by the way, you need a numbers table to use this thing. Of course.

As for using this thing, it's pretty easy:

 

DECLARE @Tokens VARCHAR(50)

SET @Tokens = 'a, ''b'', ''''c'', ''d'', ''e'''', f, ''1,2,3,4'''

SELECT Token
FROM dbo.Tokenize(@Tokens)


Token
--------
a
'b'
''c', 'd', 'e''
f
'1,2,3,4'

... and it even appears to work properly!

Enjoy... and application for this and other strange things I've been posting recently coming very, very soon.



Published Wednesday, July 12, 2006 10:34 PM by Adam Machanic
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

No Comments

Leave a Comment

(required) 
(optional)
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based independent database consultant, writer, and speaker. He has been involved in dozens of SQL Server implementations for both high-availability OLTP and large-scale data warehouse applications, and has optimized data access layer performance for several data-intensive applications. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "Expert SQL Server 2005 Development" (Apress, 2007) and "Inside SQL Server 2005: Query Tuning and Optimization" (Microsoft Press, 2007). Adam regularly speaks at user groups, community events, and conferences on a variety of SQL Server and .NET-related topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server and a Microsoft Certified IT Professional (MCITP).
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement