"hickymanz" asked in the
SQL Server Central forums
for a method of counting unique words in a text column. Wayne Lawton
recommended using a string split function, which was a good idea, but
not quite adequate for the job in my opinion.
Typical string split functions, like this one that I wrote
can handle only a single delimiter, e.g. a comma. But in the case of
splitting for unique words you want all sorts of possible delimiters --
punctuation marks, white space including spaces, possibly numerics,
etc.
I believe I've seen other, similar requests for getting unique
words before, so I decided to solve the problem. I modified that split
string function to accept a pattern of valid characters
for the second argument. Anything NOT in the pattern will be treated as
a delimiter. I figured that would be a bit more flexible than the other
way around, so that people don't have to type in (or figure out) every
single possible white space and/or punctuation character.
Anyway, I think the modification is pretty straightforward if
you understand the previous function, so check it out and if you have
any questions feel free to ask. Here is the function:
CREATE FUNCTION dbo.SplitStringPattern
(
@List TEXT,
@Pattern VARCHAR(50)
)
RETURNS @ReturnTbl TABLE (OutParam VARCHAR(20))
WITH SCHEMABINDING
AS
BEGIN
DECLARE @LeftSplit VARCHAR(7998)
DECLARE @SplitStart INT SET @SplitStart = 0
DECLARE @SplitEnd INT
SET @SplitEnd = 7998
SET @Pattern = '%' + '[^' + RIGHT(@Pattern, LEN(@Pattern) - 1) + '%'
DECLARE @Delimiter CHAR(1)
SELECT @Delimiter = CHAR(MAX(Number))
FROM dbo.Numbers
WHERE CHAR(Number) LIKE @Pattern
SELECT @SplitEnd = MAX(Number)
FROM dbo.Numbers
WHERE (SUBSTRING(@List, Number, 1) LIKE @Pattern
OR Number = DATALENGTH(@List) + 1)
AND Number BETWEEN @SplitStart AND @SplitEnd
WHILE @SplitStart < DATALENGTH(@List) - 1
BEGIN
SET @LeftSplit = @Delimiter + SUBSTRING(@List, @SplitStart, @SplitEnd - @SplitStart) + @Delimiter
INSERT @ReturnTbl (OutParam)
SELECT LTRIM(RTRIM(SUBSTRING(@LeftSplit, Number + 1,
PATINDEX(@Pattern, SUBSTRING(@LeftSplit, Number + 1, LEN(@LeftSplit))) - 1)))
AS Value
FROM dbo.Numbers
WHERE Number <= LEN(@LeftSplit) - 1
AND SUBSTRING(@LeftSplit, Number, 1) LIKE @Pattern
AND SUBSTRING(@LeftSplit, Number + 1,
PATINDEX(@Pattern, SUBSTRING(@LeftSplit, Number + 1, LEN(@LeftSplit))) - 1) <> ''
SET @SplitStart = @SplitEnd + 1
SET @SplitEnd = @SplitEnd + 7998
SELECT @SplitEnd = MAX(Number) + @SplitStart
FROM dbo.Numbers
WHERE (SUBSTRING(@List, Number + @SplitStart, 1) LIKE @Pattern
OR Number+@SplitStart = DATALENGTH(@List) + 1)
AND Number BETWEEN 1 AND @SplitEnd - @SplitStart
END
RETURN
END
A note on the pattern for input: The pattern should be
single-character based, and must be delimited with [ ]. Also, you must
type in the actual characters in the pattern. I do not recommend using
ranges; ranges cause strange side-effects because they tend to include
(depending on collation) characters that you wouldn't expect to show up
(e.g. characters with umlauts over them). So unless that's what you
really want, don't use ranges.
Here's a usage example:
SELECT OutParam
FROM dbo.SplitStringPattern('This is a test.', '[abcdefghijklmnopqrstuvwxyz]')
OutParam
-------------
This
is
a
test
Enjoy!