THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

The Rambling DBA: Jonathan Kehayias

The random ramblings and rantings of frazzled SQL Server DBA

Breaking a String into "Words" the CLR way

A good friend of mine an fellow MSDN Forum Moderator Aaron Alton recently wrote a blog post entitled Breaking a String into “Words” in which he rambles slightly more than I do (I sense an attempt to trump my claim to fame as The Rambling DBA) on normal occasions.  In his post he offers a method using pure TSQL to solve the problem of breaking a sentence into distinct words.  Now to be perfectly honest, it is a decent TSQL solution to the problem, but I am going to show how much simpler it is to solve this problem by making use of SQLCLR.

To start with, please stop reading here and click the link to his original article so that you can take a look at the code to understand how he solved the problem originally.

This kind of problem lends itself to a SQLCLR solution with easier code that is more manageable and better for performance.  First, lets evaluate the problems listed in Aaron's post.  The first is "I have a column that contains a string of text and I need to pull out all any dates that show up in the column values”.  From a purely CLR perspective, this a simple Regular Expression problem.  To demonstrate this first lets create a Regex Object in C#:

private static readonly Regex datefinder = new Regex(@"(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d")

Some things to note about this object.  First it is a static member so in order to keep our Assembly SAFE, we need to mark it read only which is fine because we won't be changing the regular expression string that creates the Regex.  Second, we make it static readonly so that it only has to be created once, and as long as the AppDomain stays loaded, the same object can be reused by the CLR stack for performing the matching.  Third, it assumes that the dates will exist in the format of MM/DD/YYYY, MM DD YYYY, MM.DD.YYYY, or MM-DD-YYYY.  If you need a different format, perhaps DD/MM/YYYY the regular expression can be changed to reverse the group order fairly easily.

One note is that you don't have to be an expert on Regular Expressions to be able to use them in CLR code.  Many of the most popular regular expression strings exist on the Internet already and are easily found using your favorite search engine.

So now that we have our our Regex lets put it to use:

    [SqlFunction(
      Name = "FindDatesInString",
      FillRowMethodName = "FindDatesInString_FillRow",
      DataAccess = DataAccessKind.None,
      IsDeterministic = true,
      TableDefinition = "string_found nvarchar(max)")]
   
public static IEnumerable FindDatesInString([SqlFacet(MaxSize = -1)]SqlString input)
   
{
       
return datefinder.Matches(input.Value);
   
}

   
private static void FindDatesInString_FillRow(object obj, out string string_found)
   

        string_found = ((Match)obj).Value;
   
}


This is the code for a table valued function that can be used in a CROSS APPLY to return any dates matching the REGEX format specified. Now to put it to use, I had to build a bit larger table than Aaron's example.  To do this I used to GO 110000 trick that I blogged about previously and built a table that had 440000 rows of data in it.  I did this to ensure that there was enough data to be run through with both processes to do a comparison that would show which performed better.  Using the following test code over a series of 10 runs the CLR operation averaged about 11 seconds faster than the TSQL operation created by Aaron.

SET STATISTICS IO ON
SET STATISTICS
TIME ON

SELECT
RowKey, TxtCol, string_found AS Word
FROM   TestWordBreak tb
CROSS APPLY dbo.FindDatesInString(tb.TxtCol)

;
WITH FindDates AS
(
SELECT     frst.RowKey, frst.TxtCol,
          
SUBSTRINGfrst.TxtCol,
                      
frst.FirstCharacter,
                       (
1 + COALESCE(lst.LastCharacter, LEN(frst.TxtCol)) - frst.FirstCharacter)
                    )
AS Word
FROM       (SELECT tb.RowKey,
                  
tb.TxtCol,
                  
n.Number + 1 AS FirstCharacter,
                  
ROW_NUMBER() OVER (PARTITION BY tb.RowKey ORDER BY n.Number) AS RowNumber
          
FROM    TestWordBreak tb
          
JOIN    Numbers n    ON SUBSTRING(tb.TxtCol, n.Number, 1) = ' '
                          
AND n.Number <= LEN(tb.TxtCol)) AS frst
LEFT JOIN (SELECT    tb2.RowKey,
                      
n.Number - 1 AS LastCharacter,
                      
ROW_NUMBER() OVER (PARTITION BY RowKey ORDER BY n.Number ASC) AS RowNumber
          
FROM    TestWordBreak tb2
          
JOIN    Numbers n ON SUBSTRING(tb2.TxtCol, n.Number, 1) = ' '
                  
AND n.Number <= LEN(tb2.TxtCol)) AS lst
                  
ON frst.RowKey = lst.RowKey
                  
AND frst.RowNumber + 1 = lst.RowNumber
)

SELECT * FROM FindDates
WHERE ISDATE(Word) = 1

SET STATISTICS IO OFF
SET STATISTICS
TIME OFF

The Statistics output from the 6th test run on my machine is below:

(440000 row(s) affected)
Table 'Worktable'. Scan count 3, logical reads 220007, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestWordBreak'. Scan count 1, logical reads 3322, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 3797 ms,  elapsed time = 43433 ms.

(440000 row(s) affected)
Table 'Numbers'. Scan count 880000, logical reads 2640000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestWordBreak'. Scan count 6, logical reads 7286, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 45359 ms,  elapsed time = 56834 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

In my experience, Regular Expression matches can generally outperform TSQL string functions for operations like this one.  To make the function more flexible, it can be altered to accept the regular expression string as an input as follows:

    [SqlFunction(
      Name = "RegexMatchesInString",
      FillRowMethodName = "FindDatesInString_FillRow",
      DataAccess = DataAccessKind.None,
      IsDeterministic = true,
      TableDefinition = "string_found nvarchar(max)")]
   
public static IEnumerable RegexMatchesInString([SqlFacet(MaxSize = -1)]SqlString input, [SqlFacet(MaxSize = -1)]SqlString expression)
   
{
       
return Regex.Matches(input.Value, expression.Value);
    }

This can then be reused for other regular expression match problems in a CROSS APPLY similar to how it is used to solve the problem in this post.

The other side of Aaron's post is splitting a string into words.  For this, another CLR function can be created that uses the String.Split() method to split the string on any delimeter that is passed as follows:

    [SqlFunction(
      Name = "StringSplitter",
      FillRowMethodName = "SplitString_FillRow",
      DataAccess = DataAccessKind.None,
      IsDeterministic = true,
      TableDefinition = "split_string nvarchar(max)")]
   
public static IEnumerable StringSplitter([SqlFacet(MaxSize = -1)]SqlString input, [SqlFacet(MaxSize = 1)]SqlString delimeter)
   
{
       
return input.Value.Split(delimeter.Value.ToCharArray());
   
}

   
private static void SplitString_FillRow(object obj, out string split_string)
   
{
        split_string
= (string)obj;
   
}

Comparing this to the TSQL equivalent that Aaron produced can be done with the following script:

SET STATISTICS IO ON
SET STATISTICS
TIME ON

SELECT    
frst.RowKey, frst.TxtCol,
          
SUBSTRINGfrst.TxtCol,
                      
frst.FirstCharacter,
                       (
1 + COALESCE(lst.LastCharacter, LEN(frst.TxtCol)) - frst.FirstCharacter)
                    )
AS Word
FROM       (SELECT tb.RowKey,
                  
tb.TxtCol,
                  
n.Number + 1 AS FirstCharacter,
                  
ROW_NUMBER() OVER (PARTITION BY tb.RowKey ORDER BY n.Number) AS RowNumber
          
FROM    TestWordBreak tb
          
JOIN    Numbers n    ON SUBSTRING(tb.TxtCol, n.Number, 1) = ' '
                          
AND n.Number <= LEN(tb.TxtCol)) AS frst
LEFT JOIN (SELECT    tb2.RowKey,
                      
n.Number - 1 AS LastCharacter,
                      
ROW_NUMBER() OVER (PARTITION BY RowKey ORDER BY n.Number ASC) AS RowNumber
          
FROM    TestWordBreak tb2
          
JOIN    Numbers n ON SUBSTRING(tb2.TxtCol, n.Number, 1) = ' '
                  
AND n.Number <= LEN(tb2.TxtCol)) AS lst
                  
ON frst.RowKey = lst.RowKey
                  
AND frst.RowNumber + 1 = lst.RowNumber

SELECT RowKey, TxtCol, split_string as Word
FROM   TestWordBreak tb
CROSS APPLY dbo.StringSplitter(tb.TxtCol, ' ')


SET STATISTICS IO OFF
SET STATISTICS
TIME OFF

In this case on my test machine I rebuilt the table with only 10000 rows in it after the first test run took around 10 minutes to complete. The two methods tended to switch back and forth as to which was going to run faster than the other one. Out of ten test runs, the CLR method was generally faster but only by milliseconds:


(290000 row(s) affected)
Table 'Numbers'. Scan count 80000, logical reads 240000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestWordBreak'. Scan count 6, logical reads 1808, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 3172 ms,  elapsed time = 23680 ms.

(290000 row(s) affected)
Table 'Worktable'. Scan count 4, logical reads 80057, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestWordBreak'. Scan count 1, logical reads 303, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 547 ms,  elapsed time = 22918 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

The CLR function is much easier to reuse and duplicate without having to write the logic repetatively as you would in TSQL.

Published Tuesday, January 20, 2009 2:40 PM by Jonathan Kehayias

Comments

 

Adam Machanic said:

You (and readers) may want to read Erland Sommarskog's article on this topic...

http://sommarskog.se/arrays-in-sql-2005.html

January 20, 2009 4:18 PM
 

Jonathan Kehayias said:

I've read everything on his site and it is all very good reading.  One thing he doesn't cover directly or in any way that makes it easy to find is which of the numerous possible ways of splitting a string actually performs the fastest in that article.

The specific point of interest I had was regarding the search for dates, which is where CLR really has more of an advantage with Regular Expressions.  Using a basic split function is rarely performed on large bulk operations such as the demos in this post, but to accurately be able to see a difference over multiple iterations, I have always had to play with large results sets.

January 20, 2009 4:31 PM
 

Plamen Ratchev said:

The second and the third paragraphs of Erland's article do have the link to the performance tests and which method performs best in  great detail:

http://sommarskog.se/arrays-in-sql-perftest.html

Erland did by far the most comprehensive testing of all splitting methods.

January 23, 2009 12:07 AM
 

Aaron Alton said:

Great article, Jonathan.  It's interesting to see how well an out-of-process .NET object can performa task that (one would think) could be done more efficiently (though in an admittedly more complex fashion) in-process.

January 25, 2009 6:23 PM
 

Aaron Alton said:

@Adam

Great - not only did Erland steal my idea, but he went back in time and wrote a small novel on it!  His articles really are astounding though - I've never seen anyone do quite as much due dilligence as Erland does.

January 25, 2009 6:26 PM
 

Jonathan Kehayias said:

Plamen,

Thanks.  I must have searched that thing a dozen times and missed that.  I wasn't being completely comprehensive since I read that over a year ago, and it is so detailed.  Have to slow down next time.

January 26, 2009 9:13 PM
Anonymous comments are disabled

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement