|
|
|
|
The random ramblings and rantings of frazzled SQL Server DBA
A good friend of mine an fellow MSDN Forum Moderator Aaron Alton recently wrote a blog post entitled Breaking a String into “Words” in which he rambles slightly more than I do (I sense an attempt to trump my claim to fame as The Rambling DBA) on normal occasions. In his post he offers a method using pure TSQL to solve the problem of breaking a sentence into distinct words. Now to be perfectly honest, it is a decent TSQL solution to the problem, but I am going to show how much simpler it is to solve this problem by making use of SQLCLR. To start with, please stop reading here and click the link to his original article so that you can take a look at the code to understand how he solved the problem originally. This kind of problem lends itself to a SQLCLR solution with easier code that is more manageable and better for performance. First, lets evaluate the problems listed in Aaron's post. The first is "I have a column that contains a string of text and I need to pull out all any dates that show up in the column values”. From a purely CLR perspective, this a simple Regular Expression problem. To demonstrate this first lets create a Regex Object in C#: private static readonly Regex datefinder = new Regex(@"(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d") Some things to note about this object. First it is a static member so in order to keep our Assembly SAFE, we need to mark it read only which is fine because we won't be changing the regular expression string that creates the Regex. Second, we make it static readonly so that it only has to be created once, and as long as the AppDomain stays loaded, the same object can be reused by the CLR stack for performing the matching. Third, it assumes that the dates will exist in the format of MM/DD/YYYY, MM DD YYYY, MM.DD.YYYY, or MM-DD-YYYY. If you need a different format, perhaps DD/MM/YYYY the regular expression can be changed to reverse the group order fairly easily. One note is that you don't have to be an expert on Regular Expressions to be able to use them in CLR code. Many of the most popular regular expression strings exist on the Internet already and are easily found using your favorite search engine. So now that we have our our Regex lets put it to use: [SqlFunction( Name = "FindDatesInString", FillRowMethodName = "FindDatesInString_FillRow", DataAccess = DataAccessKind.None, IsDeterministic = true, TableDefinition = "string_found nvarchar(max)")] public static IEnumerable FindDatesInString([SqlFacet(MaxSize = -1)]SqlString input) { return datefinder.Matches(input.Value); } private static void FindDatesInString_FillRow(object obj, out string string_found) { string_found = ((Match)obj).Value; }
This is the code for a table valued function that can be used in a CROSS APPLY to return any dates matching the REGEX format specified. Now to put it to use, I had to build a bit larger table than Aaron's example. To do this I used to GO 110000 trick that I blogged about previously and built a table that had 440000 rows of data in it. I did this to ensure that there was enough data to be run through with both processes to do a comparison that would show which performed better. Using the following test code over a series of 10 runs the CLR operation averaged about 11 seconds faster than the TSQL operation created by Aaron. SET STATISTICS IO ON SET STATISTICS TIME ON SELECT RowKey, TxtCol, string_found AS Word FROM TestWordBreak tb CROSS APPLY dbo.FindDatesInString(tb.TxtCol) ;WITH FindDates AS ( SELECT frst.RowKey, frst.TxtCol, SUBSTRING( frst.TxtCol, frst.FirstCharacter, (1 + COALESCE(lst.LastCharacter, LEN(frst.TxtCol)) - frst.FirstCharacter) ) AS Word FROM (SELECT tb.RowKey, tb.TxtCol, n.Number + 1 AS FirstCharacter, ROW_NUMBER() OVER (PARTITION BY tb.RowKey ORDER BY n.Number) AS RowNumber FROM TestWordBreak tb JOIN Numbers n ON SUBSTRING(tb.TxtCol, n.Number, 1) = ' ' AND n.Number <= LEN(tb.TxtCol)) AS frst LEFT JOIN (SELECT tb2.RowKey, n.Number - 1 AS LastCharacter, ROW_NUMBER() OVER (PARTITION BY RowKey ORDER BY n.Number ASC) AS RowNumber FROM TestWordBreak tb2 JOIN Numbers n ON SUBSTRING(tb2.TxtCol, n.Number, 1) = ' ' AND n.Number <= LEN(tb2.TxtCol)) AS lst ON frst.RowKey = lst.RowKey AND frst.RowNumber + 1 = lst.RowNumber ) SELECT * FROM FindDates WHERE ISDATE(Word) = 1 SET STATISTICS IO OFF SET STATISTICS TIME OFF
The Statistics output from the 6th test run on my machine is below: (440000 row(s) affected) Table 'Worktable'. Scan count 3, logical reads 220007, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'TestWordBreak'. Scan count 1, logical reads 3322, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 3797 ms, elapsed time = 43433 ms. (440000 row(s) affected) Table 'Numbers'. Scan count 880000, logical reads 2640000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'TestWordBreak'. Scan count 6, logical reads 7286, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 45359 ms, elapsed time = 56834 ms. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.
In my experience, Regular Expression matches can generally outperform TSQL string functions for operations like this one. To make the function more flexible, it can be altered to accept the regular expression string as an input as follows: [SqlFunction( Name = "RegexMatchesInString", FillRowMethodName = "FindDatesInString_FillRow", DataAccess = DataAccessKind.None, IsDeterministic = true, TableDefinition = "string_found nvarchar(max)")] public static IEnumerable RegexMatchesInString([SqlFacet(MaxSize = -1)]SqlString input, [SqlFacet(MaxSize = -1)]SqlString expression) { return Regex.Matches(input.Value, expression.Value); }
This can then be reused for other regular expression match problems in a CROSS APPLY similar to how it is used to solve the problem in this post. The other side of Aaron's post is splitting a string into words. For this, another CLR function can be created that uses the String.Split() method to split the string on any delimeter that is passed as follows: [SqlFunction( Name = "StringSplitter", FillRowMethodName = "SplitString_FillRow", DataAccess = DataAccessKind.None, IsDeterministic = true, TableDefinition = "split_string nvarchar(max)")] public static IEnumerable StringSplitter([SqlFacet(MaxSize = -1)]SqlString input, [SqlFacet(MaxSize = 1)]SqlString delimeter) { return input.Value.Split(delimeter.Value.ToCharArray()); } private static void SplitString_FillRow(object obj, out string split_string) { split_string = (string)obj; } Comparing this to the TSQL equivalent that Aaron produced can be done with the following script: SET STATISTICS IO ON SET STATISTICS TIME ON SELECT frst.RowKey, frst.TxtCol, SUBSTRING( frst.TxtCol, frst.FirstCharacter, (1 + COALESCE(lst.LastCharacter, LEN(frst.TxtCol)) - frst.FirstCharacter) ) AS Word FROM (SELECT tb.RowKey, tb.TxtCol, n.Number + 1 AS FirstCharacter, ROW_NUMBER() OVER (PARTITION BY tb.RowKey ORDER BY n.Number) AS RowNumber FROM TestWordBreak tb JOIN Numbers n ON SUBSTRING(tb.TxtCol, n.Number, 1) = ' ' AND n.Number <= LEN(tb.TxtCol)) AS frst LEFT JOIN (SELECT tb2.RowKey, n.Number - 1 AS LastCharacter, ROW_NUMBER() OVER (PARTITION BY RowKey ORDER BY n.Number ASC) AS RowNumber FROM TestWordBreak tb2 JOIN Numbers n ON SUBSTRING(tb2.TxtCol, n.Number, 1) = ' ' AND n.Number <= LEN(tb2.TxtCol)) AS lst ON frst.RowKey = lst.RowKey AND frst.RowNumber + 1 = lst.RowNumber SELECT RowKey, TxtCol, split_string as Word FROM TestWordBreak tb CROSS APPLY dbo.StringSplitter(tb.TxtCol, ' ') SET STATISTICS IO OFF SET STATISTICS TIME OFF
In this case on my test machine I rebuilt the table with only 10000 rows in it after the first test run took around 10 minutes to complete. The two methods tended to switch back and forth as to which was going to run faster than the other one. Out of ten test runs, the CLR method was generally faster but only by milliseconds: (290000 row(s) affected) Table 'Numbers'. Scan count 80000, logical reads 240000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'TestWordBreak'. Scan count 6, logical reads 1808, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 3172 ms, elapsed time = 23680 ms. (290000 row(s) affected) Table 'Worktable'. Scan count 4, logical reads 80057, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'TestWordBreak'. Scan count 1, logical reads 303, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 547 ms, elapsed time = 22918 ms. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.
The CLR function is much easier to reuse and duplicate without having to write the logic repetatively as you would in TSQL.
Anonymous comments are disabled
|
|
|
|
|