THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Adam Machanic

Adam Machanic, Boston-based SQL Server developer, shares his experiences with programming, monitoring, and performance tuning SQL Server. And the occasional battle with the query optimizer.

SQLCLR String Splitting Part 2: Even Faster, Even More Scalable

Two days ago, after posting what I thought was a pretty solid SQLCLR string splitting method, I received a comment telling me about a big thread on SQLServerCentral dedicated to the question of how best to split strings. So I jumped in, and went back and forth, and back and forth, and back... and forth...

Many, many messages and several revisions and re-revisions later, I am happy to present the newer, better, more bug-free SQLCLR string splitting function. Not only is it faster than the previous one I posted, but it also handles multi-character delimiters, ignores consecutive delimiters as well as pre and post delimiters, and best of all, it won't get itself stuck in an infinite loop in certain cases (always a really great stability feature).

This version walks the SqlChars character array rather than using the IndexOf method on the string, which we found to be a somewhat faster technique--and it's certainly the most scalable and memory efficient method I can imagine.

Special thanks to SQLServerCentral member Florian Reischl, who was the main person keeping the thread going with me during the last couple of days, and who managed to re-write my versions and eke out even better performance by modifying the algorithms.  Great stuff--it was definitely the most fun I've had on a technical forum in quite a long time.

But without further ado, the code:

using System;
using System.Collections;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

public partial class UserDefinedFunctions
{
    [Microsoft.SqlServer.Server.SqlFunction(
       FillRowMethodName = "FillRow_Multi",
       TableDefinition = "item nvarchar(4000)"
       )
    ]
    public static IEnumerator SplitString_Multi(
      [SqlFacet(MaxSize = -1)]
      SqlChars Input,
      [SqlFacet(MaxSize = 255)]
      SqlChars Delimiter
       )
    {
        return (
            (Input.IsNull || Delimiter.IsNull) ?
            new SplitStringMulti(new char[0], new char[0]) :
            new SplitStringMulti(Input.Value, Delimiter.Value));
    }

    public static void FillRow_Multi(object obj, out SqlString item)
    {
        item = new SqlString((string)obj);
    }

    public class SplitStringMulti : IEnumerator
    {
        public SplitStringMulti(char[] TheString, char[] Delimiter)
        {
            theString = TheString;
            stringLen = TheString.Length;
            delimiter = Delimiter;
            delimiterLen = (byte)(Delimiter.Length);
            isSingleCharDelim = (delimiterLen == 1);

            lastPos = 0;
            nextPos = delimiterLen * -1;
        }

        #region IEnumerator Members

        public object Current
        {
            get
            {
                return new string(theString, lastPos, nextPos - lastPos);
            }
        }

        public bool MoveNext()
        {
            if (nextPos >= stringLen)
                return false;
            else
            {
                lastPos = nextPos + delimiterLen;

                for (int i = lastPos; i < stringLen; i++)
                {
                    bool matches = true;

                    //Optimize for single-character delimiters
                    if (isSingleCharDelim)
                    {
                        if (theString[i] != delimiter[0])
                            matches = false;
                    }
                    else
                    {
                        for (byte j = 0; j < delimiterLen; j++)
                        {
                            if (((i + j) >= stringLen) || (theString[i + j] != delimiter[j]))
                            {
                                matches = false;
                                break;
                            }
                        }
                    }

                    if (matches)
                    {
                        nextPos = i;

                        //Deal with consecutive delimiters
                        if ((nextPos - lastPos) > 0)
                            return true;
                        else
                        {
                            i += (delimiterLen-1);
                            lastPos += delimiterLen;
                        }
                    }
                }

                lastPos = nextPos + delimiterLen;
                nextPos = stringLen;

                if ((nextPos - lastPos) > 0)
                    return true;
                else
                    return false;
            }
        }

        public void Reset()
        {
            lastPos = 0;
            nextPos = delimiterLen * -1;
        }

        #endregion

        private int lastPos;
        private int nextPos;

        private readonly char[] theString;
        private readonly char[] delimiter;
        private readonly int stringLen;
        private readonly byte delimiterLen;
        private readonly bool isSingleCharDelim;
    }
};

Enjoy!

Published Tuesday, April 28, 2009 3:54 PM by Adam Machanic

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Paul White said:

I hope people read the background to appreciate what a work of art that code now is.  Awesome work.

April 29, 2009 7:20 AM
 

Michael said:

I think you should make skipping consecutive delimiters an option as there are plenty of cases where you are looking for the x value, and you have to include blank ones in your count.

May 10, 2009 5:43 AM
 

Matt said:

Was wondering if you considered including a "position" element in the output table along with the item element.  Also, what about making the item element nvarchar(max) rather than 4000?  And did you compare the performance difference between 4000 and max parameters?  In other senarios, I've created a few SQL functions to call the same C# code, so you can call a faster SQL function when you know your inputs/outputs will be limited to 4000 characters, or call the slower SQL function when you need unlimited length inputs/outputs.

November 26, 2009 10:03 AM
 

Aaron Bertrand said:

I've always been very reluctant to use the CLR. On a completely unrelated tangent recently, I begrudgingly

July 6, 2010 7:05 PM
 

Aaron Bertrand said:

Last August, I wrote a lengthy post about how I handle splitting up a list of integers that are passed

August 10, 2010 9:51 AM
 

Darek Spiewak said:

Hi there. Maybe I'm not entirely in the know here, so please pardon me if my remark is stupid and disregard it altogether in that case... Would it not be easier and better (faster???) to use regexes for splitting strings? I remember that there are built-in options in regex's methods that will allow one to split strings based not only on concrete chars but also on patterns and to automatically get rid of empty entries as well; if we compile the regex object, it could be even faster, even though it'll incur some start-up penalty (this shouldn't matter if there are millions of rows to process, though). And, of course, the IEnumerable is then for free! I have not checked the difference, though, between the code presented here and the regex-based solution that I propose performance-wise. But I think the regexes are a viable solution that gives even more flexibility than the (beautiful) code above.

February 26, 2012 10:43 PM
 

Adam Machanic said:

Hi Darek,

You are correct; you can use RegEx. It's not faster, but it certainly is much easier to implement. Unfortunately it has a major downside: the entire collection of substrings will be allocated upfront. It will be sitting in memory the entire time you're processing, using up precious server resources. And if you have a lot of elements in that collection, it will go on the LOB heap and performance across the entire AppDomain could be severely impacted. All in all, not the kind of solution I want inside my very expensive SQL Server instance.

--Adam

February 27, 2012 10:11 AM
 

Darek Spiewak said:

Hi Adam.

Very right you are in what you say. My first comment was misfired. After I have read the whole conversation and other bits and pieces here and there, I get the point and agree that the above solution would be the ultimate. It's a pity that one cannot in this case split on regexes... but I guess one cannot have a cake and eat it :-)

Best,

Darek

March 1, 2012 2:12 PM
 

jimbobmcgee said:

I know this post is really old, but if you are calling Input.Value, are you then not doing the same pre-allocation you are trying to avoid?

Should we not find some way to stream the chars in using SqlChars.Read and stream it back out using SqlChars.Write?

Something like (pseudo, single-char delim):

   SqlChars currentResult = new SqlChars();

   long inputOffset = 0, outputOffset = 0;

   while (true) {

       int read = Input.Read(inputOffset, Input.Buffer, 0, Input.Buffer.Length);

       if (read > 0) {

           for (int i = 0; i < read; i++) {

               if (Input.Buffer[i] == delimChar) {

                   // iterate the current string and create the next one

                   yield return currentResult;

                   currentResult = new SqlChars();

                   outputOffset = 0;

               }

               else {

                   // append the char to the current string

                   currentResult.Write(outputOffset++, Input.Buffer, i, 1);

               }

           }

           offset += read;

       }

       else break;    // no more characters in input

   }

I don't know for certain whether the calls to Read and Write are correct, as I'm still learning SQLCLR (in particular how to handle LOBs -- I came here looking for a possible means to split more than 4000 chars and return more than 4000 chars)...

May 9, 2012 4:29 PM
 

jimbobmcgee said:

(There are also clearly some bugs in my above pseudo-code, too: `offset += read;` should be `inputOffset += read;`, for one; and another `yield return currentResult;` is required to spool the last result -- but you should get the gist)

May 9, 2012 4:38 PM
 

Adam Machanic said:

jimbobmcgee: YES, we should find a way to stream. Unfortunately, Microsoft broke the streaming interface for LOBs in SQL Server 2008. (Here is a Connect item describing the issue: https://connect.microsoft.com/SQLServer/feedback/details/382235/clr-error-system-invalidoperationexception-accessing-members-of-an-object-from-a-wrong-thread) ... Apparently it is now fixed, but I've not tried it out yet.

That said, bringing in the string is very different from holding a reference to all of the sub-strings. The latter case is much more expensive due to pointer overhead.

May 11, 2012 12:05 AM
 

Dave said:

Hi Adam

Fantastic article.  I've used a slightly modified version of your code in my application and seen tremendous performance gains over native SQL attempts at the same thing.

I hope you don't mind, but because I struggled to create the actual DLL and integrate it (first time I've written a CLR UDF), I've written a how-to on my blog using your code above as an example.

Due credit is given in your direction obviously. Hope that's okay?

Keep up the good work.

July 20, 2012 9:37 AM
 

Dave said:

July 20, 2012 9:38 AM
 

Patrick said:

Hi,

I was very impressed by the performance of this function.

I would need this Split function to return empty strings if two delimiters are consecutive. Which changes would be necessary ?

Thanks

August 29, 2012 4:57 AM
 

Adam Machanic said:

Patrick,

See the part of the code with the following comment:

//Deal with consecutive delimiters

:-)

August 29, 2012 12:25 PM
 

Chris Rutherford said:

Hi there Adam,

Thanks for posting this, I've created my first CLR from this code and it works like a charm.  I read the original SQL Server Central thread and a number of solutions also returned a counter as well in the table definition for each row returned.

So table definition would be the following:

TableDefinition = "counter int, item nvarchar(4000)"

I'm no C# guru but I want to add this counter to the code but I'm not sure how to do this.  I've tried a few things but I can't get my new output parameter from the SplitStringMulti class into the FillRow_Multi object, i.e:

public static void FillRow_Multi(object obj, out SqlInt32 counter, out SqlString item)

{

   counter = 1; // I don't know???

   item = new SqlString((string)obj);

}

If you have a moment are you able to advise please?

Kind regards,

Chris

December 4, 2012 4:12 PM
 

Chris Rutherford said:

Uh, never mind, I just read page 50 of that thread on SQL Server Central and I see you've added the row number there =)

Thanks again!

December 4, 2012 4:27 PM
 

Chris Rutherford said:

Oh, actually your code in this blog post and the code on page 50 (which includes the row numbers) is quite different.

I thought I'd try to modify your code on this blog with the row number code but I'm not having much luck.  So my original request still stands, are you able to assist please?

Kind regards,

Chris

December 4, 2012 4:45 PM
 

Chris Rutherford said:

Hi there,

Sorry for all of the comments.  I think I've managed to do this.  I've also modified the code as I'm only ever interested in a one character delimiter.  I'm still a C# newb so may have missed things.

Essentially the change returns an additional column which shows the place of the delimited field within the original record.  So it's not meant to be a unique integer value.

Thank you so much again for posting this on your blog.

using System;

using System.Collections;

using System.Data;

using System.Data.SqlClient;

using System.Data.SqlTypes;

using Microsoft.SqlServer.Server;

public partial class UserDefinedFunctions

{

   [Microsoft.SqlServer.Server.SqlFunction(

       FillRowMethodName = "FillRow_Multi",

       TableDefinition = "row int, item nvarchar(4000)"

       )

   ]

   public static IEnumerator SplitString_Multi(

       [SqlFacet(MaxSize = -1)] SqlChars Input,

       [SqlFacet(MaxSize = 1)] SqlChars Delimiter

       )

   {

       return ((Input.IsNull || Delimiter.IsNull) ? new SplitStringMulti(new char[0], new char[0]) : new SplitStringMulti(Input.Value, Delimiter.Value));

   }

   public static void FillRow_Multi(object obj, out SqlInt32 row, out SqlString item)

   {

       ResultSet rs = (ResultSet)obj;

       row = new SqlInt32((int)rs.get_row);

       item = new SqlString((string)rs.get_value);

   }

   public class ResultSet

   {

       private int row;

       private string value;

       private int start;

       private int end;

       private char[] theString;

       public ResultSet(int input_row, char[] input_value, int input_start, int input_end)

       {

           row = input_row;

           value = new string(input_value, input_start, input_end);

       }

       public object get_row

       {

           get

           {

               return row;

           }

       }

       public object get_value

       {

           get

           {

               return value;

           }

       }

   }

   public class SplitStringMulti : IEnumerator

   {

       public SplitStringMulti(char[] TheString, char[] Delimiter)

       {

           theString = TheString;

           stringLen = TheString.Length;

           delimiter = Delimiter;

           lastPos = 0;

           nextPos = 1 * -1;

       }

       public object Current

       {

           get

           {

               ResultSet results = new ResultSet(row, theString, lastPos, nextPos - lastPos);

               return results;

           }

       }

       public bool MoveNext()

       {

           if (nextPos >= stringLen)

           {

               return false;

           }

           else

           {

               lastPos = nextPos + 1;

               for (int i = lastPos; i < stringLen; i++)

               {

                   bool matches = true;

                   if (theString[i] != delimiter[0])

                   {

                       matches = false;

                   }

                   else

                   {

                       nextPos = i;

                       row++;

                       return true;

                   }

               }

               lastPos = nextPos + 1;

               nextPos = stringLen;

               if ((nextPos - lastPos) > 0)

               {

                   return true;

               }

               else

               {

                   return false;

               }

           }

       }

       public void Reset()

       {

           lastPos = 0;

           nextPos = 1 * -1;

           row = 0;

       }

       private int lastPos;

       private int nextPos;

       private int row = 0;

       private readonly char[] theString;

       private readonly char[] delimiter;

       private readonly int stringLen;

   }

};

December 4, 2012 7:43 PM
 

Adam Machanic said:

Hi Chris,

Glad you got it working, but the code in the SSC thread would have been totally fine to use -- I've had it in a production environment since a couple of days after I posted there, with no problems at all.

That said, probably good that you got a chance to play with some C#. Always nice to learn new things.

--Adam

December 5, 2012 9:50 AM
 

Maurice said:

Great job, got it running just fine. Thanks !!

December 21, 2012 8:19 AM
 

Surya Pratap said:

Hi Chris,

I am a newbie myself but IMHO considering that the Enumeration will run only once for a given string, I suppose you can define your row number with just

public class ResultSet

{

private readonly int row;

private readonly string value;

public ResultSet(int row, string value)

{

this.row = row;

this.value = value;

}

}

in the SplitStringMulti class keep a field

private int row;

initalise row in "Constructor" and during "Reset" set row to 0 by adding

row = 0;

then in the iterator you can use

public object Current

{

get

{

return = new ResultSet(++row, theString( lastPos, nextPos - lastPos));

}

}

and use ResultSet in the  FillRow_Multi like

public static void FillRow_Multi(object obj, out SqlInt32 row, out SqlString item)

{

ResultSet rs = (ResultSet)obj;

row = new SqlInt32(rs.row);

item = new SqlString(rs.value);

}

December 26, 2012 8:56 AM
 

sheir said:

Hi,

I am new to SQL CLR.

Is there a usage example or even how to use this class in VS2012 / SQL2008R2?

The usage scenario I have is that I have a pipe (|) and comma delimited string.  The comma separates the "records/rows" and the pipe separates the "columns" (all string values).

So would like to pass in my string and have it (in SQL2008R2) spit out a table with the columns and rows.  

Is that possible?

September 13, 2013 3:16 PM
 

Ronen Ariely - pituach said:

Good day,

If I may offer a small streamlining.

Since you are changing the code from time to time and you add improvements. It is worth using the AssemblyVersion attribute.

This will allow the user to update the assembly directly by using "ALTER ASSEMBLY..."

all that need to be done is add this to the code:

using System.Reflection;

[assembly: AssemblyVersion("1.0.0.0")]

and you can add attributes like those if you want:

[assembly: AssemblyFileVersion("1.0.0.0")]

[assembly: AssemblyDescription("Adam Machanic Split CLR")]

[assembly: AssemblyCompany("Free Open Source Code")]

November 17, 2013 1:42 PM
 

FY said:

It would be interesting to have an example of how to run the function in SSMS. If I just call it there is an error that says it fails because is a table valued object function object.

April 25, 2014 9:27 PM

Leave a Comment

(required) 
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based SQL Server developer, writer, and speaker. He focuses on large-scale data warehouse performance and development, and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "SQL Server 2008 Internals" (Microsoft Press, 2009) and "Expert SQL Server 2005 Development" (Apress, 2007). Adam regularly speaks at conferences and training events on a variety of SQL Server topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server, a Microsoft Certified IT Professional (MCITP), and an alumnus of the INETA North American Speakers Bureau.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement