THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Adam Machanic

Adam Machanic, Boston-based independent database consultant, writer, and speaker, shares his experiences with programming, performance tuning, and optimizing SQL Server 2000, 2005, and 2008, in conjunction with related technologies such as .NET.

  • The Hidden Costs of INSERT EXEC

    INSERT and EXEC: relational brothers in arms, helping you get your data and then put it somewhere. But like all brothers, their relationship has its ups and downs and sometimes you need to look just below the surface to see the real issues. In this post I will cover hidden penalties that a decision--seemingly purely aesthetic--can bring to the fore.

    To illustrate the first of these examples I will be using the following table:

    USE tempdb
    GO


    CREATE TABLE dbo.MyTable
    (
        [SalesOrderID] [int] NOT NULL,
        [SalesOrderDetailID] [int] NOT NULL,
        [CarrierTrackingNumber] [nvarchar](25) NULL,
        [OrderQty] [smallint] NOT NULL,
        [ProductID] [int] NOT NULL,
        [SpecialOfferID] [int] NOT NULL,
        [UnitPrice] [money] NOT NULL,
        [UnitPriceDiscount] [money] NOT NULL,
        [LineTotal] [money] NOT NULL,
        [rowguid] [uniqueidentifier]  NOT NULL,
        [ModifiedDate] [datetime] NOT NULL
    )
    GO

    Consider the following queries, disregarding the fact that this isn't a good dynamic SQL example (I'm trying to keep it simple), and take a moment to think about the differences between them from a behavioral point of view:

    EXEC
    ('
        INSERT dbo.MyTable
        SELECT *
        FROM AdventureWorks.Sales.SalesOrderDetail
    ')
    GO

    INSERT dbo.MyTable
    EXEC
    ('
        SELECT *
        FROM AdventureWorks.Sales.SalesOrderDetail
    ')
    GO

    It's quite common to insert some data into a table from dynamic SQL, and for as long as I can remember, I've believed the choice between these two forms of the query to be primarily an issue of style. My preference has been for the latter style, as I feel that it's a bit more readable. It also has the benefit of better supporting ownership chains, but that's something that, in most cases, we don't have to worry about any longer in SQL Server 2005 or 2008. But let's move beyond style and potential security issues and get to the substance. What actually happens when we run these two queries?

    In the first case--insert done inside of the EXEC--the first step is that the outer statement--the EXEC itself--is parsed and validated. Now we jump down one level of context, into the string that was passed to EXEC, and that string is parsed and validated. The referenced objects are resolved, and the plan cache is checked. If there is no plan, the query is compiled. And then it's executed. Data is streamed from SalesOrderDetail into MyTable and when the query inside of the EXEC is done control returns to the calling context--EXEC--and assuming there were no errors, the EXEC is now completed.

    In the second case, something quite a bit different occurs. The insert is parsed and validated, and an INSERT EXEC plan is generated. This plan does not involve SalesOrderDetail, as the query optimizer doesn't know at this point in the process where the data will be inserted from, since that step is dynamic. So the plan references something called the "Parameter Table" (plan truncated for simplicity):

    |--Table Insert(OBJECT:([tempdb].[dbo].[MyTable]))
       |--Top(ROWCOUNT est 0)
            |--Parameter Table Scan


    Once this plan has been generated, we again jump down one level of context, and the inner string is parsed and validated, the referenced object resolved, the cache checked, and the compiled plan executed. But what is this Parameter Table thing?

    Here's where things start really diverging. Data is not, at this point, streamed from SalesOrderDetail directly into MyTable. Rather, it is streamed from SalesOrderDetail into the Parameter Table. And the Parameter Table, as it turns out, is in actuality a hidden temporary table. And not until all of the data has streamed into that hidden temporary table is control resumed by the EXEC context, and only then does the data start moving into its final home.

    The natural question you might be asking yourself at this point is, just how much overhead does this Parameter Table introduce into the equation? The answer might just send you racing to check your stored procedures: the additional cost for the Parameter Table is well over 100% as compared with doing the insert inside of the EXEC. The fact that all of the data must be spooled to the Parameter Table before the insert can even begin tells us that this must be true, and it can be verified using a simple check against the sys.dm_exec_requests DMV, as in the following example:

    EXEC
    ('
        INSERT dbo.MyTable
        SELECT *
        FROM AdventureWorks.Sales.SalesOrderDetail
    ')

    SELECT
        r.writes AS [INSERT inside EXEC writes]
    FROM sys.dm_exec_requests r
    WHERE
        r.session_id = @@SPID
    GO

    INSERT dbo.MyTable
    EXEC
    ('
        SELECT *
        FROM AdventureWorks.Sales.SalesOrderDetail
    ')

    SELECT
        r.writes AS [INSERT EXEC writes]
    FROM sys.dm_exec_requests r
    WHERE
        r.session_id = @@SPID
    GO


    So that's that. We should avoid INSERT EXEC and try to do our inserts in the same context in which the SELECT is running--right?

    Well, yes and no. There is another element at play here which I haven't yet mentioned. What if we were only inserting a few rows, and the table we were inserting into looked something like the following:

    CREATE TABLE #MyTable
    (
        [SalesOrderID] [int] NOT NULL,
        [SalesOrderDetailID] [int] NOT NULL,
        [CarrierTrackingNumber] [nvarchar](25) NULL,
        [OrderQty] [smallint] NOT NULL,
        [ProductID] [int] NOT NULL,
        [SpecialOfferID] [int] NOT NULL,
        [UnitPrice] [money] NOT NULL,
        [UnitPriceDiscount] [money] NOT NULL,
        [LineTotal] [money] NOT NULL,
        [rowguid] [uniqueidentifier]  NOT NULL,
        [ModifiedDate] [datetime] NOT NULL
    )
    GO


    The only difference between this table and the previous one is that this is a temporary table and the other is not. But temporary tables have their own interesting little twists, especially when it comes down to one of the key enemies in a highly-transactional system: recompilation. As it turns out, doing the insert inside the EXEC will cause the internal statement to recompile every time a new temp table is encountered. This means that if you have a stored procedure that creates a temp table, puts together a bit of dynamic SQL, and does an insert inside of that dynamic SQL, you'll now have yourself a recompilation problem.

    To see this illustrated, try the following script. Here only a single row is inserted as a result of the dynamic query, but it's complex enough that the compile time more than overshadows the overhead of the Parameter Table:

    USE tempdb
    GO

    DBCC FREEPROCCACHE
    GO

    CREATE TABLE #AvgTimes
    (
        CPU_time DECIMAL(19,4) NOT NULL,
        insert_type VARCHAR(25) NOT NULL
    )
    GO

    CREATE TABLE #MyTable
    (
        [SalesOrderID] [int] NOT NULL,
        [SalesOrderDetailID] [int] NOT NULL,
        [CarrierTrackingNumber] [nvarchar](25) NULL,
        [OrderQty] [smallint] NOT NULL,
        [ProductID] [int] NOT NULL,
        [SpecialOfferID] [int] NOT NULL,
        [UnitPrice] [money] NOT NULL,
        [UnitPriceDiscount] [money] NOT NULL,
        [LineTotal] [money] NOT NULL,
        [rowguid] [uniqueidentifier]  NOT NULL,
        [ModifiedDate] [datetime] NOT NULL
    )

    EXEC
    ('
        INSERT #MyTable
        SELECT TOP(1)
            sod.*
        FROM AdventureWorks.Sales.SalesOrderDetail sod
        WHERE
            sod.UnitPrice > 10
            AND sod.LineTotal > 100
            AND EXISTS
            (
                SELECT *
                FROM AdventureWorks.Sales.SalesOrderHeader soh
                JOIN AdventureWorks.Person.Contact c ON
                    c.ContactID = soh.CustomerID
                WHERE
                    soh.SalesOrderID = sod.SalesOrderID
                    AND c.LastName LIKE ''Ab%''
            )
            AND EXISTS
            (
                SELECT *
                FROM AdventureWorks.Production.Product p
                WHERE
                    p.ProductID = sod.ProductID
                    AND p.Color IS NULL
            )
            AND NOT EXISTS
            (
                SELECT *
                FROM AdventureWorks.Sales.SalesOrderHeader soh
                JOIN AdventureWorks.Sales.SalesPerson sp ON
                    soh.SalesPersonID = sp.SalesPersonID
                WHERE
                    soh.SalesOrderID = sod.SalesOrderID
                    AND sp.CommissionPct > 50
            )
    ')

    INSERT #AvgTimes
    SELECT
        r.cpu_time,
        'INSERT inside EXEC'
    FROM sys.dm_exec_requests r
    WHERE
        r.session_id = @@SPID

    DROP TABLE #MyTable
    GO 5

    CREATE TABLE #MyTable
    (
        [SalesOrderID] [int] NOT NULL,
        [SalesOrderDetailID] [int] NOT NULL,
        [CarrierTrackingNumber] [nvarchar](25) NULL,
        [OrderQty] [smallint] NOT NULL,
        [ProductID] [int] NOT NULL,
        [SpecialOfferID] [int] NOT NULL,
        [UnitPrice] [money] NOT NULL,
        [UnitPriceDiscount] [money] NOT NULL,
        [LineTotal] [money] NOT NULL,
        [rowguid] [uniqueidentifier]  NOT NULL,
        [ModifiedDate] [datetime] NOT NULL
    )

    INSERT #MyTable
    EXEC
    ('
        SELECT TOP(1)
            sod.*
        FROM AdventureWorks.Sales.SalesOrderDetail sod
        WHERE
            sod.UnitPrice > 10
            AND sod.LineTotal > 100
            AND EXISTS
            (
                SELECT *
                FROM AdventureWorks.Sales.SalesOrderHeader soh
                JOIN AdventureWorks.Person.Contact c ON
                    c.ContactID = soh.CustomerID
                WHERE
                    soh.SalesOrderID = sod.SalesOrderID
                    AND c.LastName LIKE ''Ab%''
            )
            AND EXISTS
            (
                SELECT *
                FROM AdventureWorks.Production.Product p
                WHERE
                    p.ProductID = sod.ProductID
                    AND p.Color IS NULL
            )
            AND NOT EXISTS
            (
                SELECT *
                FROM AdventureWorks.Sales.SalesOrderHeader soh
                JOIN AdventureWorks.Sales.SalesPerson sp ON
                    soh.SalesPersonID = sp.SalesPersonID
                WHERE
                    soh.SalesOrderID = sod.SalesOrderID
                    AND sp.CommissionPct > 50
            )
    ')

    INSERT #AvgTimes
    SELECT
        r.cpu_time,
        'INSERT EXEC'
    FROM sys.dm_exec_requests r
    WHERE
        r.session_id = @@SPID

    DROP TABLE #MyTable
    GO 5

    SELECT
        AVG
        (
            CASE a.insert_type
                WHEN 'INSERT inside EXEC' THEN a.CPU_time
                ELSE NULL
            END
        ) AS [CPU time - INSERT inside EXEC],
        AVG
        (
            CASE a.insert_type
                WHEN 'INSERT EXEC' THEN a.CPU_time
                ELSE NULL
            END
        ) AS [CPU time - INSERT EXEC]
    FROM #AvgTimes a
    GO

    DROP TABLE #AvgTimes
    GO


    So what have we learned today? The choice between INSERT EXEC and INSERT inside of EXEC is not purely stylistic and has definite performance implications. Here are the rules I'll be following from now on:

    • When working with permanent tables, always avoid INSERT EXEC if possible. There are some cases where it won't be possible to avoid. I only showed dynamic SQL in this post, but INSERT EXEC also applies to stored procedures. Can you safely rip apart all of the stored procedures in your system in order to avoid this issue? Maybe not quite as easily as you can rip apart the dynamic SQL within stored procedures.
    • When working with temporary tables, evaluate the complexity of the operations, the amount of data to be inserted, and most importantly, test every solution! The more rows that are inserted as a result of the INSERT EXEC, the more the overhead you'll get from the Parameter Table. On the flip side, the more complex the dynamic statement, the more overhead you'll get from recompilations. Every scenario is bound to be different and you may just learn something about your processes by doing this extra bit of analysis.
    Enjoy!

  • SQL PFE Team: New Must-Read Blog?

    Several months ago I asked Joe Sack why his team, the SQL Server Premier Field Engineers, wasn't blogging and sharing the insights that result from their work with a number of high-end SQL Server customers. He replied that he thought it was a great idea, and then I heard nothing until last week, when I was happy to receive an e-mail from Joe saying that the blog had finally been created.

    The team has already put up a few interesting posts and I'm sure there are a lot more to come, so check it out! I'll be adding this blog to the SQLblog External Feeds Roller shortly, if you would rather grab the content from there.


  • Demos From Cape Cod .NET: T-SQL Power! Presentation

    On Tuesday night I was honored to present a new talk to the Cape Cod .NET Users Group that meets in Plymouth, MA. The talk is called "T-SQL Power! Learning to Harness the Under-Used OVER Clause". Following is the abstract for the talk:

    First introduced in SQL Server 2005, the OVER clause is an ANSI SQL enhancement that gives you tremendous control when dealing with aggregations. By using the OVER clause, query writers can simultaneously aggregate columns based on multiple groups. The feature also enables the query engine to provide windowing mechanisms for ranking and row numbering. Leveraging these powerful language enhancements allows you to solve a surprisingly large number of difficult query problems—including custom paging schemes, data de-duplication, "top-N" problems, and complex statistical calculations. Even better, this feature can be creatively applied to help with performance optimization of certain tough queries. In this session, you will learn all of these techniques and see why, after applying the OVER clause in dozens of projects since the release of SQL Server 2005, I consider it to be one of the most powerful T-SQL features available.

    I always enjoy trying out a new talk for the first time, especially when the talk is on a topic that I really enjoy. Based on some of the feedback and reactions I got, I have a few tweaks to make, but I think the talk is on the right track. My next scheduled delivery of this talk is in November, at the SQL Server Connections conference. Should be a great time!

    As promised on Tuesday night, demos are attached to this blog post. Enjoy!


  • Grouped String Concatenation: ... The Winner Is ...

    After weeks of putting it off, I finally found the time and spent the last day and a half judging the Grouped String Concatenation Challenge. I would like to congratulate the winner, Peter Larsson, who submitted a great query and walks away with a shiny new MSDN Premium subscription.

    For those who are interested, following is a breakdown of the judging process, along with some commentary:

     

    Submission Process

    To begin with, e-mails. As I mentioned in the first post, I ignored all e-mails that didn't follow the directions. Luckily this was only a few submissions. I felt it rather odd that people would spend a not insignificant amount of time working up a solution only to not bother to read the guidelines thoroughly. But that's human nature, I suppose.

     

    Round 1

    Once I collected all of the queries that followed the e-mail rules (all of which are included in the attached ZIP file), I began testing against an expanded version of AdventureWorks (the script for that is also included). I decided to eliminate any queries that did not produce the correct output data based on my sample set, or which took longer than 30 seconds to complete. The majority of queries did complete in a reasonable amount of time, and many were eliminated because the output simply wasn't correct. The biggest issue was ordering of the elements in the comma-delimited sets. I also deducted points from one person's entry because of invalid column names, but I decided to let the entry ride to the next round.

    An important side note is that I created this competition with the sole intention of discovering new and different ways to do grouped string concatenation, and my hope was that someone would come up with a clever, fast solution. Unfortunately, that didn't happen, and every submission that used any technique except FOR XML PATH was eliminated in the first round of testing. I received some extremely creative solutions from a couple of people and I would like to mention them here:

    • Alejandro Mesa's submissions made use of various XQuery techniques, and are very interesting to look at, although fairly slow
    • Dean Cochrane's submission used an interesting idea of doing a MAX(CASE ...) pivot for the lists. Alas, the product names lists were not correct, so the submission didn't make it to the stress testing phase
    • Scott Coleman tried a similar technique, actually using the PIVOT keyword. Unfortunately, this ran for over 200 seconds, so it was eliminated

    Also interesting to note is that a few people tried recursive CTE solutions. These were all cancelled at the 300 second mark. Recursive CTEs, as mentioned before on here on SQLblog, simply do not scale in their current implementation.

     

    Round 2

    After tabulating the Round 1 results I was left with 18 queries, and some obvious contenders.  I ran each query through a SQLQueryStress session with 10 threads running 5 iterations each. In this phase the queries were separated into fairly distinct groups: Those that ran for around 5 minutes, those that ran for around 7-8 minutes, and those that ran longer. These groups were based, not surprisingly, on how much attention was paid by the query writers to the little details. For example, Peter Larsson's winning query cut down on logical reads dramatically by doing some of the grouping in a derived table, rather than in the outermost query as some of the other submissions did.

    Lesson learned: When doing aggregations, especially when joining a lot of tables, think about what you're really aggregating, and do the aggregation as early as possible. For example, if you need to aggregate sales per customer and get customer names, do the aggregation of the sales numbers first, then join out to get the customer names. Otherwise the query processor is forced to do more work than it has to do, and your query won't be as fast. Peter and a few other contestants understood this distinction and wrote queries that were much faster as a result.

     

    Round 3

    Round 2 eliminated 4 queries, leaving me with 14 to judge based on query style. In order to judge consistently, I came up with 10 factors. A query was allotted 500 points to start, and failure to meet each factor resulted in a 50 point penalty. These factors were:

    • Consistent Indentation
      • Does the query use the same rules for indentation in all parts? This is huge for readability and helps people understand where each section of the query starts and ends.
    • Consistent Capitalization
      • Does the query use the same rules for capitalization throughout? For example, keywords should be either all capitalized, or all lowercase.
    • Capitalize Keywords
      • I like to see keywords capitalized.
    • Use AS for Alias Names
      • AS is optional, and I've left it out in many queries I've written. But the more of other peoples' code I read, the more I realize that it really does help on the readability front. Use it. Always.
    • Follow Capitalization of Base Tables/Columns
      • If the base table is called OrderHeader, I want to see it used as OrderHeader when referenced in your query, rather than orderheader. A trainer I know found this out the hard way, when he reinstalled SQL Server on his laptop shortly before a training session, and used a case-sensitive collation rather than his previously-installed case-insensitive collation. He had been careless in adhering to capitalization for his training materials, and discovered the issue in front of the class. Oops.
    • Long Horizontal Lists
      • I don't like horizontal scrolling, and I find long lists difficult to read.
    • Consistent Vertical Lists
      • Put either a comma after each element or before each element, not both. Indent your lists the same way throughout. If you indent some items below the SELECT, don't put other items on the same line as the SELECT (or GROUP BY, or ORDER BY, etc)
    • Alignment of Delimiters
      • I follow a .NET-inspired style where I put delimiters on their own lines, and line them up vertically. This gives my code what I feel is an airy, easy-to-read feel. When reading others' code I look for some kind of alignment. Failure to align delimiters makes it very difficult to understand, again, where one section begins and another ends. By the way, common delimiters for this challenge included both parens and CASE...END.
    • Comments
      • Does the query have comments? Are the comments useful in understanding the logic?
    • Aesthetics
      • This is perhaps the most subjective. My general feeling on how I enjoyed reading the code.

    All in all, the queries were pretty good. I would like to call out Rick Halliday, who had the highest score in this round with some very well formatted and highly readable code.

    Round 4

    After judging Round 3 I tallied all of the scores and was left with a tie for top 3:

    • Rick Halliday
    • Leonid Koyfman
    • Peter Larsson's query #4

    All three of these queries were well thought out, but only one could win, so I took another pass through each. Rick's query, though extremely well written and readable, was eliminetad first due to the fact that it performed worse than the other two. This left Leonid and Peter. It was a tough choice, but I had to give the prize to Peter for taking the time to really think through the problem and figure out exactly how best to do the aggregations. Leonid was a very, very close second, and I really wish I had a consolation prize for him.


    The End

    And that's that. Thank you to everyone who participated in the challenge. I hope it was as much a learning experience for you as it was for me. Congratulations again to Peter. All of the materials are attached in the ZIP file; please let me know if you have any questions, comments, etc.

     


  • Peter DeBetta and I Rock .NET

    ... or something like that.

    We're featured on this week's .NET Rocks! show, discussing some of the intricacies of database design with Richard and Carl. Enjoy!


  • Demos - TechEd DAT402 - Microsoft SQL Server 2008: Performance Profiling and Troubleshooting with Extended Events

    Whew! Another TechEd session completed, and now I'm done for the week. Thanks to everyone who joined me today; I had a fantastic time talking about the ins and outs of Extended Events.  Demos are attached.  I realize how complex some of these topics are, so please feel free to drop me a note in the comments here with any questions or concerns.  Also stay tuned--I'm continuing to work with and improve these samples and will be publishing updates from time to time in the coming months.


  • Demos - TechEd DAT305 - Best Practices for Exception Handling and Defensive Programming in Microsoft SQL Server

    A huge thank you to everyone who attended today's session on exception handling! I had a great time and got some very good, on-point questions from the audience.  As promised, the demos are attached to this blog post.  Please let me know in the comments if you have any questions (whether or not you attended--everyone is allowed to join the party).


  • I Swallowed the Twitter-Colored Pill

    I'm not entirely certain what color that is, but follow me here (if you dare).

    In keeping with the Twitter mentality I'll keep this post ultra-short.


  • SQLCLR String Splitting Part 2: Even Faster, Even More Scalable

    Two days ago, after posting what I thought was a pretty solid SQLCLR string splitting method, I received a comment telling me about a big thread on SQLServerCentral dedicated to the question of how best to split strings. So I jumped in, and went back and forth, and back and forth, and back... and forth...

    Many, many messages and several revisions and re-revisions later, I am happy to present the newer, better, more bug-free SQLCLR string splitting function. Not only is it faster than the previous one I posted, but it also handles multi-character delimiters, ignores consecutive delimiters as well as pre and post delimiters, and best of all, it won't get itself stuck in an infinite loop in certain cases (always a really great stability feature).

    This version walks the SqlChars character array rather than using the IndexOf method on the string, which we found to be a somewhat faster technique--and it's certainly the most scalable and memory efficient method I can imagine.

    Special thanks to SQLServerCentral member Florian Reischl, who was the main person keeping the thread going with me during the last couple of days, and who managed to re-write my versions and eke out even better performance by modifying the algorithms.  Great stuff--it was definitely the most fun I've had on a technical forum in quite a long time.

    But without further ado, the code:

    using System;
    using System.Collections;
    using System.Data;
    using System.Data.SqlClient;
    using System.Data.SqlTypes;
    using Microsoft.SqlServer.Server;

    public partial class UserDefinedFunctions
    {
        [Microsoft.SqlServer.Server.SqlFunction(
           FillRowMethodName = "FillRow_Multi",
           TableDefinition = "item nvarchar(4000)"
           )
        ]
        public static IEnumerator SplitString_Multi(
          [SqlFacet(MaxSize = -1)]
          SqlChars Input,
          [SqlFacet(MaxSize = 255)]
          SqlChars Delimiter
           )
        {
            return (
                (Input.IsNull || Delimiter.IsNull) ?
                new SplitStringMulti(new char[0], new char[0]) :
                new SplitStringMulti(Input.Value, Delimiter.Value));
        }

        public static void FillRow_Multi(object obj, out SqlString item)
        {
            item = new SqlString((string)obj);
        }

        public class SplitStringMulti : IEnumerator
        {
            public SplitStringMulti(char[] TheString, char[] Delimiter)
            {
                theString = TheString;
                stringLen = TheString.Length;
                delimiter = Delimiter;
                delimiterLen = (byte)(Delimiter.Length);
                isSingleCharDelim = (delimiterLen == 1);

                lastPos = 0;
                nextPos = delimiterLen * -1;
            }

            #region IEnumerator Members

            public object Current
            {
                get
                {
                    return new string(theString, lastPos, nextPos - lastPos);
                }
            }

            public bool MoveNext()
            {
                if (nextPos >= stringLen)
                    return false;
                else
                {
                    lastPos = nextPos + delimiterLen;

                    for (int i = lastPos; i < stringLen; i++)
                    {
                        bool matches = true;

                        //Optimize for single-character delimiters
                        if (isSingleCharDelim)
                        {
                            if (theString[i] != delimiter[0])
                                matches = false;
                        }
                        else
                        {
                            for (byte j = 0; j < delimiterLen; j++)
                            {
                                if (((i + j) >= stringLen) || (theString[i + j] != delimiter[j]))
                                {
                                    matches = false;
                                    break;
                                }
                            }
                        }

                        if (matches)
                        {
                            nextPos = i;

                            //Deal with consecutive delimiters
                            if ((nextPos - lastPos) > 0)
                                return true;
                            else
                            {
                                i += (delimiterLen-1);
                                lastPos += delimiterLen;
                            }
                        }
                    }

                    lastPos = nextPos + delimiterLen;
                    nextPos = stringLen;

                    if ((nextPos - lastPos) > 0)
                        return true;
                    else
                        return false;
                }
            }

            public void Reset()
            {
                lastPos = 0;
                nextPos = delimiterLen * -1;
            }

            #endregion

            private int lastPos;
            private int nextPos;

            private readonly char[] theString;
            private readonly char[] delimiter;
            private readonly int stringLen;
            private readonly byte delimiterLen;
            private readonly bool isSingleCharDelim;
        }
    };

    Enjoy!


  • Faster, More Scalable SQLCLR String Splitting

    It seems like every couple of months we see yet another post on SQLCLR string splitting routines. Many bloggers, I suppose, are still struggling, even three years later, to find that "perfect" use case for SQLCLR. Is string splitting it? Probably not. And with SQL Server 2008 table-valued parameters now available, SQLCLR string splitting has become an even less interesting exercise than it was before. None the less, a recent post on the Less Than Dot site has inspired me to counter with my own best SQLCLR string splitting method, for those of you who are still interested in solving this problem.

    I've noticed that almost invariably, the methods posted online stress how very easy it is to do string splitting in .NET, thanks to the String.Split method. And while this easy method tends to work pretty well for small strings and on workloads that don't need to scale, it quickly breaks down when any amount of load is introduced (something that, unfortunately, most writers don't bother considering). The Less Than Dot writer, "onpnt" did do some testing, and discovered that--surprise, surprise--String.Split isn't all that great.

    The issue? Well, it all comes down to large memory allocations and the art of scalable .NET programming--an area about which many SQL Server developers can (and do) remain blissfully naïve. In .NET, reduction of memory utilization--especially large allocation done en masse--is king, and String.Split does exactly the wrong thing. It takes the input string, breaks it into N chunks, and allocates all of the memory needed to store those chunks and pointers to those chunks, in one big huge operation. This can't possibly scale, and indeed it doesn't.  I did a quick SQLQueryStress test of a TVF based on String.Split and saw fairly good performance when the input sentences were small (in the 40-400 byte range--see below), but after a certain point the AppDomains began recycling and performance became abysmal. Protections put in place for stability of the CLR host include memory leak detection, and this kicks in quite readily when we force allocation of so much memory in one shot--a good thing for the SQL Server instance as a whole, but not great when we're trying to really split a huge string.

    Students of .NET who are concerned with scale (and really, everyone should be) are urged to look at the way problems are handled in LINQ. Here the vast majority of requests are internally handled using streaming iterator patterns, rather than moving around huge chunks of memory. This turns out to a much more scalable option for several reasons: Lower in-flight memory utilization, fewer large object heap allocations, and better access by the garbage collector to collect intermediate data that is no longer needed.

    So how can we apply streaming to the string splitting problem? Rather than break the string up into all of its component parts upfront, we can walk the string step-by-step, only finding the next piece as required. In order to handle this, I created the following worker class:

        public class splitIt : IEnumerator
        {
            public splitIt(string theString, char delimiter)
            {
                this.theString = theString;
                this.delimiter = delimiter;
                this.lastPos = -1;
                this.nextPos = -1;
            }

            #region IEnumerator Members

            public object Current
            {
                get { return theString.Substring(lastPos, nextPos - lastPos).Trim(); }
            }

            public bool MoveNext()
            {
                if (nextPos >= theString.Length)
                    return false;
                else
                {
                    lastPos = nextPos + 1;

                    if (lastPos == theString.Length)
                        return false;

                    nextPos = theString.IndexOf(delimiter, lastPos);
                    if (nextPos == -1)
                        nextPos = theString.Length;

                    return true;
                }
            }

            public void Reset()
            {
                this.lastPos = -1;
                this.nextPos = -1;
            }

            #endregion

            private int lastPos;
            private int nextPos;
            private string theString;
            private char delimiter;
        }


    This class is a simple enumerator implementation that looks for the next delimiter on each iteration, only when requested. Splitting strings in this way, rather than using String.Split, means that no huge upfront allocation takes place. Aside from the sentence itself, only one "chunk" is in play at any given time, and any chunks that have already been used can be garbage collected as needed when memory is tight.

    Wiring this class up in a SQLCR TVF is just about as simple as when using String.Split:

        [Microsoft.SqlServer.Server.SqlFunction(FillRowMethodName = "FillIt", TableDefinition = "output nvarchar(4000)")]
        public static IEnumerator faster_split
            (
                SqlChars instr,
                [SqlFacet(IsFixedLength=true, MaxSize=1)]
                SqlString delimiter
            )
        {
            return (
                (instr.IsNull || delimiter.IsNull) ?
                new splitIt("", ',') :
                new splitIt(instr.ToSqlString().Value, Convert.ToChar(delimiter.Value)));
        }

        public static void FillIt(object obj, out SqlString output)
        {
            output = (new SqlString((string)obj));
        }

    I've enhanced this example slightly compared with most of the usual suspects: A SqlFacet attribute is used to make sure that the delimiter is only a single character, and I've added a bit of additional code in the main method to deal with NULL inputs.

    The scalability difference between this method and the String.Split method is staggering in the simple tests I ran today on my SQL Server 2008 test server.  With small sentences, even under moderate load (100 concurrent threads), each method performs more or less equivalently.  But as sentence size increases to 50KB, the String.Split method begins slowing, taking almost 2 seconds per iteration, and the occasional AppDomain recycle is seen in the SQL Server log.  The streaming method, on the other hand, continues to complete its job in just over 1/10th of a second, and causes no AppDomain recycles. Increasing to 500KB sentences, String.Split causes numerous AppDomain recycles and time per iteration increases to over 30 seconds, while the streaming method averages just 16 seconds per iteration. Jumping to 5MB sentences, String.Split causes almost continuous AppDomain recycles, and each iteration takes almost 6 minutes to complete. Yet with the streaming method, even with sentences of this size I am still unable to cause an AppDomain recycle to occur, and iterations complete in around 55 seconds.

    The test I did was quite simple, and I won't post too many details here as I prefer that you test with your own workloads and draw your own conclusion about how this method fares when compared with T-SQL versions or the naïve String.Split method. I hope that if you do test you'll post back here with your results so that we can all learn the best way to handle these problems--whether or not string splitting really is all that interesting in the post-SQL Server 2005 world.


  • Do You Change the Results to Grid Font in SSMS?

    Update: Please respond here instead.  Thanks for the great idea, Alex!

     

    Please respond and let me know:

    A) Whether you change it at all

    and

    B) If so, whether you use a fixed width font (the default font is non-fixed width)

    Background, in case you're curious: I modified a newer version of Who is Active?, based on some feedback I received from Aaron Bertrand asking for right-padding of the numbers in the result.  I did this modification based on my assumption that most people won't change the default grid font, and even if they do change it they'll still use a non-fixed width font.  I tested my modification on my end and sent it over to Aaron to see what he thought, and low and behold he has changed his own SSMS settings to use a fixed width font.  Oops!  I would be very interested in finding out how many of you do the same.  If the vast majority of people do indeed leave the font unchanged, or at least in the same font family, perhaps I'll leave the right-padding as-is (I happen to like it, personally -- sorry, Aaron!).  If I'm mistaken in my assumption, I'll rip it out.

    Thanks for letting me know, either way.


  • TechEd Coming Up - Recommended Background for My Extended Events Talk

    TechEd North America 2009 is just a few short weeks away, and I'm really looking forward not only to the show, but also the opportunity to visit Los Angeles for the first time in several years.  I'm busy putting the finishing touches on my Extended Events talk, DAT402, "Microsoft SQL Server 2008: Performance Profiling and Troubleshooting with Extended Events", and I thought I should write a quick note to anyone out there who is thinking of attending. 

    What's noteworthy about this talk is that the TechEd team modified the title a bit and removed the word "Advanced", but left the talk at 400-level and also left the original abstract untouched:

    Imagine tracking and aggregating wait statistics at the session level rather than the server level. Imagine seeing exactly how long each step in your query took and figuring out the real cost of that index scan. Imagine capturing an exception with an associated callstack--no more guessing about exactly what component failed and why. When you're done imagining, open your eyes and attend this session to learn all of these techniques and more, all thanks to Extended Events (XEvents)--the powerful new tracing infrastructure in SQL Server 2008. Designed for DBAs and developers who already understand the basics of XEvents, this session goes from 0 to 60 in the first few minutes. See a number of code examples and gain an understanding of how to maximize XEvents for performance profiling and troubleshooting purposes. If you're serious about making your SQL Servers fly, this is one session not to be missed.

    When I gave this same talk earlier this year at SQL Server Connections in Orlando, some attendees apparently didn't understand that it really is an advanced talk, and showed up without quite enough background to follow along as well as they could have.  I'm hoping to minimize that phenomenon at TechEd; so please keep in mind that although I do a quick (five-minute) refresher, that's it, and Extended Events is a much bigger topic than can be fully introduced in five minutes.  Once the refresher part is over I jump into four fairly intense, demo-heavy deep-dive sections, so please, if you want to get your money's worth, make sure that you read one or more of the following in the next few weeks:

    Reading through these sources will give you the background so that you can focus on the performance troubleshooting aspects of the talk, rather than sitting there bogged down in the unfamiliar Extended Events syntax and terminology.

    I'm looking forward to meeting some SQLblog readers at the event; if you're attending, even if you decide to skip my Extended Events talk, stop by the Database Practices TLC and say hi.  Looking forward to seeing you in LA!


    Share this post: email it! | bookmark it! | digg it! | reddit! | kick it! |