THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul White: Page Free Space

A technical SQL Server blog from New Zealand.

  • Cardinality Estimation for Disjunctive Predicates in SQL Server 2014

    Introduction

    Back in January 2014, I wrote an article for SQLperformance.com describing the cardinality estimation process for queries with multiple predicates, from the point of view of the old and new cardinality estimators. The article describes the various behaviours and formulas involved, along with the usual sprinkling of documented and undocumented trace flags.

    I was able to describe the formula SQL Server 2014 uses to calculate a cardinality estimate for multiple predicates connected by AND (conjunctive predicates), which was already relatively well-known. Despite some fairly energetic research, and basic-to-intermediate skills with Excel, I was unable to deduce a similar formula for the disjunctive case, where predicates are connected by OR. The trace flag output I describe in the other article clearly showed that exponential backoff was used in the new 2014 cardinality estimator, but the precise formula eluded me.

    I gave up on it until a few weeks ago, when I was invited to contribute to the a technical review of a White Paper Joe Sack was writing with assistance from Microsoft. One of the small contributions I was able to make was to point out that exponential backoff was used for disjunctions as well as conjunctions. The paper has now been released so I can now write about the detail behind the statement that appears there concerning cardinality estimation for disjunctive predicates (thank you Joe for pointing this out to me):

    image

    A Bit Of Boolean

    The key bit of new information is the second sentence. As soon as I read it, a vague memory from my patchy education came back to me, as well as a cunning query rewrite Itzik Ben-Gan shows in one of his books. Both relate to a set of rules for transforming Boolean algebra, known as DeMorgan’s Laws. Wikipedia has this to say:

    image

    This gives us a way to turn a disjunction into a conjunction. It is a short step from there to understand that exponential backoff for disjunctions in the SQL Server 2014 cardinality estimator works by applying DeMorgan’s laws. Converting the troublesome OR to an AND (with the appropriate negations) allows us to reuse the AND exponential backoff calculation for OR queries!

    The Theory

    Borrowing from the Wikipedia article again, we have DeMorgan’s Laws stated as:

    image

    The interesting transformation from disjunction to conjunction can be expressed as:

    not (A or B) is the same as (not A) and (not B)

    This does not quite give us a direct translation from OR to AND. We need an extra step of observing:

    (A or B) is the same as not (not (A or B))

    This trivial double-negative means we can now directly substitute the not (A or B) above, to get:

    (A or B) is the same as not ((not A) and (not B))

    This gives us OR expressed in terms of NOT and AND.

    The very last bit of theory is to remember that A and B are probabilities between 0 and 1, so:

    (not X) = (1 – X)

    Worked Example

    Now we have all we need to show how the disjunctive exponential backoff calculation works in SQL Server 2014:

    SELECT 
        COUNT_BIG(*) AS NumRows
    FROM Production.TransactionHistory AS TH
    WHERE 
        TH.TransactionID BETWEEN 100000 AND 168412
        OR TH.TransactionDate BETWEEN '20070901' AND '20080313';

    There are two predicates in this query, connected by OR.

    The selectivity of the predicate on Transaction ID can be derived directly from the statistics histogram:

    DBCC SHOW_STATISTICS
    (
        'Production.TransactionHistory', 
        'PK_TransactionHistory_TransactionID'
    );

    image

    This very compact histogram gives us all the information we need to say that the range of 68,413 Transaction ID values can be expected to produce 68,413 rows.

    Expressed as a fraction of the 113,443 rows in the table, the selectivity of this predicate is 68413 / 113443 = 0.603061. We will call this value selectivity A.

    The second predicate (on TransactionDate) is estimated from the histogram in a similar way (except the histogram has more steps than is convenient to show here). The histogram calculation results in a cardinality estimate for this predicate that is also 68,413 (because I deliberately chose the predicates to qualify exactly the same physical rows). The selectivity is likewise 0.603061.  We will call this value selectivity B.

    Enter DeMorgan

    We know from our work with DeMorgan that:

    (A or B) is the same as not ((not A) and (not B))

    So we can rewrite our (A or B) predicate immediately as not ((not A) and (not B)), enabling us to avoid worrying about the OR.

    Note this transform is done purely for cardinality estimation purposes. The optimizer tree is not affected by this work.

    Doing the Math

    Now, given that selectivities A and B are both equal to 0.603061 in this contrived example, our expanded expression becomes:

    not ((not 0.603061) and (not 0.603061))

    We also know:

    (not X) = (1 – X)

    Expanding the inner nots means the computation becomes:

    not ((1 - 0.603061) and (1 - 0.603061))

    We can perform the numeric math inside the parentheses to get:

    not (0.396939 and 0.396939)

    Now we use exponential backoff to compute the and selectivity:

    not (0.396939 * SQRT(0.396939))

    And replace the outer not just as we did before:

    1 - (0.396939 * SQRT(0.396939))

    Now we have just numbers. The final selectivity is 0.749916.

    The Result

    The very final step is to multiply this selectivity by the cardinality of the base table to get our overall estimate:

    0.749916 * 113443 = 85073 rounded up

    The 2014 execution plan for this query, when using the new cardinality estimator, is:

    image

    The cardinality estimate is 85,073 rows, just as we predicted!

    Sadly, the actual number of rows returned by this query is 68,413 rows (because both predicates qualify exactly the same rows). Nothing is perfect, and cardinality estimation doubly so.

     

    © 2014 Paul White
    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi (note the underscore!)

  • Nested Loops Prefetching

    PASS_2013_SpeakingButton_180x180 (2)Nested loops join query plans can be a lot more interesting (and complicated!) than is commonly realized.

    One query plan area I get asked about a lot is prefetching. It is not documented in full detail anywhere, so this seems like a good topic to address in a blog post. The examples used in this article are based on questions asked by Adam Machanic.

    Test Query

    The following query uses the AdventureWorks sample database (as usual). Run it a couple of times if necessary so you see the query plan when all the data needed by the query is in memory (the “warm cache” case):

    SELECT TOP (1000)
        P.Name,
        TH.TransactionID
    FROM Production.Product AS P
    JOIN Production.TransactionHistory AS TH
        ON TH.ProductID = P.ProductID
    WHERE
        P.Name LIKE N'[K-P]%'
    ORDER BY 
        P.Name, 
        TH.TransactionID;

    The post-execution (“actual”) plan is shown below:

    Query Plan

    Plan Execution

    Some of you will already know how query plans are executed in detail, but it will be very important in a moment so please take the time to read this next bit, even if it is just as a refresher:

    Query execution is an iterative (cursor-like) process. It starts at the far left (the green SELECT icon) and control passes first to the Top iterator, which requests a single row from the Nested Loops join. The join reads one row from its outer input (using the Index Seek on the Product table) and then looks for the first match on its inner input (the Index Seek on Transaction History).

    The single-row result is passed back to the Top operator, and ultimately queued for transmission to the client in a TDS packet. The Top operator then asks the join for another row. The join looks for the next match on its inner input (it does not access the outer input this time). If a row is found, it is passed back to the Top (and on to the client) and so on.

    At some point, the join will run out of matches from the inner input (for the current value of Product ID) so it switches back to its outer input to get the next Product ID from the Product table. The new Product ID is used to reinitialize (rebind) the inner input, which returns the first match from the Transaction History table index seek using the new Product ID.

    Eventually, the whole iterative process results in the 1000th row passing through the Top operator, and the process comes to an end (because the query specifies TOP 1000). There is also the possibility that the query tree below (i.e. to the right of) the Top operator produces less than 1,000 rows in total. Execution would also come to an end in that case, but that is not what happens here.

    The execution plan does not show all this detail (and in particular it aggregates over all the inner-side executions) but it does provide a summary of what happened:

    • A total of four rows were read (at different times) from the Product table
    • The Transaction History table seek was executed four times (with a different Product ID each time)
    • A total of 1,000 rows were returned to the client

    The particular detail I want to draw your attention to is highlighted in the query plan above: the inner side index seek is executed four times, one for each row produced by the outer input to the join. Given our current understanding of query execution, the number of rows produced by the nested loops join outer input should always match the number of executions reported for the inner input. Right? Wrong.

    Cold Cache Test

    The next script runs exactly the same query as before, but with a cold cache:

    CHECKPOINT;
    DBCC DROPCLEANBUFFERS;
    GO
    SELECT TOP (1000)
        P.Name,
        TH.TransactionID
    FROM Production.Product AS P
    JOIN Production.TransactionHistory AS TH
        ON TH.ProductID = P.ProductID
    WHERE
        P.Name LIKE N'[K-P]%'
    ORDER BY 
        P.Name, 
        TH.TransactionID;

    Query Plan 2

    Well that’s odd: 35 rows are read from the outer (Product) input but the inner side of the join is only executed 4 times. What happened to the extra 31 outer side rows? Where did they go?

    Nested Loops Prefetching

    The SQL Server query processor contains a number of optimizations for nested loops joins, primarily aimed at making I/O more efficient.

    I/O efficiency on the outer side of the join is not usually a problem. Regular (storage engine) read-ahead is enough to ensure that pages are brought into memory before they are needed. The same read-ahead mechanism also applies to inner side executions, but the number of rows returned for each outer row is often quite small, so the benefit can be limited or even non-existent.

    Remember that the inner side of a nested loop join rebinds (restarts) for each new correlated value from the outer side. As a result, a common access pattern for nested loops join is a number of small range scans on the inner side that start at different points in the same index.

    One of the available SQL Server optimizations is nested loops prefetching. The general idea is to issue asynchronous I/O for index pages that will be needed by the inner side – and not just for the current correlated join parameter value, but for future values too. To be clear: say the outer side Product table seek produces ProductID rows with values (1, 17, 60, 99). Nested loops prefetching will try to bring pages in memory for the inner side Transaction History table index that has ProductID as its leading key for the values (1, 17, 60, 99…) before the inner side starts working on ProductID 1. When this mechanism is effective, it can produce a significant performance boost for nested loops joins by overlapping prefetch I/O with ongoing nested loop execution.

    The way this works is not really discussed anywhere in detail, so let us look at it now. It will also explain why our cold-cache test query plan shows 35 rows on the outer input but only 4 executions on the inner side.

    The Missing Node

    The execution plan shows a good deal of useful summary information, but it does not contain every important detail. One indication that there is something missing from the XML or graphical showplan information comes from looking at the node IDs of the iterators in the plan (if you are using SSMS, the node IDs are shown in the tooltip for each plan iterator):

    Click to enlarge

    Query plan iterators (nodes) are numbered (starting with zero, in geek tradition) from the far left. A discontinuity in the numbering is often a sign that something has been replaced, or is being hidden. In our test plan, nodes (0, 1, 3, 4) are shown, but node 2 is conspicuously missing. We can sometimes see extra information about query tree nodes using undocumented trace flag 7352 (output shown below, with node IDs in parentheses):

    Trace flag 7352 output

    The output reveals the mystery node 2, but the trace flag is unable to give a name to it. The output also shows that the mystery node uses an expression labelled [Expr1004] with a type of binary(24). This expression is shown in regular showplan output as an outer reference (correlated parameter) of the nested loops join:

    Outer references

    No definition is provided for this expression, and no other explicit reference is made to it anywhere else in the plan.

    The Real Execution Plan

    If showplan output included the missing node 2, it could look like this:

    Node 2 Plan

    I have labelled the new iterator, “Delayed Prefetch” and used the standard catch-all iterator icon for it because I am too lazy to draw something appropriate (I do not have the illustrative skills of Kendra Little or Michael Swart).

    Anyway, the execution engine class that implements this iterator is QScanRangePrefetchDelayNew. It uses the correlated join parameter values (ProductID here) to issue asynchronous scatter/gather I/O for the inner side (Transaction History) index, as described earlier. We have to attach the debugger to see this hidden iterator in action:

    Prefetch Delay

    That call stack shows delayed prefetch for the nested loops join resulting in an asynchronous scatter read call to the operating system (right at the top in this capture). The prefetch operator can issue multiple asynchronous reads based on outer-input values. It uses a small memory buffer to keep track, and to save index leaf page information in particular. A reference to this information was seen earlier as the outer reference binary (24) expression label Expr1004. My understanding is that the leaf page information is used on the inner side of the join in hints to the storage engine to prevent duplicated effort by the normal read-ahead mechanisms. It may be used for other purposes.

    The delayed prefetch iterator also checks for completion of previously-issued asynchronous I/Os:

    IO Completions

    …and completes SQL Server processing for any asynchronous I/Os that are found to have completed (the example below shows page checksums being validated):

    Page Checksum

    Details and Explanations

    The delayed prefetch iterator only reads extra rows from its input when it encounters a page that requires a physical I/O. Scatter reads can bring in multiple discontinuous pages in one I/O request, so extra outer side rows are read to make better use of the scatter read mechanism. If we are going to issue an I/O call, we might as well grab as much as we can.

    If an inner-side index page checked by delayed prefetch is already present in the buffer pool, no physical I/O issued (because there is nothing to do) but a logical read is counted (because prefetch touched the buffer pool page to check it was there). Control then returns to the nested loop join iterator to continue its normal processing. If we know the inner-side page is in memory, we might as well get on with processing it – there’s no need to read extra rows from the outer input.

    We can now fully explain the query plan differences we saw earlier.

    In the cold-cache test, the delayed prefetch iterator tests inner-side index pages, finds they are not in cache, so asynchronous scatter reads are issued. A total of 35 rows are read by the delayed prefetch iterator in this test. The 1,000 rows needed for the query are produced by the nested loop join from just 4 of these rows, so we only see 4 executions on the inner side. (Remember the row-at-a-time cursor-like execution of query plans).

    In the warm-cache test, the delayed prefetch iterator does not encounter any pages are not in memory, so no asynchronous I/Os are issued, and no extra rows are read. Prefetching just adds a few logical reads (of pages that were already in memory) to the reported cost of the query.

    Ordered and Unordered Prefetching

    The prefetching mode is shown on the nested loops join iterator. If prefetching is used, either the WithUnorderedPrefetch or WithOrderedPrefetch attributes will be shown set to true (SSMS users will need to click on the nested loops join and inspect the properties window to see these properties). If prefetching is not used, neither attribute will be present.

    An unordered prefetch allows the inner side of the join to proceed using data from whichever I/Os happened to complete first.

    An ordered prefetch preserves key order on the inner side of the nested loops join, so SQL Server ensures that asynchronous I/Os are processed in the correct order.

    Unordered prefetching may therefore be somewhat more efficient than ordered prefetching. That said, the optimizer may choose to use ordered prefetching where it can make use of the preserved-order guarantee to avoid a more expensive join type (e.g. hash join) or where it can avoid an explicit sort later in the execution plan.

    Our test query features an ORDER BY clause that the optimizer can avoid sorting for by using ordered prefetch:

    Plan Tree

    Second Example – Function Calls

    Regular readers will know that I am not a fan of scalar T-SQL functions, but I am going to use one here to illustrate another important aspect of prefetching. The function in question does nothing except return its input value (I have deliberately created this function without schema-binding so SQL Server will consider it to be non-deterministic):

    CREATE FUNCTION dbo.F
        (@ProductID integer)
    RETURNS integer
    AS
    BEGIN
        RETURN @ProductID;
    END;

    The next script uses the function in our test query as part of the join predicate:

    SELECT TOP (1000)
        P.Name,
        TH.TransactionID
    FROM Production.Product AS P
    JOIN Production.TransactionHistory AS TH
        ON TH.ProductID = dbo.F(P.ProductID) -- Changed!
    WHERE
        P.Name LIKE N'[K-P]%'
    ORDER BY 
        P.Name, 
        TH.TransactionID;

    The execution plan (with a warm cache) shows 4 rows read from the outer input, and 4 executions of the inner side of the nested loops join:

    Warm cache plan

    The Seek Predicate for the inner-side seek now references the function. The function is not mentioned anywhere else in the query plan.

    How many times is the function executed?

    If you said four, you were wrong!

    The number of function invocations can be monitored using the Profiler event SP:StatementCompleted (despite the name and documentation suggesting it is only for stored procedures). Profiler shows the function was called eight times:

    Profiler output

    Four calls correspond to the four executions of the inner-side index seek. The other four calls are caused by the hidden delayed prefetch iterator. Remember this is the warm cache case, where no prefetching is necessary.

    The delayed prefetch iterator still tries to check if the index page is in the buffer pool. In order to find the inner-side index page for the current value of ProductID, the prefetch iterator must execute the function. It needs the result of the function to seek into the inner-side index.

    The debugger call stack below shows the UDF being called from the delayed prefetch iterator:

    UDF call

    Cold Cache

    When pages need to be prefetched, the effect is even more dramatic:

    CHECKPOINT;
    DBCC DROPCLEANBUFFERS;
     
    SELECT TOP (1000)
        P.Name,
        TH.TransactionID
    FROM Production.Product AS P
    JOIN Production.TransactionHistory AS TH
        ON TH.ProductID = dbo.F(P.ProductID) -- Changed!
    WHERE
        P.Name LIKE N'[K-P]%'
    ORDER BY 
        P.Name, 
        TH.TransactionID;

    The outer input shows 35 rows (as we now expect). There are (as usual) only 4 executions of the inner side:

    Query plan

    Profiler shows the function is executed 39 times – 4 times by the inner side seek, and 35 times by prefetch!

    Profiler output 2

    Final Words

    The message here is not that prefetching is bad (it is generally a good thing). It can have unexpected side-effects, and can make interpreting nested loops joins more challenging. Never use T-SQL scalar functions!

    Winking smile

    © Paul White 2013 – All Rights Reserved
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

    Additional Reading:

    Prefetch and Query Performance – Fabiano Amorim
    Random Prefetching – Craig Freedman
    Optimized Nested Loops Joins – Craig Freedman
    Optimizing I/O Performance by Sorting – Craig Freedman

  • Incorrect Results Caused By Adding an Index

    It’s not often that a SQL Server bug really surprises me or makes me wonder how it was never spotted before, but this is one of those.

    To get right into it, say you have the following two tables:

    CREATE PARTITION FUNCTION PF (integer)
    AS RANGE RIGHT
    FOR VALUES (1000, 2000, 3000, 4000, 5000);
     
    CREATE PARTITION SCHEME PS
    AS PARTITION PF
    ALL TO ([PRIMARY]);
     
    -- Partitioned
    CREATE TABLE dbo.T1
    (
        T1ID    integer NOT NULL,
        SomeID  integer NOT NULL,
     
        CONSTRAINT [PK dbo.T1 T1ID]
            PRIMARY KEY CLUSTERED (T1ID)
            ON PS (T1ID)
    );
     
    -- Not partitioned
    CREATE TABLE dbo.T2
    (
        T2ID    integer IDENTITY (1,1) NOT NULL,
        T1ID    integer NOT NULL,
        
        CONSTRAINT [PK dbo.T2 T2ID]
            PRIMARY KEY CLUSTERED (T2ID)
            ON [PRIMARY]
    );

    Load table T1 with 4,999 rows. All of the rows have a SomeID value of 1234, and the T1ID primary key is sequentially numbered from 1 to 4,999:

    INSERT dbo.T1 (T1ID, SomeID)
    SELECT N.n, 1234
    FROM dbo.Numbers AS N
    WHERE N.n BETWEEN 1 AND 4999;

    Table T2 gets 999 rows, generated by adding T1ID values from T1 that divide exactly by 5:

    INSERT dbo.T2 (T1ID)
    SELECT T1ID
    FROM dbo.T1
    WHERE T1ID % 5 = 0;

    More visually, T1 looks like this (T1ID goes up to 4,999):

    T1 data

    And T2 looks like this (T2ID goes up to 999):

    T2 data

    The test query simply counts the rows that match between the two tables when joined on T1ID:

    SELECT COUNT_BIG(*)
    FROM dbo.T1 AS T1
    JOIN dbo.T2 AS T2
        ON T2.T1ID = T1.T1ID
    WHERE T1.SomeID = 1234;

    The execution plan features a merge join:

    Merge Join Execution Plan

    The correct result (999) is returned and everyone is happy:

    Correct Result

    Enter the Index

    Now someone comes along and adds a new index to table T1:

    CREATE NONCLUSTERED INDEX [dbo.T1 SomeID]
    ON dbo.T1 (SomeID DESC);

    This is a perfectly reasonable index, apparently essential for some crucial query or other. Let’s run our COUNT_BIG(*) query again:

    SELECT COUNT_BIG(*)
    FROM dbo.T1 AS T1
    JOIN dbo.T2 AS T2
        ON T2.T1ID = T1.T1ID
    WHERE T1.SomeID = 1234;

    The execution plan looks similar:

    Execution Plan with New Index

    But the result is wrong! There are still 999 matches in the underlying data.

    Wrong Result

    An Even Simpler Query

    With the new index still in place, we run this query:

    SELECT T1ID
    FROM dbo.T1 
    WHERE SomeID = 1234
    ORDER BY T1ID ASC;

    This query should obviously return all the T1IDs from 1 to 4,999 in ascending order. Instead, we get:

    Wrongly Ordered Result Part 1

    The list starts at 4000 not 1! Also, out-of-order rows are found further down:

    Wrongly Ordered Result Part 2

    The results are not ordered by T1ID despite the ORDER BY T1TD ASC clause. Quite astonishing.

    Cause

    Both problems are caused by a bug in the query optimizer, which is present in all versions of SQL Server from 2008 to 2014 CTP 1 inclusive. The bug produces a query plan that does not provide the ordering guarantees the optimizer thinks it does, leading to incorrect results. A partitioned table is required to reproduce the bug.

    The sneaky aspect to it is that the index which causes the problem could be added at any time, without the original query-writer’s knowledge. Equally, data changes could mean that a query plan that used to use a hash or nested loops join suddenly recompiles to choose a merge join. Since a merge join requires sorted input, the opportunity for suddenly incorrect (incomplete!) results is obvious (and an example was shown above).

    There is no trace flag that I am aware of that fixes this issue.

    I have opened a Connect item for this bug, and written more about the detailed explanation in a guest post on SQLperformance.com

    Resolution

    The fix for this issue is now available and documented in a Knowledge Base article. Please note the fix requires a code update and trace flag 4199, which enables a range of other query processor changes. It is unusual for an incorrect-results bug to be fixed under 4199. I asked for clarification on that and the response was:

    Even though this problem involves incorrect results like other hotfixes involving the Query Processor we have only enabled this fix under trace flag 4199 for SQL Server 2008, 2008 R2, and 2012. However, this fix is “on” by default without the trace flag in SQL Server 2014 RTM.

    Paul White

    Twitter : @SQL_Kiwi

    © 2013 All Rights Reserved

  • Improving Partitioned Table Join Performance

    The query optimizer does not always choose an optimal strategy when joining partitioned tables. This post looks at an example, showing how a manual rewrite of the query can almost double performance, while reducing the memory grant to almost nothing.

    Test Data

    The two tables in this example use a common partitioning partition scheme. The partition function uses 41 equal-size partitions:

    CREATE PARTITION FUNCTION PFT (integer)
    AS RANGE RIGHT
    FOR VALUES 
        (
            125000, 250000, 375000, 500000,
            625000, 750000, 875000, 1000000,
            1125000, 1250000, 1375000, 1500000,
            1625000, 1750000, 1875000, 2000000,
            2125000, 2250000, 2375000, 2500000,
            2625000, 2750000, 2875000, 3000000,
            3125000, 3250000, 3375000, 3500000,
            3625000, 3750000, 3875000, 4000000,
            4125000, 4250000, 4375000, 4500000,
            4625000, 4750000, 4875000, 5000000
        );
    GO
    CREATE PARTITION SCHEME PST
    AS PARTITION PFT
    ALL TO ([PRIMARY]);

    There two tables are:

    CREATE TABLE dbo.T1
    (
        TID integer NOT NULL IDENTITY(0,1),
        Column1 integer NOT NULL,
        Padding binary(100) NOT NULL DEFAULT 0x,
     
        CONSTRAINT PK_T1
        PRIMARY KEY CLUSTERED (TID)
        ON PST (TID)
    );
     
    CREATE TABLE dbo.T2
    (
        TID integer NOT NULL,
        Column1 integer NOT NULL,
        Padding binary(100) NOT NULL DEFAULT 0x,
     
        CONSTRAINT PK_T2
        PRIMARY KEY CLUSTERED (TID, Column1)
        ON PST (TID)
    );

    The next script loads 5 million rows into T1 with a pseudo-random value between 1 and 5 for Column1. The table is partitioned on the IDENTITY column TID:

    INSERT dbo.T1 WITH (TABLOCKX)
        (Column1)
    SELECT
        (ABS(CHECKSUM(NEWID())) % 5) + 1
    FROM dbo.Numbers AS N
    WHERE n BETWEEN 1 AND 5000000;

    In case you don’t already have an auxiliary table of numbers lying around, here’s a script to create one with 10 million rows:

    CREATE TABLE dbo.Numbers (n bigint PRIMARY KEY);
     
    WITH
      L0   AS(SELECT 1 AS c UNION ALL SELECT 1),
      L1   AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
      L2   AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
      L3   AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
      L4   AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
      L5   AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
      Nums AS(SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM L5)
    INSERT dbo.Numbers WITH (TABLOCKX)
    SELECT TOP (10000000) n 
    FROM Nums 
    ORDER BY n
    OPTION (MAXDOP 1);

    Table T1 contains data like this:

    T1 sample

    Next we load data into table T2. The relationship between the two tables is that table 2 contains ‘n’ rows for each row in table 1, where ‘n’ is determined by the value in Column1 of table T1. There is nothing particularly special about the data or distribution, by the way.

    INSERT dbo.T2 WITH (TABLOCKX)
        (TID, Column1)
    SELECT
        T.TID,
        N.n
    FROM dbo.T1 AS T
    JOIN dbo.Numbers AS N
        ON N.n >= 1
        AND N.n <= T.Column1;

    Table T2 ends up containing about 15 million rows:

    T2 sample

    The primary key for table T2 is a combination of TID and Column1. The data is partitioned according to the value in column TID alone.

    Partition Distribution

    The following query shows the number of rows in each partition of table T1:

    SELECT
        PartitionID = CA1.P,
        NumRows = COUNT_BIG(*)
    FROM dbo.T1 AS T
    CROSS APPLY (VALUES ($PARTITION.PFT(TID))) AS CA1 (P)
    GROUP BY CA1.P
    ORDER BY CA1.P;

    Partition Ditribution T1

    There are 40 partitions containing 125,000 rows (40 * 125k = 5m rows). The rightmost partition remains empty. The next query shows the distribution for table 2:

    SELECT
        PartitionID = CA1.P,
        NumRows = COUNT_BIG(*)
    FROM dbo.T2 AS T
    CROSS APPLY (VALUES ($PARTITION.PFT(TID))) AS CA1 (P)
    GROUP BY CA1.P
    ORDER BY CA1.P;

    There are roughly 375,000 rows in each partition (the rightmost partition is also empty):

    Partition Ditribution T2

    Ok, that’s the test data done.

    Test Query and Execution Plan

    The task is to count the rows resulting from joining tables 1 and 2 on the TID column:

    SET STATISTICS IO ON;
    DECLARE @s datetime2 = SYSUTCDATETIME();
     
    SELECT COUNT_BIG(*)
    FROM dbo.T1 AS T1
    JOIN dbo.T2 AS T2
        ON T2.TID = T1.TID;
     
    SELECT DATEDIFF(Millisecond, @s, SYSUTCDATETIME());
    SET STATISTICS IO OFF;

    The optimizer chooses a plan using parallel hash join, and partial aggregation:

    Parallel Hash Join Plan

    The Plan Explorer plan tree view shows accurate cardinality estimates and an even distribution of rows across threads (click to enlarge the image):

    Plan Explorer Plan Tree View 1

    With a warm data cache, the STATISTICS IO output shows that no physical I/O was needed, and all 41 partitions were touched:

    STATISTICS IO

    Running the query without actual execution plan or STATISTICS IO information for maximum performance, the query returns in around 2600ms.

    Execution Plan Analysis

    The first step toward improving on the execution plan produced by the query optimizer is to understand how it works, at least in outline.

    The two parallel Clustered Index Scans use multiple threads to read rows from tables T1 and T2. Parallel scan uses a demand-based scheme where threads are given page(s) to scan from the table as needed. This arrangement has certain important advantages, but does result in an unpredictable distribution of rows amongst threads. The point is that multiple threads cooperate to scan the whole table, but it is impossible to predict which rows end up on which threads.

    For correct results from the parallel hash join, the execution plan has to ensure that rows from T1 and T2 that might join are processed on the same thread. For example, if a row from T1 with join key value ‘1234’ is placed in thread 5’s hash table, the execution plan must guarantee that any rows from T2 that also have join key value ‘1234’ probe thread 5’s hash table for matches.

    The way this guarantee is enforced in this parallel hash join plan is by repartitioning rows to threads after each parallel scan. The two repartitioning exchanges route rows to threads using a hash function over the hash join keys. The two repartitioning exchanges use the same hash function so rows from T1 and T2 with the same join key must end up on the same hash join thread.

    Expensive Exchanges

    This business of repartitioning rows between threads can be very expensive, especially if a large number of rows is involved. The execution plan selected by the optimizer moves 5 million rows through one repartitioning exchange and around 15 million across the other.

    As a first step toward removing these exchanges, consider the execution plan selected by the optimizer if we join just one partition from each table, disallowing parallelism:

    SELECT COUNT_BIG(*)
    FROM dbo.T1 AS T1
    JOIN dbo.T2 AS T2
        ON T2.TID = T1.TID
    WHERE
        $PARTITION.PFT(T1.TID) = 1
        AND $PARTITION.PFT(T2.TID) = 1
    OPTION (MAXDOP 1);

    Single Partition Plan

    The optimizer has chosen a (one-to-many) merge join instead of a hash join. The single-partition query completes in around 100ms. If everything scaled linearly, we would expect that extending this strategy to all 40 populated partitions would result in an execution time around 4000ms. Using parallelism could reduce that further, perhaps to be competitive with the parallel hash join chosen by the optimizer.

    This raises a question. If the most efficient way to join one partition from each of the tables is to use a merge join, why does the optimizer not choose a merge join for the full query?

    Forcing a Merge Join

    Let’s force the optimizer to use a merge join on the test query using a hint:

    SELECT COUNT_BIG(*)
    FROM dbo.T1 AS T1
    JOIN dbo.T2 AS T2
        ON T2.TID = T1.TID
    OPTION (MERGE JOIN);

    This is the execution plan selected by the optimizer:

    Optimizer Merge Join Plan

    This plan results in the same number of logical reads reported previously, but instead of 2600ms the query takes 5000ms. The natural explanation for this drop in performance is that the merge join plan is only using a single thread, whereas the parallel hash join plan could use multiple threads.

    Parallel Merge Join

    We can get a parallel merge join plan using the same query hint as before, and adding trace flag 8649:

    SELECT COUNT_BIG(*)
    FROM dbo.T1 AS T1
    JOIN dbo.T2 AS T2
        ON T2.TID = T1.TID
    OPTION (MERGE JOIN, QUERYTRACEON 8649);
    The execution plan is:

    Optimizer Parallel Merge Join

    This looks promising. It uses a similar strategy to distribute work across threads as seen for the parallel hash join. In practice though, performance is disappointing. On a typical run, the parallel merge plan runs for around 8400ms; slower than the single-threaded merge join plan (5000ms) and much worse than the 2600ms for the parallel hash join. We seem to be going backwards!

    The logical reads for the parallel merge are still exactly the same as before, with no physical IOs. The cardinality estimates and thread distribution are also still very good (click to enlarge):

    Plan Explorer Tree View 2

    A big clue to the reason for the poor performance is shown in the wait statistics (captured by Plan Explorer Pro):

    Plan Explorer Pro Waits

    CXPACKET waits require careful interpretation, and are most often benign, but in this case excessive waiting occurs at the repartitioning exchanges. Unlike the parallel hash join, the repartitioning exchanges in this plan are order-preserving ‘merging’ exchanges (because merge join requires ordered inputs):

    Reprartition Streams Operator Properties

    Parallelism works best when threads can just grab any available unit of work and get on with processing it. Preserving order introduces inter-thread dependencies that can easily lead to significant waits occurring. In extreme cases, these dependencies can result in an intra-query deadlock, though the details of that will have to wait for another time to explore in detail.

    The potential for waits and deadlocks leads the query optimizer to cost parallel merge join relatively highly, especially as the degree of parallelism (DOP) increases. This high costing resulted in the optimizer choosing a serial merge join rather than parallel in this case. The test results certainly confirm its reasoning.

    Collocated Joins

    The optimizer has another available strategy when joining tables that share a common partition scheme. This strategy is a collocated join, also known as as a per-partition join. It can be applied in both serial and parallel execution plans, though it is limited to 2-way joins in the current optimizer. Whether the optimizer chooses a collocated join or not depends on cost estimation. The primary benefits of a collocated join are that it eliminates an exchange and requires less memory, as we will see next.

    Costing and Plan Selection

    The query optimizer did consider a collocated join for our original query, but it was rejected on cost grounds. The parallel hash join with repartitioning exchanges appeared to be a cheaper option. There is no query hint to force a collocated join, so we have to mess with the costing framework to produce one for our test query. Pretending that IOs cost 50 times more than usual is enough to convince the optimizer to use collocated join with our test query:

    -- Pretend IOs are 50x cost temporarily
    DBCC SETIOWEIGHT(50);
     
    -- Co-located hash join
    SELECT COUNT_BIG(*)
    FROM dbo.T1 AS T1
    JOIN dbo.T2 AS T2
        ON T2.TID = T1.TID
    OPTION (RECOMPILE);
     
    -- Reset IO costing
    DBCC SETIOWEIGHT(1);

    Collocated Join Plan

    The estimated execution plan for the collocated join is:

    Collocated Hash Join

    The Constant Scan contains one row for each partition of the shared partitioning scheme, from 1 to 41. The hash repartitioning exchanges seen previously are replaced by a single Distribute Streams exchange using Demand partitioning. Demand partitioning means that the next partition id is given to the next parallel thread that asks for one.

    My test machine has eight logical processors, and all are available for SQL Server to use. As a result, there are eight threads in the single parallel branch in this plan, each processing one partition from each table at a time. Once a thread finishes processing a partition, it grabs a new partition number from the Distribute Streams exchange…and so on until all partitions have been processed.

    It is important to understand that the parallel scans in this plan are different from the parallel hash join plan. Although the scans have the same parallelism icon, tables T1 and T2 are not being co-operatively scanned by multiple threads in the same way. Each thread reads a single partition of T1 and performs a hash match join with the same partition from table T2. The properties of the two Clustered Index Scans show a Seek Predicate (unusual for a scan!) limiting the rows to a single partition:

    Clustered Index Scan Properties

    The crucial point is that the join between T1 and T2 is on TID, and TID is the partitioning column for both tables. A thread that processes partition ‘n’ is guaranteed to see all rows that can possibly join on TID for that partition. In addition, no other thread will see rows from that partition, so this removes the need for repartitioning exchanges.

    CPU and Memory Efficiency Improvements

    The collocated join has removed two expensive repartitioning exchanges and added a single exchange processing 41 rows (one for each partition id). Remember, the parallel hash join plan exchanges had to process 5 million and 15 million rows. The amount of processor time spent on exchanges will be much lower in the collocated join plan.

    In addition, the collocated join plan has a maximum of 8 threads processing single partitions at any one time. The 41 partitions will all be processed eventually, but a new partition is not started until a thread asks for it. Threads can reuse hash table memory for the new partition.

    The parallel hash join plan also had 8 hash tables, but with all 5,000,000 build rows loaded at the same time. The collocated plan needs memory for only 8 * 125,000 = 1,000,000 rows at any one time.

    Collocated Hash Join Performance

    The collated join plan has disappointing performance in this case. The query runs for around 25,300ms despite the same IO statistics as usual. This is much the worst result so far, so what went wrong?

    It turns out that cardinality estimation for the single partition scans of table T1 is slightly low. The properties of the Clustered Index Scan of T1 (graphic immediately above) show the estimation was for 121,951 rows. This is a small shortfall compared with the 125,000 rows actually encountered, but it was enough to cause the hash join to spill to physical tempdb:

    Collocated Hash Join Performance

    A level 1 spill doesn’t sound too bad, until you realize that the spill to tempdb probably occurs for each of the 41 partitions.

    As a side note, the cardinality estimation error is a little surprising because the system tables accurately show there are 125,000 rows in every partition of T1. Unfortunately, the optimizer uses regular column and index statistics to derive cardinality estimates here rather than system table information (e.g. sys.partitions).

    Collocated Merge Join

    We will never know how well the collocated parallel hash join plan might have worked without the cardinality estimation error (and the resulting 41 spills to tempdb) but we do know:

    • Merge join does not require a memory grant; and
    • Merge join was the optimizer’s preferred join option for a single partition join

    Putting this all together, what we would really like to see is the same collocated join strategy, but using merge join instead of hash join. Unfortunately, the current query optimizers (SQL Server 2008 and later) cannot produce a collocated merge join; it only knows how to do collocated hash join. SQL Server 2005 used an APPLY pattern to join partitioned tables, so a collocated merge join was possible in that version.

    So where does this leave us?

    CROSS APPLY sys.partitions

    We can try to write our own collocated join query. We can use sys.partitions to find the partition numbers, and CROSS APPLY to get a count per partition, with a final step to sum the partial counts. The following query implements this idea:

    SELECT
        row_count = SUM(Subtotals.cnt)
    FROM
    (
        -- Partition numbers
        SELECT
            p.partition_number
        FROM sys.partitions AS p 
        WHERE 
            p.[object_id] = OBJECT_ID(N'T1', N'U')
            AND p.index_id = 1
    ) AS P
    CROSS APPLY
    (
        -- Count per collocated join
        SELECT
            cnt = COUNT_BIG(*)
        FROM dbo.T1 AS T1
        JOIN dbo.T2 AS T2
            ON T2.TID = T1.TID
        WHERE 
            $PARTITION.PFT(T1.TID) = p.partition_number
            AND $PARTITION.PFT(T2.TID) = p.partition_number
    ) AS SubTotals;

    The estimated plan is:

    Partition Apply

    The cardinality estimates aren’t all that good here, especially the estimate for the scan of the system table underlying the sys.partitions view.

    Nevertheless, the plan shape is heading toward where we would like to be. Each partition number from the system table results in a per-partition scan of T1 and T2, a one-to-many Merge Join, and a Stream Aggregate to compute the partial counts. The final Stream Aggregate just sums the partial counts.

    Execution time for this query is around 3,500ms, with the same IO statistics as always. This compares favourably with 5,000ms for the serial plan produced by the optimizer with the OPTION (MERGE JOIN) hint. This is another case of the sum of the parts being less than the whole – summing 41 partial counts from 41 single-partition merge joins is faster than a single merge join and count over all partitions.

    Even so, this single-threaded collocated merge join is not as quick as the original parallel hash join plan, which executed in 2,600ms.

    On the positive side, our collocated merge join uses only one logical processor and requires no memory grant. The parallel hash join plan used 16 threads and reserved 569 MB of memory:

    image 

    Using a Temporary Table

    Our collocated merge join plan should benefit from parallelism. The reason parallelism is not being used is that the query references a system table. We can work around that by writing the partition numbers to a temporary table (or table variable):

    SET STATISTICS IO ON;
    DECLARE @s datetime2 = SYSUTCDATETIME();
     
    CREATE TABLE #P
    (
        partition_number integer PRIMARY KEY);
     
    INSERT #P (partition_number)
    SELECT
        p.partition_number
    FROM sys.partitions AS p 
    WHERE 
        p.[object_id] = OBJECT_ID(N'T1', N'U')
        AND p.index_id = 1;
     
    SELECT
        row_count = SUM(Subtotals.cnt)
    FROM #P AS p
    CROSS APPLY
    (
        SELECT
            cnt = COUNT_BIG(*)
        FROM dbo.T1 AS T1
        JOIN dbo.T2 AS T2
            ON T2.TID = T1.TID
        WHERE 
            $PARTITION.PFT(T1.TID) = p.partition_number
            AND $PARTITION.PFT(T2.TID) = p.partition_number
    ) AS SubTotals;
     
    DROP TABLE #P;
     
    SELECT DATEDIFF(Millisecond, @s, SYSUTCDATETIME());
    SET STATISTICS IO OFF;

    Using the temporary table adds a few logical reads, but the overall execution time is still around 3500ms, indistinguishable from the same query without the temporary table. The problem is that the query optimizer still doesn’t choose a parallel plan for this query, though the removal of the system table reference means that it could if it chose to:

    Temporary Table Plan

    In fact the optimizer did enter the parallel plan phase of query optimization (running search 1 for a second time):

    Optimizer Search Phases

    Unfortunately, the parallel plan found seemed to be more expensive than the serial plan. This is a crazy result, caused by the optimizer’s cost model not reducing operator CPU costs on the inner side of a nested loops join. Don’t get me started on that, we’ll be here all night.

    In this plan, everything expensive happens on the inner side of a nested loops join. Without a CPU cost reduction to compensate for the added cost of exchange operators, candidate parallel plans always look more expensive to the optimizer than the equivalent serial plan.

    Parallel Collocated Merge Join

    We can produce the desired parallel plan using trace flag 8649 again:

    SELECT
        row_count = SUM(Subtotals.cnt)
    FROM #P AS p
    CROSS APPLY
    (
        SELECT
            cnt = COUNT_BIG(*)
        FROM dbo.T1 AS T1
        JOIN dbo.T2 AS T2
            ON T2.TID = T1.TID
        WHERE 
            $PARTITION.PFT(T1.TID) = p.partition_number
            AND $PARTITION.PFT(T2.TID) = p.partition_number
    ) AS SubTotals
    OPTION (QUERYTRACEON 8649);

    The actual execution plan is:

    Parallel Collocated Merge Join

    One difference between this plan and the collocated hash join plan is that a Repartition Streams exchange operator is used instead of Distribute Streams. The effect is similar, though not quite identical. The Repartition uses round-robin partitioning, meaning the next partition id is pushed to the next thread in sequence. The Distribute Streams exchange seen earlier used Demand partitioning, meaning the next partition id is pulled across the exchange by the next thread that is ready for more work. There are subtle performance implications for each partitioning option, but going into that would again take us too far off the main point of this post.

    Performance

    The important thing is the performance of this parallel collocated merge join – just 1350ms on a typical run.

    The list below shows all the alternatives from this post (all timings include creation, population, and deletion of the temporary table where appropriate) from quickest to slowest:

    • Collocated parallel merge join: 1350ms
    • Parallel hash join: 2600ms
    • Collocated serial merge join: 3500ms
    • Serial merge join: 5000ms
    • Parallel merge join: 8400ms
    • Collated parallel hash join: 25,300ms (hash spill per partition)

    The parallel collocated merge join requires no memory grant (aside from a paltry 1.2MB used for exchange buffers). This plan uses 16 threads at DOP 8; but 8 of those are (rather pointlessly) allocated to the parallel scan of the temporary table. These are minor concerns, but it turns out there is a way to address them if it bothers you.

    Parallel Collocated Merge Join with Demand Partitioning

    This final tweak replaces the temporary table with a hard-coded list of partition ids (dynamic SQL could be used to generate this query from sys.partitions):

    SELECT
        row_count = SUM(Subtotals.cnt)
    FROM
    (
        VALUES 
            (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
            (11),(12),(13),(14),(15),(16),(17),(18),(19),(20),
            (21),(22),(23),(24),(25),(26),(27),(28),(29),(30),
            (31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41)
    ) AS P (partition_number)
    CROSS APPLY
    (
        SELECT
            cnt = COUNT_BIG(*)
        FROM dbo.T1 AS T1
        JOIN dbo.T2 AS T2
            ON T2.TID = T1.TID
        WHERE 
            $PARTITION.PFT(T1.TID) = p.partition_number
            AND $PARTITION.PFT(T2.TID) = p.partition_number
    ) AS SubTotals
    OPTION (QUERYTRACEON 8649);

    The actual execution plan is:

    Parallel Collocated Merge Join - Demand

    The parallel collocated hash join plan is reproduced below for comparison:

    Parallel Collocated Hash Join

    The manual rewrite has another advantage that has not been mentioned so far: the partial counts (per partition) can be computed earlier than the partial counts (per thread) in the optimizer’s collocated join plan. The earlier aggregation is performed by the extra Stream Aggregate under the nested loops join.

    The performance of the parallel collocated merge join is unchanged at around 1350ms.

    Final Words

    It is a shame that the current query optimizer does not consider a collocated merge join (Connect item closed as Won’t Fix). The example used in this post showed an improvement in execution time from 2600ms to 1350ms using a modestly-sized data set and limited parallelism. In addition, the memory requirement for the query was almost completely eliminated  – down from 569MB to 1.2MB.

    The problem with the parallel hash join selected by the optimizer is that it attempts to process the full data set all at once (albeit using eight threads). It requires a large memory grant to hold all 5 million rows from table T1 across the eight hash tables, and does not take advantage of the divide-and-conquer opportunity offered by the common partitioning.

    The great thing about the collocated join strategies is that each parallel thread works on a single partition from both tables, reading rows, performing the join, and computing a per-partition subtotal, before moving on to a new partition.

    Note that the general technique is not limited to partitioned tables. Any table that can be naturally segmented using a driver table of values can be written to use a collocated merge join, by substituting a driver value for the partition id.

    From a thread’s point of view…

    If you have trouble visualizing what is happening from just looking at the parallel collocated merge join execution plan, let’s look at it again, but from the point of view of just one thread operating between the two Parallelism (exchange) operators.

    image

    Our thread picks up a single partition id from the Distribute Streams exchange, and starts a merge join using ordered rows from partition 1 of table T1 and partition 1 of table T2. By definition, this is all happening on a single thread. As rows join, they are added to a (per-partition) count in the Stream Aggregate immediately above the Merge Join. Eventually, either T1 (partition 1) or T2 (partition 1) runs out of rows and the merge join stops.

    The per-partition count from the aggregate passes on through the Nested Loops join to another Stream Aggregate, which is maintaining a per-thread subtotal. Our same thread now picks up a new partition id from the exchange (say it gets id 9 this time). The count in the per-partition aggregate is reset to zero, and the processing of partition 9 of both tables proceeds just as it did for partition 1, and on the same thread.

    Each thread picks up a single partition id and processes all the data for that partition, completely independently from other threads working on other partitions. One thread might eventually process partitions (1, 9, 17, 25, 33, 41) while another is concurrently processing partitions (2, 10, 18, 26, 34) and so on for the other six threads at DOP 8.

    The point is that all 8 threads can execute independently and concurrently, continuing to process new partitions until the wider job (of which the thread has no knowledge!) is done. This divide-and-conquer technique can be much more efficient than simply splitting the entire workload across eight threads all at once.

    PASS_2013_SpeakingButton_180x180

    Related Reading

    Understanding and Using Parallelism in SQL Server

    Parallel Execution Plans Suck

    © 2013 Paul White – All Rights Reserved

    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • Hello Operator, My Switch Is Bored

    This is a post for T-SQL Tuesday #43 hosted by my good friend Rob Farley. The topic this month is Plan Operators. I haven’t taken part in T-SQL Tuesday before, but I do like to write about execution plans, so this seemed like a good time to start.

    This post is in two parts. The first part is primarily an excuse to use a pretty bad play on words in the title of this blog post (if you’re too young to know what a telephone operator or a switchboard is, I hate you). The second part of the post looks at an invisible query plan operator (so to speak).

    1. My Switch Is Bored

    Allow me to present the rare and interesting execution plan operator, Switch:

    Switch Showplan Operator

    Books Online has this to say about Switch:

    BOL description

    Following that description, I had a go at producing a Fast Forward Cursor plan that used the TOP operator, but had no luck. That may be due to my lack of skill with cursors, I’m not too sure.

    The only application of Switch in SQL Server 2012 that I am familiar with requires a local partitioned view:

    CREATE TABLE dbo.T1 (c1 int NOT NULL CHECK (c1 BETWEEN 00 AND 24));
    CREATE TABLE dbo.T2 (c1 int NOT NULL CHECK (c1 BETWEEN 25 AND 49));
    CREATE TABLE dbo.T3 (c1 int NOT NULL CHECK (c1 BETWEEN 50 AND 74));
    CREATE TABLE dbo.T4 (c1 int NOT NULL CHECK (c1 BETWEEN 75 AND 99));
    GO
    CREATE VIEW V1 AS
    SELECT c1 FROM dbo.T1
    UNION ALL
    SELECT c1 FROM dbo.T2
    UNION ALL
    SELECT c1 FROM dbo.T3
    UNION ALL
    SELECT c1 FROM dbo.T4;

    Not only that, but it needs an updatable local partitioned view. We’ll need some primary keys to meet that requirement:

    ALTER TABLE dbo.T1
    ADD CONSTRAINT PK_T1
    PRIMARY KEY (c1);
     
    ALTER TABLE dbo.T2
    ADD CONSTRAINT PK_T2
    PRIMARY KEY (c1);
     
    ALTER TABLE dbo.T3
    ADD CONSTRAINT PK_T3
    PRIMARY KEY (c1);
     
    ALTER TABLE dbo.T4
    ADD CONSTRAINT PK_T4
    PRIMARY KEY (c1);

    We also need an INSERT statement that references the view. Even more specifically, to see a Switch operator, we need to perform a single-row insert (multi-row inserts use a different plan shape):

    INSERT dbo.V1 (c1)
    VALUES (1);

    And now…the execution plan:

    Switch Execution Plan

    The Constant Scan manufactures a single row with no columns. The Compute Scalar works out which partition of the view the new value should go in. The Assert checks that the computed partition number is not null (if it is, an error is returned). The Nested Loops Join executes exactly once, with the partition id as an outer reference (correlated parameter).

    The Switch operator checks the value of the parameter and executes the corresponding input only. If the partition id is 0, the uppermost Clustered Index Insert is executed, adding a row to table T1. If the partition id is 1, the next lower Clustered Index Insert is executed, adding a row to table T2…and so on.

    In case you were wondering, here’s a query and execution plan for a multi-row insert to the view:

    INSERT dbo.V1 (c1)
    VALUES (1), (2);

    Multi-row partitioned view insert

    Yuck! An Eager Table Spool and four Filters! I prefer the Switch plan.

    My guess is that almost all the old strategies that used a Switch operator have been replaced over time, using things like a regular Concatenation Union All combined with Start-Up Filters on its inputs. Other new (relative to the Switch operator) features like table partitioning have specific execution plan support that doesn’t need the Switch operator either. This feels like a bit of a shame, but perhaps it is just nostalgia on my part, it’s hard to know.

    Please do let me know if you encounter a query that can still use the Switch operator in 2012 – it must be very bored if this is the only possible modern usage!

    2. Invisible Plan Operators

    The second part of this post uses an example based on a question Dave Ballantyne asked using the SQL Sentry Plan Explorer plan upload facility. If you haven’t tried that yet, make sure you’re on the latest version of the (free) Plan Explorer software, and then click the Post to SQLPerformance.com button. That will create a site question with the query plan attached (which can be anonymized if the plan contains sensitive information). Aaron Bertrand and I keep a close eye on questions there, so if you have ever wanted to ask a query plan question of either of us, that’s a good way to do it.

    The problem

    The issue I want to talk about revolves around a query issued against a calendar table. The script below creates a simplified version and adds 100 years of per-day information to it:

    USE tempdb;
    GO
    CREATE TABLE dbo.Calendar  
    (  
        dt          date     NOT NULL,
        isWeekday   bit      NOT NULL,
        theYear     smallint NOT NULL,
     
        CONSTRAINT PK__dbo_Calendar_dt
            PRIMARY KEY CLUSTERED (dt)
    );
    GO
    -- Monday is the first day of the week for me
    SET DATEFIRST 1;
     
    -- Add 100 years of data
    INSERT dbo.Calendar WITH (TABLOCKX)
        (dt, isWeekday, theYear)
    SELECT
        CA.dt,
        isWeekday =
            CASE
                WHEN DATEPART(WEEKDAY, CA.dt) IN (6, 7)
                THEN 0
                ELSE 1
            END,
        theYear = YEAR(CA.dt)
    FROM Sandpit.dbo.Numbers AS N
    CROSS APPLY
    (
        VALUES (DATEADD(DAY, N.n - 1, CONVERT(date, '01 Jan 2000', 113)))
    ) AS CA (dt)
    WHERE
        N.n BETWEEN 1 AND 36525;

    The following query counts the number of weekend days in 2013:

    SELECT
        Days = COUNT_BIG(*)
    FROM dbo.Calendar AS C
    WHERE
        theYear = 2013
        AND isWeekday = 0;

    It returns the correct result (104) using the following execution plan:

    Original Plan

    The query optimizer has managed to estimate the number of rows returned from the table exactly, based purely on the default statistics created separately on the two columns referenced in the query’s WHERE clause. (Well, almost exactly, the unrounded estimate is 104.289 rows.)

    There is already an invisible operator in this query plan – a Filter operator used to apply the WHERE clause predicates. We can see it by re-running the query with the enormously useful (but undocumented) trace flag 9130 enabled:

    TF 9130 plan

    Now we can see the full picture. The whole table is scanned, returning all 36,525 rows, before the Filter narrows that down to just the 104 we want. Without the trace flag, the Filter is incorporated in the Clustered Index Scan as a residual predicate. It is a little bit more efficient than using a separate operator, but residual predicates are still something you will want to avoid where possible. The estimates are still spot on though:

    Plan Explorer Actual vs Estimate

    Anyway, looking to improve the performance of this query, Dave added the following filtered index to the Calendar table:

    CREATE NONCLUSTERED INDEX Weekends
    ON dbo.Calendar(theYear)
    WHERE isWeekday = 0;

    The original query now produces a much more efficient plan:

    Filtered index plan

    Unfortunately, the estimated number of rows produced by the seek is now wrong (365 instead of 104):

    Wrong Estimate!

    What’s going on? The estimate was spot on before we added the index!

    Explanation

    You might want to grab a coffee for this bit.

    Using another trace flag or two (8606 and 8612) we can see that the cardinality estimates were exactly right initially:

    8606 and 8612

    The highlighted information shows the initial cardinality estimates for the base table (36,525 rows), the result of applying the two relational selects in our WHERE clause (104 rows), and after performing the COUNT_BIG(*) group by aggregate (1 row). All of these are correct, but that was before cost-based optimization got involved :)

    Cost-based optimization

    When cost-based optimization starts up, the logical tree above is copied into a structure (the ‘memo’) that has one group per logical operation (roughly speaking). The logical read of the base table (LogOp_Get) ends up in group 7; the two predicates (LogOp_Select) end up in group 8 (with the details of the selections in subgroups 0-6). These two groups still have the correct cardinalities as trace flag 8608 output (initial memo contents) shows:

    8608

    During cost-based optimization, a rule called SelToIdxStrategy runs on group 8. It’s job is to match logical selections to indexable expressions (SARGs). It successfully matches the selections (theYear = 2013, is Weekday = 0) to the filtered index, and writes a new alternative into the memo structure. The new alternative is entered into group 8 as option 1 (option 0 was the original LogOp_Select):

    Rule result

    The new alternative is to do nothing (PhyOp_NOP = no operation), but to instead follow the new logical instructions listed below the NOP.

    The LogOp_GetIdx (full read of an index) goes into group 21, and the LogOp_SelectIdx (selection on an index) is placed in group 22, operating on the result of group 21. The definition of the comparison ‘the Year = 2013’ (ScaOp_Comp downwards) was already present in the memo starting at group 2, so no new memo groups are created for that.

    New Cardinality Estimates

    The new memo groups require two new cardinality estimates to be derived. First, LogOp_Idx (full read of the index) gets a predicted cardinality of 10,436. This number comes from the filtered index statistics:

    DBCC SHOW_STATISTICS (Calendar, Weekends) 
    WITH STAT_HEADER;

    Filtered Index Header

    The second new cardinality derivation is for the LogOp_SelectIdx applying the predicate (theYear = 2013). To get a number for this, the cardinality estimator uses statistics for the column ‘theYear’, producing an estimate of 365 rows (there are 365 days in 2013!):

    DBCC SHOW_STATISTICS (Calendar, theYear)
    WITH HISTOGRAM;

    365 row estimate

    This is where the mistake happens. Cardinality estimation should have used the filtered index statistics here, to get an estimate of 104 rows:

    DBCC SHOW_STATISTICS (Calendar, Weekends)
    WITH HISTOGRAM;

    Filtered stats histogram

    Unfortunately, the logic has lost sight of the link between the read of the filtered index (LogOp_GetIdx) in group 22, and the selection on that index (LogOp_SelectIdx) that it is deriving a cardinality estimate for, in group 21. The correct cardinality estimate (104 rows) is still present in the memo, attached to group 8, but that group now has a PhyOp_NOP implementation.

    Skipping over the rest of cost-based optimization (in a belated attempt at brevity) we can see the optimizer’s final output using trace flag 8607:

    8607

    This output shows the (incorrect, but understandable) 365 row estimate for the index range operation, and the correct 104 estimate still attached to its PhyOp_NOP. This tree still has to go through a few post-optimizer rewrites and ‘copy out’ from the memo structure into a tree suitable for the execution engine. One step in this process removes PhyOp_NOP, discarding its 104-row cardinality estimate as it does so.

    To finish this section on a more positive note, consider what happens if we add an OVER clause to the query aggregate. This isn’t intended to be a ‘fix’ of any sort, I just want to show you that the 104 estimate can survive and be used if later cardinality estimation needs it:

    SELECT
        Days = COUNT_BIG(*) OVER ()
    FROM dbo.Calendar AS C
    WHERE
        theYear = 2013
        AND isWeekday = 0;

    The estimated execution plan is:

    OVER plan

    Note the 365 estimate at the Index Seek, but the 104 lives again at the Segment! We can imagine the lost predicate ‘isWeekday = 0’ as sitting between the seek and the segment in an invisible Filter operator that drops the estimate from 365 to 104.

    Even though the NOP group is removed after optimization (so we don’t see it in the execution plan) bear in mind that all cost-based choices were made with the 104-row memo group present, so although things look a bit odd, it shouldn’t affect the optimizer’s plan selection. I should also mention that we can work around the estimation issue by including the index’s filtering columns in the index key:

    CREATE NONCLUSTERED INDEX Weekends
    ON dbo.Calendar(theYear, isWeekday)
    WHERE isWeekday = 0
    WITH (DROP_EXISTING = ON);

    There are some downsides to doing this, including that changes to the isWeekday column may now require Halloween Protection, but that is unlikely to be a big problem for a static calendar table ;)  With the updated index in place, the original query produces an execution plan with the correct cardinality estimation showing at the Index Seek:

    Fixed plan

    That’s all for today, remember to let me know about any Switch plans you come across on a modern instance of SQL Server!

    Finally, here are some other posts of mine that cover other plan operators:

    Segment and Sequence Project

    Common Subexpression Spools

    Why Plan Operators Run Backwards

    Row Goals and the Top Operator

    Hash Match Flow Distinct

    Top N Sort

    Index Spools and Page Splits

    Singleton and Range Seeks

    Bitmaps

    Hash Join Performance

    Compute Scalar

    © 2013 Paul White – All Rights Reserved

    Twitter: @SQL_Kiwi

  • SQL Intersection Conference, Las Vegas MGM Grand April 8-11

    I'll_Be_There

    You might have noticed the SQL Intersection logo appearing in my recent blog posts. This is a brand new event focussed on tuning and troubleshooting the SQL Server core engine (including Windows SQL Azure Database). If you read and enjoy at least some of the things I write about, this conference is one you should seriously consider attending.

    For my part, I’ll be presenting three sessions: two on Interpreting Execution Plans, and one on Parallel Query Execution. I have to promote my sessions, because just look at the other speakers you have to choose from:

    SQL Speakers

    As well as speaking, all these people will be around to chat to over the four conference days, and some are offering complete days of training at pre- and post-conference workshops as well.

    Personally, I’m particularly excited about talking to the following people and attending their sessions:

    • Bob Ward
    • Conor Cunningham
    • Paul Randal
    • Kimberly Tripp
    • Brent Ozar
    • Kendra Little
    • Grant Fritchey
    • Joe Sack
    • Jeremiah Peschka
    • Aaron Bertrand
    • Steve Jones

    Main Conference Overview

    Each of the main conference days has three tracks to choose from:

    Tuesday

    Tuesday Schedule

    Wednesday

    Wednesday Schedule

    Thursday

    Thursdsay Schedule

    The full schedule is available here as a PDF document, or you can visit the conference website for all the details.

    Pricing

    The conference pricing is very competitive (especially so, considering the location and speaker line-up):

    Pricing Grid

    Discount Code

    When registering, quote the Discount Code WHITE to receive an extra $50 off these prices.

    I really hope to see you in Las Vegas – please feel free to come and talk to me about anything concerning execution plans or the query optimizer :)

    Paul White

    SQL Server MVP 2011, 2012
    Twitter:
    @SQL_Kiwi

  • Execution Plan Analysis: The Mystery Work Table

    Ill_Be_There4

    I love SQL Server execution plans. It is often easy to spot the cause of a performance problem just by looking at one. The task is considerably easier if the plan includes run-time information (a so-called ‘actual’ execution plan), but even a compiled plan can be very useful. Nevertheless, there are still times where the execution plan does not tell the whole story, and we need to think more deeply about query execution to really understand a performance problem. This post looks at one such example, based on a recent question posted on the SQL Performance Q & A site.

    The Execution Plan

    Original Query Plan

    This plan is reasonably large (20MB cached plan size) but not massively complex once you break it down (click on the image above to view it full-size in a new window). The context of the question is that this query usually executes in less than a minute, but sometimes it runs for nearly twenty minutes – though the plan appears identical.

    High-Cost Operators

    There are many different things to look for in execution plans. What you choose to look at first is as much a matter of personal preference as anything, but many people are drawn to high-cost operators, so I will start there. In this plan, the cost of one operator dominates all others, shown as being responsible for 100% of the cost of the query. It is highlighted in red in Plan Explorer; I have expanded the relevant plan section (the top right) below:

    100% operator cost

    There is no doubt that this seek is busy little thing. It is executed 249,484 times, though it only produces a grand total of 167,813 rows over all iterations of the loop join – an average of just 0.7 rows per seek. There are all sorts of interesting details in the plan about this seek – I could write a whole blog post about it – but two details that stand out are the “Force Seek: True” and “Partitioned: True” attributes. These tell us that the base table is partitioned, and the query writer had to use a FORCESEEK table hint to get this plan.

    Without this hint, the optimizer would almost certainly choose a Hash Match or Merge Join rather than Nested Loops. This is understandable given the optimizer’s cost model and the simplifying assumptions it makes (such as assuming every query starts with a cold buffer cache). That’s fine, but we can see from the query plan that the inner-side table has 643 million rows. Left to its own devices, the optimizer would estimate that it would be faster to perform a sequential scan of 643 million rows (with large-block read-ahead) than it would be to run a quarter-million randomly-distributed seeks driven by a Nested Loops join.

    I doubt that the optimizer’s reasoning here is sound (at least on any reasonably modern hardware) but there we go. The query author probably knows that a good fraction of this table is likely to be in cache, so with all that in mind, I think we can reasonably assume at this stage that the FORCESEEK hint is genuinely needed here, and this part of the plan is at least reasonably optimal.

    Important note: The seek certainly does not account for 100% of the runtime cost of this query. Remember cost percentages are always estimates – even in ‘actual’ plans. It can be useful to check the reasons for high-estimated-cost operators, but they should never be used as a primary tuning metric.

    Execution Warnings

    Sort Warning

    This query was executed on SQL Server 2012, so there is a handy warning triangle on the Sort operator indicating that one or more sort runs had to be spilled to physical tempdb disk. The plan clearly shows this spilling is a result of an inaccurate cardinality estimate at the Filter operator (the estimates are not bad at all prior to this). The Sort expects 9,217 rows totalling approximately 5MB, but actually encountered 61,846 rows in 35MB. As you may know, memory for sorts and hashes is allocated before execution starts, and generally cannot expand dynamically at run time.

    The spilled sort is undesirable, of course, but it is unlikely to be a major cause of the occasional poor performance given the small size of the spilled data. Nevertheless, this might be a good place to split this query up. The idea would be to write the results of the query (up to and including the Filter) to a temporary heap table using SELECT INTO, and then create a clustered index with the same keys as the Sort operator. The temporary table would not be large, and may well perform better overall than the spilled sort, including the cost of creating the clustered index. Of course, creating this index will involve a sort, but it will be one based on the known cardinality of the temporary heap table. The part of the plan that could be replaced by a temporary table is shown below:

    Plan subtree replaced with a temp table

    I am a big fan of simplifications like this. Smaller execution plans tend to optimize better for all sorts of reasons, and the source code usually becomes easier to maintain as well. I should mention there is another warning triangle in the 2012 execution plan (shown on the root icon), which relates to some implicit conversions that I will mention later.

    I/O Information

    The execution plan was captured with Plan Explorer, so we can also easily see I/O statistics for the two executions. The first is for a fast (sub-60-second) run:

    I/O data - fast

    Overall, these I/O numbers show pretty much what we would expect: a decent number of logical reads associated with the seeks into the Trans table (but certainly not 100% of the total, ha ha), a very small number of physical reads, and a small amount of read-ahead activity on a couple of tables.

    The second set of I/O data is from a slow run (18 minutes or so):

    I/O data - slow

    The very obvious difference is the appearance of a work table, with 178 million logical reads and 130 million LOB logical reads. It seems very likely this work table, and its 300 million logical reads, is responsible for the dramatic decrease in query performance. But given that the execution plans are identical (right down to the XML) what is causing this?

    My answer to that question (on the Q & A site) was that it is related to the increased level of read-ahead activity, but to see why that is the case, we will need to reproduce the issue and dig a bit deeper.

    Execution Outline

    Before we really get going on this, it will be useful to take a look at what the execution plan is doing in outline. We saw the first part of the plan earlier when looking at the spilling sort. The data set at that point (which we would like to write to a temporary table, remember) essentially represents source data for a second query, which uses a series of Nested Loops Left Joins to lookup information from other tables:

    Nested Loop Lookups

    The inner side of each join involves some reasonably involved logic, which is thankfully not important to the present discussion. What is important is that the result of each lookup is a LOB data type. This begins to shed some light on the LOB logical reads reported against the work table, but it does not explain why the work table (and the 300 million associated reads) do not appear when the query runs quickly (with the same execution plan).

    Reproducing the problem

    Table Creation

    The first part of the repro involves creating six tables that represent the lookup tables in the original query plan. Each table will have 10,000 rows, consisting of a sequential reference number and a second column containing a 2048 single-byte-character string. The source table used to drive the lookups will be a regular Numbers table containing just a single integer column.

    CREATE TABLE dbo.T1 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T2 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T3 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T4 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T5 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T6 (id integer IDENTITY PRIMARY KEY, d char(2048));
    GO
    INSERT dbo.T1 WITH (TABLOCKX)
    SELECT REPLICATE('A', 2048)
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    GO
    INSERT dbo.T2 WITH (TABLOCKX)
    SELECT REPLICATE('B', 2048)
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    GO
    INSERT dbo.T3 WITH (TABLOCKX)
    SELECT REPLICATE('C', 2048)
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    GO
    INSERT dbo.T4 WITH (TABLOCKX)
    SELECT REPLICATE('D', 2048)
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    GO
    INSERT dbo.T5 WITH (TABLOCKX)
    SELECT REPLICATE('E', 2048)
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    GO
    INSERT dbo.T6 WITH (TABLOCKX)
    SELECT REPLICATE('F', 2048)
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;

    The next step is to ensure that each lookup table is optimally organized for read-ahead:

    ALTER TABLE dbo.T1 REBUILD WITH (MAXDOP = 1);
    ALTER TABLE dbo.T2 REBUILD WITH (MAXDOP = 1);
    ALTER TABLE dbo.T3 REBUILD WITH (MAXDOP = 1);
    ALTER TABLE dbo.T4 REBUILD WITH (MAXDOP = 1);
    ALTER TABLE dbo.T5 REBUILD WITH (MAXDOP = 1);
    ALTER TABLE dbo.T6 REBUILD WITH (MAXDOP = 1);

    Test Query

    The original query translates into our simplified test rig as:

    DECLARE @d nvarchar(max) = NCHAR(10000);
     
    SELECT n.n,
        DATALENGTH
        (
            CONCAT
            (
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T1 AS t WHERE t.id = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T2 AS t WHERE t.id = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T3 AS t WHERE t.id = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T4 AS t WHERE t.id = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T5 AS t WHERE t.id = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T6 AS t WHERE t.id = n.n)
            )
        )
    FROM dbo.Numbers AS n
    WHERE n.n BETWEEN 1 AND 10000
    ORDER BY n.n
    OPTION (LOOP JOIN, FORCE ORDER, MAXDOP 1);

    The broad idea there is to concatenate our 2048-character column to itself five times and include a Unicode character that was used in the original query as a delimiter that could not appear in the source data. Each lookup performs the same basic operation against its target table, and the final result is the result of concatenating all the intermediate results. The query hints are necessary to get the right plan shape, just because my test rig tables are so much smaller than the real ones.

    Note that the Unicode delimiter means the 2048-character single-byte data is implicitly converted to Unicode, doubling in size. It is not a crucial feature of the test, but it did appear in the original query and explains the type conversion warnings in the execution plan I mentioned earlier. The execution plan for the test query is (click to enlarge if necessary):

    Test query execution plan

    I should also stress that the CONCAT operator (new in SQL Server 2012) is not crucial either. If you are using an earlier version of SQL Server, an equivalent query (for present purposes) is shown below. I’m going to stick with CONCAT for the remainder of the post, however.

    DECLARE @d nvarchar(max) = NCHAR(10000);
     
    SELECT n.n,
        DATALENGTH
        (
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T1 AS t WHERE t.id = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T2 AS t WHERE t.id = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T3 AS t WHERE t.id = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T4 AS t WHERE t.id = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T5 AS t WHERE t.id = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T6 AS t WHERE t.id = n.n)
        )
    FROM dbo.Numbers AS n
    WHERE n.n BETWEEN 1 AND 10000
    ORDER BY n.n
    OPTION (LOOP JOIN, FORCE ORDER, MAXDOP 1);

    Warm cache results

    With all data in memory, the test query (in either form) completes in about 1.6 seconds on my laptop. The result shows that each output row contains 147,468 bytes of Unicode character data. A typical set of I/O statistics follows:

    image

    Nothing too exciting to see there, but this is just our baseline.

    Cold cache results

    CHECKPOINT;
    DBCC DROPCLEANBUFFERS;

    With no data in memory, the test query now runs for 18.6 seconds – almost 12x slower. The I/O statistics show the expected (but still mysterious!) work table and its associated reads:

    image

    The Extended Events wait statistics show SQL Server spent very little of that time waiting on my laptop’s slow hard drive – just 402 ms:

    image

    Explanation

    The are a number of factors in play here that we will look at in turn.

    Nested Loops Prefetching

    One of the reasons the optimizer prefers Hash Match and Merge Join for larger inputs is that the data access patterns tend to favour large sequential read-ahead. Both hash and merge tend to scan (range-scan in the case of a seek) their inputs, and the SQL Server Storage Engine automatically issues read-ahead when it detects this type of access. There is nothing in the execution plan to show that a base table will be read with read-ahead, it just happens.

    A very basic implementation of Nested Loops join would not benefit from read-ahead at all on its inner side. The outer (driving) side of the loops join might well be a scan or range-scan of an index, and so benefit from automatic read-ahead, of course. The inner side is executed once per outer row, resulting in a rapid succession of small index seeks for different values. These small seeks will typically not be large enough to trigger the automatic read-ahead mechanism. Indeed, in our test, each inner side seek is for precisely one value.

    SQL Server improves on this by implementing a second read-ahead mechanism especially for Nested Loops joins (not all N-L joins, it is a cost-based decision the optimizer makes). The basic idea is to buffer extra rows from the outer side of the join, and to use the row values in the buffer to drive read-ahead for the inner side. The effect is that the Nested Loops join becomes a partly blocking operator as outer-side rows are read into the buffer and read-ahead issued based on buffered index key values.

    This read-ahead may be either order-preserving or not, and is indicated in the execution plan by the Nested Loop attributes With Ordered Prefetch and With Unordered Prefetch, respectively. When unordered prefetch occurs, the inner side is processed in whatever order the asynchronous reads happen to complete. With ordered prefetching, the mechanism is careful to ensure that the order of rows entering the join is preserved on the output.

    In the test rig, the ORDER BY clause means there is a need to preserve row order, so Ordered Prefetch is used:

    Ordered Prefetch

    The issue described in this post is not specific to ordered prefetching – the same behaviour is just as likely with unordered prefetching. The point is that Nested Loops prefetching is one of the requirements.

    Documented trace flags 652 and 8744 may be used (with care, and after serious testing) to disable automatic read-ahead and Nested Loops prefetching respectively. This is sometimes beneficial where all data is expected to be in memory (in which case read-ahead processing consumes resources better used by query execution) or where the I/O subsystem is extremely fast. In case you were wondering, there is no background thread for prefetching – all the work of checking whether the data is in memory, and issuing I/O if not, is performed by the worker thread executing the query.

    I should stress that read-ahead and Nested Loops prefetching is generally A Very Good Thing with typical storage solutions (e.g. SANs) and both work best (or at all) when indexes have low logical fragmentation.

    Manufactured LOBs

    The issue described here also requires that a large object data type is manufactured before prefetching. The Compute Scalar operators in the test execution plan perform that function:

    Manufactured LOB

    By ‘manufactured’, I mean that the source columns are not LOB types, but the expression output is – notice the implicit conversion to nvarchar(max). To be clear about it, the issue we are analysing here does not occur when Nested Loops prefetching occurs with an expression that was a LOB type to begin with.

    The Outer Join

    The optimizer is quite good, generally speaking, at moving scalar expressions around. If the query had featured inner joins (whether by query design or through optimizer activities) the chances are quite good that the problematic expressions (the LOB manufacturers) would have moved beyond the prefetching, and so out of harm’s way. It is quite tricky to preserve NULL-extension and other outer-join semantics properly when moving expressions above an outer join, so the optimizer generally does not even try. In essence, the outer join represents an optimization barrier to the LOB-manufacturing expressions.

    Memory Allocation

    When Nested Loops prefetching occurs with a manufactured LOB, the question arises of where to store the created LOBs when buffering rows for prefetch. If the source data were already a LOB type, the execution engine would already have memory structures in place to handle them. When prefetching encounters a manufactured LOB, it needs to store it somewhere, since the engine is no longer processing a stream of one row at a time. It turns out that there is a small memory buffer set aside for this eventuality, which empirical tests show to be 24KB.

    However, this 24KB (directly allocated, not via workspace memory grant) is shared across all concurrently executing prefetching joins in the query. With six such joins in the test rig plan and large manufactured LOBs, the buffer stands no chance. As a result, query execution engages a bail-out option: a work table created in tempdb. Though the pages of the worktable may in fact remain memory-resident, overheads (including latching and using general-purpose code interfaces for access to the buffered rows) mean this is very much slower than using the direct-memory cache.

    As with most internal work tables, the logical reads reported on this work table indicate the number of rows processed (not 8KB pages, as for regular I/O statistics). This fact, together with the large number of items processed via the worktable in our test, accounts for the millions of reads reported.

    The creation and use of the work table depends on run time conditions and timing. If execution finds the data it needs is already in memory, the prefetch checks are still performed, but no asynchronous read requests end up being posted. The 24KB buffer is never filled, so the need to create a work table never arises. The more prefetch that actually occurs, the higher the chances that the buffer will fill. It is quite possible to experience a low level of prefetch with manufactured LOBs without the engine needing to bail out to a work table, especially if the LOBs are not very big and the I/O system is quite fast.

    Workaround

    We can rewrite the query to avoid feeding manufactured LOB data to the prefetch buffer. The idea is to use OUTER APPLY to return the data that contributes to the concatenation, rather than the result of the concatenation. We can then perform the CONCAT operation (which handles NULLs nicely without extra work) after the join, avoiding the prefetch buffer issue completely. In SQL Server versions prior to 2012, we would need to use direct string concatenation, and handle rows that are NULL-extended explicitly using ISNULL or COALESCE.

    DECLARE @d nvarchar(max) = NCHAR(10000);
     
    SELECT 
        n.n,
        DATALENGTH
        (
            CONCAT
            (
                CONCAT(oa1.i0, oa1.i1, oa1.i2, oa1.i3, oa1.i4, oa1.i5, oa1.i6),
                CONCAT(oa2.i0, oa2.i1, oa2.i2, oa2.i3, oa2.i4, oa2.i5, oa2.i6),
                CONCAT(oa3.i0, oa3.i1, oa3.i2, oa3.i3, oa3.i4, oa3.i5, oa3.i6),
                CONCAT(oa4.i0, oa4.i1, oa4.i2, oa4.i3, oa4.i4, oa4.i5, oa4.i6),
                CONCAT(oa5.i0, oa5.i1, oa5.i2, oa5.i3, oa5.i4, oa5.i5, oa5.i6),
                CONCAT(oa6.i0, oa6.i1, oa6.i2, oa6.i3, oa6.i4, oa6.i5, oa6.i6)
            )
        )
    FROM dbo.Numbers AS n
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T1 AS t WHERE t.id = n.n) AS oa1
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T2 AS t WHERE t.id = n.n) AS oa2
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T3 AS t WHERE t.id = n.n) AS oa3
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T4 AS t WHERE t.id = n.n) AS oa4
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T5 AS t WHERE t.id = n.n) AS oa5
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T6 AS t WHERE t.id = n.n) AS oa6
    WHERE n.n BETWEEN 1 AND 10000
    ORDER BY n.n
    OPTION (LOOP JOIN, FORCE ORDER, MAXDOP 1);

    The execution plan for the rewritten query looks visually similar to the problematic one:

    Rewritten query plan

    However, the Compute Scalars no longer manufacture a LOB data type, they just emit column and variable references:

    image

    All the concatenation work (and LOB manufacture) is performed by the final top-level Compute Scalar in a single monster expression [Expr1056]:

    image

    Warm cache results

    With all data in memory, the new query completes in 1.8 seconds (very slightly up on 1.6 seconds before):

    image

    Cold cache results

    When all data must be fetched from disk, the query issues optimal prefetching and completes in 7.3 seconds (down from 18.6 seconds) with no work table:

    image

    The Extended Events wait statistics now show 3.8 seconds spent waiting for my laptop’s slow spinning disk (which is a good thing!)

    image

    Final Thoughts

    Work tables can appear in STATISTICS IO output for a wide range of reasons, but if you encounter one with a very large number of reads – particularly LOB reads – you may be encountering this issue. The rewrite proposed above may not always be possible, but you should be able to refactor your query to avoid the issue now you know it exists.

    I am not a fan of doing large amounts of string manipulation in SQL Server. I am always particularly suspicious of the perceived need to split or concatenate large volumes of strings.

    I am, however, a fan of always using explicit data types (rather than relying on implicit conversions) and generating relatively small query plans that offer the query optimizer clear and obvious choices. By necessity, this often means writing small SQL queries in logical steps (and no, long chains of common table expressions do not count!)

    The real world does not always make these things possible, of course, but it is good to have goals :)

    © 2013 Paul White – All Rights Reserved
    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi

    Screenshots acquired using SnagIt by TechSmith
    Query plan details obtained using Plan Explorer PRO by SQLSentry

  • Halloween Protection – The Complete Series

    image

    I have just published a four-part series for SQLPerformance.com on the Halloween Problem. Some of you will never have heard of this issue, and those that have might associate it only with T-SQL UPDATE queries. In fact, the Halloween problem affects execution plans for INSERT, UPDATE, DELETE and MERGE statements.

    This is a topic I have been meaning to write about properly for years, ever since I read Craig Freedman’s 2008 blog post on the topic, which ended with the cryptic comment:

    “…although I've used update statements for all of the examples in this post, some insert and delete statements also require Halloween protection, but I'll save that topic for a future post.”

    That future post never materialized, sadly, so I thought I would have a go. The four parts of the series are summarized and linked below, I hope you find the material interesting.


    Part 1 – The Halloween Problem and UPDATE statements

    • The SQL standard and three-phase separation
    • Logical update processing
    • Pipelined execution
    • The Halloween problem
    • Avoiding the problem in UPDATE statements

    Part 2 – The Halloween Problem in INSERT and DELETE queries

    • INSERT examples
    • DELETE examples
    • Constraint checking and phase separation

    Part 3 – Halloween Problem optimizations for MERGE

    • MERGE contains several optimizations the other DML statements do not
    • Hole-filling with merge join
    • Hole-filling with nested loops
    • Avoiding an extra B-tree navigation
    • Avoiding the join

    Part 4 – The Halloween Problem and the Query Optimizer

    • Early optimization approaches
    • The SQL Server optimizer approach
    • The case of the redundant sort
    • HP levels and properties
    • Plan changes for Halloween Protection
    • Non-spool options
    • Row versioning
    • Heaps and forwarded records
    • T-SQL functions

    As always, I appreciate your comments and feedback.

    Paul White
    @SQL_Kiwi
    SQLkiwi@gmail.com

    Ill_Be_There4

  • Incorrect Results with Indexed Views


    Summary: If you use MERGE, indexed views and foreign keys, your queries can return incorrect results.

    Microsoft have released a fix for incorrect results returned when querying an indexed view. The problem applies to:

    • SQL Server 2012
    • SQL Server 2008 R2
    • SQL Server 2008

    The Knowledge Base article does not go into much detail, or provide a reproduction script, but this blog entry does :)

    The KB does say that reproducing the bug requires these features:

    • An indexed view on two tables that have a foreign key relationship
    • An update performed against the base tables
    • A query executed against the indexed view using a NOEXPAND hint

    There are two important details I would like to add right away:

    • The NOEXPAND hint is not required to reproduce the bug on Enterprise Edition
    • The update must be performed by a MERGE statement

    The fix is available in the following cumulative update packages:

    Cumulative Update 2 for SQL Server 2012 SP1 [build 11.0.3339]
    Cumulative Update 5 for SQL Server 2012 RTM [build 11.0.2395]
    Cumulative Update 4 for SQL Server 2008 R2 SP2 [build 10.50.4270]
    Cumulative Update 10 for SQL Server 2008 R2 SP1 [build 10.50.2868]
    Cumulative Update 8 for SQL Server 2008 SP3 [build 10.00.5828]

    No service pack contains this fix, you must apply one of the hotfix packages above.

    Steps to Reproduce

    The first thing we will need is two tables:

    CREATE TABLE dbo.Parent 
    (
        parent_id integer IDENTITY NOT NULL,
        value varchar(20) NOT NULL,
     
        CONSTRAINT PK_Parent_id 
            PRIMARY KEY CLUSTERED (parent_id)
    );
    GO
    CREATE TABLE dbo.Child
    (
        child_id integer IDENTITY NOT NULL,
        parent_id integer NOT NULL,
     
        CONSTRAINT PK_Child_id 
            PRIMARY KEY CLUSTERED (child_id)
    );

    And a few rows of data:

    INSERT dbo.Child (parent_id)
    SELECT New.parent_id 
    FROM 
        (
        INSERT Parent 
        OUTPUT inserted.parent_id 
        VALUES 
            ('Apple'), 
            ('Banana'), 
            ('Cherry')
        ) AS New;

    The tables now look like this (parent first):

    Original Data

    We can now add the required FOREIGN KEY relationship:

    ALTER TABLE dbo.Child
    ADD CONSTRAINT FK_Child_Parent
    FOREIGN KEY (parent_id)
    REFERENCES dbo.Parent (parent_id);

    Next, a very simple indexed view that joins the two tables (the view could contain more complex features like aggregates):

    CREATE VIEW dbo.ParentsAndChildren
    WITH SCHEMABINDING 
    AS
    SELECT 
        p.parent_id, 
        p.value, 
        c.child_id
    FROM dbo.Parent AS p
    JOIN dbo.Child AS c ON 
        c.parent_id = p.parent_id;
    GO
    CREATE UNIQUE CLUSTERED INDEX cuq 
    ON dbo.ParentsAndChildren (child_id);

    The final step is to use a MERGE statement to make some changes to the Parent table:

    DECLARE @ParentMerge AS TABLE
    (
        parent_id integer PRIMARY KEY, 
        value varchar(20) NOT NULL
    );
     
    INSERT @ParentMerge
        (parent_id, value)
    VALUES
        (1, 'Kiwi Fruit'),
        (4, 'Dragon Fruit');
     
    MERGE dbo.Parent AS p
    USING @ParentMerge AS s ON 
        s.parent_id = p.parent_id
    WHEN MATCHED THEN 
        UPDATE SET value = s.value
    WHEN NOT MATCHED THEN 
        INSERT (value) VALUES (s.value)
    OUTPUT 
        $action, 
        inserted.parent_id, 
        deleted.value AS old_value, 
        inserted.value AS new_value;

    This MERGE performs two actions:

    1. Updates the value column of parent row 1 from Apple to Kiwi Fruit
    2. Adds a new parent row 4 for Dragon Fruit

    The statement includes an OUTPUT clause to show the changes it makes (this is not required for the repro):

    MERGE changes

    This confirms that the changes have been made as we requested: parent row 1 has changed; and row 4 has been added. The changes are reflected in the base tables:

    SELECT * FROM dbo.Parent AS p;
    SELECT * FROM dbo.Child AS c;

    Updated Base Table Data

    As highlighted, row 1 has changed from Apple to Kiwi Fruit and row 4 has been added.

    We do not expect to see row 4 in the indexed view because there are no child records for that row, and the indexed view uses an inner join. Checking the indexed view using the NOEXPAND table hint (required in non-Enterprise SKUs to use indexes on a view):

    SELECT *
    FROM dbo.ParentsAndChildren 
    WITH (NOEXPAND);

    Incorrect indexed view data

    The results are incorrect. They show the old value of the data for parent row 1.

    Now we try using the EXPAND VIEWS query hint to force SQL Server to access the base tables rather than reading view indexes:

    SELECT *
    FROM dbo.ParentsAndChildren
    OPTION (EXPAND VIEWS);

    Expand Views Hint

    This query produces correct results.

    On SQL Server Enterprise Edition, the optimizer chooses whether to access the indexed view or the base tables. For following query, without any hints, the optimizer chooses not to expand the view. It reads the view index and produces incorrect results:

    -- Enterprise Edition ONLY
    SELECT * 
    FROM dbo.ParentsAndChildren;

    Indexed View Matching

    Perhaps adding a child row to match the new parent row 4 will somehow fix things up?

    INSERT dbo.Child (parent_id) VALUES (4);
    GO
    SELECT * FROM dbo.ParentsAndChildren WITH (NOEXPAND);
    SELECT * FROM dbo.ParentsAndChildren OPTION (EXPAND VIEWS);

    After Insert

    No. The query plan that accesses the view index still returns an incorrect value for row 1. It seems MERGE has corrupted our indexed view.

    Analysis using DBCC CHECKTABLE

    Checking the view with DBCC CHECKTABLE returns no errors:

    DBCC CHECKTABLE (ParentsAndChildren);

    DBCC output 1

    Unless we use the EXTENDED_LOGICAL_CHECKS option:

    DBCC CHECKTABLE (ParentsAndChildren) WITH EXTENDED_LOGICAL_CHECKS;

    DBCC output 2

    The damage is repairable:

    ALTER DATABASE Sandpit 
    SET SINGLE_USER 
    WITH ROLLBACK IMMEDIATE;
    GO
    DBCC CHECKTABLE (ParentsAndChildren, REPAIR_REBUILD) WITH EXTENDED_LOGICAL_CHECKS;
    GO
    DBCC CHECKTABLE (ParentsAndChildren) WITH EXTENDED_LOGICAL_CHECKS;
    GO
    ALTER DATABASE Sandpit SET MULTI_USER;

    DBCC repair

    You probably do not want to set your database to SINGLE_USER mode and run a DBCC repair after every MERGE statement, however. We could also rebuild the indexed view’s clustered index manually, of course.

    Cause

    For the MERGE statement above, the query optimizer builds a plan that does not update the indexed view (click to enlarge):

    Incorrect Update Plan

    In a version of SQL Server with the fix applied, the same MERGE statement produces a plan that does maintain the indexed view:

    Correct Update Plan

    The plan operators used to keep the view index in step with the base tables are highlighted. Without these operators, the changes to the base table are not correctly written to any indexes defined on the view. The root cause is related to the same simplification that allows the optimizer to remove the reference to the Parent table in this query:

    SELECT 
        COUNT_BIG(*)
    FROM dbo.Parent AS p 
    JOIN dbo.Child AS c ON 
        c.parent_id = p.parent_id;

    Simplification Example Plan

    The FOREIGN KEY relationship and NOT NULL constraints on the referencing column together mean that the join to Parent cannot affect the result of the query, so the join is simplified away. In SQL Server 2012, we can see when this simplification is performed because the following message appears when undocumented trace flags 8619 and 3604 are enabled during compilation:

    Trace Flag 8619 output

    The same message is emitted when a MERGE statement contains a WHEN MATCHED THEN UPDATE clause and either a WHEN NOT MATCHED THEN INSERT or WHEN MATCHED … THEN DELETE clause. These conditions combine such that the optimizer incorrectly concludes that a table reference can be removed, when in fact it is needed later on when the update side of the plan is built.

    Other details of the query and database can affect whether the simplification can be misapplied. For example, if the FOREIGN KEY constraint contains an ON DELETE CASCADE clause, and the MERGE contains a DELETE clause, the simplification is not performed, the TF 8619 message does not appear, and the bug does not manifest.

    The key to determining whether a particular query is vulnerable to this bug (TF 8619 aside) is to check whether the query plan includes operators to maintain the indexed view. At a minimum, you should see a update operator for the view:

    View Update Operator

    SQL Sentry Plan Explorer identifies the operator as applying to a view explicitly, in SSMS you need to click on the graphical operator and inspect the Properties window.

    Summary

    The updated conditions for incorrect results are:

    • An indexed view that joins tables
    • Two of the tables have a single-column FOREIGN KEY constraint
    • A MERGE statement contains an UPDATE action that affects one of the tables
    • The MERGE statement also contains an INSERT or DELETE action (or both)
    • The optimizer applies a simplification that removes a table reference based on the FK relationship and other metadata
    • As a result, the MERGE execution plan does not contain the operators necessary to correctly maintain the indexed view
    • A subsequent query plan accesses an index on the view, either explicitly or via indexed-view matching (Enterprise Edition)

    Note:

    • The simplification is not applied in tempdb
    • The simplification is not applied to multi-column foreign key constraints

    Under these conditions, the indexes on the view do not reflect the state of the base tables and incorrect results are returned. Once the hot fix is applied, the optimizer does not misapply the simplification so the correct indexed view maintenance features are built into execution plans.

    Update: I am adding this based on Ian Yates’ great question in the comments: my expectation is that applying this hotfix will not remove any existing indexed view corruption. You would need to test the hotfix as usual, apply it, and then either rebuild all affected indexed views manually, or run DBCC CHECKDB with the EXTENDED_LOGICAL_CHECKS option (which could take a long time).

    © 2013 Paul White – All Rights Reserved

    Ill_Be_There4

    Twitter: @SQL_Kiwi
    Email: SQLKiwi@gmail.com

  • A creative use of IGNORE_DUP_KEY

    I'll_Be_There

    Let’s say you have a big table with a clustered primary key, and an application that inserts batches of rows into it. The nature of the business is that the batch will inevitably sometimes contain rows that already exist in the table.

    The default SQL Server INSERT behaviour for such a batch is to throw error 2627 (primary key violation), terminate the statement, roll back all the inserts (not just the rows that conflicted) and keep any active transaction open:

    CREATE TABLE #Big (pk int NOT NULL CONSTRAINT PK_Big PRIMARY KEY);
    GO
    -- Start with values 1 & 5 in the table
    INSERT #Big (pk) VALUES (1), (5);
     
    -- Our batch transaction
    BEGIN TRANSACTION;
        -- Insert a batch containing some pre-existing rows
        INSERT #Big (pk) VALUES (1), (2), (3), (4), (5);
     
        -- Show the contents of the table after the insert statement
        SELECT pk FROM #Big;
     
        -- Show the transaction count
        SELECT tran_count = @@TRANCOUNT;
     
    -- Rollback
    ROLLBACK TRANSACTION;
     
    -- Final table contents
    SELECT pk FROM #Big;
    GO
    -- Tidy up
    DROP TABLE #Big;

    The output is:

    INSERT with duplicate values

    Ignoring Duplicates

    Let us further imagine that the desired behaviour is that new rows in a batch should be inserted, and any duplicates silently rejected. Most importantly, no error messages should be returned to the client. Ideally, this would be achieved without immediate application changes, and without impacting concurrent users of the table (the instance is Enterprise Edition, so online operations are available).

    This seems like an ideal use for the IGNORE_DUP_KEY option:

    CREATE TABLE dbo.Big
    (
        pk int NOT NULL,
        
        CONSTRAINT PK_Big 
        PRIMARY KEY (pk)
        WITH (IGNORE_DUP_KEY = ON)
    );
    GO
    -- Unique values
    INSERT dbo.Big (pk)
    VALUES (1), (3), (5);
    GO
    -- key 3 already exists
    INSERT dbo.Big (pk)
    VALUES (2), (3), (4);

    That script executes successfully with just a warning (not an error!) about the duplicate key:

    IGNORE_DUP_KEY output

    The problem

    We would like to add the IGNORE_DUP_KEY option to the existing primary key constraint, but the ALTER TABLE command does not have an ALTER CONSTRAINT clause. We do not want to drop the existing primary key and recreate it with the new option, because the table is large and we want to avoid disrupting concurrent users. Dropping and recreating would result in rebuilding the entire table first as a heap and then as a clustered table again.

    Although there is no ALTER CONSTRAINT syntax, we do know that certain constraint modifications can be performed using ALTER INDEX REBUILD on the index used to enforce the constraint. We can, for example, change the type of compression used, the index fill factor, and whether row locking is enabled:

    CREATE TABLE dbo.Big (pk int NOT NULL CONSTRAINT PK_Big PRIMARY KEY);
    GO
    INSERT dbo.Big (pk)
    VALUES (1), (2), (3), (4), (5);
    GO
    ALTER INDEX PK_Big ON dbo.Big
    REBUILD WITH (DATA_COMPRESSION = PAGE, ONLINE = ON);
    GO
    ALTER INDEX PK_Big ON dbo.Big
    REBUILD WITH (FILLFACTOR = 90, ALLOW_ROW_LOCKS = OFF, ONLINE = ON);
    GO
    DROP TABLE dbo.Big;

    Unfortunately, we cannot use the same trick to add the IGNORE_DUP_KEY option to the underlying index for the primary key:

    ALTER INDEX with IGNORE_DUP_KEY

    The same error message results even if the ONLINE = ON option is not specified. There will be a range of views about how accurate the error message text is here, but we cannot avoid the fact that the operation we want to perform is not supported by SQL Server.

    A creative workaround

    One idea is to add a new UNIQUE constraint (or index) on the same columns as the primary key, but with the IGNORE_DUP_KEY option added. This operation can be performed ONLINE:

    -- Existing table
    CREATE TABLE dbo.Big (pk int NOT NULL CONSTRAINT PK_Big PRIMARY KEY);
    GO
    -- Existing data
    INSERT dbo.Big (pk) VALUES (1), (2), (3);
    GO
    -- New constraint (or index) with IGNORE_DUP_KEY, added ONLINE
    ALTER TABLE dbo.Big
    ADD CONSTRAINT UQ_idk
    UNIQUE NONCLUSTERED (pk)
    WITH (IGNORE_DUP_KEY = ON, ONLINE = ON);
     
    --CREATE UNIQUE NONCLUSTERED INDEX UQ_idk 
    --ON dbo.Big (pk)
    --WITH (IGNORE_DUP_KEY = ON, ONLINE = ON);
    GO
    -- Key 3 is a duplicate, just a warning now
    INSERT dbo.Big (pk) VALUES (3), (4), (5);
    GO
    SELECT pk FROM dbo.Big;

    The new arrangement results in the correct final state of the database, without throwing an error. The effect is the same as if we had been able to modify the existing primary key constraint to add IGNORE_DUP_KEY:

    Output with duplicate unique index

    Execution plan analysis

    There are some drawbacks to this idea, which we will explore in some detail. The drawbacks are significant, so adding the extra index with IGNORE_DUP_KEY is only really suitable as a temporary solution, as we will see. The first INSERT in the previous script (without the extra constraint or index with IGNORE_DUP_KEY) is pretty trivial; it just shows the constants from the VALUES clause being inserted to the clustered index:

    Trivial INSERT

    The second INSERT (with the IGNORE_DUP_KEY index) is rather more complex (click to enlarge):

    INSERT with IGNORE_DUP_KEY

    An extra index to maintain

    One fairly obvious consequence of adding the new index is that the Clustered Index Insert operator now has to maintain the new nonclustered index too. I used Plan Explorer above because it shows the per-row nonclustered index insert more explicitly than SSMS, where you have to dig into the Object node in the Properties window with the relevant graphical operator selected:

    SSMS Object Node

    Another way to see the nonclustered index maintenance explicitly is to run the query again with undocumented trace flag 8670 to produce a wide update plan:

    Wide update plan

    Besides maintaining the extra index, the IGNORE_DUP_KEY plans seem to be doing a lot of extra work: there are lots of new operators compared with the simple insert. As you would expect, all the new operators are associated with ignoring duplicate keys, and there are two distinct cases to consider.

    Rows that already exist

    The first case relates to checking for INSERT rows that already exist in the base table. This checking is implemented in the execution plan by the Left Semi Join, Index Seek, and Assert operators:

    Plan Right Fragment

    The Index Seek looks for the key value of the current row we are looking to insert. The Semi Join cannot be an Inner Join because that would reject rows where a match was not found (and we need to insert a row in that case). Nevertheless, the query processor needs to know if an existing row was found with the same key value. The Semi Join sets a value to indicate this, which is stored in the Probe Column:

    Nested Loops Operator Properties

    The Assert operator (known internally as a Stream Check) tests a condition and raises an error if the test fails. Assert operators are often seen in plans that affect columns with CHECK constraints, for example. In this case, however, the Assert does not raise an error, it emits the ‘Duplicate key was ignored.’ message instead, and only passes a row on to its parent (the Sort operator) if the condition passes. The check performed by the Assert is based on the Probe value stored in the expression labelled [Expr1013] above:

    Assert Operator Tooltip

    The Asset passes rows where the predicate evaluates to anything other than NULL. If a match was found by the Index Seek (so the Probe Column, Expr1013 is not NULL) the Assert does not pass on the row and raises a warning instead. The following TF 8607 output fragment for the INSERT statement shows the option (there is nothing equivalent in regular show plan output unfortunately):

    WarnIgnoreDuplicate

    Duplicates within the Insert Set

    The first check only looks for rows from the insert set that already exist in the base table. The query processor also needs to check for duplicates within the set of values we are inserting. The Segment and Top operators in the plan combine to meet this requirement:

    Plan Middle Fragment

    The Segment requires a stream ordered by the index keys, and adds a flag to indicate the start of a new group of key values. The Top operator returns only one row from each group. The overall effect is to remove rows with duplicate index key values.

    The Top and Segment execution operators together implement a single physical query processor operator called a “Group-By Top”. I mention this detail because we need to be familiar with the internal name to understand the TF 8607 output, which indicates that this operator also emits a `Duplicate key was ignored.’ warning when it encounters a group with more than one key value:

    WARN-DUP

    Locking

    Adding the new index means that locks will now be taken on the nonclustered index when checking for existing rows:

    Index Seek Locks

    It might surprise you to learn that the index seek acquires Range S-U locks on the nonclustered index (extract from the Books Online link below):

    Range Locks

    Key-range locks like this are only taken under the SERIALIZABLE isolation level, but our INSERT statement uses them whatever isolation level the user runs the query under (for example, range locks are still taken if the session is running under the default READ COMMITTED isolation level).

    SQL Server raises the effective isolation level to SERIALIZABLE for the Index Seek (and just that operator) because it needs to be sure that if it does not find a match in the index, that situation remains the same until it inserts the new row. The Index Seek is looking for a key value that does not exist, so a range lock is necessary to prevent a concurrent transaction inserting a row with that key value before we do. There is no existing key in the index to lock (because the row does not exist) so a range lock is the only option. If SQL Server did not do this, our INSERT query might fail with a duplicate key error despite the IGNORE_DUP_KEY setting (we check for a row, don’t find it, someone else inserts it, then we try to).

    The optimizer adds a SERIALIZABLE hint for the Index Seek operator (internally, a physical Range operator) as we can see using another extract from the TF 8607 output:

    Query Processor Hints

    The execution plan is also forced to use the nonclustered index with the IGNORE_DUP_KEY setting for this seek (FORCEDINDEX) and an update lock (UPDLOCK) hint is also given to help prevent conversion deadlocks. Nevertheless, you may find increased deadlocking if you choose to add an extra IGNORE_DUP_KEY index like this.

    Permanent solutions

    Adding a nonclustered unique index with the IGNORE_DUP_KEY option and the same key as the clustered primary key allowed us to solve an immediate problem without code changes, while keeping the table online, but it does come at a price. The execution plan is much more complex (and expensive) than the original INSERT, and there is a chance of deadlocks. The biggest performance impact of adding the extra index is of course the cost of maintaining it, meaning we need to look at a more permanent solution via a code change.

    IGNORE_DUP_KEY

    The (trivial) test I am going to run inserts 5000 unique rows from my table of numbers into the table we have been using so far. To establish a baseline, we will first look at the execution plan for an INSERT to the table with a nonclustered IGNORE_DUP_KEY index:

    INSERT dbo.Big (pk)
    SELECT n FROM dbo.Numbers
    WHERE n BETWEEN 1 AND 5000;

    IGNORE_DUP_KEY test

    This plan is very similar to the one we analysed earlier, though a Merge is used to perform the Left Semi Join, and a wide update plan has been chosen, making the nonclustered index insert easier to see. The other main feature is an Eager Table Spool, required for Halloween Protection. The estimated cost of this execution plan is 0.363269 cost units on my installation.

    Modified INSERT

    The first code alternative to the extra index is to modify the INSERT statement to check that rows do not already exist in the destination. There are a number of SQL syntaxes for this, each with different characteristics and performance in different circumstances. To keep things simple, and because I only want to make a couple of specific points here, I have chosen my simple example to produce the same execution plans for all the common syntax variants. The first thing to do is to drop our extra constraint:

    ALTER TABLE dbo.Big
    DROP CONSTRAINT UQ_idk;

    Now we can evaluate the modified INSERT statement:

    -- I prefer this syntax
    INSERT dbo.Big (pk)
    SELECT n FROM dbo.Numbers
    WHERE n BETWEEN 1 AND 5000
    AND NOT EXISTS (SELECT * FROM dbo.Big WHERE pk = n);
     
    -- With EXCEPT
    INSERT dbo.Big (pk)
    SELECT n FROM dbo.Numbers
    WHERE n BETWEEN 1 AND 5000
    EXCEPT
    SELECT pk FROM dbo.Big;
     
    -- Not recommended
    INSERT dbo.Big (pk)
    SELECT n FROM dbo.Numbers
    WHERE n BETWEEN 1 AND 5000
    AND n NOT IN (SELECT pk FROM dbo.Big);

    All three produce this execution plan:

    Modified INSERT test

    This has an estimated cost of 0.0981188 units – much cheaper than the 0.363269 cost seen previously. The Halloween Protection spool still features in the plan, but there is a weakness in our queries that you might have already spotted. We are not doing anything to protect against duplicate key violation errors in case someone else inserts a row after we have checked to see if it exists, but before we insert it. The query optimizer added a SERIALIZABLE hint when it added an existence check, so if avoiding errors is important to us, we need to do the same:

    INSERT dbo.Big (pk)
    SELECT n FROM dbo.Numbers
    WHERE n BETWEEN 1 AND 5000
    AND NOT EXISTS (SELECT * FROM dbo.Big WITH (SERIALIZABLE) WHERE pk = n);

    We do not need the UPDLOCK hint for two reasons. First, the engine automatically takes update locks for us when reading the source table. Second, we are reading from the same index we are inserting to (not reading from a nonclustered index and inserting to the clustered index) so the previous deadlock scenario is not applicable.

    Using MERGE

    Another option is to use the WHEN NOT MATCHED feature of the Merge statement. This time we will add the necessary SERIALIZABLE hint up front:

    MERGE dbo.Big WITH (SERIALIZABLE) AS b
    USING (SELECT n FROM dbo.Numbers WHERE n BETWEEN 1 AND 5000) AS s
    ON s.n = b.pk
    WHEN NOT MATCHED THEN 
    INSERT (pk) VALUES (s.n);

    MERGE test

    This plan has an estimated cost of 0.0950127 units – slightly less than the 0.0981188 units for the modified INSERT plans. Some of this improvement is due to the lack of a Halloween Protection spool, for interesting reasons I cover in depth in a short series of articles to be published shortly (update: part one is out).

    These are not meant to be performance tests by any stretch of the imagination. There are any number of subtle factors that will affect the execution plans and run times for different numbers of rows, different distributions, and so on. I should stress that I normally find MERGE plans perform less well than separate INSERT/UPDATE/DELETE more often than not. Anyway, in case you are interested, typical performance results on my machine for this specific test are (INSERT first, then MERGE), timings in milliseconds:

    Performance results

    IGNORE_DUP_KEY and Clustered Indexes

    When IGNORE_DUP_KEY is specified for a unique clustered index (primary key or otherwise), duplicates are handled by the storage engine rather than the query processor. For example:

    CREATE TABLE dbo.T (pk int CONSTRAINT PK_idk PRIMARY KEY WITH (IGNORE_DUP_KEY = ON));
    INSERT T VALUES (1), (1), (2), (3);

    The execution plan shows none of the complexity of the nonclustered index case:

    image

    The TF 8607 output contains none of the query processor hints to ignore duplicates and raise warnings:

    image

    The IGNORE_DUP_KEY side of things is handled entirely by the storage engine – if it finds a row in the unique index where it was going to insert one, it flags a warning but does not try to insert (which would cause an error). The estimated cost of this query plan is identical whether the unique clustered index has IGNORE_DUP_KEY or not – there is no extra work for the query processor.

    If you are wondering why the same mechanism is not used for a nonclustered unique index with the IGNORE_DUP_KEY option, my understanding is it is because the Clustered Index is always updated before the nonclustered ones (even in a narrow plan). By the time the storage engine detected a duplicate in the nonclustered index, it would have already added the row to the base table. So, the query processor handles IGNORE_DUP_KEY for nonclustered indexes.

    This is my understanding of the mechanics of IGNORE_DUP_KEY, I don’t know for sure and will happily correct any details I may have wrong. Any storage engine experts reading this please contact me if so.

    Acknowledgement

    This post was inspired by a post by Daniel Adeniji and our subsequent discussion of it. My thanks to him for permission to analyse the issue further here, and for allowing me to reference his original. The issue described here may not accurately reflect Daniel’s original problem – I used a certain amount of artistic licence. Even if the specific issue is not particularly interesting to you, I hope you enjoyed some aspects of the analysis and maybe picked up some new information along the way.

    © 2013 Paul White – All rights reserved
    twitter: @SQL_Kiwi
    email: SQLkiwi@gmail.com

  • Optimizing T-SQL queries that change data

    I'll_Be_There[4]

    Most tuning efforts for data-changing operations concentrate on the SELECT side of the query plan. Sometimes people will also look at important storage engine considerations like locking and transaction log throughput that can have dramatic effects. As a consequence, a number of common practices have emerged, such as avoiding large numbers of row locks and lock escalation, splitting large changes into smaller batches of a few thousand rows and combining a number of small changes into a single transaction in order to optimize log flushes.

    This is all good of course, but what about the data-changing side of the query plan – the INSERT, UPDATE, DELETE or MERGE operation itself – are there any query processor considerations we should take into account? The short answer is yes. The query optimizer considers different plan options for the write-side of an I/U/D/M query, though there isn’t a huge amount of T-SQL language support that allows us to affect these choices directly. Nevertheless, there are things to be aware of, and things we can look to change.

    Side note: In this post I am going to use the term update (lower case) to apply to any operation that changes the state of the database (INSERT, UPDATE, DELETE and MERGE in T-SQL). This is a common practice in the literature, and is used inside SQL Server too as we will see.

    The Three Basic Update Forms

    First a bit of background. All query plans execute using a demand-driven iterator model, and updates are no exception. Parent operators drive child operators (to the right of the parent) to do work by asking them for a row at a time. Take the following simple AdventureWorks INSERT for example:

    DECLARE @T AS TABLE (ProductID int);
    INSERT @T (ProductID) 
    SELECT p.ProductID 
    FROM Production.Product AS p;

    Query Plan Execution Order Example

    Plan execution starts at the far left, where you can think of the green T-SQL INSERT icon as representing rows returned to the client. This root node asks its immediate child operator (the Table Insert) for a row. The Table Insert requests a row from the Index Scan, which provides one. This row is inserted into the heap, and an empty row is returned to the root node (if the INSERT query contained an OUTPUT clause, the returned row would contain data). This row-by-row process continues until all source rows have been processed. Notice that the XML show plan output shows the Heap Table Insert is performed by an Update operator internally. Ok, on to the matter at hand:

    1. Wide, per-index updates

    Wide per-index update plans have a separate update operator for each clustered and nonclustered index. It is always possible for the optimizer to produce this type of plan; the options described later have not been (or cannot be) implemented when certain engine features are used. For example, if the update target has an indexed view defined on it, a wide update plan has to be used to maintain the indexed view. An example per-index update plan is shown below:

    Wide update plan

    This plan updates the base table using a Clustered Index Delete operator. This operator may also read and output extra column values necessary to find and delete rows from nonclustered indexes. The iterative delete activity on the base table is driven by the Eager Table Spool on the top branch. The spool is eager because it stores all the rows from its input in a worktable before returning the first row to its parent operator (the Index Delete on the top branch). The effect is that all qualifying base table rows are deleted before any nonclustered index changes occur.

    The spool in this plan is a common subexpression spool. It is populated once and then acts as a row source for multiple consumers. In this case, the contents of the spool are consumed first by the top-branch Index Delete, which removes index entries from one of the nonclustered indexes. When the Sequence operator switches to asking for rows from the lower branch, the spool is rewound and replays its rows for the second nonclustered index. Note that the spool contains the UNION of all columns required by nonclustered index changes.

    2. Narrow, per-row updates

    Narrow per-row updates have a single update operator which maintains the base table (heap or clustered) and one or more nonclustered indexes. Each row arriving at the update operator updates all indexes associated with the operator before processing the next row. An example:

    DECLARE @T AS TABLE (pk int PRIMARY KEY, col1 int UNIQUE, col2 int UNIQUE);
    DECLARE @S AS TABLE (pk int PRIMARY KEY, col1 int, col2 int);
     
    INSERT @T (pk, col1)
    SELECT
        s.pk,
        s.col1 
    FROM @S AS s;

    The execution plan viewed using SQL Sentry Plan Explorer shows the nonclustered index maintenance explicitly (hover over the Clustered Index Insert for a tooltip showing the names of the nonclustered indexes involved):

    Plan Explorer Narrow Update Plan

    In SQL Server Management Studio, there is no information about the nonclustered index maintenance in the graphical plan or tooltips, we need to click on the update operator (Clustered Index Insert) and then check the Properties window:

    SSMS narrow index updates

    This shows the clustered index and two nonclustered indexes being maintained. Because all indexes are updated for each row streaming through the update operator, this plan shape is also known as a per-row update plan.

    The query optimizer uses cost-based reasoning to decide whether to update each nonclustered index separately (per-index) or as part of the base table changes (per-row). An example that updates one nonclustered index per-row, and another per-index is shown below:

    CREATE TABLE #T (pk int IDENTITY NOT NULL, col1 int, col2 varchar(8000) DEFAULT '');
    ALTER TABLE #T ADD CONSTRAINT PK_T PRIMARY KEY (pk);
    CREATE INDEX nc1 ON #T (col1);
    CREATE INDEX nc2 ON #T (col1) INCLUDE (col2);

    The combination strategy can be seen with Plan Explorer (click to expand in a new tab):

    Combination index update

    The details are also displayed in the SSMS Properties window when the Clustered Index Insert operator is selected as well.

    3. Single-operator updates

    The third form of update is an optimization of the per-row update plan for very simple operations. The cost-based optimizer still emits a per-row update plan, but a rewrite is subsequently applied to collapse the reading and writing operations into a single operator:

    -- Simple operations
    DECLARE @T AS TABLE (pk int PRIMARY KEY, col1 int);
    INSERT @T (pk, col1) VALUES (1, 1000);
    UPDATE @T SET col1 = 1234 WHERE pk = 1;
    DELETE @T WHERE pk = 1;

    The complete execution plans (and extracts from the XML plans) are shown below:

    Single operator update plans

    The UPDATE and DELETE plans have been collapsed to a `Simple Update` operation, while the INSERT is simplified to a `Scalar Insert`. We can see the original optimizer output tree with trace flags 3604 and 8607:

    DECLARE @T AS TABLE (pk int PRIMARY KEY, col1 int);
     
    DBCC TRACEON (3604, 8607);
     
    INSERT @T (pk, col1) VALUES (1, 1000) OPTION (RECOMPILE);
    UPDATE @T SET col1 = 1234 WHERE pk = 1 OPTION (RECOMPILE);
    DELETE @T WHERE pk = 1 OPTION (RECOMPILE);
     
    DBCC TRACEOFF (3604, 8607);

    The trace output for the INSERT statement shown above is:

    Optimizer output tree

    Notice the two physical operators: a constant scan (in-memory table) containing the literal values specified in the query; and a stream update operator that performs the database changes per row. All three statements produce a similar optimizer tree output (and all use a stream update). We can prevent the single-operator optimization being applied with undocumented trace flag 8758:

    DECLARE @T AS TABLE (pk int PRIMARY KEY, col1 int);
     
    DBCC TRACEON (8758);
     
    INSERT @T (pk, col1) VALUES (1, 1000) OPTION (RECOMPILE);
    UPDATE @T SET col1 = 1234 WHERE pk = 1 OPTION (RECOMPILE);
    DELETE @T WHERE pk = 1 OPTION (RECOMPILE);
     
    DBCC TRACEOFF (8758);

    This exposes the expanded per-row update plans produced by the optimizer before the single-operator rewrite:

    Expanded single operator plan

    Single operator plans can be significantly more efficient than the original multiple-operator form in the right circumstances, for example if the plan is executed very frequently.

    Update plan optimizations

    From this point forward, I’m going to use simple AdventureWorks DELETE examples, but the general points apply equally well to INSERT, UPDATE and MERGE as well. The examples will not have a complicated SELECT component, because I want to concentrate on the update side of the plan rather than the reading side.

    Per-row updates

    It’s worth emphasising again that per-row update plans are an optimization that is not available for every update query. The optimizer favours a per-row plan if the expected number of rows at the update operator is low. In the example below, the optimizer expects to delete 25 rows:

    Per-row updates

    This plan updates the base table (a clustered index in this case) and all nonclustered indexes in sequence per row. The row source here is an ordered scan of a nonclustered index, which means rows will not generally be presented in clustered index key order to the update operator.

    An unordered stream like this tends to result in random I/O (assuming physical I/O is required at all, of course). If the plan is expected to process only a small number of rows, the optimizer decides it is not worth adding extra operations to encourage sequential I/O.

    Unordered Prefetch

    If the expected number of rows is a bit larger, the optimizer may decide to build a plan that applies prefetching to one or more update operators. The basic idea is the same as ordinary prefetching (a.k.a read-ahead) for scans and range seeks. The engine looks ahead at the keys of the incoming stream and issues asynchronous IOs for pages that will be needed by the update operator soon. Prefetching helps reduce the impact of the random I/O mentioned a moment ago. An example of prefetching on the update operator is shown below for the same query and an expected cardinality of 50 rows:

    image

    The prefetching is unordered from the perspective of the clustered index. Neither SSMS nor Plan Explorer show the prefetch information prominently. In Plan Explorer, we need to have the With Unordered Prefetch optional column selected. To do this, switch to the Plan Tree tab, open the Column Chooser, and drag the column to the grid. The unordered prefetch property is ticked for the Clustered Index Delete operator in the screenshot below:

    Unordered Prefetch in Plan Explorer

    In SSMS, click on the Clustered Index Delete icon in the graphical plan, and then look in the Properties window:

    SSMS Unordered Prefetch

    If the query plan were reading from the clustered index, the read operation would issue regular read-ahead, so there would be no point prefetching from the Clustered Index Delete as well. The example below shows the expected cardinality increased to 100, where the optimizer switches from scanning the nonclustered index with unordered prefetching to scanning the clustered index. No prefetching occurs on the update operator in this plan:

    No clustered index prefetch

    Where the plan property With Unordered Prefetch is not set, it is simply omitted – it does not appear set to False as you might expect.

    Ordered Prefetch

    Where rows are expected to arrive at an update operator in non-key order, the optimizer might consider adding an explicit sort. The idea is that presenting rows in key order will encourage sequential I/O for the index pages. The optimizer weighs the cost of the sort against the expected benefits of avoiding random I/O. The execution plan below shows an example of a sort being added for this reason:

    Sort to reduce random I/O

    The sort is on Transaction ID, the clustering key for this table. With the incoming stream now sorted, the update operator can use ordered prefetching. The plan is still scanning the nonclustered index on the read side, so read-ahead issued by that operator does not bring in the clustered index pages needed by the update operator. Ordered prefetching tends to result in sequential I/O, compared with the mostly random I/O of unordered prefetching. The With Ordered Prefetch column is also not shown by default by Plan Explorer, so it needs to be added as we did before:

    Plan Explorer Ordered Prefetch

    Not every update operator with ordered prefetching requires a sort. If the optimizer finds a plan that naturally presents keys in nonclustered index order, an update operator may still use ordered prefetching.

    DML Request Sort

    When the optimizer decides to explore a plan alternative that requires ordered input to an update operator, it will set the DML Request Sort property for that operator. Plan Explorer shows this setting in the update operator tooltip:

    Plan Explorer DML Request Sort

    SSMS shows With Ordered Prefetch and DML Request Sort in the Properties window:

    SSMS DMLRequestSort and WithOrderedPrefetch

    Nonclustered index updates

    Sorting and ordered prefetching may also be considered for the nonclustered indexes in a per-index update plan.

    Sort for nonclustered index

    This plan shows all three update operators with DML Request Sort set. The Clustered Index Delete does not require an explicit sort because rows are being read (with internal ordering guarantees) from the Clustered Index Scan. The two non-clustered indexes do require explicit sorts, however. Both nonclustered index update operators also use ordered prefetching; the clustered index update does not because the scan automatically invokes read-ahead for that index.

    The next example shows that each update operator is considered separately for the various optimizations:

    Indexes optimized separately

    That plan features unordered prefetching on the Clustered Index Delete (because the row source is a scan of a nonclustered index), an explicit sort with ordered prefetching on the top branch Index Delete, and unordered prefetching on the bottom branch Index Delete.

    The per-index and per-row update plan

    Is the following a per-index or per-row update plan?

    Per-index and per-row plan

    It appears to be a per-index plan because there are two separate update operators, but there is no blocking operator (eager spool or sort) between the two. This plan is a pipeline: each row returned from the seek will be processed first by the Clustered Index Delete and then by the Index Delete. In that sense, the both updates (clustered and nonclustered) are certainly per-row.

    The important thing is not the terminology, it is being able to interpret the plan. Now we know what the various optimizations are for, we can see why the optimizer chose the plan above over the seemingly equivalent but more efficient-looking narrow plan:

    Narrow plan

    The narrow plan has no prefetching. The Seek provides read-ahead for the clustered index pages, but the single non-clustered index has nothing to prefetch any pages from disk it might need. By contrast, the previous wide update plan has two update operators. The Clustered Index Delete has no prefetching for the same reason as in the narrow plan, but the Index Delete has unordered prefetching enabled.

    So, although the smaller narrow plan looks like it ought to be more efficient, it might perform less well if nonclustered index pages are required from disk and unordered prefetching proves effective. On the other hand, if all required nonclustered index pages are in memory at the time the query is executed, the wide plan might perform less well since it features an extra operator and has to perform all the work associated with prefetching (even though ultimately no physical reads are issued).

    Performance Impacts

    You may have noticed a lack of performance numbers (I/O statistics, elapsed times and so on) in this post. There is a good reason for this. The optimizations presented here are quite dependent on SQL Server configuration and hardware. I don’t want you drawing general conclusions based on the performance of the rather ordinary single SATA drive in my laptop, the 256MB of RAM I have allowed for my SQL Server 2012 instance, and some rather trivial AdventureWorks examples.

    Which Optimizations are Good?

    It depends, naturally.

    If the data structures you are changing are very likely to be in memory, and/or if you have a very fast random I/O system, the optimizer’s goal of minimizing random I/O and enhancing sequential I/O may not make much sense for you. The strategies the optimizer employs may well end up costing you much more than they save.

    On the other hand, if your system has a much smaller buffer pool than database working set size, and your I/O system works very much faster with large sequential I/O than with smaller random I/O requests, you should generally find the optimizer makes reasonable choices for you, subject to the usual caveats about useful up-to-date statistics and accurate cardinality estimation, of course.

    If your system fits the optimizer’s model, at least to a first approximation, you will usually find that narrow update plans are best for low cardinality update operations, unordered prefetching helps a bit when more rows are involved, ordered prefetching helps more, and explicit sorts before nonclustered index updates helps most of all for the largest sets. That is the theory, anyway.

    What are the Problems?

    The problem I see most often is with the optimization that is supposed to help most: the explicit sort before a nonclustered index update. The idea is that sorting the input to the index update in key order promotes sequential I/O, and often it does. The problem occurs if the workspace memory allocated to the sort at execution time proves to be too small to perform the sort entirely in memory. The memory grant is fixed based on cardinality estimation and row size estimates, and may be reduced if the server is under memory pressure at the time.

    In any case, however it occurs, a sort that runs out of memory spills whole sort runs to physical tempdb disk, often repeatedly. Unsurprisingly, this is not at all good for performance, and can result in queries that complete very much slower than if the sort had not been introduced at all. SQL Server reuses the general Sort operator for sorting index keys – it does not have a sort specially optimized for index updates which could make a best effort to sort within the memory allocated, but never spill to disk. Perhaps I should suggest this on Connect :)

    The narrow plan optimization can cause problems where it is selected due to a low cardinality estimate, where several nonclustered indexes need to be maintained, the necessary pages are not in memory, and the I/O system is slow at random reads.

    The ordered and unordered prefetch options cause problems much more rarely. For a system with all data in memory, there is a small but possibly measurable impact due to the prefetch overhead (finding pages to consider read-ahead for, checking to see if they are already in the buffer pool, and so on). Even if no asynchronous I/O is ever issued by the worker, it still spends time on the task that it could spend doing real work.

    What options are there?

    The usual answer to deal with poor execution plan choices due to a system’s configuration or performance characteristics not matching the optimizer’s model is to try different T-SQL syntax, and/or to try query hints. There are times when it is possible to rewrite the query to get a plan that performs better, and other times where rethinking the problem a bit can pay dividends. Various creative uses of partitioning, minimal logging, and bulk loading are possible in some cases, for example.

    There are very few hints that can help with the update side of a plan that does not respond to the usual tricks, however. The two I use most often affect the optimizer’s cardinality estimation - OPTION (FAST n) and a trick involving TOP and the OPTIMIZE FOR query hint. OPTION (FAST n) affects cardinality estimates in a plan by setting a final row goal, but its effects can be difficult to predict, and may not have much of an effect if the plan contains blocking operators. Anyway, the general idea is to vary the value of ‘n’ in the hint until the optimizer chooses the desired plan shape or optimization options as described in this post. Best of luck.

    The idea with TOP is similar, but tends to work better in my experience. The trick is to declare a bigint variable with the number of rows the query should return (or a very large value such as the maximum value of a bigint if all rows are required), use that as the parameter to a TOP, and then use the OPTIMIZE FOR hint to set a value for ‘n’ that the optimizer should use when considering alternative plans. This option is particularly useful for encouraging a narrow plan.

    Most of the examples in this post used the TOP trick in the following general form:

    DECLARE @n bigint = 9223372036854775807;
    DELETE TOP (@n) 
    FROM Production.TransactionHistory
    OPTION (OPTIMIZE FOR (@n = 50));

    No magic trace flags?

    Sure there are trace flags that affect the optimizations in this post. They are undocumented and unsupported of course, and may only work on some versions, but they can be handy to validate a performance analysis, or to generate a plan guide for a particularly crucial query. They can also be fun to play with on a personal system to explore their effects. The main ones that affect the optimizations described here are:


    2332 : Force DML Request Sort (CUpdUtil::FDemandRowsSortedForPerformance)

    8633:  Enable prefetch (CUpdUtil::FPrefetchAllowedForDML and CPhyOp_StreamUpdate::FDoNotPrefetch)

    8744 : Disable prefetch (CUpdUtil::FPrefetchAllowedForDML)

    8758 : Disable rewrite to a single operator plan (CPhyOp_StreamUpdate::PqteConvert)

    8790 : Force a wide update plan (CUpdUtil::PexprBuildWideUpdatePlan)

    8795 : Disable DML Request Sort (CUpdUtil::FDemandRowsSortedForPerformance)

    9115 : Disable prefetch (CUpdUtil::FPrefetchAllowedForDML)

     


     

    These trace flags can all be manipulated using the usual DBCC commands or enabled temporarily for a particular query using the OPTION (QUERYTRACEON xxxx) hint.

    Final Thoughts

    The optimizer has a wide range of choices when building the writing side of an update plan. It may choose to update one or more indexes separately, or as part of the base table update. It may choose to explicitly sort rows as part of a per-index update strategy, and may elect to perform unordered or ordered prefetching as well. As usual, these decisions are made based on costs computed using the optimizer’s model. This model may not always produce plans that are optimal for your hardware (and as usual the optimizer’s choices are only as good as the information you give it).

    If a particular update query is performance critical for you, make sure cardinality estimates are reasonably accurate. Test with alternate syntax and/or trace flags to see if an alternative plan would perform significantly better in practice. It is usually possible to use documented techniques like TOP clauses and OPTIMIZE FOR hints to produce an execution plan that performs well. Where that is not possible, more advanced tricks and techniques like plan guides may be called for.

    Thanks for reading today, I hope this post was interesting and provided some new insights into update query optimization.

    © 2013 Paul White – All rights reserved
    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi

  • MERGE Bug with Filtered Indexes

    A MERGE statement can fail, and incorrectly report a unique key violation when:

    • The target table uses a unique filtered index; and
    • No key column of the filtered index is updated; and
    • A column from the filtering condition is updated; and
    • Transient key violations are possible

    Example Tables

    Say we have two tables, one that is the target of a MERGE statement, and another that contains updates to be applied to the target.  The target table contains three columns, an integer primary key, a single character alternate key, and a status code column.  A filtered unique index exists on the alternate key, but is only enforced where the status code is ‘a’:

    CREATE TABLE #Target 
    (
        pk          integer NOT NULL, 
        ak          character(1) NOT NULL,
        status_code character(1) NOT NULL,
     
        PRIMARY KEY (pk)
    );
     
    CREATE UNIQUE INDEX uq1
    ON #Target (ak)
    INCLUDE (status_code)
    WHERE status_code = 'a';

    The changes table contains just an integer primary key (to identify the target row to change) and the new status code:

    CREATE TABLE #Changes
    (
        pk          integer NOT NULL,
        status_code character(1) NOT NULL,
     
        PRIMARY KEY (pk)
    );

    Sample Data

    The sample data for the example is:

    INSERT #Target 
        (pk, ak, status_code)
    VALUES 
        (1, 'A', 'a'),
        (2, 'B', 'a'),
        (3, 'C', 'a'),
        (4, 'A', 'd');
     
    INSERT #Changes
        (pk, status_code)
    VALUES
        (1, 'd'),
        (4, 'a');

             Target                     Changes
    ╔════╦════╦═════════════╗    ╔════╦═════════════╗
    ║ pk ║ ak ║ status_code ║    ║ pk ║ status_code ║
    ╠════╬════╬═════════════╣    ╠════╬═════════════╣
    ║  1 ║ A  ║ a           ║    ║  1 ║ d           ║
    ║  2 ║ B  ║ a           ║    ║  4 ║ a           ║
    ║  3 ║ C  ║ a           ║    ╚════╩═════════════╝
    ║  4 ║ A  ║ d           ║
    ╚════╩════╩═════════════╝

    The target table’s alternate key (ak) column is unique, for rows where status_code = ‘a’.  Applying the changes to the target will change row 1 from status ‘a’ to status ‘d’, and row 4 from status ‘d’ to status ‘a’.  The result of applying all the changes will still satisfy the filtered unique index, because the ‘A’ in row 1 will be deleted from the index and the ‘A’ in row 4 will be added.

    Merge Test One

    Let’s now execute a MERGE statement to apply the changes:

    MERGE #Target AS t
    USING #Changes AS c ON 
        c.pk = t.pk
    WHEN MATCHED 
        AND c.status_code <> t.status_code 
        THEN UPDATE SET status_code = c.status_code;

    The MERGE changes the two target rows as expected.  The updated target table now contains:

    ╔════╦════╦═════════════╗
    ║ pk ║ ak ║ status_code ║
    ╠════╬════╬═════════════╣
    ║  1 ║ A  ║ d           ║ <—changed from ‘a’
    ║  2 ║ B  ║ a           ║
    ║  3 ║ C  ║ a           ║
    ║  4 ║ A  ║ a           ║ <—changed from ‘d’
    ╚════╩════╩═════════════╝

    Merge Test Two

    Now let’s repopulate the changes table to reverse the updates we just performed:

    TRUNCATE TABLE #Changes;
     
    INSERT #Changes
        (pk, status_code)
    VALUES
        (1, 'a'),
        (4, 'd');

    This will change row 1 back to status ‘a’ and row 4 back to status ‘d’.  As a reminder, the current state of the tables is:

             Target                        Changes
    ╔════╦════╦═════════════╗    ╔════╦═════════════╗
    ║ pk ║ ak ║ status_code ║    ║ pk ║ status_code ║
    ╠════╬════╬═════════════╣    ╠════╬═════════════╣
    ║  1 ║ A  ║ d           ║    ║  1 ║ a           ║
    ║  2 ║ B  ║ a           ║    ║  4 ║ d           ║
    ║  3 ║ C  ║ a           ║    ╚════╩═════════════╝
    ║  4 ║ A  ║ a           ║
    ╚════╩════╩═════════════╝

    We execute the same MERGE statement:

    MERGE #Target AS t
    USING #Changes AS c ON 
        c.pk = t.pk
    WHEN MATCHED 
        AND c.status_code <> t.status_code 
        THEN UPDATE SET status_code = c.status_code;

    However this time we receive the following message:

    Msg 2601, Level 14, State 1, Line 1
    Cannot insert duplicate key row in object 'dbo.#Target' with unique index 'uq1'. The duplicate key value is (A).
    The statement has been terminated.

    Applying the changes using UPDATE

    Let’s now rewrite the MERGE to use UPDATE instead:

    UPDATE t
    SET status_code = c.status_code
    FROM #Target AS t
    JOIN #Changes AS c ON
        t.pk = c.pk
    WHERE
        c.status_code <> t.status_code;

    This query succeeds where the MERGE failed.  The two rows are updated as expected:

    ╔════╦════╦═════════════╗
    ║ pk ║ ak ║ status_code ║
    ╠════╬════╬═════════════╣
    ║  1 ║ A  ║ a           ║ <—changed back to ‘a’
    ║  2 ║ B  ║ a           ║
    ║  3 ║ C  ║ a           ║
    ║  4 ║ A  ║ d           ║ <—changed back to ‘d’
    ╚════╩════╩═════════════╝

    What went wrong with the MERGE?

    In this test, the MERGE query execution happens to apply the changes in the order of the ‘pk’ column.

    In test one, this was not a problem: row 1 is removed from the unique filtered index by changing status_code from ‘a’ to ‘d’ before row 4 is added.  At no point does the table contain two rows where ak = ‘A’ and status_code = ‘a’.

    In test two, however, the first change was to change row 1 from status ‘d’ to status ‘a’.  This change means there would be two rows in the filtered unique index where ak = ‘A’ (both row 1 and row 4 meet the index filtering criteria ‘status_code = a’).

    The storage engine does not allow the query processor to violate a unique key (unless IGNORE_DUP_KEY is ON, but that is a different story, and doesn’t apply to MERGE in any case).  This strict rule applies regardless of the fact that if all changes were applied, there would be no unique key violation (row 4 would eventually be changed from ‘a’ to ‘d’, removing it from the filtered unique index, and resolving the key violation).

    Why it went wrong

    The query optimizer usually detects when this sort of temporary uniqueness violation could occur, and builds a plan that avoids the issue.  I wrote about this a couple of years ago in my post Beware Sneaky Reads with Unique Indexes (you can read more about the details on pages 495-497 of Microsoft SQL Server 2008 Internals or in Craig Freedman’s blog post on maintaining unique indexes).  To summarize though, the optimizer introduces Split, Filter, Sort, and Collapse operators into the query plan to:

    1. Split each row update into delete followed by an inserts
    2. Filter out rows that would not change the index (due to the filter on the index, or a non-updating update)
    3. Sort the resulting stream by index key, with deletes before inserts
    4. Collapse delete/insert pairs on the same index key back into an update

    The effect of all this is that only net changes are applied to an index (as one or more insert, update, and/or delete operations).  In this case, the net effect is a single update of the filtered unique index: changing the row for ak = ‘A’ from pk = 4 to pk = 1.  In case that is less than 100% clear, let’s look at the operation in test two again:

             Target                     Changes                   Result
    ╔════╦════╦═════════════╗    ╔════╦═════════════╗    ╔════╦════╦═════════════╗
    ║ pk ║ ak ║ status_code ║    ║ pk ║ status_code ║    ║ pk ║ ak ║ status_code ║
    ╠════╬════╬═════════════╣    ╠════╬═════════════╣    ╠════╬════╬═════════════╣
    ║  1 ║ A  ║ d           ║    ║  1 ║ d           ║    ║  1 ║ A  ║ a           ║
    ║  2 ║ B  ║ a           ║    ║  4 ║ a           ║    ║  2 ║ B  ║ a           ║
    ║  3 ║ C  ║ a           ║    ╚════╩═════════════╝    ║  3 ║ C  ║ a           ║
    ║  4 ║ A  ║ a           ║                            ║  4 ║ A  ║ d           ║
    ╚════╩════╩═════════════╝                            ╚════╩════╩═════════════╝

    From the filtered index’s point of view (filtered for status_code = ‘a’ and shown in nonclustered index key order) the overall effect of the query is:

      Before           After
    ╔════╦════╗    ╔════╦════╗
    ║ pk ║ ak ║    ║ pk ║ ak ║
    ╠════╬════╣    ╠════╬════╣
    ║  4 ║ A  ║    ║  1 ║ A  ║
    ║  2 ║ B  ║    ║  2 ║ B  ║
    ║  3 ║ C  ║    ║  3 ║ C  ║
    ╚════╩════╝    ╚════╩════╝

    The single net change there is a change of pk from 4 to 1 for the nonclustered index entry ak = ‘A’.  This is the magic performed by the split, sort, and collapse.  Notice in particular how the original changes to the index key (on the ‘ak’ column) have been transformed into an update of a non-key column (pk is included in the nonclustered index).  By not updating any nonclustered index keys, we are guaranteed to avoid transient key violations.

    The Execution Plans

    The estimated MERGE execution plan that produces the incorrect key-violation error looks like this (click to enlarge in a new window):

    MERGE plan

    The successful UPDATE execution plan is (click to enlarge in a new window):

    UPDATE execution plan

    The MERGE execution plan is a narrow (per-row) update.  The single Clustered Index Merge operator maintains both the clustered index and the filtered nonclustered index.  The UPDATE plan is a wide (per-index) update.  The clustered index is maintained first, then the Split, Filter, Sort, Collapse sequence is applied before the nonclustered index is separately maintained.

    There is always a wide update plan for any query that modifies the database. The narrow form is a performance optimization where the number of rows is expected to be relatively small, and is not available for all operations.  One of the operations that should disallow a narrow plan is maintaining a unique index where intermediate key violations could occur.

    Workarounds

    The MERGE can be made to work (producing a wide update plan with split, sort, and collapse) by:

    • Adding all columns referenced in the filtered index’s WHERE clause to the index key (INCLUDE is not sufficient); or
    • Executing the query with trace flag 8790 set e.g. OPTION (QUERYTRACEON 8790).

    Undocumented trace flag 8790 forces a wide update plan for any data-changing query (remember that a wide update plan is always possible).  Either change will produce a successfully-executing wide update plan for the MERGE that failed previously.

    Conclusion

    The optimizer fails to spot the possibility of transient unique key violations with MERGE under the conditions listed at the start of this post.  It incorrectly chooses a narrow plan for the MERGE, which cannot provide the protection of a split/sort/collapse sequence for the nonclustered index maintenance.

    The MERGE plan may fail at execution time depending on the order in which rows are processed, and the distribution of data in the database.  Worse, a previously solid MERGE query may suddenly start to fail unpredictably if a filtered unique index is added to the merge target table at any point.

    Connect bug filed here

    Bug reproduced on the following SQL Server versions (all x64 Developer Edition):

    2012 SP1 CUI (build 11.0.3321 – November 2012)
    2008 R2 SP2 CU3 (build 10.50.4266 – October 2012)
    2008 SP3 CU8 (build 10.0.5828 – November 2012)

    © 2012 Paul White – All Rights Reserved

    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • Cardinality Estimation Bug with Lookups

    Estimated row counts on key or RID lookups where a filtering predicate is applied can be wrong in SSMS execution plans.  This error does not affect the optimizer’s ultimate plan selection, but it does look odd.  There are other cases where estimated row counts are inconsistent (for defensible reasons) but the behaviour shown in this post in certainly a bug.

    SQL Server 2005

    The following AdventureWorks sample query displays TransactionHistory information for ProductID #1 and the last three months of 2003 (if you are using a more recent version of AdventureWorks, you will need to change the year from 2003 to 2007):

    SELECT
        th.ProductID,
        th.TransactionID,
        th.TransactionDate
    FROM Production.TransactionHistory AS th 
    WHERE 
        th.ProductID = 1 
        AND th.TransactionDate BETWEEN '20030901' AND '20031231';

    The query plan is:

    Plain Plan

    The execution flow is pretty straightforward.  The plan seeks a non-clustered index on the ProductID column to find rows where ProductID = 1.  This non-clustered index is keyed on ProductID alone, but the index also includes the TransactionID clustering key (to point to the parent row in the base table).  The index does not cover the query (the TransactionDate column is not present in the index) so a Key Lookup is required for the TransactionDate column.  The nested loops join drives a look up process such that each row from the Index Seek on ProductID = 1 results in a lookup in the clustered index to find the TransactionDate value for that row.  A final Filter operator passes only rows that meet the date range condition.

    The next diagram shows the same plan with cardinality estimates and extra details for the Key Lookup:

    SQL Server 2005 plan

    The cardinality estimate at the Index Seek is exactly correct.  The table is not very large, there are up-to-date statistics associated with the index, so this is as expected.  If you run a query to find all rows in the TransactionHistory table for ProductID #1, 45 rows will indeed be returned.

    The estimate for the Key Lookup is exactly correct.  Each lookup into the Clustered Index to find the Transaction Date is guaranteed to return exactly one row (each non-clustered index entry is associated with exactly one base table row).  The Key Lookup is expected to be executed 45 times (shown circled).

    The estimate for the Inner Join is exactly correct – 45 rows from the seek joining to one row each time from the lookup, gives a total of 45 rows.

    The Filter estimate is also good: SQL Server estimates that of the 45 rows entering the filter, 16.9951 rows will match the specified range of transaction dates.  In reality, only 11 rows are produced by this query, but that small difference in estimates is quite normal and certainly nothing to worry about here.  You might like to keep that estimate of 16.9951 rows in mind, however.

    SQL Server 2008 onward

    The same query executed against an identical copy of AdventureWorks on SQL Server 2008 (or R2, or 2012) produces a quite different execution plan:

    SQL Server 2008 plan

    Instead of an explicit Filter to find rows with the requested date range, the optimizer has pushed the date predicate down to the Key Lookup (the Key Lookup now has a predicate on Transaction Date).  This is a good optimization in general terms; it makes sense to filter rows as early as possible.  Unfortunately, it has made a bit of a mess of the cardinality estimates in the process.  The post-Filter estimate of 16.9951 rows seen in the 2005 plan has been moved to the Key Lookup.  Instead of estimating one row per lookup, the plan now suggests that 16.9951 rows will be produced by each clustered index lookup – clearly not right!

    To my way of thinking, the execution plan cardinality estimates should look something like this:

    image

    As shown, I personally prefer to see the inner side of a nested loops join estimate the number of rows per execution, but I understand I may be in a minority on this point.  If the estimate were aggregated over the expected number of executions, the inner-side estimate would be also be 16.9951 (45 executions * 0.37767 per execution).

    Cause

    The query tree produced by the 2008+ optimizer looks much the same as the 2005 version – the explicit Filter is still present.  However, a post-optimization rewrite occurs in the ‘copy out’ phase that removes the Filter and incorporates it in the Key Lookup seek, resulting in a residual predicate on that operator.  This rewrite applies to regular scans and seeks too (so all residual predicates on scans and seeks are a result of this late rewrite).

    I would like to thank Dima Piliugin (twitter | blog) for introducing me to undocumented trace flag 9130, which disables the rewrite from Filter + (Scan or Seek) to (Scan or Seek) + Residual Predicate.  Enabling this flag retains the Filter in the final execution plan, resulting in a SQL Server 2008+ plan that mirrors the 2005 version, with correct estimates.

    The bug is that this rewrite that does not correctly update cardinality estimates when the filter is pushed down to a lookup.  I should stress that the rewrite occurs after all cost-based decisions have been made, so the incorrect estimate just looks odd and makes plan analysis harder than it ought to be.  Cardinality estimates with regular scans and seeks appear to work correctly, as far as I can tell.  The bug applies to both RID lookups and Key Lookups where a residual predicate is applied.

    Workarounds

    Using the trace flag is not a workaround because (a) it is (very) undocumented and unsupported; and (b) it results in a less efficient plan where rows are filtered much later than is optimal.  One genuine workaround is to provide a covering non-clustered index (avoiding the lookup avoids the problem):

    CREATE INDEX nc1 
    ON Production.TransactionHistory (ProductID) 
    INCLUDE (TransactionDate);

    With the Transaction Date filter applied as a residual predicate in the same operator as the seek, the final estimate is again as expected:

    image

    We could also force the use of the ultimate covering index (the clustered one) with a suitable table index hint:

    SELECT
        th.ProductID,
        th.TransactionID,
        th.TransactionDate
    FROM Production.TransactionHistory AS th WITH (INDEX(1))
    WHERE 
        th.ProductID = 1 
        AND th.TransactionDate BETWEEN '20030901' AND '20031231';

    image

    Again, the estimate is 16.9951 as expected.

    Fixed in SQL Sentry Plan Explorer build 7.2.42.0

    After this post was published on October 15 2012 I was contacted by the SQL Sentry people to see if a good workaround could be incorporated in their free Plan Explorer product.  I was happy to provide a little testing and general feedback during a process that ultimately resulted in a new build being released on 31 October 2012.  If only SSMS limitations could be addressed so quickly!  Once you upgrade to the new version, the plan displayed for our test query is:

    image

    This is exactly the representation we would expect if the SQL Server bug did not exist (note that Plan Explorer aggregates estimated inner-side executions for you, and rounds to integer).  Well done to the SQL Sentry team, especially Brooke Philpott (@MacroMullet).

    Summary

    Providing a covering non-clustered index to avoid lookups for all possible queries is not always practical, and scanning the clustered index will rarely be the optimal choice either.  Nevertheless, these are the best workarounds we have today in SSMS.

    Watch out for poor cardinality estimates when a predicate is applied as part of a lookup.

    This particular cardinality estimation issue does not affect the final plan choice (the internal estimates are correct) but it does look odd and will confuse people when analysing query plans in SSMS.  If you think this situation should be improved, please vote for this Connect item.  It will be interesting to see how long it takes to catch up with Plan Explorer.

    © 2012 Paul White – All Rights Reserved

    twitter: @SQL_Kiwi
    email: SQLkiwi@gmail.com

  • Why Doesn’t Partition Elimination Work?

    Given a partitioned table and a simple SELECT query that compares the partitioning column to a single literal value, why does SQL Server read all the partitions when it seems obvious that only one partition needs to be examined?

    Sample Data

    The following script creates a table, partitioned on the char(3) column ‘Div’, and populates it with 100,000 rows of data:

    USE Sandpit;
    GO
    CREATE PARTITION FUNCTION PF (char(3))
    AS RANGE RIGHT FOR VALUES
    (
        '1', '2', '3', '4', '5', '6', '7', '8', '9',
        'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I',
        'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R',
        'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'
    );
    GO
    CREATE PARTITION SCHEME PS
    AS PARTITION PF
    ALL TO ([PRIMARY]);
    GO
    CREATE TABLE dbo.Test
    (
        Div     char(3) NOT NULL,
        Data    char(50) NOT NULL,
    )
    ON PS(Div);
    GO
    INSERT dbo.Test WITH (TABLOCKX)
        (Div, Data)
    SELECT TOP (100000)
        CONVERT(char(3), CONVERT(char(36), NEWID())),
        CONVERT(char(50), REPLICATE('X', 50))
    FROM sys.columns AS c
    CROSS JOIN sys.columns AS c2
    CROSS JOIN sys.columns AS c3;

    Sample data:

    Partition Elimination Sample Data

    The ‘Div’ column is pseudo-random so your results will be slightly different, but this is the distribution across partitions I saw:

    SELECT
        PartitionID = F.PartitionID,
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t WITH (TABLOCK)
    CROSS APPLY (VALUES($PARTITION.PF(t.Div))) AS F (PartitionID)
    GROUP BY
        F.PartitionID
    ORDER BY
        F.PartitionID;

    Sample Data Distribution

    (The partition function defines a total of 36 partitions but only 16 are used by the sample data.)

    The Test Query

    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = 'ABF';

    Since ‘Div’ is the partitioning column, the expectation is that only one partition will be scanned to count the rows that match the WHERE clause predicate.  However, the execution plan shows that query execution scans all 36 partitions (a full table scan):

    Execution Plan 1

    Why did SQL Server not just look at the one partition it knows the value ‘ABF’ must lie in?

    Here’s Why:

    The answer lies in the predicate:

    Predicate

    The value ‘ABF’ is nowhere to be seen; it has been replaced by [@1].  This means the query has been considered for simple parameterization by SQL Server to promote query plan reuse.  The presence of the [@1] does not mean that simple parameterization was successful (considered safe for all possible values by the optimizer).  We have to check the plan cache to see if an associated prepared plan exists and is reused for different ‘parameter’ values.  It turns out that this time the simple parameterization attempt is considered safe, so a prepared plan is created and is reused for different values.  The full parameterized text is:

    (@1 varchar(8000))SELECT COUNT_BIG(*) [RowCnt] FROM [dbo].[Test] [t] WHERE [t].[Div]=@1

    Static Partition Elimination

    The reason SQL Server cannot apply static partition elimination (determining the partition number at compile time) is that the plan now contains a parameter.  Subsequent executions that reuse the plan might specify a different value for the parameter, so static elimination would not be safe.  It is not possible to disable simple parameterization, but we can prevent it applying to this query in a number of ways.  One way is to incorporate an inequality predicate of the form expression <> constant:

    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = 'ABF'
        AND 1 <> 2;

    The redundant 1 <> 2 predicate is completely removed by the optimizer, and the query still qualifies for TRIVIAL optimization as it did previously:

    Execution Plan 2

    There are a number of interesting things to notice here:

    • The query is no longer parameterized (the literal value ‘ABF’ appears)
    • Only one partition is touched (partition 11)
    • This unindexed heap table has a seek predicate
    • The ‘ordered’ property has changed from False to True

    The problem with this plan is that it cannot be reused for different predicate values – so you could end up with a separate cached plan for each possible literal value.

    Adding OPTION (RECOMPILE) also defeats simple parameterization and therefore achieves static elimination.  In an ad-hoc SQL batch, OPTION (RECOMPILE) also means the query plan will not be cached (stored procedures are not so lucky).  The downside is a fresh trip through the optimizer on each call.  This query does qualify for trivial plan, so optimization time is minimal, but more complex queries with the same desire for partition elimination might well require full optimization.

    Dynamic Partition Elimination

    Ideally, we would like a query plan that is reusable but still only touches one partition for any given string literal value.  What we need is dynamic partition elimination.  In case the concept is new to you, the optimizer can produce plans that determine the correct partition at execution time (rather than at compile-time, as for static elimination).  To achieve this, the optimizer builds a plan that uses an internal function called RangePartitionNew().  One might think the original simple-parameterized plan ought to have included this feature; the reason it didn’t is quite interesting.

    When SQL Server parameterizes a query using simple or forced parameterization, it has to decide on a data type for the parameter.  For string literals, it always chooses varchar(8000) for non-Unicode strings and nvarchar(4000) for Unicode strings.  If it chose a specific type like varchar(3), we could end up with the same plan cache bloating and non-reuse problem as for the non-parameterized case – and there are potentially 8000 different lengths for varchar.  Also, prior to SQL Server 2005 SP2, the parameter-declaration part of the query text was not included in the hash value used to decide in which cache bucket to store the plan.  This could result in hundreds of same-text plans (with different parameters) occupying the same hash bucket (this does not work well).

    Anyway, faced with the parameterized predicate Div = [@1], where Div is char(3) and [@1] is varchar(8000), dynamic partition elimination does not occur because the value of [@1] might be truncated.  One way to address this is to explicitly convert the to-be-parameterized string literal to the same type as the target column (char(3)):

    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = CONVERT(char(3), 'CDC');

    Now we have a parameterized query plan that uses dynamic partition elimination:

    Dynamic Elimination 1

    This plan has all the interesting features of the static elimination plan, but the heap seek no longer selects a hard-coded partition ID.  The single partition searched is the one returned at runtime by the call to RangePartitionNew().  The explicit convert prevents data truncation when the parameterized value is compared with the partition boundary values (defined as char(3) in the partition function).  The extra CONVERT_IMPLICIT to char(3) is required for internal reasons, and only appears with non-Unicode partition function types.

    If we explicitly convert to char(4) instead of char(3), the possibility of truncation arises again and we are back to scanning all 36 partitions:

    No Partition Elimination

    Using varchar(3) instead of char(3) also results in dynamic elimination, which surprises some people:

    Dynamic Elimination 2

    Implicit Conversions and Precedence

    An alternative explanation I have seen for the original behaviour is that the string literal ‘ABF’ is implicitly typed as a varchar(3) by SQL Server and data type precedence means the char(3) ‘Div’ column has to be implicitly converted to varchar(3) to perform the comparison.  This is incorrect, but it is interesting to look at why that is so:

    Literal Types

    We can see the implicit type of the string literal ‘ABF’ using a query:

    SELECT
        BaseType = SQL_VARIANT_PROPERTY(V.v, 'BaseType'),
        MaxLength = SQL_VARIANT_PROPERTY(V.v, 'MaxLength')
    FROM (VALUES(CONVERT(sql_variant, 'ABF'))) AS V (v);

    Literal Data Type

    The implied type of ‘ABF’ does seem to be varchar(3), so that checks out.

    Type Precedence

    The data type precedence table in Books Online shows that varchar does indeed have a higher precedence than char:

    Data Type Precedence

    Also, from the Books Online topic Data Types (Transact-SQL):

    Precedence

    The example below uses a UNION ALL over char(5) and varchar(3) columns:

    DECLARE @T1 AS TABLE (col1 char(5) NOT NULL);
    DECLARE @T2 AS TABLE (col1 varchar(3) NOT NULL);
     
    INSERT @T1 VALUES ('12345');
    INSERT @T2 VALUES ('123');
     
    SELECT
        BaseType = SQL_VARIANT_PROPERTY(CONVERT(sql_variant, t.col1), 'BaseType'),
        MaxLength = SQL_VARIANT_PROPERTY(CONVERT(sql_variant, t.col1), 'MaxLength'),
        t.col1
    FROM
    (
        SELECT col1 FROM  @T1 
        UNION ALL
        SELECT col1 FROM @T2
    ) AS t;

    Result Data Type

    The result column ‘col1’ has to have a well-defined type.  As expected, the result is varchar(5) since varchar has a higher precedence than char and 5 is the maximum length of the result.  All good so far.

    Strings Are Special

    We get used to seeing (and preferably avoiding) implicit conversions in query plans where two mismatched types are compared:

    DECLARE @T1 AS TABLE (col1 integer NULL);
    DECLARE @v tinyint;
    SELECT col1 FROM @T1 AS t WHERE col1 = @v;

    Mismatched Types Comparison

    The plan shows the tinyint variable being implicitly converted to the higher-precedence integer data type, as expected.  Now let’s do the same thing with varchar and char (perhaps expecting char to be converted to the higher-precedence varchar):

    DECLARE @T1 AS TABLE (col1 varchar(5) NULL);
    DECLARE @v char(3);
    SELECT col1 FROM @T1 AS t WHERE col1 = @v;

    No Implicit Conversion

    Note the lack of an implicit conversion in the predicate.

    Important

    SQL Server can compare char with varchar without performing a conversion.  They are fundamentally the same type, it’s just that one is fixed-length and the other is variable-length.  The SQL standard requires padding for the character strings used in comparisons so that their lengths match before comparing, thereby removing the only distinction between char and varchar.  The same optimization can apply to a comparison between nchar and nvarchar (though not to Unicode versus non-Unicode comparisons, since they are genuinely different types of course).

    The present case involves a predicate “WHERE Div = ‘ABF’”.  This is a comparison that can use this optimization, so there is no implicit conversion.  This is the reason the explicit conversion to varchar(3) works in the main example – the precedence rules do not need to be invoked, and the data column is not converted from char(3) to varchar(3).

    Why Use a Partitioned Heap?

    In case you were wondering, the only reason I chose a partitioned heap for this post is just because I could (it’s more interesting in many ways).  All the demonstrations work just as well for a partitioned clustered table.  A complete script using a clustered partitioned table is below:

    USE Sandpit;
    GO
    CREATE PARTITION FUNCTION PF (char(3))
    AS RANGE RIGHT FOR VALUES
    (
        '1', '2', '3', '4', '5', '6', '7', '8', '9',
        'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I',
        'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R',
        'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'
    );
    GO
    CREATE PARTITION SCHEME PS
    AS PARTITION PF
    ALL TO ([PRIMARY]);
    GO
    CREATE TABLE dbo.Test
    (
        ID      integer IDENTITY NOT NULL,
        Div     char(3) NOT NULL,
        Data    char(50) NOT NULL,
     
        CONSTRAINT PK__dbo_Test_Div_ID
        PRIMARY KEY CLUSTERED (Div, ID)
        ON PS(Div)
    );
    GO
    INSERT dbo.Test WITH (TABLOCKX)
        (Div, Data)
    SELECT TOP (100000)
        CONVERT(char(3), CONVERT(char(36), NEWID())),
        CONVERT(char(50), REPLICATE('X', 50))
    FROM sys.columns AS c
    CROSS JOIN sys.columns AS c2
    CROSS JOIN sys.columns AS c3;
    GO
    -- Show the distribution of data
    SELECT
        PartitionID = F.PartitionID,
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t WITH (TABLOCK)
    CROSS APPLY (VALUES($PARTITION.PF(t.Div))) AS F (PartitionID)
    GROUP BY
        F.PartitionID
    ORDER BY
        F.PartitionID;
    GO
    -- Seek on all 36 partitions
    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = 'ABF';
    GO
    -- Static elimination = 1 hard-coded partition ID
    -- Cached plan unlikely to be reused
    SELECT
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = 'ABF'
        AND 1 <> 2;
    GO
    -- Static elimination
    -- Compilation overhead
    SELECT
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = 'ABF'
    OPTION (RECOMPILE);
    GO
    -- Dynamic elimination with convert to char(3)
    -- Reusable plan
    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = CONVERT(char(3), 'ABF');
    GO
    -- Dynamic elimination with convert to varchar(3)
    -- Data type precedence not applied:
    -- varchar(3) convert does not cause the char(3)
    -- column Div to be converted (would prevent a seek)
    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = CONVERT(varchar(3), 'ABF');
    GO
    -- char(4) means truncation could occur when
    -- comparing with the char(3) partition function
    -- No partition elimination, seek on 36 partitions
    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = CONVERT(char(4), 'ABF');
    GO
    -- Clean up
    DROP TABLE dbo.Test;
    DROP PARTITION SCHEME PS;
    DROP PARTITION FUNCTION PF;

    This post applies to SQL Server 2008, 2008 R2 and 2012 only (partitioned table query execution was very different in 2005).  As far as I can tell, SQL Server 2005 did not attempt simple parameterization on a partitioned table.  I’m in two minds whether the SQL Server 2008+ behaviour is a bug, an oversight, or an undesirable consequence of fixing something else…so I opened a Connect item for it.  Update: fixing this is under consideration for the next release.

    Sean Broadley raises an important practical point in the comments below – the issue highlighted in this post only occurs when a literal value is used.  By correctly parameterizing your queries (including application code and dynamic SQL) you will avoid simple parameterization, achieve dynamic partition elimination, and cache a more reusable query plan.

    Acknowledgement

    Thanks to Jonathan Kehayias (blog | twitter) who first altered me to the MSDN forum question that prompted this post.

    Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

    © 2012 Paul White – All rights reserved

    image454

  • Compute Scalars, Expressions and Execution Plan Performance

    Compute Scalar Operator

    The humble Compute Scalar is one of the least well-understood of the execution plan operators, and usually the last place people look for query performance problems.  It often appears in execution plans with a very low (or even zero) cost, which goes some way to explaining why people ignore it.

    Some readers will already know that a Compute Scalar can contain a call to a user-defined function, and that any T-SQL function with a BEGIN…END block in its definition can have truly disastrous consequences for performance (see this post by Rob Farley for details).  This post is not about those sorts of concerns.

    Compute Scalars

    The Books Online description of this operator is not very detailed:

    Books Online Description

    This leads people to think that Compute Scalar behaves like the majority of other operators: as rows flow through it,  the results of whatever computations the Compute Scalar contains are added to the stream.  This is not generally true.  Despite the name, Compute Scalar does not always compute anything, and does not always contain a single scalar value (it can be a vector, an alias, or even a Boolean predicate, for example).  More often than not, a Compute Scalar simply defines an expression; the actual computation is deferred until something later in the execution plan needs the result.


    Compute Scalars are not the only operators that can define expressions. You can find expression definitions with labels like [Expr1008] in many query plan operators, including Constant Scans, Stream Aggregates and even Inserts.

    Deferred expression evaluation was introduced in SQL Server 2005 as a performance optimization.  The idea is that the number of rows tends to decrease in later parts of the plan due to the effects of filtering and joining (reducing the number of expression evaluations at runtime).  To be fair to the Books Online entry, it does mention this behaviour in a note:

    Books Online Note

    The take away from this first section is that a Compute Scalar is most often just a placeholder in the execution plan.  It defines an expression and assigns a label to it (e.g. [Expr1000]) but no calculations are performed at that point.  The defined expression may be referenced by multiple operators later in the execution plan.  This makes counting the number of ‘executions’ more complex than for a regular plan operator., and goes some way to explaining why Compute Scalar operators only ever show estimated row counts (even in an ‘actual’ execution plan).

    Execution Plans and the Expression Service

    The execution plans you see in Management Studio (or Plan Explorer) are a highly simplified representation that is nevertheless useful for many kinds of plan analysis.  The XML form of show plan has a more detail than the graphical representation, but even that is a shallow depiction of the code that actually runs inside SQL Server when a query is executed.  It is easy to be seduced into thinking that what we see as an ‘execution plan’ is more complete than it actually is.  As a tool for high-level analysis, it is excellent, but do not make the mistake of taking it too literally.  More detail on this later.

    One more thing to appreciate before we get into a concrete example: the SQL Server engine contains a component known as the Expression Service.  This is used by components such as the query optimizer, query execution, and the storage engine to perform computations, conversions and comparisons.  It is the call to the expression service in the execution plan that can be deferred and performed by a plan operator other than the source Compute Scalar (or other expression-defining operator).

    An Example

    I came across a forum thread recently which demonstrated the concepts I want to explore in this post really well.  The topic of discussion was delimited string-splitting (an old favourite of a topic, for sure) but I want to emphasise that the particular code is not at all important – it is the query plan execution I want to focus on.  The particular string-splitting method employed was one I first saw written about by Brad Shultz; it replaces the delimiters with XML tags, and then uses XML shredding to return the results.  I should say that I am not a fan of this technique because it is (a) a strange use of XML; and (b) it does not work for all possible inputs.  Anyway, I said the example was not that important in itself, so let’s get on to the first example:

    Test One – XML Split

    DECLARE
        @ItemID bigint,
        @Item varchar(8000),
        @Input varchar(8000);
     
    SET @Input = REPLICATE('a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z', 32);
     
    SELECT
        @ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), 
        @Item = x.i.value('./text()[1]', 'varchar(8000)')    
    FROM
    (
        VALUES
            (
            CONVERT
                (xml, 
                '<r>' + 
                REPLACE(@Input, ',', '</r><r>') + 
                '</r>', 0
                )
            )
    ) AS ToXML (result) 
    CROSS APPLY ToXML.result.nodes('./r') AS x(i);

    The code above creates a single comma-delimited string containing the letters of the alphabet, repeated 32 times (a total string length of 1632 characters).  The string-splitting SELECT statement replaces the commas with XML tags, fixes up the start and end of the string, and then uses the nodes() XML method to perform the actual splitting.  There is also a ROW_NUMBER call to number the elements returned in whatever order the nodes() method spits them out.  Finally, two variables are used to ‘eat’ the output so returning results to the client does not affect execution time.  The execution plan is as follows (click to enlarge):

    XML String Splitter Test 1

    The Constant Scan operator defines an expression labelled as [Expr1000] which performs the string replacement and conversion to the XML type.  There are four subsequent references to this expression in the query plan:

    • Two XML Reader table-valued functions (performing the nodes() shredding and value() functions)
    • The Filter operator (a start-up predicate checks that the expression is not NULL)
    • The Stream Aggregate (as part of a CASE expression that determines what the value() XML method should finally return)

    The query takes 3753ms to execute on my laptop.  This seems a rather long time to shred a single string containing just 801 elements.

    Test Two – The Empty String

    The following code makes one small modification to the test one script:

    SELECT
        @ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), 
        @Item = x.i.value('./text()[1]', 'varchar(8000)')    
    FROM
    (
        VALUES
            (
            CONVERT
                (xml, 
                '<r>' + 
                REPLACE(@Input, ',', '</r><r>') + 
                '</r>' +
                LEFT(@@DBTS, 0)  -- NEW
                , 0)
            )
    ) AS ToXML (result) 
    CROSS APPLY ToXML.result.nodes('./r') AS x(i);

    The only difference in the execution plan is in the definition of [Expr1000] (click to enlarge):

    XML String Splitter Test 2

    This plan returns the same correct results as the first test, but it completes in 8ms (quite an improvement over 3753ms).

    Beyond the Execution Plan – Test One

    There is nothing in the query plan to explain why the second form of the query runs 500 times faster than the first.  To see what is going on under the covers of the execution engine when executing the slow query, we can attach a debugger and look closer at the code SQL Server is really running.  Using SQL Server 2012 and setting a breakpoint on the CXMLTypeProvider::ConvertStringToXMLForES method (as the name suggests, it converts a string to XML for the Expression Service) the first debugger break occurs with this stack trace:

    ConvertStringToXMLForES

    I know that not everyone reading this post will routinely analyse stack traces, so for this first example I will describe the interesting calls from the bottom up (the currently executing code is on top, each line below is the method that called that routine):

    1. There is nothing very interesting below CQueryScan::GetRow (it is just start-up code)
    2. CQueryScan:GetRow drives the plan as we are used to seeing it.  You might think of it as approximately the code behind the green SELECT query plan icon in a graphical plan.  It reads a row at a time from the first plan operator to its right
    3. The first visible plan operator to execute is the leftmost Nested Loops join (CQScanNLJoinTrivialNew::GetRow).  (The Compute Scalar to its left in the graphical plan simply defines an alias for an expression computed by the Stream Aggregate – it does not even exist in the code executed by the execution engine).
    4. The nested loops join operator asks the Sequence Project operator for a row (CQScanSeqProjectNew::GetRow)
    5. The sequence project operator asks the Segment operator for a row (CQScanSegmentNew::GetRow)
    6. The segment operator asks the second Nested Loops Join for a row (CQScanNLJoinTrivialNew::GetRow)
    7. The nested loops join has obtained a row from the Constant Scan and is now preparing to fetch a row from the inner side of the join
    8. The start-up Filter operator is executing its Open() method (CQScanStartupFilterNew::Open)
    9. The start-up Filter calls into the general entry point for the Expression Service to evaluate expression [Expr1000] defined by the Constant Scan
    10. CEsExec::GeneralEval4 calls an ES method to convert a string to XML (ConvertFromStringTypesAndXmlToXml)
    11. The XML type provider executes the code that actually converts the string to XML (ConvertStringToXMLForES)

    The stack trace illustrates two things nicely:

    • The iterative nature of plan operators (one row at a time, Open, GetRow, and Close methods)
    • An expression defined by one operator (the Constant Scan) being deferred for evaluation until a later operator (the start-up filter) requires the result

    Continuing execution with the debugger, the next break occurs when the Table-valued Function to the right of the filter is initialised:

    Table-valued Function Stack Trace

    Another break when the Table-valued Function on the lowest branch of the plan initialises:

    Table-valued Function Stack Trace 2

    And finally, a break from the Stream Aggregate, when it needs to evaluate [Expr1016] (which contains a reference to [Expr1000]):

    Stream Aggregate Stack Trace

    The last two breaks (lowest table-valued function and stream aggregate) execute a total of 801 times.  The reason for the slow performance of this query is now clear: [Expr1000] is being computed more than 1604 times.  Yes, that means the REPLACE and CONVERT to XML really does run 1604 times on the same value – no wonder performance is so poor!

    If you are inclined to test it, the expression service method that performs the REPLACE is sqlTsEs!BhReplaceBhStrStr.  It turns out that the string REPLACE is very much faster than conversion to XML (remember XML is a LOB type and also has complex validation rules) so the dominant performance effect is the conversion to XML.

    Beyond the Execution Plan – Test Two

    Running test two with the debugger attached reveals that only one call is made to CXMLTypeProvider::ConvertStringToXMLForES:

    FillExprCache

    Reading down the stack trace we see that execution of the plan (as we know it) has not started yet.  The call to convert the string to XML is called from InitForExecute() and SetupQueryScanAndExpression.  The conversion to XML is evaluated once and cached in the execution context (CQueryExecContext::FillExprCache).  Once the iterative part of plan execution begins, the cached result can be retrieved directly without calling ConvertStringToXMLForES() 1604 times.

    Test Three – Column Reference

    Before you go off adding LEFT(@@DBTS, 0) to all your expressions (!) look what happens when we use a table to hold the input data rather than a variable:

    DECLARE
        @ItemID bigint,
        @Item varchar(8000);
     
    DECLARE @Table AS TABLE (Input varchar(8000) NOT NULL);
     
    INSERT @Table (Input)
    VALUES (REPLICATE('a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z', 32));
     
    SELECT
        @ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), 
        @Item = x.i.value('./text()[1]', 'varchar(8000)')    
    FROM @Table AS t
    CROSS APPLY
    (
        VALUES
            (
            CONVERT
                (xml, 
                '<r>' + 
                REPLACE(t.Input, ',', '</r><r>') + 
                '</r>' +
                LEFT(@@DBTS, 0)
                , 0)
            )
    ) AS ToXML (result) 
    CROSS APPLY ToXML.result.nodes('./r') AS x(i);

    Instead of the Constant Scan we saw when using a variable, we now have a table scan and a Compute Scalar that defines the expression:

    [Expr1003] = Scalar Operator(
    CONVERT(xml,'<r>'+replace(@Table.[Input] as [t].[Input],',','</r><r>')+'</r>'+
    substring(CONVERT_IMPLICIT(varchar(8),@@dbts,0),(1),(0)),0))

    XML String Splitter Test 3

    The other plan details are the same as before; the expression label [Expr1003] is still referenced by the Filter, two Table-valued functions and the Stream Aggregate.  The test three query takes 3758ms to complete.  Debugger analysis shows that CXMLTypeProvider::ConvertStringToXMLForES is again being called directly by the Filter, Table-valued Functions, and Stream Aggregate.

    Test Four – CRYPT_GEN_RANDOM

    This final test replaces @@DBTS with the CRYPT_GEN_RANDOM function:

    SELECT
        @ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), 
        @Item = x.i.value('./text()[1]', 'varchar(8000)')    
    FROM @Table AS t
    CROSS APPLY
    (
        VALUES
            (
            CONVERT
                (xml, 
                '<r>' + 
                REPLACE(t.Input, ',', '</r><r>') + 
                '</r>' +
                LEFT(CRYPT_GEN_RANDOM(1), 0)    -- CHANGED
                , 0)
            )
    ) AS ToXML (result) 
    CROSS APPLY ToXML.result.nodes('./r') AS x(i);

    The execution plan shows an extra Compute Scalar next to the Table Scan:

    [Expr1018] = Scalar Operator('<r>'+replace(@Table.[Input] as [t].[Input],',','</r><r>')+'</r>')

    The Compute Scalar to the left of that is the same one seen previously, though it now references the new expression in its definition:

    [Expr1003] = Scalar Operator(
    CONVERT(xml,[Expr1018]+
    substring(CONVERT_IMPLICIT(varchar(8000),Crypt_Gen_Random((1)),0),(1),(0)),0))

    XML String Splitter Test 4

    The remainder of the query plan is unaffected; the same operators reference [Expr1003] as before.  As you might have been expecting, this query executes in 8ms again.

    Explanation

    In test one, there is nothing to stop the query processor deferring execution of the expression, which results in the same expensive computation being performed 1604 times (once each for the filter and first table-valued function, and 801 times each for the second function and the stream aggregate).  As an interesting aside, notice that even with OPTION (RECOMPILE) specified, constant-folding only applies to the REPLACE and string concatenation; constant-folding cannot produce a LOB type (including XML).

    In test two, we introduced the non-deterministic function @@DBTS.  This is one of the functions (like RAND and GETDATE) that are implemented to always return the same value per query (despite being non-deterministic, see the links in the Further Reading section below for more details).  These runtime constants are evaluated before the query starts processing rows and the result is cached.  There are some hints (ConstExpr labels) in the execution plan that may indicate a runtime constant but I have not found this to be reliable.

    In test three, we changed the game again by replacing the reference to a variable with a column reference from a table.  The value of this column reference could change for each table row so the engine can no longer treat the expression as a runtime constant.  As a consequence, the result is no longer cached and we see the same behaviour (and slow performance) as in test one.

    In test four, we replaced the @@DBTS function with the side-effecting function CRYPT_GEN_RANDOM (the other such function is NEWID).  One of the effects of this is to disable the deferred evaluation of the expression defined by the Compute Scalar.  The result is a XML LOB and a second optimization occurs where handles to the same LOB (evaluated by the defining operator) can be shared by other operators that reference the LOB.  This combination of effects means the expensive computation is only performed once in test four, even though the runtime constant caching mechanism is not used.  By the way, there was a bug with CRYPT_GEN_RANDOM which meant it was incorrectly assessed by the optimizer in early versions of SQL Server 2008.  If you encounter this issue when running the scripts above, apply a later service pack (or use the NEWID function for SQL Server 2005).

    Final Thoughts

    • There is a lot more going on in query execution than we can see in execution plans
    • In general, there are no guarantees about the order of execution of expressions, or how many times they will be evaluated
    • Please do not start using @@DBTS or CRYPT_GEN_RANDOM for their undocumented effects
    • It might be nice if show plan output did indicate if a computation was deferred or cached, but it is not there today
    • In many cases we can work around the potential for poor performance caused by repeated evaluation by explicitly materializing the result of the expression using a variable, table, non-inline function…though not always
    • Please do not suggest improvements to the XML string splitter unless it handles all input characters and can accept a semicolon as a delimiter ;c)
    • I always use SQLCLR for string splitting

    Acknowledgements

    I would like to particularly thank Usman Butt and Chris Morris from SQL Server Central for their contributions and thoughts which helped me write this entry.

    Further Reading

    When a Function is indeed a Constant – Andrew Kelly
    Runtime Constant Functions – Conor Cunningham
    RAND() and other runtime constant functions – Conor again
    Scalar Operator Costing – Guess who

    © 2012 Paul White
    twitter: @SQL_Kiwi
    email: SQLkiwi@gmail.com

    image454

More Posts Next page »
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement