THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Page Free Space

See also my SQL Server technical articles on

  • Mastering Execution Plan Analysis - Lisbon, Portugal 15 May 2015


    I will be presenting my full-day workshop, “Mastering Execution Plan Analysis” on Friday 15 May 2015 as part of SQL Saturday #369 in Lisbon, Portugal.

    The early-bird period has ended now, but regular-rate tickets at €150 are still available. Tickets can be purchased via EventBrite, where the full details of the workshop are also listed.

    Thanks to my friends at SQL Sentry one lucky workshop attendee will also win a licence for the PRO version of SQL Sentry Plan Explorer worth US$295 (featuring deadlock analysis, automatic wait statistics, version history, custom plan layouts and more).

    The EventBrite page is (mostly) in Portuguese, but my full-day workshop (and all the SQL Saturday sessions) will be presented in English.

    For those of you in the UK, Lisbon is just 2½ hours from London by air, with various carriers offering return flights from £69 as of the time of writing. Seems like a great way to combine a short break in beautiful Lisbon with a full-day workshop, and all the quality free content available at the main SQL Saturday event!

    SQL Saturday Portugal is a fantastic “mini-conference” event that always attracts great speakers. Check out this short video of the 2013 event courtesy of Paulo Benittes:

    Click to play this video in a new tab

  • Mastering Execution Plan Analysis - Melbourne 6 Feb 2015

    SQL Saturday 365

    Over the years I have spent working with SQL Server, the personal time investment that has repaid itself more than any other is becoming intimately familiar with execution plans and how they can be used to diagnose and correct performance problems.

    There is nothing I enjoy more than talking about execution plans, the query optimizer, and how internals knowledge can be applied to solve real problems. Living in New Zealand, I don’t get to speak on this topic as often as I would like, primarily due to the time and travel costs involved in getting to the places most SQL people live and work.

    So, I am very pleased to announce I will be presenting a Pre-Conference Training Day on Mastering Execution Plan Analysis in Melbourne, Australia, on Friday 6 February, 2015 as part of the SQL Saturday #365 event, being held at Monash University’s Caulfield Campus.

    The course runs from 8:30am to 5pm, beginning with an overview of the SQL Server Query Optimizer, followed by comprehensive coverage of execution plans and plan analysis techniques. Supported by heaps of heavily-commented T-SQL code examples, the course packs an enormous amount of information into one fast-paced day. Though not suitable for people completely new to SQL Server, developers and database professionals with a few years’ experience working with SQL Server will find many new insights, internal details, and practical tips that they will refer to time and time again.

    The standard ticket price is AUD 315. Tickets are available until 2 February 2015 AEDT.

    I hope to see you there!

  • SQL Intersection Conference, Las Vegas MGM Grand 10-13 November 2014

    SQLintersection2014I am very pleased to announce that I will be speaking at the SQL Intersection conference in Las Vegas again this year.

    This time around, I am giving a full-day workshop, "Mastering SQL Server Execution Plan Analysis" as well as a two-part session, "Parallel Query Execution" during the main conference.

    The workshop is a pre-conference event, held on Sunday 9 November (straight after this year's PASS Summit). Being on Sunday gives you the whole Monday off to recover and before the main conference kicks off on Tuesday.

    The workshop itself is an updated and enhanced version of the (record-setting) pre-conference full day session I presented at PASS Summit 2013 in Charlotte. If you weren't one of the 180+ people that came along that day, this is a great chance to learn about query optimization and execution plan analysis to an advanced level.

    The seats are (truly!) very limited, so you are guaranteed a more personal experience than you would have had at Summit 2013. Please also note that this is a level 300-500 workshop, so it's not really suitable for people completely new to SQL Server. If you are a regular reader of my articles here and at, you will know what level of material to expect already :)

    There is an option to attend just the workshop ($499), or you can combine it with entry to the full six-in-one conference event (and/or additional pre- or post-conference workshops):



    Speaking of the main event, check out the list of SQL Intersection speakers:

    • Brent Ozar
    • Kimberley L. Tripp
    • Joe Sack
    • Jonathan Kehayias
    • Tim Chapman
    • Kevin Kline
    • Richard Campbell
    • Paul S. Randal
    • Andrew J.Kelly
    • Kendra Little
    • Glenn Berry
    • Jeremiah Peschka
    • Bob Ward
    • Aaron Bertrand
    • David Pless

    You can find more details (including a full schedule) on the conference website. There are early bird and other special offers available if you register by August 18, 2014.

    Hope to see you there!

    Paul White

    Twitter: @SQL_Kiwi

  • Cardinality Estimation for Disjunctive Predicates in SQL Server 2014


    Back in January 2014, I wrote an article for describing the cardinality estimation process for queries with multiple predicates, from the point of view of the old and new cardinality estimators. The article describes the various behaviours and formulas involved, along with the usual sprinkling of documented and undocumented trace flags.

    I was able to describe the formula SQL Server 2014 uses to calculate a cardinality estimate for multiple predicates connected by AND (conjunctive predicates), which was already relatively well-known. Despite some fairly energetic research, and basic-to-intermediate skills with Excel, I was unable to deduce a similar formula for the disjunctive case, where predicates are connected by OR. The trace flag output I describe in the other article clearly showed that exponential backoff was used in the new 2014 cardinality estimator, but the precise formula eluded me.

    I gave up on it until a few weeks ago, when I was invited to contribute to the a technical review of a White Paper Joe Sack was writing with assistance from Microsoft. One of the small contributions I was able to make was to point out that exponential backoff was used for disjunctions as well as conjunctions. The paper has now been released so I can now write about the detail behind the statement that appears there concerning cardinality estimation for disjunctive predicates (thank you Joe for pointing this out to me):


    A Bit Of Boolean

    The key bit of new information is the second sentence. As soon as I read it, a vague memory from my patchy education came back to me, as well as a cunning query rewrite Itzik Ben-Gan shows in one of his books. Both relate to a set of rules for transforming Boolean algebra, known as DeMorgan’s Laws. Wikipedia has this to say:


    This gives us a way to turn a disjunction into a conjunction. It is a short step from there to understand that exponential backoff for disjunctions in the SQL Server 2014 cardinality estimator works by applying DeMorgan’s laws. Converting the troublesome OR to an AND (with the appropriate negations) allows us to reuse the AND exponential backoff calculation for OR queries!

    The Theory

    Borrowing from the Wikipedia article again, we have DeMorgan’s Laws stated as:


    The interesting transformation from disjunction to conjunction can be expressed as:

    not (A or B) is the same as (not A) and (not B)

    This does not quite give us a direct translation from OR to AND. We need an extra step of observing:

    (A or B) is the same as not (not (A or B))

    This trivial double-negative means we can now directly substitute the not (A or B) above, to get:

    (A or B) is the same as not ((not A) and (not B))

    This gives us OR expressed in terms of NOT and AND.

    The very last bit of theory is to remember that A and B are probabilities between 0 and 1, so:

    (not X) = (1 – X)

    Worked Example

    Now we have all we need to show how the disjunctive exponential backoff calculation works in SQL Server 2014:

        COUNT_BIG(*) AS NumRows
    FROM Production.TransactionHistory AS TH
        TH.TransactionID BETWEEN 100000 AND 168412
        OR TH.TransactionDate BETWEEN '20070901' AND '20080313';

    There are two predicates in this query, connected by OR.

    The selectivity of the predicate on Transaction ID can be derived directly from the statistics histogram:



    This very compact histogram gives us all the information we need to say that the range of 68,413 Transaction ID values can be expected to produce 68,413 rows.

    Expressed as a fraction of the 113,443 rows in the table, the selectivity of this predicate is 68413 / 113443 = 0.603061. We will call this value selectivity A.

    The second predicate (on TransactionDate) is estimated from the histogram in a similar way (except the histogram has more steps than is convenient to show here). The histogram calculation results in a cardinality estimate for this predicate that is also 68,413 (because I deliberately chose the predicates to qualify exactly the same physical rows). The selectivity is likewise 0.603061.  We will call this value selectivity B.

    Enter DeMorgan

    We know from our work with DeMorgan that:

    (A or B) is the same as not ((not A) and (not B))

    So we can rewrite our (A or B) predicate immediately as not ((not A) and (not B)), enabling us to avoid worrying about the OR.

    Note this transform is done purely for cardinality estimation purposes. The optimizer tree is not affected by this work.

    Doing the Math

    Now, given that selectivities A and B are both equal to 0.603061 in this contrived example, our expanded expression becomes:

    not ((not 0.603061) and (not 0.603061))

    We also know:

    (not X) = (1 – X)

    Expanding the inner nots means the computation becomes:

    not ((1 - 0.603061) and (1 - 0.603061))

    We can perform the numeric math inside the parentheses to get:

    not (0.396939 and 0.396939)

    Now we use exponential backoff to compute the and selectivity:

    not (0.396939 * SQRT(0.396939))

    And replace the outer not just as we did before:

    1 - (0.396939 * SQRT(0.396939))

    Now we have just numbers. The final selectivity is 0.749916.

    The Result

    The very final step is to multiply this selectivity by the cardinality of the base table to get our overall estimate:

    0.749916 * 113443 = 85073 rounded up

    The 2014 execution plan for this query, when using the new cardinality estimator, is:


    The cardinality estimate is 85,073 rows, just as we predicted!

    Sadly, the actual number of rows returned by this query is 68,413 rows (because both predicates qualify exactly the same rows). Nothing is perfect, and cardinality estimation doubly so.


    © 2014 Paul White
    twitter: @SQL_Kiwi (note the underscore!)

  • Nested Loops Prefetching

    PASS_2013_SpeakingButton_180x180 (2)Nested loops join query plans can be a lot more interesting (and complicated!) than is commonly realized.

    One query plan area I get asked about a lot is prefetching. It is not documented in full detail anywhere, so this seems like a good topic to address in a blog post. The examples used in this article are based on questions asked by Adam Machanic.

    Test Query

    The following query uses the AdventureWorks sample database (as usual). Run it a couple of times if necessary so you see the query plan when all the data needed by the query is in memory (the “warm cache” case):

    SELECT TOP (1000)
    FROM Production.Product AS P
    JOIN Production.TransactionHistory AS TH
        ON TH.ProductID = P.ProductID
        P.Name LIKE N'[K-P]%'

    The post-execution (“actual”) plan is shown below:

    Query Plan

    Plan Execution

    Some of you will already know how query plans are executed in detail, but it will be very important in a moment so please take the time to read this next bit, even if it is just as a refresher:

    Query execution is an iterative (cursor-like) process. It starts at the far left (the green SELECT icon) and control passes first to the Top iterator, which requests a single row from the Nested Loops join. The join reads one row from its outer input (using the Index Seek on the Product table) and then looks for the first match on its inner input (the Index Seek on Transaction History).

    The single-row result is passed back to the Top operator, and ultimately queued for transmission to the client in a TDS packet. The Top operator then asks the join for another row. The join looks for the next match on its inner input (it does not access the outer input this time). If a row is found, it is passed back to the Top (and on to the client) and so on.

    At some point, the join will run out of matches from the inner input (for the current value of Product ID) so it switches back to its outer input to get the next Product ID from the Product table. The new Product ID is used to reinitialize (rebind) the inner input, which returns the first match from the Transaction History table index seek using the new Product ID.

    Eventually, the whole iterative process results in the 1000th row passing through the Top operator, and the process comes to an end (because the query specifies TOP 1000). There is also the possibility that the query tree below (i.e. to the right of) the Top operator produces less than 1,000 rows in total. Execution would also come to an end in that case, but that is not what happens here.

    The execution plan does not show all this detail (and in particular it aggregates over all the inner-side executions) but it does provide a summary of what happened:

    • A total of four rows were read (at different times) from the Product table
    • The Transaction History table seek was executed four times (with a different Product ID each time)
    • A total of 1,000 rows were returned to the client

    The particular detail I want to draw your attention to is highlighted in the query plan above: the inner side index seek is executed four times, one for each row produced by the outer input to the join. Given our current understanding of query execution, the number of rows produced by the nested loops join outer input should always match the number of executions reported for the inner input. Right? Wrong.

    Cold Cache Test

    The next script runs exactly the same query as before, but with a cold cache:

    SELECT TOP (1000)
    FROM Production.Product AS P
    JOIN Production.TransactionHistory AS TH
        ON TH.ProductID = P.ProductID
        P.Name LIKE N'[K-P]%'

    Query Plan 2

    Well that’s odd: 35 rows are read from the outer (Product) input but the inner side of the join is only executed 4 times. What happened to the extra 31 outer side rows? Where did they go?

    Nested Loops Prefetching

    The SQL Server query processor contains a number of optimizations for nested loops joins, primarily aimed at making I/O more efficient.

    I/O efficiency on the outer side of the join is not usually a problem. Regular (storage engine) read-ahead is enough to ensure that pages are brought into memory before they are needed. The same read-ahead mechanism also applies to inner side executions, but the number of rows returned for each outer row is often quite small, so the benefit can be limited or even non-existent.

    Remember that the inner side of a nested loop join rebinds (restarts) for each new correlated value from the outer side. As a result, a common access pattern for nested loops join is a number of small range scans on the inner side that start at different points in the same index.

    One of the available SQL Server optimizations is nested loops prefetching. The general idea is to issue asynchronous I/O for index pages that will be needed by the inner side – and not just for the current correlated join parameter value, but for future values too. To be clear: say the outer side Product table seek produces ProductID rows with values (1, 17, 60, 99). Nested loops prefetching will try to bring pages in memory for the inner side Transaction History table index that has ProductID as its leading key for the values (1, 17, 60, 99…) before the inner side starts working on ProductID 1. When this mechanism is effective, it can produce a significant performance boost for nested loops joins by overlapping prefetch I/O with ongoing nested loop execution.

    The way this works is not really discussed anywhere in detail, so let us look at it now. It will also explain why our cold-cache test query plan shows 35 rows on the outer input but only 4 executions on the inner side.

    The Missing Node

    The execution plan shows a good deal of useful summary information, but it does not contain every important detail. One indication that there is something missing from the XML or graphical showplan information comes from looking at the node IDs of the iterators in the plan (if you are using SSMS, the node IDs are shown in the tooltip for each plan iterator):

    Click to enlarge

    Query plan iterators (nodes) are numbered (starting with zero, in geek tradition) from the far left. A discontinuity in the numbering is often a sign that something has been replaced, or is being hidden. In our test plan, nodes (0, 1, 3, 4) are shown, but node 2 is conspicuously missing. We can sometimes see extra information about query tree nodes using undocumented trace flag 7352 (output shown below, with node IDs in parentheses):

    Trace flag 7352 output

    The output reveals the mystery node 2, but the trace flag is unable to give a name to it. The output also shows that the mystery node uses an expression labelled [Expr1004] with a type of binary(24). This expression is shown in regular showplan output as an outer reference (correlated parameter) of the nested loops join:

    Outer references

    No definition is provided for this expression, and no other explicit reference is made to it anywhere else in the plan.

    The Real Execution Plan

    If showplan output included the missing node 2, it could look like this:

    Node 2 Plan

    I have labelled the new iterator, “Delayed Prefetch” and used the standard catch-all iterator icon for it because I am too lazy to draw something appropriate (I do not have the illustrative skills of Kendra Little or Michael Swart).

    Anyway, the execution engine class that implements this iterator is QScanRangePrefetchDelayNew. It uses the correlated join parameter values (ProductID here) to issue asynchronous scatter/gather I/O for the inner side (Transaction History) index, as described earlier. We have to attach the debugger to see this hidden iterator in action:

    Prefetch Delay

    That call stack shows delayed prefetch for the nested loops join resulting in an asynchronous scatter read call to the operating system (right at the top in this capture). The prefetch operator can issue multiple asynchronous reads based on outer-input values. It uses a small memory buffer to keep track, and to save index leaf page information in particular. A reference to this information was seen earlier as the outer reference binary (24) expression label Expr1004. My understanding is that the leaf page information is used on the inner side of the join in hints to the storage engine to prevent duplicated effort by the normal read-ahead mechanisms. It may be used for other purposes.

    The delayed prefetch iterator also checks for completion of previously-issued asynchronous I/Os:

    IO Completions

    …and completes SQL Server processing for any asynchronous I/Os that are found to have completed (the example below shows page checksums being validated):

    Page Checksum

    Details and Explanations

    The delayed prefetch iterator only reads extra rows from its input when it encounters a page that requires a physical I/O. Scatter reads can bring in multiple discontinuous pages in one I/O request, so extra outer side rows are read to make better use of the scatter read mechanism. If we are going to issue an I/O call, we might as well grab as much as we can.

    If an inner-side index page checked by delayed prefetch is already present in the buffer pool, no physical I/O issued (because there is nothing to do) but a logical read is counted (because prefetch touched the buffer pool page to check it was there). Control then returns to the nested loop join iterator to continue its normal processing. If we know the inner-side page is in memory, we might as well get on with processing it – there’s no need to read extra rows from the outer input.

    We can now fully explain the query plan differences we saw earlier.

    In the cold-cache test, the delayed prefetch iterator tests inner-side index pages, finds they are not in cache, so asynchronous scatter reads are issued. A total of 35 rows are read by the delayed prefetch iterator in this test. The 1,000 rows needed for the query are produced by the nested loop join from just 4 of these rows, so we only see 4 executions on the inner side. (Remember the row-at-a-time cursor-like execution of query plans).

    In the warm-cache test, the delayed prefetch iterator does not encounter any pages are not in memory, so no asynchronous I/Os are issued, and no extra rows are read. Prefetching just adds a few logical reads (of pages that were already in memory) to the reported cost of the query.

    Ordered and Unordered Prefetching

    The prefetching mode is shown on the nested loops join iterator. If prefetching is used, either the WithUnorderedPrefetch or WithOrderedPrefetch attributes will be shown set to true (SSMS users will need to click on the nested loops join and inspect the properties window to see these properties). If prefetching is not used, neither attribute will be present.

    An unordered prefetch allows the inner side of the join to proceed using data from whichever I/Os happened to complete first.

    An ordered prefetch preserves key order on the inner side of the nested loops join, so SQL Server ensures that asynchronous I/Os are processed in the correct order.

    Unordered prefetching may therefore be somewhat more efficient than ordered prefetching. That said, the optimizer may choose to use ordered prefetching where it can make use of the preserved-order guarantee to avoid a more expensive join type (e.g. hash join) or where it can avoid an explicit sort later in the execution plan.

    Our test query features an ORDER BY clause that the optimizer can avoid sorting for by using ordered prefetch:

    Plan Tree

    Second Example – Function Calls

    Regular readers will know that I am not a fan of scalar T-SQL functions, but I am going to use one here to illustrate another important aspect of prefetching. The function in question does nothing except return its input value (I have deliberately created this function without schema-binding so SQL Server will consider it to be non-deterministic):

        (@ProductID integer)
    RETURNS integer
        RETURN @ProductID;

    The next script uses the function in our test query as part of the join predicate:

    SELECT TOP (1000)
    FROM Production.Product AS P
    JOIN Production.TransactionHistory AS TH
        ON TH.ProductID = dbo.F(P.ProductID) -- Changed!
        P.Name LIKE N'[K-P]%'

    The execution plan (with a warm cache) shows 4 rows read from the outer input, and 4 executions of the inner side of the nested loops join:

    Warm cache plan

    The Seek Predicate for the inner-side seek now references the function. The function is not mentioned anywhere else in the query plan.

    How many times is the function executed?

    If you said four, you were wrong!

    The number of function invocations can be monitored using the Profiler event SP:StatementCompleted (despite the name and documentation suggesting it is only for stored procedures). Profiler shows the function was called eight times:

    Profiler output

    Four calls correspond to the four executions of the inner-side index seek. The other four calls are caused by the hidden delayed prefetch iterator. Remember this is the warm cache case, where no prefetching is necessary.

    The delayed prefetch iterator still tries to check if the index page is in the buffer pool. In order to find the inner-side index page for the current value of ProductID, the prefetch iterator must execute the function. It needs the result of the function to seek into the inner-side index.

    The debugger call stack below shows the UDF being called from the delayed prefetch iterator:

    UDF call

    Cold Cache

    When pages need to be prefetched, the effect is even more dramatic:

    SELECT TOP (1000)
    FROM Production.Product AS P
    JOIN Production.TransactionHistory AS TH
        ON TH.ProductID = dbo.F(P.ProductID) -- Changed!
        P.Name LIKE N'[K-P]%'

    The outer input shows 35 rows (as we now expect). There are (as usual) only 4 executions of the inner side:

    Query plan

    Profiler shows the function is executed 39 times – 4 times by the inner side seek, and 35 times by prefetch!

    Profiler output 2

    Final Words

    The message here is not that prefetching is bad (it is generally a good thing). It can have unexpected side-effects, and can make interpreting nested loops joins more challenging. Never use T-SQL scalar functions!

    Winking smile

    © Paul White 2013 – All Rights Reserved
    Twitter: @SQL_Kiwi

    Additional Reading:

    Prefetch and Query Performance – Fabiano Amorim
    Random Prefetching – Craig Freedman
    Optimized Nested Loops Joins – Craig Freedman
    Optimizing I/O Performance by Sorting – Craig Freedman

  • Incorrect Results Caused By Adding an Index

    It’s not often that a SQL Server bug really surprises me or makes me wonder how it was never spotted before, but this is one of those.

    To get right into it, say you have the following two tables:

    FOR VALUES (1000, 2000, 3000, 4000, 5000);
    -- Partitioned
        T1ID    integer NOT NULL,
        SomeID  integer NOT NULL,
        CONSTRAINT [PK dbo.T1 T1ID]
            ON PS (T1ID)
    -- Not partitioned
        T2ID    integer IDENTITY (1,1) NOT NULL,
        T1ID    integer NOT NULL,
        CONSTRAINT [PK dbo.T2 T2ID]
            ON [PRIMARY]

    Load table T1 with 4,999 rows. All of the rows have a SomeID value of 1234, and the T1ID primary key is sequentially numbered from 1 to 4,999:

    INSERT dbo.T1 (T1ID, SomeID)
    SELECT N.n, 1234
    FROM dbo.Numbers AS N
    WHERE N.n BETWEEN 1 AND 4999;

    Table T2 gets 999 rows, generated by adding T1ID values from T1 that divide exactly by 5:

    INSERT dbo.T2 (T1ID)
    FROM dbo.T1
    WHERE T1ID % 5 = 0;

    More visually, T1 looks like this (T1ID goes up to 4,999):

    T1 data

    And T2 looks like this (T2ID goes up to 999):

    T2 data

    The test query simply counts the rows that match between the two tables when joined on T1ID:

    FROM dbo.T1 AS T1
    JOIN dbo.T2 AS T2
        ON T2.T1ID = T1.T1ID
    WHERE T1.SomeID = 1234;

    The execution plan features a merge join:

    Merge Join Execution Plan

    The correct result (999) is returned and everyone is happy:

    Correct Result

    Enter the Index

    Now someone comes along and adds a new index to table T1:

    ON dbo.T1 (SomeID DESC);

    This is a perfectly reasonable index, apparently essential for some crucial query or other. Let’s run our COUNT_BIG(*) query again:

    FROM dbo.T1 AS T1
    JOIN dbo.T2 AS T2
        ON T2.T1ID = T1.T1ID
    WHERE T1.SomeID = 1234;

    The execution plan looks similar:

    Execution Plan with New Index

    But the result is wrong! There are still 999 matches in the underlying data.

    Wrong Result

    An Even Simpler Query

    With the new index still in place, we run this query:

    FROM dbo.T1 
    WHERE SomeID = 1234

    This query should obviously return all the T1IDs from 1 to 4,999 in ascending order. Instead, we get:

    Wrongly Ordered Result Part 1

    The list starts at 4000 not 1! Also, out-of-order rows are found further down:

    Wrongly Ordered Result Part 2

    The results are not ordered by T1ID despite the ORDER BY T1TD ASC clause. Quite astonishing.


    Both problems are caused by a bug in the query optimizer, which is present in all versions of SQL Server from 2008 to 2014 CTP 1 inclusive. The bug produces a query plan that does not provide the ordering guarantees the optimizer thinks it does, leading to incorrect results. A partitioned table is required to reproduce the bug.

    The sneaky aspect to it is that the index which causes the problem could be added at any time, without the original query-writer’s knowledge. Equally, data changes could mean that a query plan that used to use a hash or nested loops join suddenly recompiles to choose a merge join. Since a merge join requires sorted input, the opportunity for suddenly incorrect (incomplete!) results is obvious (and an example was shown above).

    There is no trace flag that I am aware of that fixes this issue.

    I have opened a Connect item for this bug, and written more about the detailed explanation in a guest post on


    The fix for this issue is now available and documented in a Knowledge Base article. Please note the fix requires a code update and trace flag 4199, which enables a range of other query processor changes. It is unusual for an incorrect-results bug to be fixed under 4199. I asked for clarification on that and the response was:

    Even though this problem involves incorrect results like other hotfixes involving the Query Processor we have only enabled this fix under trace flag 4199 for SQL Server 2008, 2008 R2, and 2012. However, this fix is “on” by default without the trace flag in SQL Server 2014 RTM.

    Paul White

    Twitter : @SQL_Kiwi

    © 2013 All Rights Reserved

  • Improving Partitioned Table Join Performance

    The query optimizer does not always choose an optimal strategy when joining partitioned tables. This post looks at an example, showing how a manual rewrite of the query can almost double performance, while reducing the memory grant to almost nothing.

    Test Data

    The two tables in this example use a common partitioning partition scheme. The partition function uses 41 equal-size partitions:

            125000, 250000, 375000, 500000,
            625000, 750000, 875000, 1000000,
            1125000, 1250000, 1375000, 1500000,
            1625000, 1750000, 1875000, 2000000,
            2125000, 2250000, 2375000, 2500000,
            2625000, 2750000, 2875000, 3000000,
            3125000, 3250000, 3375000, 3500000,
            3625000, 3750000, 3875000, 4000000,
            4125000, 4250000, 4375000, 4500000,
            4625000, 4750000, 4875000, 5000000

    There two tables are:

        TID integer NOT NULL IDENTITY(0,1),
        Column1 integer NOT NULL,
        Padding binary(100) NOT NULL DEFAULT 0x,
        ON PST (TID)
        TID integer NOT NULL,
        Column1 integer NOT NULL,
        Padding binary(100) NOT NULL DEFAULT 0x,
        ON PST (TID)

    The next script loads 5 million rows into T1 with a pseudo-random value between 1 and 5 for Column1. The table is partitioned on the IDENTITY column TID:

        (ABS(CHECKSUM(NEWID())) % 5) + 1
    FROM dbo.Numbers AS N
    WHERE n BETWEEN 1 AND 5000000;

    In case you don’t already have an auxiliary table of numbers lying around, here’s a script to create one with 10 million rows:

    CREATE TABLE dbo.Numbers (n bigint PRIMARY KEY);
    INSERT dbo.Numbers WITH (TABLOCKX)
    SELECT TOP (10000000) n 
    FROM Nums 
    ORDER BY n

    Table T1 contains data like this:

    T1 sample

    Next we load data into table T2. The relationship between the two tables is that table 2 contains ‘n’ rows for each row in table 1, where ‘n’ is determined by the value in Column1 of table T1. There is nothing particularly special about the data or distribution, by the way.

        (TID, Column1)
    FROM dbo.T1 AS T
    JOIN dbo.Numbers AS N
        ON N.n >= 1
        AND N.n <= T.Column1;

    Table T2 ends up containing about 15 million rows:

    T2 sample

    The primary key for table T2 is a combination of TID and Column1. The data is partitioned according to the value in column TID alone.

    Partition Distribution

    The following query shows the number of rows in each partition of table T1:

        PartitionID = CA1.P,
        NumRows = COUNT_BIG(*)
    FROM dbo.T1 AS T

    Partition Ditribution T1

    There are 40 partitions containing 125,000 rows (40 * 125k = 5m rows). The rightmost partition remains empty. The next query shows the distribution for table 2:

        PartitionID = CA1.P,
        NumRows = COUNT_BIG(*)
    FROM dbo.T2 AS T

    There are roughly 375,000 rows in each partition (the rightmost partition is also empty):

    Partition Ditribution T2

    Ok, that’s the test data done.

    Test Query and Execution Plan

    The task is to count the rows resulting from joining tables 1 and 2 on the TID column:

    DECLARE @s datetime2 = SYSUTCDATETIME();
    FROM dbo.T1 AS T1
    JOIN dbo.T2 AS T2
        ON T2.TID = T1.TID;

    The optimizer chooses a plan using parallel hash join, and partial aggregation:

    Parallel Hash Join Plan

    The Plan Explorer plan tree view shows accurate cardinality estimates and an even distribution of rows across threads (click to enlarge the image):

    Plan Explorer Plan Tree View 1

    With a warm data cache, the STATISTICS IO output shows that no physical I/O was needed, and all 41 partitions were touched:


    Running the query without actual execution plan or STATISTICS IO information for maximum performance, the query returns in around 2600ms.

    Execution Plan Analysis

    The first step toward improving on the execution plan produced by the query optimizer is to understand how it works, at least in outline.

    The two parallel Clustered Index Scans use multiple threads to read rows from tables T1 and T2. Parallel scan uses a demand-based scheme where threads are given page(s) to scan from the table as needed. This arrangement has certain important advantages, but does result in an unpredictable distribution of rows amongst threads. The point is that multiple threads cooperate to scan the whole table, but it is impossible to predict which rows end up on which threads.

    For correct results from the parallel hash join, the execution plan has to ensure that rows from T1 and T2 that might join are processed on the same thread. For example, if a row from T1 with join key value ‘1234’ is placed in thread 5’s hash table, the execution plan must guarantee that any rows from T2 that also have join key value ‘1234’ probe thread 5’s hash table for matches.

    The way this guarantee is enforced in this parallel hash join plan is by repartitioning rows to threads after each parallel scan. The two repartitioning exchanges route rows to threads using a hash function over the hash join keys. The two repartitioning exchanges use the same hash function so rows from T1 and T2 with the same join key must end up on the same hash join thread.

    Expensive Exchanges

    This business of repartitioning rows between threads can be very expensive, especially if a large number of rows is involved. The execution plan selected by the optimizer moves 5 million rows through one repartitioning exchange and around 15 million across the other.

    As a first step toward removing these exchanges, consider the execution plan selected by the optimizer if we join just one partition from each table, disallowing parallelism:

    FROM dbo.T1 AS T1
    JOIN dbo.T2 AS T2
        ON T2.TID = T1.TID
        $PARTITION.PFT(T1.TID) = 1
        AND $PARTITION.PFT(T2.TID) = 1

    Single Partition Plan

    The optimizer has chosen a (one-to-many) merge join instead of a hash join. The single-partition query completes in around 100ms. If everything scaled linearly, we would expect that extending this strategy to all 40 populated partitions would result in an execution time around 4000ms. Using parallelism could reduce that further, perhaps to be competitive with the parallel hash join chosen by the optimizer.

    This raises a question. If the most efficient way to join one partition from each of the tables is to use a merge join, why does the optimizer not choose a merge join for the full query?

    Forcing a Merge Join

    Let’s force the optimizer to use a merge join on the test query using a hint:

    FROM dbo.T1 AS T1
    JOIN dbo.T2 AS T2
        ON T2.TID = T1.TID

    This is the execution plan selected by the optimizer:

    Optimizer Merge Join Plan

    This plan results in the same number of logical reads reported previously, but instead of 2600ms the query takes 5000ms. The natural explanation for this drop in performance is that the merge join plan is only using a single thread, whereas the parallel hash join plan could use multiple threads.

    Parallel Merge Join

    We can get a parallel merge join plan using the same query hint as before, and adding trace flag 8649:

    FROM dbo.T1 AS T1
    JOIN dbo.T2 AS T2
        ON T2.TID = T1.TID
    The execution plan is:

    Optimizer Parallel Merge Join

    This looks promising. It uses a similar strategy to distribute work across threads as seen for the parallel hash join. In practice though, performance is disappointing. On a typical run, the parallel merge plan runs for around 8400ms; slower than the single-threaded merge join plan (5000ms) and much worse than the 2600ms for the parallel hash join. We seem to be going backwards!

    The logical reads for the parallel merge are still exactly the same as before, with no physical IOs. The cardinality estimates and thread distribution are also still very good (click to enlarge):

    Plan Explorer Tree View 2

    A big clue to the reason for the poor performance is shown in the wait statistics (captured by Plan Explorer Pro):

    Plan Explorer Pro Waits

    CXPACKET waits require careful interpretation, and are most often benign, but in this case excessive waiting occurs at the repartitioning exchanges. Unlike the parallel hash join, the repartitioning exchanges in this plan are order-preserving ‘merging’ exchanges (because merge join requires ordered inputs):

    Reprartition Streams Operator Properties

    Parallelism works best when threads can just grab any available unit of work and get on with processing it. Preserving order introduces inter-thread dependencies that can easily lead to significant waits occurring. In extreme cases, these dependencies can result in an intra-query deadlock, though the details of that will have to wait for another time to explore in detail.

    The potential for waits and deadlocks leads the query optimizer to cost parallel merge join relatively highly, especially as the degree of parallelism (DOP) increases. This high costing resulted in the optimizer choosing a serial merge join rather than parallel in this case. The test results certainly confirm its reasoning.

    Collocated Joins

    The optimizer has another available strategy when joining tables that share a common partition scheme. This strategy is a collocated join, also known as as a per-partition join. It can be applied in both serial and parallel execution plans, though it is limited to 2-way joins in the current optimizer. Whether the optimizer chooses a collocated join or not depends on cost estimation. The primary benefits of a collocated join are that it eliminates an exchange and requires less memory, as we will see next.

    Costing and Plan Selection

    The query optimizer did consider a collocated join for our original query, but it was rejected on cost grounds. The parallel hash join with repartitioning exchanges appeared to be a cheaper option. There is no query hint to force a collocated join, so we have to mess with the costing framework to produce one for our test query. Pretending that IOs cost 50 times more than usual is enough to convince the optimizer to use collocated join with our test query:

    -- Pretend IOs are 50x cost temporarily
    -- Co-located hash join
    FROM dbo.T1 AS T1
    JOIN dbo.T2 AS T2
        ON T2.TID = T1.TID
    -- Reset IO costing

    Collocated Join Plan

    The estimated execution plan for the collocated join is:

    Collocated Hash Join

    The Constant Scan contains one row for each partition of the shared partitioning scheme, from 1 to 41. The hash repartitioning exchanges seen previously are replaced by a single Distribute Streams exchange using Demand partitioning. Demand partitioning means that the next partition id is given to the next parallel thread that asks for one.

    My test machine has eight logical processors, and all are available for SQL Server to use. As a result, there are eight threads in the single parallel branch in this plan, each processing one partition from each table at a time. Once a thread finishes processing a partition, it grabs a new partition number from the Distribute Streams exchange…and so on until all partitions have been processed.

    It is important to understand that the parallel scans in this plan are different from the parallel hash join plan. Although the scans have the same parallelism icon, tables T1 and T2 are not being co-operatively scanned by multiple threads in the same way. Each thread reads a single partition of T1 and performs a hash match join with the same partition from table T2. The properties of the two Clustered Index Scans show a Seek Predicate (unusual for a scan!) limiting the rows to a single partition:

    Clustered Index Scan Properties

    The crucial point is that the join between T1 and T2 is on TID, and TID is the partitioning column for both tables. A thread that processes partition ‘n’ is guaranteed to see all rows that can possibly join on TID for that partition. In addition, no other thread will see rows from that partition, so this removes the need for repartitioning exchanges.

    CPU and Memory Efficiency Improvements

    The collocated join has removed two expensive repartitioning exchanges and added a single exchange processing 41 rows (one for each partition id). Remember, the parallel hash join plan exchanges had to process 5 million and 15 million rows. The amount of processor time spent on exchanges will be much lower in the collocated join plan.

    In addition, the collocated join plan has a maximum of 8 threads processing single partitions at any one time. The 41 partitions will all be processed eventually, but a new partition is not started until a thread asks for it. Threads can reuse hash table memory for the new partition.

    The parallel hash join plan also had 8 hash tables, but with all 5,000,000 build rows loaded at the same time. The collocated plan needs memory for only 8 * 125,000 = 1,000,000 rows at any one time.

    Collocated Hash Join Performance

    The collated join plan has disappointing performance in this case. The query runs for around 25,300ms despite the same IO statistics as usual. This is much the worst result so far, so what went wrong?

    It turns out that cardinality estimation for the single partition scans of table T1 is slightly low. The properties of the Clustered Index Scan of T1 (graphic immediately above) show the estimation was for 121,951 rows. This is a small shortfall compared with the 125,000 rows actually encountered, but it was enough to cause the hash join to spill to physical tempdb:

    Collocated Hash Join Performance

    A level 1 spill doesn’t sound too bad, until you realize that the spill to tempdb probably occurs for each of the 41 partitions.

    As a side note, the cardinality estimation error is a little surprising because the system tables accurately show there are 125,000 rows in every partition of T1. Unfortunately, the optimizer uses regular column and index statistics to derive cardinality estimates here rather than system table information (e.g. sys.partitions).

    Collocated Merge Join

    We will never know how well the collocated parallel hash join plan might have worked without the cardinality estimation error (and the resulting 41 spills to tempdb) but we do know:

    • Merge join does not require a memory grant; and
    • Merge join was the optimizer’s preferred join option for a single partition join

    Putting this all together, what we would really like to see is the same collocated join strategy, but using merge join instead of hash join. Unfortunately, the current query optimizers (SQL Server 2008 and later) cannot produce a collocated merge join; it only knows how to do collocated hash join. SQL Server 2005 used an APPLY pattern to join partitioned tables, so a collocated merge join was possible in that version.

    So where does this leave us?

    CROSS APPLY sys.partitions

    We can try to write our own collocated join query. We can use sys.partitions to find the partition numbers, and CROSS APPLY to get a count per partition, with a final step to sum the partial counts. The following query implements this idea:

        row_count = SUM(Subtotals.cnt)
        -- Partition numbers
        FROM sys.partitions AS p 
            p.[object_id] = OBJECT_ID(N'T1', N'U')
            AND p.index_id = 1
    ) AS P
        -- Count per collocated join
            cnt = COUNT_BIG(*)
        FROM dbo.T1 AS T1
        JOIN dbo.T2 AS T2
            ON T2.TID = T1.TID
            $PARTITION.PFT(T1.TID) = p.partition_number
            AND $PARTITION.PFT(T2.TID) = p.partition_number
    ) AS SubTotals;

    The estimated plan is:

    Partition Apply

    The cardinality estimates aren’t all that good here, especially the estimate for the scan of the system table underlying the sys.partitions view.

    Nevertheless, the plan shape is heading toward where we would like to be. Each partition number from the system table results in a per-partition scan of T1 and T2, a one-to-many Merge Join, and a Stream Aggregate to compute the partial counts. The final Stream Aggregate just sums the partial counts.

    Execution time for this query is around 3,500ms, with the same IO statistics as always. This compares favourably with 5,000ms for the serial plan produced by the optimizer with the OPTION (MERGE JOIN) hint. This is another case of the sum of the parts being less than the whole – summing 41 partial counts from 41 single-partition merge joins is faster than a single merge join and count over all partitions.

    Even so, this single-threaded collocated merge join is not as quick as the original parallel hash join plan, which executed in 2,600ms.

    On the positive side, our collocated merge join uses only one logical processor and requires no memory grant. The parallel hash join plan used 16 threads and reserved 569 MB of memory:


    Using a Temporary Table

    Our collocated merge join plan should benefit from parallelism. The reason parallelism is not being used is that the query references a system table. We can work around that by writing the partition numbers to a temporary table (or table variable):

    DECLARE @s datetime2 = SYSUTCDATETIME();
        partition_number integer PRIMARY KEY);
    INSERT #P (partition_number)
    FROM sys.partitions AS p 
        p.[object_id] = OBJECT_ID(N'T1', N'U')
        AND p.index_id = 1;
        row_count = SUM(Subtotals.cnt)
    FROM #P AS p
            cnt = COUNT_BIG(*)
        FROM dbo.T1 AS T1
        JOIN dbo.T2 AS T2
            ON T2.TID = T1.TID
            $PARTITION.PFT(T1.TID) = p.partition_number
            AND $PARTITION.PFT(T2.TID) = p.partition_number
    ) AS SubTotals;

    Using the temporary table adds a few logical reads, but the overall execution time is still around 3500ms, indistinguishable from the same query without the temporary table. The problem is that the query optimizer still doesn’t choose a parallel plan for this query, though the removal of the system table reference means that it could if it chose to:

    Temporary Table Plan

    In fact the optimizer did enter the parallel plan phase of query optimization (running search 1 for a second time):

    Optimizer Search Phases

    Unfortunately, the parallel plan found seemed to be more expensive than the serial plan. This is a crazy result, caused by the optimizer’s cost model not reducing operator CPU costs on the inner side of a nested loops join. Don’t get me started on that, we’ll be here all night.

    In this plan, everything expensive happens on the inner side of a nested loops join. Without a CPU cost reduction to compensate for the added cost of exchange operators, candidate parallel plans always look more expensive to the optimizer than the equivalent serial plan.

    Parallel Collocated Merge Join

    We can produce the desired parallel plan using trace flag 8649 again:

        row_count = SUM(Subtotals.cnt)
    FROM #P AS p
            cnt = COUNT_BIG(*)
        FROM dbo.T1 AS T1
        JOIN dbo.T2 AS T2
            ON T2.TID = T1.TID
            $PARTITION.PFT(T1.TID) = p.partition_number
            AND $PARTITION.PFT(T2.TID) = p.partition_number
    ) AS SubTotals

    The actual execution plan is:

    Parallel Collocated Merge Join

    One difference between this plan and the collocated hash join plan is that a Repartition Streams exchange operator is used instead of Distribute Streams. The effect is similar, though not quite identical. The Repartition uses round-robin partitioning, meaning the next partition id is pushed to the next thread in sequence. The Distribute Streams exchange seen earlier used Demand partitioning, meaning the next partition id is pulled across the exchange by the next thread that is ready for more work. There are subtle performance implications for each partitioning option, but going into that would again take us too far off the main point of this post.


    The important thing is the performance of this parallel collocated merge join – just 1350ms on a typical run.

    The list below shows all the alternatives from this post (all timings include creation, population, and deletion of the temporary table where appropriate) from quickest to slowest:

    • Collocated parallel merge join: 1350ms
    • Parallel hash join: 2600ms
    • Collocated serial merge join: 3500ms
    • Serial merge join: 5000ms
    • Parallel merge join: 8400ms
    • Collated parallel hash join: 25,300ms (hash spill per partition)

    The parallel collocated merge join requires no memory grant (aside from a paltry 1.2MB used for exchange buffers). This plan uses 16 threads at DOP 8; but 8 of those are (rather pointlessly) allocated to the parallel scan of the temporary table. These are minor concerns, but it turns out there is a way to address them if it bothers you.

    Parallel Collocated Merge Join with Demand Partitioning

    This final tweak replaces the temporary table with a hard-coded list of partition ids (dynamic SQL could be used to generate this query from sys.partitions):

        row_count = SUM(Subtotals.cnt)
    ) AS P (partition_number)
            cnt = COUNT_BIG(*)
        FROM dbo.T1 AS T1
        JOIN dbo.T2 AS T2
            ON T2.TID = T1.TID
            $PARTITION.PFT(T1.TID) = p.partition_number
            AND $PARTITION.PFT(T2.TID) = p.partition_number
    ) AS SubTotals

    The actual execution plan is:

    Parallel Collocated Merge Join - Demand

    The parallel collocated hash join plan is reproduced below for comparison:

    Parallel Collocated Hash Join

    The manual rewrite has another advantage that has not been mentioned so far: the partial counts (per partition) can be computed earlier than the partial counts (per thread) in the optimizer’s collocated join plan. The earlier aggregation is performed by the extra Stream Aggregate under the nested loops join.

    The performance of the parallel collocated merge join is unchanged at around 1350ms.

    Final Words

    It is a shame that the current query optimizer does not consider a collocated merge join (Connect item closed as Won’t Fix). The example used in this post showed an improvement in execution time from 2600ms to 1350ms using a modestly-sized data set and limited parallelism. In addition, the memory requirement for the query was almost completely eliminated  – down from 569MB to 1.2MB.

    The problem with the parallel hash join selected by the optimizer is that it attempts to process the full data set all at once (albeit using eight threads). It requires a large memory grant to hold all 5 million rows from table T1 across the eight hash tables, and does not take advantage of the divide-and-conquer opportunity offered by the common partitioning.

    The great thing about the collocated join strategies is that each parallel thread works on a single partition from both tables, reading rows, performing the join, and computing a per-partition subtotal, before moving on to a new partition.

    Note that the general technique is not limited to partitioned tables. Any table that can be naturally segmented using a driver table of values can be written to use a collocated merge join, by substituting a driver value for the partition id.

    From a thread’s point of view…

    If you have trouble visualizing what is happening from just looking at the parallel collocated merge join execution plan, let’s look at it again, but from the point of view of just one thread operating between the two Parallelism (exchange) operators.


    Our thread picks up a single partition id from the Distribute Streams exchange, and starts a merge join using ordered rows from partition 1 of table T1 and partition 1 of table T2. By definition, this is all happening on a single thread. As rows join, they are added to a (per-partition) count in the Stream Aggregate immediately above the Merge Join. Eventually, either T1 (partition 1) or T2 (partition 1) runs out of rows and the merge join stops.

    The per-partition count from the aggregate passes on through the Nested Loops join to another Stream Aggregate, which is maintaining a per-thread subtotal. Our same thread now picks up a new partition id from the exchange (say it gets id 9 this time). The count in the per-partition aggregate is reset to zero, and the processing of partition 9 of both tables proceeds just as it did for partition 1, and on the same thread.

    Each thread picks up a single partition id and processes all the data for that partition, completely independently from other threads working on other partitions. One thread might eventually process partitions (1, 9, 17, 25, 33, 41) while another is concurrently processing partitions (2, 10, 18, 26, 34) and so on for the other six threads at DOP 8.

    The point is that all 8 threads can execute independently and concurrently, continuing to process new partitions until the wider job (of which the thread has no knowledge!) is done. This divide-and-conquer technique can be much more efficient than simply splitting the entire workload across eight threads all at once.


    Related Reading

    Understanding and Using Parallelism in SQL Server

    Parallel Execution Plans Suck

    © 2013 Paul White – All Rights Reserved

    Twitter: @SQL_Kiwi

  • Hello Operator, My Switch Is Bored

    This is a post for T-SQL Tuesday #43 hosted by my good friend Rob Farley. The topic this month is Plan Operators. I haven’t taken part in T-SQL Tuesday before, but I do like to write about execution plans, so this seemed like a good time to start.

    This post is in two parts. The first part is primarily an excuse to use a pretty bad play on words in the title of this blog post (if you’re too young to know what a telephone operator or a switchboard is, I hate you). The second part of the post looks at an invisible query plan operator (so to speak).

    1. My Switch Is Bored

    Allow me to present the rare and interesting execution plan operator, Switch:

    Switch Showplan Operator

    Books Online has this to say about Switch:

    BOL description

    Following that description, I had a go at producing a Fast Forward Cursor plan that used the TOP operator, but had no luck. That may be due to my lack of skill with cursors, I’m not too sure.

    The only application of Switch in SQL Server 2012 that I am familiar with requires a local partitioned view:

    CREATE TABLE dbo.T1 (c1 int NOT NULL CHECK (c1 BETWEEN 00 AND 24));
    CREATE TABLE dbo.T2 (c1 int NOT NULL CHECK (c1 BETWEEN 25 AND 49));
    CREATE TABLE dbo.T3 (c1 int NOT NULL CHECK (c1 BETWEEN 50 AND 74));
    CREATE TABLE dbo.T4 (c1 int NOT NULL CHECK (c1 BETWEEN 75 AND 99));
    SELECT c1 FROM dbo.T1
    SELECT c1 FROM dbo.T2
    SELECT c1 FROM dbo.T3
    SELECT c1 FROM dbo.T4;

    Not only that, but it needs an updatable local partitioned view. We’ll need some primary keys to meet that requirement:

    ALTER TABLE dbo.T1
    PRIMARY KEY (c1);
    ALTER TABLE dbo.T2
    PRIMARY KEY (c1);
    ALTER TABLE dbo.T3
    PRIMARY KEY (c1);
    ALTER TABLE dbo.T4
    PRIMARY KEY (c1);

    We also need an INSERT statement that references the view. Even more specifically, to see a Switch operator, we need to perform a single-row insert (multi-row inserts use a different plan shape):

    INSERT dbo.V1 (c1)
    VALUES (1);

    And now…the execution plan:

    Switch Execution Plan

    The Constant Scan manufactures a single row with no columns. The Compute Scalar works out which partition of the view the new value should go in. The Assert checks that the computed partition number is not null (if it is, an error is returned). The Nested Loops Join executes exactly once, with the partition id as an outer reference (correlated parameter).

    The Switch operator checks the value of the parameter and executes the corresponding input only. If the partition id is 0, the uppermost Clustered Index Insert is executed, adding a row to table T1. If the partition id is 1, the next lower Clustered Index Insert is executed, adding a row to table T2…and so on.

    In case you were wondering, here’s a query and execution plan for a multi-row insert to the view:

    INSERT dbo.V1 (c1)
    VALUES (1), (2);

    Multi-row partitioned view insert

    Yuck! An Eager Table Spool and four Filters! I prefer the Switch plan.

    My guess is that almost all the old strategies that used a Switch operator have been replaced over time, using things like a regular Concatenation Union All combined with Start-Up Filters on its inputs. Other new (relative to the Switch operator) features like table partitioning have specific execution plan support that doesn’t need the Switch operator either. This feels like a bit of a shame, but perhaps it is just nostalgia on my part, it’s hard to know.

    Please do let me know if you encounter a query that can still use the Switch operator in 2012 – it must be very bored if this is the only possible modern usage!

    2. Invisible Plan Operators

    The second part of this post uses an example based on a question Dave Ballantyne asked using the SQL Sentry Plan Explorer plan upload facility. If you haven’t tried that yet, make sure you’re on the latest version of the (free) Plan Explorer software, and then click the Post to button. That will create a site question with the query plan attached (which can be anonymized if the plan contains sensitive information). Aaron Bertrand and I keep a close eye on questions there, so if you have ever wanted to ask a query plan question of either of us, that’s a good way to do it.

    The problem

    The issue I want to talk about revolves around a query issued against a calendar table. The script below creates a simplified version and adds 100 years of per-day information to it:

    USE tempdb;
    CREATE TABLE dbo.Calendar  
        dt          date     NOT NULL,
        isWeekday   bit      NOT NULL,
        theYear     smallint NOT NULL,
        CONSTRAINT PK__dbo_Calendar_dt
            PRIMARY KEY CLUSTERED (dt)
    -- Monday is the first day of the week for me
    -- Add 100 years of data
    INSERT dbo.Calendar WITH (TABLOCKX)
        (dt, isWeekday, theYear)
        isWeekday =
                WHEN DATEPART(WEEKDAY, CA.dt) IN (6, 7)
                THEN 0
                ELSE 1
        theYear = YEAR(CA.dt)
    FROM Sandpit.dbo.Numbers AS N
        VALUES (DATEADD(DAY, N.n - 1, CONVERT(date, '01 Jan 2000', 113)))
    ) AS CA (dt)
        N.n BETWEEN 1 AND 36525;

    The following query counts the number of weekend days in 2013:

        Days = COUNT_BIG(*)
    FROM dbo.Calendar AS C
        theYear = 2013
        AND isWeekday = 0;

    It returns the correct result (104) using the following execution plan:

    Original Plan

    The query optimizer has managed to estimate the number of rows returned from the table exactly, based purely on the default statistics created separately on the two columns referenced in the query’s WHERE clause. (Well, almost exactly, the unrounded estimate is 104.289 rows.)

    There is already an invisible operator in this query plan – a Filter operator used to apply the WHERE clause predicates. We can see it by re-running the query with the enormously useful (but undocumented) trace flag 9130 enabled:

    TF 9130 plan

    Now we can see the full picture. The whole table is scanned, returning all 36,525 rows, before the Filter narrows that down to just the 104 we want. Without the trace flag, the Filter is incorporated in the Clustered Index Scan as a residual predicate. It is a little bit more efficient than using a separate operator, but residual predicates are still something you will want to avoid where possible. The estimates are still spot on though:

    Plan Explorer Actual vs Estimate

    Anyway, looking to improve the performance of this query, Dave added the following filtered index to the Calendar table:

    ON dbo.Calendar(theYear)
    WHERE isWeekday = 0;

    The original query now produces a much more efficient plan:

    Filtered index plan

    Unfortunately, the estimated number of rows produced by the seek is now wrong (365 instead of 104):

    Wrong Estimate!

    What’s going on? The estimate was spot on before we added the index!


    You might want to grab a coffee for this bit.

    Using another trace flag or two (8606 and 8612) we can see that the cardinality estimates were exactly right initially:

    8606 and 8612

    The highlighted information shows the initial cardinality estimates for the base table (36,525 rows), the result of applying the two relational selects in our WHERE clause (104 rows), and after performing the COUNT_BIG(*) group by aggregate (1 row). All of these are correct, but that was before cost-based optimization got involved :)

    Cost-based optimization

    When cost-based optimization starts up, the logical tree above is copied into a structure (the ‘memo’) that has one group per logical operation (roughly speaking). The logical read of the base table (LogOp_Get) ends up in group 7; the two predicates (LogOp_Select) end up in group 8 (with the details of the selections in subgroups 0-6). These two groups still have the correct cardinalities as trace flag 8608 output (initial memo contents) shows:


    During cost-based optimization, a rule called SelToIdxStrategy runs on group 8. It’s job is to match logical selections to indexable expressions (SARGs). It successfully matches the selections (theYear = 2013, is Weekday = 0) to the filtered index, and writes a new alternative into the memo structure. The new alternative is entered into group 8 as option 1 (option 0 was the original LogOp_Select):

    Rule result

    The new alternative is to do nothing (PhyOp_NOP = no operation), but to instead follow the new logical instructions listed below the NOP.

    The LogOp_GetIdx (full read of an index) goes into group 21, and the LogOp_SelectIdx (selection on an index) is placed in group 22, operating on the result of group 21. The definition of the comparison ‘the Year = 2013’ (ScaOp_Comp downwards) was already present in the memo starting at group 2, so no new memo groups are created for that.

    New Cardinality Estimates

    The new memo groups require two new cardinality estimates to be derived. First, LogOp_Idx (full read of the index) gets a predicted cardinality of 10,436. This number comes from the filtered index statistics:

    DBCC SHOW_STATISTICS (Calendar, Weekends) 

    Filtered Index Header

    The second new cardinality derivation is for the LogOp_SelectIdx applying the predicate (theYear = 2013). To get a number for this, the cardinality estimator uses statistics for the column ‘theYear’, producing an estimate of 365 rows (there are 365 days in 2013!):

    DBCC SHOW_STATISTICS (Calendar, theYear)

    365 row estimate

    This is where the mistake happens. Cardinality estimation should have used the filtered index statistics here, to get an estimate of 104 rows:

    DBCC SHOW_STATISTICS (Calendar, Weekends)

    Filtered stats histogram

    Unfortunately, the logic has lost sight of the link between the read of the filtered index (LogOp_GetIdx) in group 22, and the selection on that index (LogOp_SelectIdx) that it is deriving a cardinality estimate for, in group 21. The correct cardinality estimate (104 rows) is still present in the memo, attached to group 8, but that group now has a PhyOp_NOP implementation.

    Skipping over the rest of cost-based optimization (in a belated attempt at brevity) we can see the optimizer’s final output using trace flag 8607:


    This output shows the (incorrect, but understandable) 365 row estimate for the index range operation, and the correct 104 estimate still attached to its PhyOp_NOP. This tree still has to go through a few post-optimizer rewrites and ‘copy out’ from the memo structure into a tree suitable for the execution engine. One step in this process removes PhyOp_NOP, discarding its 104-row cardinality estimate as it does so.

    To finish this section on a more positive note, consider what happens if we add an OVER clause to the query aggregate. This isn’t intended to be a ‘fix’ of any sort, I just want to show you that the 104 estimate can survive and be used if later cardinality estimation needs it:

        Days = COUNT_BIG(*) OVER ()
    FROM dbo.Calendar AS C
        theYear = 2013
        AND isWeekday = 0;

    The estimated execution plan is:

    OVER plan

    Note the 365 estimate at the Index Seek, but the 104 lives again at the Segment! We can imagine the lost predicate ‘isWeekday = 0’ as sitting between the seek and the segment in an invisible Filter operator that drops the estimate from 365 to 104.

    Even though the NOP group is removed after optimization (so we don’t see it in the execution plan) bear in mind that all cost-based choices were made with the 104-row memo group present, so although things look a bit odd, it shouldn’t affect the optimizer’s plan selection. I should also mention that we can work around the estimation issue by including the index’s filtering columns in the index key:

    ON dbo.Calendar(theYear, isWeekday)
    WHERE isWeekday = 0

    There are some downsides to doing this, including that changes to the isWeekday column may now require Halloween Protection, but that is unlikely to be a big problem for a static calendar table ;)  With the updated index in place, the original query produces an execution plan with the correct cardinality estimation showing at the Index Seek:

    Fixed plan

    That’s all for today, remember to let me know about any Switch plans you come across on a modern instance of SQL Server!

    Finally, here are some other posts of mine that cover other plan operators:

    Segment and Sequence Project

    Common Subexpression Spools

    Why Plan Operators Run Backwards

    Row Goals and the Top Operator

    Hash Match Flow Distinct

    Top N Sort

    Index Spools and Page Splits

    Singleton and Range Seeks


    Hash Join Performance

    Compute Scalar

    © 2013 Paul White – All Rights Reserved

    Twitter: @SQL_Kiwi

  • SQL Intersection Conference, Las Vegas MGM Grand 8-11 April 2013


    You might have noticed the SQL Intersection logo appearing in my recent blog posts. This is a brand new event focussed on tuning and troubleshooting the SQL Server core engine (including Windows SQL Azure Database). If you read and enjoy at least some of the things I write about, this conference is one you should seriously consider attending.

    For my part, I’ll be presenting three sessions: two on Interpreting Execution Plans, and one on Parallel Query Execution. I have to promote my sessions, because just look at the other speakers you have to choose from:

    SQL Speakers

    As well as speaking, all these people will be around to chat to over the four conference days, and some are offering complete days of training at pre- and post-conference workshops as well.

    Personally, I’m particularly excited about talking to the following people and attending their sessions:

    • Bob Ward
    • Conor Cunningham
    • Paul Randal
    • Kimberly Tripp
    • Brent Ozar
    • Kendra Little
    • Grant Fritchey
    • Joe Sack
    • Jeremiah Peschka
    • Aaron Bertrand
    • Steve Jones

    Main Conference Overview

    Each of the main conference days has three tracks to choose from:


    Tuesday Schedule


    Wednesday Schedule


    Thursdsay Schedule

    The full schedule is available here as a PDF document, or you can visit the conference website for all the details.


    The conference pricing is very competitive (especially so, considering the location and speaker line-up):

    Pricing Grid

    Discount Code

    When registering, quote the Discount Code WHITE to receive an extra $50 off these prices.

    I really hope to see you in Las Vegas – please feel free to come and talk to me about anything concerning execution plans or the query optimizer :)

    Paul White

    SQL Server MVP 2011, 2012

  • Execution Plan Analysis: The Mystery Work Table


    I love SQL Server execution plans. It is often easy to spot the cause of a performance problem just by looking at one. The task is considerably easier if the plan includes run-time information (a so-called ‘actual’ execution plan), but even a compiled plan can be very useful. Nevertheless, there are still times where the execution plan does not tell the whole story, and we need to think more deeply about query execution to really understand a performance problem. This post looks at one such example, based on a recent question posted on the SQL Performance Q & A site.

    The Execution Plan

    Original Query Plan

    This plan is reasonably large (20MB cached plan size) but not massively complex once you break it down (click on the image above to view it full-size in a new window). The context of the question is that this query usually executes in less than a minute, but sometimes it runs for nearly twenty minutes – though the plan appears identical.

    High-Cost Operators

    There are many different things to look for in execution plans. What you choose to look at first is as much a matter of personal preference as anything, but many people are drawn to high-cost operators, so I will start there. In this plan, the cost of one operator dominates all others, shown as being responsible for 100% of the cost of the query. It is highlighted in red in Plan Explorer; I have expanded the relevant plan section (the top right) below:

    100% operator cost

    There is no doubt that this seek is busy little thing. It is executed 249,484 times, though it only produces a grand total of 167,813 rows over all iterations of the loop join – an average of just 0.7 rows per seek. There are all sorts of interesting details in the plan about this seek – I could write a whole blog post about it – but two details that stand out are the “Force Seek: True” and “Partitioned: True” attributes. These tell us that the base table is partitioned, and the query writer had to use a FORCESEEK table hint to get this plan.

    Without this hint, the optimizer would almost certainly choose a Hash Match or Merge Join rather than Nested Loops. This is understandable given the optimizer’s cost model and the simplifying assumptions it makes (such as assuming every query starts with a cold buffer cache). That’s fine, but we can see from the query plan that the inner-side table has 643 million rows. Left to its own devices, the optimizer would estimate that it would be faster to perform a sequential scan of 643 million rows (with large-block read-ahead) than it would be to run a quarter-million randomly-distributed seeks driven by a Nested Loops join.

    I doubt that the optimizer’s reasoning here is sound (at least on any reasonably modern hardware) but there we go. The query author probably knows that a good fraction of this table is likely to be in cache, so with all that in mind, I think we can reasonably assume at this stage that the FORCESEEK hint is genuinely needed here, and this part of the plan is at least reasonably optimal.

    Important note: The seek certainly does not account for 100% of the runtime cost of this query. Remember cost percentages are always estimates – even in ‘actual’ plans. It can be useful to check the reasons for high-estimated-cost operators, but they should never be used as a primary tuning metric.

    Execution Warnings

    Sort Warning

    This query was executed on SQL Server 2012, so there is a handy warning triangle on the Sort operator indicating that one or more sort runs had to be spilled to physical tempdb disk. The plan clearly shows this spilling is a result of an inaccurate cardinality estimate at the Filter operator (the estimates are not bad at all prior to this). The Sort expects 9,217 rows totalling approximately 5MB, but actually encountered 61,846 rows in 35MB. As you may know, memory for sorts and hashes is allocated before execution starts, and generally cannot expand dynamically at run time.

    The spilled sort is undesirable, of course, but it is unlikely to be a major cause of the occasional poor performance given the small size of the spilled data. Nevertheless, this might be a good place to split this query up. The idea would be to write the results of the query (up to and including the Filter) to a temporary heap table using SELECT INTO, and then create a clustered index with the same keys as the Sort operator. The temporary table would not be large, and may well perform better overall than the spilled sort, including the cost of creating the clustered index. Of course, creating this index will involve a sort, but it will be one based on the known cardinality of the temporary heap table. The part of the plan that could be replaced by a temporary table is shown below:

    Plan subtree replaced with a temp table

    I am a big fan of simplifications like this. Smaller execution plans tend to optimize better for all sorts of reasons, and the source code usually becomes easier to maintain as well. I should mention there is another warning triangle in the 2012 execution plan (shown on the root icon), which relates to some implicit conversions that I will mention later.

    I/O Information

    The execution plan was captured with Plan Explorer, so we can also easily see I/O statistics for the two executions. The first is for a fast (sub-60-second) run:

    I/O data - fast

    Overall, these I/O numbers show pretty much what we would expect: a decent number of logical reads associated with the seeks into the Trans table (but certainly not 100% of the total, ha ha), a very small number of physical reads, and a small amount of read-ahead activity on a couple of tables.

    The second set of I/O data is from a slow run (18 minutes or so):

    I/O data - slow

    The very obvious difference is the appearance of a work table, with 178 million logical reads and 130 million LOB logical reads. It seems very likely this work table, and its 300 million logical reads, is responsible for the dramatic decrease in query performance. But given that the execution plans are identical (right down to the XML) what is causing this?

    My answer to that question (on the Q & A site) was that it is related to the increased level of read-ahead activity, but to see why that is the case, we will need to reproduce the issue and dig a bit deeper.

    Execution Outline

    Before we really get going on this, it will be useful to take a look at what the execution plan is doing in outline. We saw the first part of the plan earlier when looking at the spilling sort. The data set at that point (which we would like to write to a temporary table, remember) essentially represents source data for a second query, which uses a series of Nested Loops Left Joins to lookup information from other tables:

    Nested Loop Lookups

    The inner side of each join involves some reasonably involved logic, which is thankfully not important to the present discussion. What is important is that the result of each lookup is a LOB data type. This begins to shed some light on the LOB logical reads reported against the work table, but it does not explain why the work table (and the 300 million associated reads) do not appear when the query runs quickly (with the same execution plan).

    Reproducing the problem

    Table Creation

    The first part of the repro involves creating six tables that represent the lookup tables in the original query plan. Each table will have 10,000 rows, consisting of a sequential reference number and a second column containing a 2048 single-byte-character string. The source table used to drive the lookups will be a regular Numbers table containing just a single integer column.

    CREATE TABLE dbo.T1 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T2 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T3 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T4 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T5 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T6 (id integer IDENTITY PRIMARY KEY, d char(2048));
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;

    The next step is to ensure that each lookup table is optimally organized for read-ahead:


    Test Query

    The original query translates into our simplified test rig as:

    DECLARE @d nvarchar(max) = NCHAR(10000);
    SELECT n.n,
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T1 AS t WHERE = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T2 AS t WHERE = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T3 AS t WHERE = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T4 AS t WHERE = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T5 AS t WHERE = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T6 AS t WHERE = n.n)
    FROM dbo.Numbers AS n
    WHERE n.n BETWEEN 1 AND 10000
    ORDER BY n.n

    The broad idea there is to concatenate our 2048-character column to itself five times and include a Unicode character that was used in the original query as a delimiter that could not appear in the source data. Each lookup performs the same basic operation against its target table, and the final result is the result of concatenating all the intermediate results. The query hints are necessary to get the right plan shape, just because my test rig tables are so much smaller than the real ones.

    Note that the Unicode delimiter means the 2048-character single-byte data is implicitly converted to Unicode, doubling in size. It is not a crucial feature of the test, but it did appear in the original query and explains the type conversion warnings in the execution plan I mentioned earlier. The execution plan for the test query is (click to enlarge if necessary):

    Test query execution plan

    I should also stress that the CONCAT operator (new in SQL Server 2012) is not crucial either. If you are using an earlier version of SQL Server, an equivalent query (for present purposes) is shown below. I’m going to stick with CONCAT for the remainder of the post, however.

    DECLARE @d nvarchar(max) = NCHAR(10000);
    SELECT n.n,
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T1 AS t WHERE = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T2 AS t WHERE = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T3 AS t WHERE = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T4 AS t WHERE = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T5 AS t WHERE = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T6 AS t WHERE = n.n)
    FROM dbo.Numbers AS n
    WHERE n.n BETWEEN 1 AND 10000
    ORDER BY n.n

    Warm cache results

    With all data in memory, the test query (in either form) completes in about 1.6 seconds on my laptop. The result shows that each output row contains 147,468 bytes of Unicode character data. A typical set of I/O statistics follows:


    Nothing too exciting to see there, but this is just our baseline.

    Cold cache results


    With no data in memory, the test query now runs for 18.6 seconds – almost 12x slower. The I/O statistics show the expected (but still mysterious!) work table and its associated reads:


    The Extended Events wait statistics show SQL Server spent very little of that time waiting on my laptop’s slow hard drive – just 402 ms:



    The are a number of factors in play here that we will look at in turn.

    Nested Loops Prefetching

    One of the reasons the optimizer prefers Hash Match and Merge Join for larger inputs is that the data access patterns tend to favour large sequential read-ahead. Both hash and merge tend to scan (range-scan in the case of a seek) their inputs, and the SQL Server Storage Engine automatically issues read-ahead when it detects this type of access. There is nothing in the execution plan to show that a base table will be read with read-ahead, it just happens.

    A very basic implementation of Nested Loops join would not benefit from read-ahead at all on its inner side. The outer (driving) side of the loops join might well be a scan or range-scan of an index, and so benefit from automatic read-ahead, of course. The inner side is executed once per outer row, resulting in a rapid succession of small index seeks for different values. These small seeks will typically not be large enough to trigger the automatic read-ahead mechanism. Indeed, in our test, each inner side seek is for precisely one value.

    SQL Server improves on this by implementing a second read-ahead mechanism especially for Nested Loops joins (not all N-L joins, it is a cost-based decision the optimizer makes). The basic idea is to buffer extra rows from the outer side of the join, and to use the row values in the buffer to drive read-ahead for the inner side. The effect is that the Nested Loops join becomes a partly blocking operator as outer-side rows are read into the buffer and read-ahead issued based on buffered index key values.

    This read-ahead may be either order-preserving or not, and is indicated in the execution plan by the Nested Loop attributes With Ordered Prefetch and With Unordered Prefetch, respectively. When unordered prefetch occurs, the inner side is processed in whatever order the asynchronous reads happen to complete. With ordered prefetching, the mechanism is careful to ensure that the order of rows entering the join is preserved on the output.

    In the test rig, the ORDER BY clause means there is a need to preserve row order, so Ordered Prefetch is used:

    Ordered Prefetch

    The issue described in this post is not specific to ordered prefetching – the same behaviour is just as likely with unordered prefetching. The point is that Nested Loops prefetching is one of the requirements.

    Documented trace flags 652 and 8744 may be used (with care, and after serious testing) to disable automatic read-ahead and Nested Loops prefetching respectively. This is sometimes beneficial where all data is expected to be in memory (in which case read-ahead processing consumes resources better used by query execution) or where the I/O subsystem is extremely fast. In case you were wondering, there is no background thread for prefetching – all the work of checking whether the data is in memory, and issuing I/O if not, is performed by the worker thread executing the query.

    I should stress that read-ahead and Nested Loops prefetching is generally A Very Good Thing with typical storage solutions (e.g. SANs) and both work best (or at all) when indexes have low logical fragmentation.

    Manufactured LOBs

    The issue described here also requires that a large object data type is manufactured before prefetching. The Compute Scalar operators in the test execution plan perform that function:

    Manufactured LOB

    By ‘manufactured’, I mean that the source columns are not LOB types, but the expression output is – notice the implicit conversion to nvarchar(max). To be clear about it, the issue we are analysing here does not occur when Nested Loops prefetching occurs with an expression that was a LOB type to begin with.

    The Outer Join

    The optimizer is quite good, generally speaking, at moving scalar expressions around. If the query had featured inner joins (whether by query design or through optimizer activities) the chances are quite good that the problematic expressions (the LOB manufacturers) would have moved beyond the prefetching, and so out of harm’s way. It is quite tricky to preserve NULL-extension and other outer-join semantics properly when moving expressions above an outer join, so the optimizer generally does not even try. In essence, the outer join represents an optimization barrier to the LOB-manufacturing expressions.

    Memory Allocation

    When Nested Loops prefetching occurs with a manufactured LOB, the question arises of where to store the created LOBs when buffering rows for prefetch. If the source data were already a LOB type, the execution engine would already have memory structures in place to handle them. When prefetching encounters a manufactured LOB, it needs to store it somewhere, since the engine is no longer processing a stream of one row at a time. It turns out that there is a small memory buffer set aside for this eventuality, which empirical tests show to be 24KB.

    However, this 24KB (directly allocated, not via workspace memory grant) is shared across all concurrently executing prefetching joins in the query. With six such joins in the test rig plan and large manufactured LOBs, the buffer stands no chance. As a result, query execution engages a bail-out option: a work table created in tempdb. Though the pages of the worktable may in fact remain memory-resident, overheads (including latching and using general-purpose code interfaces for access to the buffered rows) mean this is very much slower than using the direct-memory cache.

    As with most internal work tables, the logical reads reported on this work table indicate the number of rows processed (not 8KB pages, as for regular I/O statistics). This fact, together with the large number of items processed via the worktable in our test, accounts for the millions of reads reported.

    The creation and use of the work table depends on run time conditions and timing. If execution finds the data it needs is already in memory, the prefetch checks are still performed, but no asynchronous read requests end up being posted. The 24KB buffer is never filled, so the need to create a work table never arises. The more prefetch that actually occurs, the higher the chances that the buffer will fill. It is quite possible to experience a low level of prefetch with manufactured LOBs without the engine needing to bail out to a work table, especially if the LOBs are not very big and the I/O system is quite fast.


    We can rewrite the query to avoid feeding manufactured LOB data to the prefetch buffer. The idea is to use OUTER APPLY to return the data that contributes to the concatenation, rather than the result of the concatenation. We can then perform the CONCAT operation (which handles NULLs nicely without extra work) after the join, avoiding the prefetch buffer issue completely. In SQL Server versions prior to 2012, we would need to use direct string concatenation, and handle rows that are NULL-extended explicitly using ISNULL or COALESCE.

    DECLARE @d nvarchar(max) = NCHAR(10000);
                CONCAT(oa1.i0, oa1.i1, oa1.i2, oa1.i3, oa1.i4, oa1.i5, oa1.i6),
                CONCAT(oa2.i0, oa2.i1, oa2.i2, oa2.i3, oa2.i4, oa2.i5, oa2.i6),
                CONCAT(oa3.i0, oa3.i1, oa3.i2, oa3.i3, oa3.i4, oa3.i5, oa3.i6),
                CONCAT(oa4.i0, oa4.i1, oa4.i2, oa4.i3, oa4.i4, oa4.i5, oa4.i6),
                CONCAT(oa5.i0, oa5.i1, oa5.i2, oa5.i3, oa5.i4, oa5.i5, oa5.i6),
                CONCAT(oa6.i0, oa6.i1, oa6.i2, oa6.i3, oa6.i4, oa6.i5, oa6.i6)
    FROM dbo.Numbers AS n
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T1 AS t WHERE = n.n) AS oa1
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T2 AS t WHERE = n.n) AS oa2
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T3 AS t WHERE = n.n) AS oa3
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T4 AS t WHERE = n.n) AS oa4
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T5 AS t WHERE = n.n) AS oa5
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T6 AS t WHERE = n.n) AS oa6
    WHERE n.n BETWEEN 1 AND 10000
    ORDER BY n.n

    The execution plan for the rewritten query looks visually similar to the problematic one:

    Rewritten query plan

    However, the Compute Scalars no longer manufacture a LOB data type, they just emit column and variable references:


    All the concatenation work (and LOB manufacture) is performed by the final top-level Compute Scalar in a single monster expression [Expr1056]:


    Warm cache results

    With all data in memory, the new query completes in 1.8 seconds (very slightly up on 1.6 seconds before):


    Cold cache results

    When all data must be fetched from disk, the query issues optimal prefetching and completes in 7.3 seconds (down from 18.6 seconds) with no work table:


    The Extended Events wait statistics now show 3.8 seconds spent waiting for my laptop’s slow spinning disk (which is a good thing!)


    Final Thoughts

    Work tables can appear in STATISTICS IO output for a wide range of reasons, but if you encounter one with a very large number of reads – particularly LOB reads – you may be encountering this issue. The rewrite proposed above may not always be possible, but you should be able to refactor your query to avoid the issue now you know it exists.

    I am not a fan of doing large amounts of string manipulation in SQL Server. I am always particularly suspicious of the perceived need to split or concatenate large volumes of strings.

    I am, however, a fan of always using explicit data types (rather than relying on implicit conversions) and generating relatively small query plans that offer the query optimizer clear and obvious choices. By necessity, this often means writing small SQL queries in logical steps (and no, long chains of common table expressions do not count!)

    The real world does not always make these things possible, of course, but it is good to have goals :)

    © 2013 Paul White – All Rights Reserved
    twitter: @SQL_Kiwi

    Screenshots acquired using SnagIt by TechSmith
    Query plan details obtained using Plan Explorer PRO by SQLSentry

  • Halloween Protection – The Complete Series


    I have just published a four-part series for on the Halloween Problem. Some of you will never have heard of this issue, and those that have might associate it only with T-SQL UPDATE queries. In fact, the Halloween problem affects execution plans for INSERT, UPDATE, DELETE and MERGE statements.

    This is a topic I have been meaning to write about properly for years, ever since I read Craig Freedman’s 2008 blog post on the topic, which ended with the cryptic comment:

    “…although I've used update statements for all of the examples in this post, some insert and delete statements also require Halloween protection, but I'll save that topic for a future post.”

    That future post never materialized, sadly, so I thought I would have a go. The four parts of the series are summarized and linked below, I hope you find the material interesting.

    Part 1 – The Halloween Problem and UPDATE statements

    • The SQL standard and three-phase separation
    • Logical update processing
    • Pipelined execution
    • The Halloween problem
    • Avoiding the problem in UPDATE statements

    Part 2 – The Halloween Problem in INSERT and DELETE queries

    • INSERT examples
    • DELETE examples
    • Constraint checking and phase separation

    Part 3 – Halloween Problem optimizations for MERGE

    • MERGE contains several optimizations the other DML statements do not
    • Hole-filling with merge join
    • Hole-filling with nested loops
    • Avoiding an extra B-tree navigation
    • Avoiding the join

    Part 4 – The Halloween Problem and the Query Optimizer

    • Early optimization approaches
    • The SQL Server optimizer approach
    • The case of the redundant sort
    • HP levels and properties
    • Plan changes for Halloween Protection
    • Non-spool options
    • Row versioning
    • Heaps and forwarded records
    • T-SQL functions

    As always, I appreciate your comments and feedback.

    Paul White


  • Incorrect Results with Indexed Views

    Summary: If you use MERGE, indexed views and foreign keys, your queries can return incorrect results.

    Microsoft have released a fix for incorrect results returned when querying an indexed view. The problem applies to:

    • SQL Server 2012
    • SQL Server 2008 R2
    • SQL Server 2008

    The Knowledge Base article does not go into much detail, or provide a reproduction script, but this blog entry does :)

    The KB does say that reproducing the bug requires these features:

    • An indexed view on two tables that have a foreign key relationship
    • An update performed against the base tables
    • A query executed against the indexed view using a NOEXPAND hint

    There are two important details I would like to add right away:

    • The NOEXPAND hint is not required to reproduce the bug on Enterprise Edition
    • The update must be performed by a MERGE statement

    The fix is available in the following cumulative update packages:

    Cumulative Update 2 for SQL Server 2012 SP1 [build 11.0.3339]
    Cumulative Update 5 for SQL Server 2012 RTM [build 11.0.2395]
    Cumulative Update 4 for SQL Server 2008 R2 SP2 [build 10.50.4270]
    Cumulative Update 10 for SQL Server 2008 R2 SP1 [build 10.50.2868]
    Cumulative Update 8 for SQL Server 2008 SP3 [build 10.00.5828]

    No service pack contains this fix, you must apply one of the hotfix packages above.

    Steps to Reproduce

    The first thing we will need is two tables:

    CREATE TABLE dbo.Parent 
        parent_id integer IDENTITY NOT NULL,
        value varchar(20) NOT NULL,
        CONSTRAINT PK_Parent_id 
            PRIMARY KEY CLUSTERED (parent_id)
    CREATE TABLE dbo.Child
        child_id integer IDENTITY NOT NULL,
        parent_id integer NOT NULL,
        CONSTRAINT PK_Child_id 
            PRIMARY KEY CLUSTERED (child_id)

    And a few rows of data:

    INSERT dbo.Child (parent_id)
    SELECT New.parent_id 
        INSERT Parent 
        OUTPUT inserted.parent_id 
        ) AS New;

    The tables now look like this (parent first):

    Original Data

    We can now add the required FOREIGN KEY relationship:

    ALTER TABLE dbo.Child
    ADD CONSTRAINT FK_Child_Parent
    FOREIGN KEY (parent_id)
    REFERENCES dbo.Parent (parent_id);

    Next, a very simple indexed view that joins the two tables (the view could contain more complex features like aggregates):

    CREATE VIEW dbo.ParentsAndChildren
    FROM dbo.Parent AS p
    JOIN dbo.Child AS c ON 
        c.parent_id = p.parent_id;
    ON dbo.ParentsAndChildren (child_id);

    The final step is to use a MERGE statement to make some changes to the Parent table:

    DECLARE @ParentMerge AS TABLE
        parent_id integer PRIMARY KEY, 
        value varchar(20) NOT NULL
    INSERT @ParentMerge
        (parent_id, value)
        (1, 'Kiwi Fruit'),
        (4, 'Dragon Fruit');
    MERGE dbo.Parent AS p
    USING @ParentMerge AS s ON 
        s.parent_id = p.parent_id
        UPDATE SET value = s.value
        INSERT (value) VALUES (s.value)
        deleted.value AS old_value, 
        inserted.value AS new_value;

    This MERGE performs two actions:

    1. Updates the value column of parent row 1 from Apple to Kiwi Fruit
    2. Adds a new parent row 4 for Dragon Fruit

    The statement includes an OUTPUT clause to show the changes it makes (this is not required for the repro):

    MERGE changes

    This confirms that the changes have been made as we requested: parent row 1 has changed; and row 4 has been added. The changes are reflected in the base tables:

    SELECT * FROM dbo.Parent AS p;
    SELECT * FROM dbo.Child AS c;

    Updated Base Table Data

    As highlighted, row 1 has changed from Apple to Kiwi Fruit and row 4 has been added.

    We do not expect to see row 4 in the indexed view because there are no child records for that row, and the indexed view uses an inner join. Checking the indexed view using the NOEXPAND table hint (required in non-Enterprise SKUs to use indexes on a view):

    SELECT *
    FROM dbo.ParentsAndChildren 

    Incorrect indexed view data

    The results are incorrect. They show the old value of the data for parent row 1.

    Now we try using the EXPAND VIEWS query hint to force SQL Server to access the base tables rather than reading view indexes:

    SELECT *
    FROM dbo.ParentsAndChildren

    Expand Views Hint

    This query produces correct results.

    On SQL Server Enterprise Edition, the optimizer chooses whether to access the indexed view or the base tables. For following query, without any hints, the optimizer chooses not to expand the view. It reads the view index and produces incorrect results:

    -- Enterprise Edition ONLY
    SELECT * 
    FROM dbo.ParentsAndChildren;

    Indexed View Matching

    Perhaps adding a child row to match the new parent row 4 will somehow fix things up?

    INSERT dbo.Child (parent_id) VALUES (4);
    SELECT * FROM dbo.ParentsAndChildren WITH (NOEXPAND);
    SELECT * FROM dbo.ParentsAndChildren OPTION (EXPAND VIEWS);

    After Insert

    No. The query plan that accesses the view index still returns an incorrect value for row 1. It seems MERGE has corrupted our indexed view.

    Analysis using DBCC CHECKTABLE

    Checking the view with DBCC CHECKTABLE returns no errors:

    DBCC CHECKTABLE (ParentsAndChildren);

    DBCC output 1

    Unless we use the EXTENDED_LOGICAL_CHECKS option:


    DBCC output 2

    The damage is repairable:


    DBCC repair

    You probably do not want to set your database to SINGLE_USER mode and run a DBCC repair after every MERGE statement, however. We could also rebuild the indexed view’s clustered index manually, of course.


    For the MERGE statement above, the query optimizer builds a plan that does not update the indexed view (click to enlarge):

    Incorrect Update Plan

    In a version of SQL Server with the fix applied, the same MERGE statement produces a plan that does maintain the indexed view:

    Correct Update Plan

    The plan operators used to keep the view index in step with the base tables are highlighted. Without these operators, the changes to the base table are not correctly written to any indexes defined on the view. The root cause is related to the same simplification that allows the optimizer to remove the reference to the Parent table in this query:

    FROM dbo.Parent AS p 
    JOIN dbo.Child AS c ON 
        c.parent_id = p.parent_id;

    Simplification Example Plan

    The FOREIGN KEY relationship and NOT NULL constraints on the referencing column together mean that the join to Parent cannot affect the result of the query, so the join is simplified away. In SQL Server 2012, we can see when this simplification is performed because the following message appears when undocumented trace flags 8619 and 3604 are enabled during compilation:

    Trace Flag 8619 output

    The same message is emitted when a MERGE statement contains a WHEN MATCHED THEN UPDATE clause and either a WHEN NOT MATCHED THEN INSERT or WHEN MATCHED … THEN DELETE clause. These conditions combine such that the optimizer incorrectly concludes that a table reference can be removed, when in fact it is needed later on when the update side of the plan is built.

    Other details of the query and database can affect whether the simplification can be misapplied. For example, if the FOREIGN KEY constraint contains an ON DELETE CASCADE clause, and the MERGE contains a DELETE clause, the simplification is not performed, the TF 8619 message does not appear, and the bug does not manifest.

    The key to determining whether a particular query is vulnerable to this bug (TF 8619 aside) is to check whether the query plan includes operators to maintain the indexed view. At a minimum, you should see a update operator for the view:

    View Update Operator

    SQL Sentry Plan Explorer identifies the operator as applying to a view explicitly, in SSMS you need to click on the graphical operator and inspect the Properties window.


    The updated conditions for incorrect results are:

    • An indexed view that joins tables
    • Two of the tables have a single-column FOREIGN KEY constraint
    • A MERGE statement contains an UPDATE action that affects one of the tables
    • The MERGE statement also contains an INSERT or DELETE action (or both)
    • The optimizer applies a simplification that removes a table reference based on the FK relationship and other metadata
    • As a result, the MERGE execution plan does not contain the operators necessary to correctly maintain the indexed view
    • A subsequent query plan accesses an index on the view, either explicitly or via indexed-view matching (Enterprise Edition)


    • The simplification is not applied in tempdb
    • The simplification is not applied to multi-column foreign key constraints

    Under these conditions, the indexes on the view do not reflect the state of the base tables and incorrect results are returned. Once the hot fix is applied, the optimizer does not misapply the simplification so the correct indexed view maintenance features are built into execution plans.

    Update: I am adding this based on Ian Yates’ great question in the comments: my expectation is that applying this hotfix will not remove any existing indexed view corruption. You would need to test the hotfix as usual, apply it, and then either rebuild all affected indexed views manually, or run DBCC CHECKDB with the EXTENDED_LOGICAL_CHECKS option (which could take a long time).

    © 2013 Paul White – All Rights Reserved


    Twitter: @SQL_Kiwi

  • A creative use of IGNORE_DUP_KEY


    Let’s say you have a big table with a clustered primary key, and an application that inserts batches of rows into it. The nature of the business is that the batch will inevitably sometimes contain rows that already exist in the table.

    The default SQL Server INSERT behaviour for such a batch is to throw error 2627 (primary key violation), terminate the statement, roll back all the inserts (not just the rows that conflicted) and keep any active transaction open:

    -- Start with values 1 & 5 in the table
    INSERT #Big (pk) VALUES (1), (5);
    -- Our batch transaction
        -- Insert a batch containing some pre-existing rows
        INSERT #Big (pk) VALUES (1), (2), (3), (4), (5);
        -- Show the contents of the table after the insert statement
        SELECT pk FROM #Big;
        -- Show the transaction count
        SELECT tran_count = @@TRANCOUNT;
    -- Rollback
    -- Final table contents
    SELECT pk FROM #Big;
    -- Tidy up
    DROP TABLE #Big;

    The output is:

    INSERT with duplicate values

    Ignoring Duplicates

    Let us further imagine that the desired behaviour is that new rows in a batch should be inserted, and any duplicates silently rejected. Most importantly, no error messages should be returned to the client. Ideally, this would be achieved without immediate application changes, and without impacting concurrent users of the table (the instance is Enterprise Edition, so online operations are available).

    This seems like an ideal use for the IGNORE_DUP_KEY option:

    CREATE TABLE dbo.Big
        pk int NOT NULL,
        CONSTRAINT PK_Big 
        PRIMARY KEY (pk)
    -- Unique values
    INSERT dbo.Big (pk)
    VALUES (1), (3), (5);
    -- key 3 already exists
    INSERT dbo.Big (pk)
    VALUES (2), (3), (4);

    That script executes successfully with just a warning (not an error!) about the duplicate key:

    IGNORE_DUP_KEY output

    The problem

    We would like to add the IGNORE_DUP_KEY option to the existing primary key constraint, but the ALTER TABLE command does not have an ALTER CONSTRAINT clause. We do not want to drop the existing primary key and recreate it with the new option, because the table is large and we want to avoid disrupting concurrent users. Dropping and recreating would result in rebuilding the entire table first as a heap and then as a clustered table again.

    Although there is no ALTER CONSTRAINT syntax, we do know that certain constraint modifications can be performed using ALTER INDEX REBUILD on the index used to enforce the constraint. We can, for example, change the type of compression used, the index fill factor, and whether row locking is enabled:

    INSERT dbo.Big (pk)
    VALUES (1), (2), (3), (4), (5);
    ALTER INDEX PK_Big ON dbo.Big
    ALTER INDEX PK_Big ON dbo.Big
    DROP TABLE dbo.Big;

    Unfortunately, we cannot use the same trick to add the IGNORE_DUP_KEY option to the underlying index for the primary key:


    The same error message results even if the ONLINE = ON option is not specified. There will be a range of views about how accurate the error message text is here, but we cannot avoid the fact that the operation we want to perform is not supported by SQL Server.

    A creative workaround

    One idea is to add a new UNIQUE constraint (or index) on the same columns as the primary key, but with the IGNORE_DUP_KEY option added. This operation can be performed ONLINE:

    -- Existing table
    -- Existing data
    INSERT dbo.Big (pk) VALUES (1), (2), (3);
    -- New constraint (or index) with IGNORE_DUP_KEY, added ONLINE
    ALTER TABLE dbo.Big
    --ON dbo.Big (pk)
    -- Key 3 is a duplicate, just a warning now
    INSERT dbo.Big (pk) VALUES (3), (4), (5);
    SELECT pk FROM dbo.Big;

    The new arrangement results in the correct final state of the database, without throwing an error. The effect is the same as if we had been able to modify the existing primary key constraint to add IGNORE_DUP_KEY:

    Output with duplicate unique index

    Execution plan analysis

    There are some drawbacks to this idea, which we will explore in some detail. The drawbacks are significant, so adding the extra index with IGNORE_DUP_KEY is only really suitable as a temporary solution, as we will see. The first INSERT in the previous script (without the extra constraint or index with IGNORE_DUP_KEY) is pretty trivial; it just shows the constants from the VALUES clause being inserted to the clustered index:

    Trivial INSERT

    The second INSERT (with the IGNORE_DUP_KEY index) is rather more complex (click to enlarge):


    An extra index to maintain

    One fairly obvious consequence of adding the new index is that the Clustered Index Insert operator now has to maintain the new nonclustered index too. I used Plan Explorer above because it shows the per-row nonclustered index insert more explicitly than SSMS, where you have to dig into the Object node in the Properties window with the relevant graphical operator selected:

    SSMS Object Node

    Another way to see the nonclustered index maintenance explicitly is to run the query again with undocumented trace flag 8790 to produce a wide update plan:

    Wide update plan

    Besides maintaining the extra index, the IGNORE_DUP_KEY plans seem to be doing a lot of extra work: there are lots of new operators compared with the simple insert. As you would expect, all the new operators are associated with ignoring duplicate keys, and there are two distinct cases to consider.

    Rows that already exist

    The first case relates to checking for INSERT rows that already exist in the base table. This checking is implemented in the execution plan by the Left Semi Join, Index Seek, and Assert operators:

    Plan Right Fragment

    The Index Seek looks for the key value of the current row we are looking to insert. The Semi Join cannot be an Inner Join because that would reject rows where a match was not found (and we need to insert a row in that case). Nevertheless, the query processor needs to know if an existing row was found with the same key value. The Semi Join sets a value to indicate this, which is stored in the Probe Column:

    Nested Loops Operator Properties

    The Assert operator (known internally as a Stream Check) tests a condition and raises an error if the test fails. Assert operators are often seen in plans that affect columns with CHECK constraints, for example. In this case, however, the Assert does not raise an error, it emits the ‘Duplicate key was ignored.’ message instead, and only passes a row on to its parent (the Sort operator) if the condition passes. The check performed by the Assert is based on the Probe value stored in the expression labelled [Expr1013] above:

    Assert Operator Tooltip

    The Asset passes rows where the predicate evaluates to anything other than NULL. If a match was found by the Index Seek (so the Probe Column, Expr1013 is not NULL) the Assert does not pass on the row and raises a warning instead. The following TF 8607 output fragment for the INSERT statement shows the option (there is nothing equivalent in regular show plan output unfortunately):


    Duplicates within the Insert Set

    The first check only looks for rows from the insert set that already exist in the base table. The query processor also needs to check for duplicates within the set of values we are inserting. The Segment and Top operators in the plan combine to meet this requirement:

    Plan Middle Fragment

    The Segment requires a stream ordered by the index keys, and adds a flag to indicate the start of a new group of key values. The Top operator returns only one row from each group. The overall effect is to remove rows with duplicate index key values.

    The Top and Segment execution operators together implement a single physical query processor operator called a “Group-By Top”. I mention this detail because we need to be familiar with the internal name to understand the TF 8607 output, which indicates that this operator also emits a `Duplicate key was ignored.’ warning when it encounters a group with more than one key value:



    Adding the new index means that locks will now be taken on the nonclustered index when checking for existing rows:

    Index Seek Locks

    It might surprise you to learn that the index seek acquires Range S-U locks on the nonclustered index (extract from the Books Online link below):

    Range Locks

    Key-range locks like this are only taken under the SERIALIZABLE isolation level, but our INSERT statement uses them whatever isolation level the user runs the query under (for example, range locks are still taken if the session is running under the default READ COMMITTED isolation level).

    SQL Server raises the effective isolation level to SERIALIZABLE for the Index Seek (and just that operator) because it needs to be sure that if it does not find a match in the index, that situation remains the same until it inserts the new row. The Index Seek is looking for a key value that does not exist, so a range lock is necessary to prevent a concurrent transaction inserting a row with that key value before we do. There is no existing key in the index to lock (because the row does not exist) so a range lock is the only option. If SQL Server did not do this, our INSERT query might fail with a duplicate key error despite the IGNORE_DUP_KEY setting (we check for a row, don’t find it, someone else inserts it, then we try to).

    The optimizer adds a SERIALIZABLE hint for the Index Seek operator (internally, a physical Range operator) as we can see using another extract from the TF 8607 output:

    Query Processor Hints

    The execution plan is also forced to use the nonclustered index with the IGNORE_DUP_KEY setting for this seek (FORCEDINDEX) and an update lock (UPDLOCK) hint is also given to help prevent conversion deadlocks. Nevertheless, you may find increased deadlocking if you choose to add an extra IGNORE_DUP_KEY index like this.

    Permanent solutions

    Adding a nonclustered unique index with the IGNORE_DUP_KEY option and the same key as the clustered primary key allowed us to solve an immediate problem without code changes, while keeping the table online, but it does come at a price. The execution plan is much more complex (and expensive) than the original INSERT, and there is a chance of deadlocks. The biggest performance impact of adding the extra index is of course the cost of maintaining it, meaning we need to look at a more permanent solution via a code change.


    The (trivial) test I am going to run inserts 5000 unique rows from my table of numbers into the table we have been using so far. To establish a baseline, we will first look at the execution plan for an INSERT to the table with a nonclustered IGNORE_DUP_KEY index:

    INSERT dbo.Big (pk)
    SELECT n FROM dbo.Numbers
    WHERE n BETWEEN 1 AND 5000;


    This plan is very similar to the one we analysed earlier, though a Merge is used to perform the Left Semi Join, and a wide update plan has been chosen, making the nonclustered index insert easier to see. The other main feature is an Eager Table Spool, required for Halloween Protection. The estimated cost of this execution plan is 0.363269 cost units on my installation.

    Modified INSERT

    The first code alternative to the extra index is to modify the INSERT statement to check that rows do not already exist in the destination. There are a number of SQL syntaxes for this, each with different characteristics and performance in different circumstances. To keep things simple, and because I only want to make a couple of specific points here, I have chosen my simple example to produce the same execution plans for all the common syntax variants. The first thing to do is to drop our extra constraint:

    ALTER TABLE dbo.Big

    Now we can evaluate the modified INSERT statement:

    -- I prefer this syntax
    INSERT dbo.Big (pk)
    SELECT n FROM dbo.Numbers
    WHERE n BETWEEN 1 AND 5000
    AND NOT EXISTS (SELECT * FROM dbo.Big WHERE pk = n);
    -- With EXCEPT
    INSERT dbo.Big (pk)
    SELECT n FROM dbo.Numbers
    WHERE n BETWEEN 1 AND 5000
    SELECT pk FROM dbo.Big;
    -- Not recommended
    INSERT dbo.Big (pk)
    SELECT n FROM dbo.Numbers
    WHERE n BETWEEN 1 AND 5000
    AND n NOT IN (SELECT pk FROM dbo.Big);

    All three produce this execution plan:

    Modified INSERT test

    This has an estimated cost of 0.0981188 units – much cheaper than the 0.363269 cost seen previously. The Halloween Protection spool still features in the plan, but there is a weakness in our queries that you might have already spotted. We are not doing anything to protect against duplicate key violation errors in case someone else inserts a row after we have checked to see if it exists, but before we insert it. The query optimizer added a SERIALIZABLE hint when it added an existence check, so if avoiding errors is important to us, we need to do the same:

    INSERT dbo.Big (pk)
    SELECT n FROM dbo.Numbers
    WHERE n BETWEEN 1 AND 5000

    We do not need the UPDLOCK hint for two reasons. First, the engine automatically takes update locks for us when reading the source table. Second, we are reading from the same index we are inserting to (not reading from a nonclustered index and inserting to the clustered index) so the previous deadlock scenario is not applicable.

    Using MERGE

    Another option is to use the WHEN NOT MATCHED feature of the Merge statement. This time we will add the necessary SERIALIZABLE hint up front:

    USING (SELECT n FROM dbo.Numbers WHERE n BETWEEN 1 AND 5000) AS s
    ON s.n =
    INSERT (pk) VALUES (s.n);

    MERGE test

    This plan has an estimated cost of 0.0950127 units – slightly less than the 0.0981188 units for the modified INSERT plans. Some of this improvement is due to the lack of a Halloween Protection spool, for interesting reasons I cover in depth in a short series of articles to be published shortly (update: part one is out).

    These are not meant to be performance tests by any stretch of the imagination. There are any number of subtle factors that will affect the execution plans and run times for different numbers of rows, different distributions, and so on. I should stress that I normally find MERGE plans perform less well than separate INSERT/UPDATE/DELETE more often than not. Anyway, in case you are interested, typical performance results on my machine for this specific test are (INSERT first, then MERGE), timings in milliseconds:

    Performance results

    IGNORE_DUP_KEY and Clustered Indexes

    When IGNORE_DUP_KEY is specified for a unique clustered index (primary key or otherwise), duplicates are handled by the storage engine rather than the query processor. For example:

    INSERT T VALUES (1), (1), (2), (3);

    The execution plan shows none of the complexity of the nonclustered index case:


    The TF 8607 output contains none of the query processor hints to ignore duplicates and raise warnings:


    The IGNORE_DUP_KEY side of things is handled entirely by the storage engine – if it finds a row in the unique index where it was going to insert one, it flags a warning but does not try to insert (which would cause an error). The estimated cost of this query plan is identical whether the unique clustered index has IGNORE_DUP_KEY or not – there is no extra work for the query processor.

    If you are wondering why the same mechanism is not used for a nonclustered unique index with the IGNORE_DUP_KEY option, my understanding is it is because the Clustered Index is always updated before the nonclustered ones (even in a narrow plan). By the time the storage engine detected a duplicate in the nonclustered index, it would have already added the row to the base table. So, the query processor handles IGNORE_DUP_KEY for nonclustered indexes.

    This is my understanding of the mechanics of IGNORE_DUP_KEY, I don’t know for sure and will happily correct any details I may have wrong. Any storage engine experts reading this please contact me if so.


    This post was inspired by a post by Daniel Adeniji and our subsequent discussion of it. My thanks to him for permission to analyse the issue further here, and for allowing me to reference his original. The issue described here may not accurately reflect Daniel’s original problem – I used a certain amount of artistic licence. Even if the specific issue is not particularly interesting to you, I hope you enjoyed some aspects of the analysis and maybe picked up some new information along the way.

    © 2013 Paul White – All rights reserved
    twitter: @SQL_Kiwi

  • Optimizing T-SQL queries that change data


    Most tuning efforts for data-changing operations concentrate on the SELECT side of the query plan. Sometimes people will also look at important storage engine considerations like locking and transaction log throughput that can have dramatic effects. As a consequence, a number of common practices have emerged, such as avoiding large numbers of row locks and lock escalation, splitting large changes into smaller batches of a few thousand rows and combining a number of small changes into a single transaction in order to optimize log flushes.

    This is all good of course, but what about the data-changing side of the query plan – the INSERT, UPDATE, DELETE or MERGE operation itself – are there any query processor considerations we should take into account? The short answer is yes. The query optimizer considers different plan options for the write-side of an I/U/D/M query, though there isn’t a huge amount of T-SQL language support that allows us to affect these choices directly. Nevertheless, there are things to be aware of, and things we can look to change.

    Side note: In this post I am going to use the term update (lower case) to apply to any operation that changes the state of the database (INSERT, UPDATE, DELETE and MERGE in T-SQL). This is a common practice in the literature, and is used inside SQL Server too as we will see.

    The Three Basic Update Forms

    First a bit of background. All query plans execute using a demand-driven iterator model, and updates are no exception. Parent operators drive child operators (to the right of the parent) to do work by asking them for a row at a time. Take the following simple AdventureWorks INSERT for example:

    DECLARE @T AS TABLE (ProductID int);
    INSERT @T (ProductID) 
    SELECT p.ProductID 
    FROM Production.Product AS p;

    Query Plan Execution Order Example

    Plan execution starts at the far left, where you can think of the green T-SQL INSERT icon as representing rows returned to the client. This root node asks its immediate child operator (the Table Insert) for a row. The Table Insert requests a row from the Index Scan, which provides one. This row is inserted into the heap, and an empty row is returned to the root node (if the INSERT query contained an OUTPUT clause, the returned row would contain data). This row-by-row process continues until all source rows have been processed. Notice that the XML show plan output shows the Heap Table Insert is performed by an Update operator internally. Ok, on to the matter at hand:

    1. Wide, per-index updates

    Wide per-index update plans have a separate update operator for each clustered and nonclustered index. It is always possible for the optimizer to produce this type of plan; the options described later have not been (or cannot be) implemented when certain engine features are used. For example, if the update target has an indexed view defined on it, a wide update plan has to be used to maintain the indexed view. An example per-index update plan is shown below:

    Wide update plan

    This plan updates the base table using a Clustered Index Delete operator. This operator may also read and output extra column values necessary to find and delete rows from nonclustered indexes. The iterative delete activity on the base table is driven by the Eager Table Spool on the top branch. The spool is eager because it stores all the rows from its input in a worktable before returning the first row to its parent operator (the Index Delete on the top branch). The effect is that all qualifying base table rows are deleted before any nonclustered index changes occur.

    The spool in this plan is a common subexpression spool. It is populated once and then acts as a row source for multiple consumers. In this case, the contents of the spool are consumed first by the top-branch Index Delete, which removes index entries from one of the nonclustered indexes. When the Sequence operator switches to asking for rows from the lower branch, the spool is rewound and replays its rows for the second nonclustered index. Note that the spool contains the UNION of all columns required by nonclustered index changes.

    2. Narrow, per-row updates

    Narrow per-row updates have a single update operator which maintains the base table (heap or clustered) and one or more nonclustered indexes. Each row arriving at the update operator updates all indexes associated with the operator before processing the next row. An example:

    DECLARE @T AS TABLE (pk int PRIMARY KEY, col1 int UNIQUE, col2 int UNIQUE);
    DECLARE @S AS TABLE (pk int PRIMARY KEY, col1 int, col2 int);
    INSERT @T (pk, col1)
    FROM @S AS s;

    The execution plan viewed using SQL Sentry Plan Explorer shows the nonclustered index maintenance explicitly (hover over the Clustered Index Insert for a tooltip showing the names of the nonclustered indexes involved):

    Plan Explorer Narrow Update Plan

    In SQL Server Management Studio, there is no information about the nonclustered index maintenance in the graphical plan or tooltips, we need to click on the update operator (Clustered Index Insert) and then check the Properties window:

    SSMS narrow index updates

    This shows the clustered index and two nonclustered indexes being maintained. Because all indexes are updated for each row streaming through the update operator, this plan shape is also known as a per-row update plan.

    The query optimizer uses cost-based reasoning to decide whether to update each nonclustered index separately (per-index) or as part of the base table changes (per-row). An example that updates one nonclustered index per-row, and another per-index is shown below:

    CREATE TABLE #T (pk int IDENTITY NOT NULL, col1 int, col2 varchar(8000) DEFAULT '');
    CREATE INDEX nc1 ON #T (col1);
    CREATE INDEX nc2 ON #T (col1) INCLUDE (col2);

    The combination strategy can be seen with Plan Explorer (click to expand in a new tab):

    Combination index update

    The details are also displayed in the SSMS Properties window when the Clustered Index Insert operator is selected as well.

    3. Single-operator updates

    The third form of update is an optimization of the per-row update plan for very simple operations. The cost-based optimizer still emits a per-row update plan, but a rewrite is subsequently applied to collapse the reading and writing operations into a single operator:

    -- Simple operations
    DECLARE @T AS TABLE (pk int PRIMARY KEY, col1 int);
    INSERT @T (pk, col1) VALUES (1, 1000);
    UPDATE @T SET col1 = 1234 WHERE pk = 1;
    DELETE @T WHERE pk = 1;

    The complete execution plans (and extracts from the XML plans) are shown below:

    Single operator update plans

    The UPDATE and DELETE plans have been collapsed to a `Simple Update` operation, while the INSERT is simplified to a `Scalar Insert`. We can see the original optimizer output tree with trace flags 3604 and 8607:

    DECLARE @T AS TABLE (pk int PRIMARY KEY, col1 int);
    DBCC TRACEON (3604, 8607);
    INSERT @T (pk, col1) VALUES (1, 1000) OPTION (RECOMPILE);
    UPDATE @T SET col1 = 1234 WHERE pk = 1 OPTION (RECOMPILE);
    DBCC TRACEOFF (3604, 8607);

    The trace output for the INSERT statement shown above is:

    Optimizer output tree

    Notice the two physical operators: a constant scan (in-memory table) containing the literal values specified in the query; and a stream update operator that performs the database changes per row. All three statements produce a similar optimizer tree output (and all use a stream update). We can prevent the single-operator optimization being applied with undocumented trace flag 8758:

    DECLARE @T AS TABLE (pk int PRIMARY KEY, col1 int);
    DBCC TRACEON (8758);
    INSERT @T (pk, col1) VALUES (1, 1000) OPTION (RECOMPILE);
    UPDATE @T SET col1 = 1234 WHERE pk = 1 OPTION (RECOMPILE);
    DBCC TRACEOFF (8758);

    This exposes the expanded per-row update plans produced by the optimizer before the single-operator rewrite:

    Expanded single operator plan

    Single operator plans can be significantly more efficient than the original multiple-operator form in the right circumstances, for example if the plan is executed very frequently.

    Update plan optimizations

    From this point forward, I’m going to use simple AdventureWorks DELETE examples, but the general points apply equally well to INSERT, UPDATE and MERGE as well. The examples will not have a complicated SELECT component, because I want to concentrate on the update side of the plan rather than the reading side.

    Per-row updates

    It’s worth emphasising again that per-row update plans are an optimization that is not available for every update query. The optimizer favours a per-row plan if the expected number of rows at the update operator is low. In the example below, the optimizer expects to delete 25 rows:

    Per-row updates

    This plan updates the base table (a clustered index in this case) and all nonclustered indexes in sequence per row. The row source here is an ordered scan of a nonclustered index, which means rows will not generally be presented in clustered index key order to the update operator.

    An unordered stream like this tends to result in random I/O (assuming physical I/O is required at all, of course). If the plan is expected to process only a small number of rows, the optimizer decides it is not worth adding extra operations to encourage sequential I/O.

    Unordered Prefetch

    If the expected number of rows is a bit larger, the optimizer may decide to build a plan that applies prefetching to one or more update operators. The basic idea is the same as ordinary prefetching (a.k.a read-ahead) for scans and range seeks. The engine looks ahead at the keys of the incoming stream and issues asynchronous IOs for pages that will be needed by the update operator soon. Prefetching helps reduce the impact of the random I/O mentioned a moment ago. An example of prefetching on the update operator is shown below for the same query and an expected cardinality of 50 rows:


    The prefetching is unordered from the perspective of the clustered index. Neither SSMS nor Plan Explorer show the prefetch information prominently. In Plan Explorer, we need to have the With Unordered Prefetch optional column selected. To do this, switch to the Plan Tree tab, open the Column Chooser, and drag the column to the grid. The unordered prefetch property is ticked for the Clustered Index Delete operator in the screenshot below:

    Unordered Prefetch in Plan Explorer

    In SSMS, click on the Clustered Index Delete icon in the graphical plan, and then look in the Properties window:

    SSMS Unordered Prefetch

    If the query plan were reading from the clustered index, the read operation would issue regular read-ahead, so there would be no point prefetching from the Clustered Index Delete as well. The example below shows the expected cardinality increased to 100, where the optimizer switches from scanning the nonclustered index with unordered prefetching to scanning the clustered index. No prefetching occurs on the update operator in this plan:

    No clustered index prefetch

    Where the plan property With Unordered Prefetch is not set, it is simply omitted – it does not appear set to False as you might expect.

    Ordered Prefetch

    Where rows are expected to arrive at an update operator in non-key order, the optimizer might consider adding an explicit sort. The idea is that presenting rows in key order will encourage sequential I/O for the index pages. The optimizer weighs the cost of the sort against the expected benefits of avoiding random I/O. The execution plan below shows an example of a sort being added for this reason:

    Sort to reduce random I/O

    The sort is on Transaction ID, the clustering key for this table. With the incoming stream now sorted, the update operator can use ordered prefetching. The plan is still scanning the nonclustered index on the read side, so read-ahead issued by that operator does not bring in the clustered index pages needed by the update operator. Ordered prefetching tends to result in sequential I/O, compared with the mostly random I/O of unordered prefetching. The With Ordered Prefetch column is also not shown by default by Plan Explorer, so it needs to be added as we did before:

    Plan Explorer Ordered Prefetch

    Not every update operator with ordered prefetching requires a sort. If the optimizer finds a plan that naturally presents keys in nonclustered index order, an update operator may still use ordered prefetching.

    DML Request Sort

    When the optimizer decides to explore a plan alternative that requires ordered input to an update operator, it will set the DML Request Sort property for that operator. Plan Explorer shows this setting in the update operator tooltip:

    Plan Explorer DML Request Sort

    SSMS shows With Ordered Prefetch and DML Request Sort in the Properties window:

    SSMS DMLRequestSort and WithOrderedPrefetch

    Nonclustered index updates

    Sorting and ordered prefetching may also be considered for the nonclustered indexes in a per-index update plan.

    Sort for nonclustered index

    This plan shows all three update operators with DML Request Sort set. The Clustered Index Delete does not require an explicit sort because rows are being read (with internal ordering guarantees) from the Clustered Index Scan. The two non-clustered indexes do require explicit sorts, however. Both nonclustered index update operators also use ordered prefetching; the clustered index update does not because the scan automatically invokes read-ahead for that index.

    The next example shows that each update operator is considered separately for the various optimizations:

    Indexes optimized separately

    That plan features unordered prefetching on the Clustered Index Delete (because the row source is a scan of a nonclustered index), an explicit sort with ordered prefetching on the top branch Index Delete, and unordered prefetching on the bottom branch Index Delete.

    The per-index and per-row update plan

    Is the following a per-index or per-row update plan?

    Per-index and per-row plan

    It appears to be a per-index plan because there are two separate update operators, but there is no blocking operator (eager spool or sort) between the two. This plan is a pipeline: each row returned from the seek will be processed first by the Clustered Index Delete and then by the Index Delete. In that sense, the both updates (clustered and nonclustered) are certainly per-row.

    The important thing is not the terminology, it is being able to interpret the plan. Now we know what the various optimizations are for, we can see why the optimizer chose the plan above over the seemingly equivalent but more efficient-looking narrow plan:

    Narrow plan

    The narrow plan has no prefetching. The Seek provides read-ahead for the clustered index pages, but the single non-clustered index has nothing to prefetch any pages from disk it might need. By contrast, the previous wide update plan has two update operators. The Clustered Index Delete has no prefetching for the same reason as in the narrow plan, but the Index Delete has unordered prefetching enabled.

    So, although the smaller narrow plan looks like it ought to be more efficient, it might perform less well if nonclustered index pages are required from disk and unordered prefetching proves effective. On the other hand, if all required nonclustered index pages are in memory at the time the query is executed, the wide plan might perform less well since it features an extra operator and has to perform all the work associated with prefetching (even though ultimately no physical reads are issued).

    Performance Impacts

    You may have noticed a lack of performance numbers (I/O statistics, elapsed times and so on) in this post. There is a good reason for this. The optimizations presented here are quite dependent on SQL Server configuration and hardware. I don’t want you drawing general conclusions based on the performance of the rather ordinary single SATA drive in my laptop, the 256MB of RAM I have allowed for my SQL Server 2012 instance, and some rather trivial AdventureWorks examples.

    Which Optimizations are Good?

    It depends, naturally.

    If the data structures you are changing are very likely to be in memory, and/or if you have a very fast random I/O system, the optimizer’s goal of minimizing random I/O and enhancing sequential I/O may not make much sense for you. The strategies the optimizer employs may well end up costing you much more than they save.

    On the other hand, if your system has a much smaller buffer pool than database working set size, and your I/O system works very much faster with large sequential I/O than with smaller random I/O requests, you should generally find the optimizer makes reasonable choices for you, subject to the usual caveats about useful up-to-date statistics and accurate cardinality estimation, of course.

    If your system fits the optimizer’s model, at least to a first approximation, you will usually find that narrow update plans are best for low cardinality update operations, unordered prefetching helps a bit when more rows are involved, ordered prefetching helps more, and explicit sorts before nonclustered index updates helps most of all for the largest sets. That is the theory, anyway.

    What are the Problems?

    The problem I see most often is with the optimization that is supposed to help most: the explicit sort before a nonclustered index update. The idea is that sorting the input to the index update in key order promotes sequential I/O, and often it does. The problem occurs if the workspace memory allocated to the sort at execution time proves to be too small to perform the sort entirely in memory. The memory grant is fixed based on cardinality estimation and row size estimates, and may be reduced if the server is under memory pressure at the time.

    In any case, however it occurs, a sort that runs out of memory spills whole sort runs to physical tempdb disk, often repeatedly. Unsurprisingly, this is not at all good for performance, and can result in queries that complete very much slower than if the sort had not been introduced at all. SQL Server reuses the general Sort operator for sorting index keys – it does not have a sort specially optimized for index updates which could make a best effort to sort within the memory allocated, but never spill to disk. Perhaps I should suggest this on Connect :)

    The narrow plan optimization can cause problems where it is selected due to a low cardinality estimate, where several nonclustered indexes need to be maintained, the necessary pages are not in memory, and the I/O system is slow at random reads.

    The ordered and unordered prefetch options cause problems much more rarely. For a system with all data in memory, there is a small but possibly measurable impact due to the prefetch overhead (finding pages to consider read-ahead for, checking to see if they are already in the buffer pool, and so on). Even if no asynchronous I/O is ever issued by the worker, it still spends time on the task that it could spend doing real work.

    What options are there?

    The usual answer to deal with poor execution plan choices due to a system’s configuration or performance characteristics not matching the optimizer’s model is to try different T-SQL syntax, and/or to try query hints. There are times when it is possible to rewrite the query to get a plan that performs better, and other times where rethinking the problem a bit can pay dividends. Various creative uses of partitioning, minimal logging, and bulk loading are possible in some cases, for example.

    There are very few hints that can help with the update side of a plan that does not respond to the usual tricks, however. The two I use most often affect the optimizer’s cardinality estimation - OPTION (FAST n) and a trick involving TOP and the OPTIMIZE FOR query hint. OPTION (FAST n) affects cardinality estimates in a plan by setting a final row goal, but its effects can be difficult to predict, and may not have much of an effect if the plan contains blocking operators. Anyway, the general idea is to vary the value of ‘n’ in the hint until the optimizer chooses the desired plan shape or optimization options as described in this post. Best of luck.

    The idea with TOP is similar, but tends to work better in my experience. The trick is to declare a bigint variable with the number of rows the query should return (or a very large value such as the maximum value of a bigint if all rows are required), use that as the parameter to a TOP, and then use the OPTIMIZE FOR hint to set a value for ‘n’ that the optimizer should use when considering alternative plans. This option is particularly useful for encouraging a narrow plan.

    Most of the examples in this post used the TOP trick in the following general form:

    DECLARE @n bigint = 9223372036854775807;
    DELETE TOP (@n) 
    FROM Production.TransactionHistory
    OPTION (OPTIMIZE FOR (@n = 50));

    No magic trace flags?

    Sure there are trace flags that affect the optimizations in this post. They are undocumented and unsupported of course, and may only work on some versions, but they can be handy to validate a performance analysis, or to generate a plan guide for a particularly crucial query. They can also be fun to play with on a personal system to explore their effects. The main ones that affect the optimizations described here are:

    2332 : Force DML Request Sort (CUpdUtil::FDemandRowsSortedForPerformance)

    8633:  Enable prefetch (CUpdUtil::FPrefetchAllowedForDML and CPhyOp_StreamUpdate::FDoNotPrefetch)

    8744 : Disable prefetch (CUpdUtil::FPrefetchAllowedForDML)

    8758 : Disable rewrite to a single operator plan (CPhyOp_StreamUpdate::PqteConvert)

    8790 : Force a wide update plan (CUpdUtil::PexprBuildWideUpdatePlan)

    8795 : Disable DML Request Sort (CUpdUtil::FDemandRowsSortedForPerformance)

    9115 : Disable prefetch (CUpdUtil::FPrefetchAllowedForDML)



    These trace flags can all be manipulated using the usual DBCC commands or enabled temporarily for a particular query using the OPTION (QUERYTRACEON xxxx) hint.

    Final Thoughts

    The optimizer has a wide range of choices when building the writing side of an update plan. It may choose to update one or more indexes separately, or as part of the base table update. It may choose to explicitly sort rows as part of a per-index update strategy, and may elect to perform unordered or ordered prefetching as well. As usual, these decisions are made based on costs computed using the optimizer’s model. This model may not always produce plans that are optimal for your hardware (and as usual the optimizer’s choices are only as good as the information you give it).

    If a particular update query is performance critical for you, make sure cardinality estimates are reasonably accurate. Test with alternate syntax and/or trace flags to see if an alternative plan would perform significantly better in practice. It is usually possible to use documented techniques like TOP clauses and OPTIMIZE FOR hints to produce an execution plan that performs well. Where that is not possible, more advanced tricks and techniques like plan guides may be called for.

    Thanks for reading today, I hope this post was interesting and provided some new insights into update query optimization.

    © 2013 Paul White – All rights reserved
    twitter: @SQL_Kiwi

  • MERGE Bug with Filtered Indexes

    A MERGE statement can fail, and incorrectly report a unique key violation when:

    • The target table uses a unique filtered index; and
    • No key column of the filtered index is updated; and
    • A column from the filtering condition is updated; and
    • Transient key violations are possible

    Example Tables

    Say we have two tables, one that is the target of a MERGE statement, and another that contains updates to be applied to the target.  The target table contains three columns, an integer primary key, a single character alternate key, and a status code column.  A filtered unique index exists on the alternate key, but is only enforced where the status code is ‘a’:

    CREATE TABLE #Target 
        pk          integer NOT NULL, 
        ak          character(1) NOT NULL,
        status_code character(1) NOT NULL,
        PRIMARY KEY (pk)
    ON #Target (ak)
    INCLUDE (status_code)
    WHERE status_code = 'a';

    The changes table contains just an integer primary key (to identify the target row to change) and the new status code:

    CREATE TABLE #Changes
        pk          integer NOT NULL,
        status_code character(1) NOT NULL,
        PRIMARY KEY (pk)

    Sample Data

    The sample data for the example is:

    INSERT #Target 
        (pk, ak, status_code)
        (1, 'A', 'a'),
        (2, 'B', 'a'),
        (3, 'C', 'a'),
        (4, 'A', 'd');
    INSERT #Changes
        (pk, status_code)
        (1, 'd'),
        (4, 'a');

             Target                     Changes
    ╔════╦════╦═════════════╗    ╔════╦═════════════╗
    ║ pk ║ ak ║ status_code ║    ║ pk ║ status_code ║
    ╠════╬════╬═════════════╣    ╠════╬═════════════╣
    ║  1 ║ A  ║ a           ║    ║  1 ║ d           ║
    ║  2 ║ B  ║ a           ║    ║  4 ║ a           ║
    ║  3 ║ C  ║ a           ║    ╚════╩═════════════╝
    ║  4 ║ A  ║ d           ║

    The target table’s alternate key (ak) column is unique, for rows where status_code = ‘a’.  Applying the changes to the target will change row 1 from status ‘a’ to status ‘d’, and row 4 from status ‘d’ to status ‘a’.  The result of applying all the changes will still satisfy the filtered unique index, because the ‘A’ in row 1 will be deleted from the index and the ‘A’ in row 4 will be added.

    Merge Test One

    Let’s now execute a MERGE statement to apply the changes:

    MERGE #Target AS t
    USING #Changes AS c ON =
        AND c.status_code <> t.status_code 
        THEN UPDATE SET status_code = c.status_code;

    The MERGE changes the two target rows as expected.  The updated target table now contains:

    ║ pk ║ ak ║ status_code ║
    ║  1 ║ A  ║ d           ║ <—changed from ‘a’
    ║  2 ║ B  ║ a           ║
    ║  3 ║ C  ║ a           ║
    ║  4 ║ A  ║ a           ║ <—changed from ‘d’

    Merge Test Two

    Now let’s repopulate the changes table to reverse the updates we just performed:

    TRUNCATE TABLE #Changes;
    INSERT #Changes
        (pk, status_code)
        (1, 'a'),
        (4, 'd');

    This will change row 1 back to status ‘a’ and row 4 back to status ‘d’.  As a reminder, the current state of the tables is:

             Target                        Changes
    ╔════╦════╦═════════════╗    ╔════╦═════════════╗
    ║ pk ║ ak ║ status_code ║    ║ pk ║ status_code ║
    ╠════╬════╬═════════════╣    ╠════╬═════════════╣
    ║  1 ║ A  ║ d           ║    ║  1 ║ a           ║
    ║  2 ║ B  ║ a           ║    ║  4 ║ d           ║
    ║  3 ║ C  ║ a           ║    ╚════╩═════════════╝
    ║  4 ║ A  ║ a           ║

    We execute the same MERGE statement:

    MERGE #Target AS t
    USING #Changes AS c ON =
        AND c.status_code <> t.status_code 
        THEN UPDATE SET status_code = c.status_code;

    However this time we receive the following message:

    Msg 2601, Level 14, State 1, Line 1
    Cannot insert duplicate key row in object 'dbo.#Target' with unique index 'uq1'. The duplicate key value is (A).
    The statement has been terminated.

    Applying the changes using UPDATE

    Let’s now rewrite the MERGE to use UPDATE instead:

    UPDATE t
    SET status_code = c.status_code
    FROM #Target AS t
    JOIN #Changes AS c ON =
        c.status_code <> t.status_code;

    This query succeeds where the MERGE failed.  The two rows are updated as expected:

    ║ pk ║ ak ║ status_code ║
    ║  1 ║ A  ║ a           ║ <—changed back to ‘a’
    ║  2 ║ B  ║ a           ║
    ║  3 ║ C  ║ a           ║
    ║  4 ║ A  ║ d           ║ <—changed back to ‘d’

    What went wrong with the MERGE?

    In this test, the MERGE query execution happens to apply the changes in the order of the ‘pk’ column.

    In test one, this was not a problem: row 1 is removed from the unique filtered index by changing status_code from ‘a’ to ‘d’ before row 4 is added.  At no point does the table contain two rows where ak = ‘A’ and status_code = ‘a’.

    In test two, however, the first change was to change row 1 from status ‘d’ to status ‘a’.  This change means there would be two rows in the filtered unique index where ak = ‘A’ (both row 1 and row 4 meet the index filtering criteria ‘status_code = a’).

    The storage engine does not allow the query processor to violate a unique key (unless IGNORE_DUP_KEY is ON, but that is a different story, and doesn’t apply to MERGE in any case).  This strict rule applies regardless of the fact that if all changes were applied, there would be no unique key violation (row 4 would eventually be changed from ‘a’ to ‘d’, removing it from the filtered unique index, and resolving the key violation).

    Why it went wrong

    The query optimizer usually detects when this sort of temporary uniqueness violation could occur, and builds a plan that avoids the issue.  I wrote about this a couple of years ago in my post Beware Sneaky Reads with Unique Indexes (you can read more about the details on pages 495-497 of Microsoft SQL Server 2008 Internals or in Craig Freedman’s blog post on maintaining unique indexes).  To summarize though, the optimizer introduces Split, Filter, Sort, and Collapse operators into the query plan to:

    1. Split each row update into delete followed by an inserts
    2. Filter out rows that would not change the index (due to the filter on the index, or a non-updating update)
    3. Sort the resulting stream by index key, with deletes before inserts
    4. Collapse delete/insert pairs on the same index key back into an update

    The effect of all this is that only net changes are applied to an index (as one or more insert, update, and/or delete operations).  In this case, the net effect is a single update of the filtered unique index: changing the row for ak = ‘A’ from pk = 4 to pk = 1.  In case that is less than 100% clear, let’s look at the operation in test two again:

             Target                     Changes                   Result
    ╔════╦════╦═════════════╗    ╔════╦═════════════╗    ╔════╦════╦═════════════╗
    ║ pk ║ ak ║ status_code ║    ║ pk ║ status_code ║    ║ pk ║ ak ║ status_code ║
    ╠════╬════╬═════════════╣    ╠════╬═════════════╣    ╠════╬════╬═════════════╣
    ║  1 ║ A  ║ d           ║    ║  1 ║ d           ║    ║  1 ║ A  ║ a           ║
    ║  2 ║ B  ║ a           ║    ║  4 ║ a           ║    ║  2 ║ B  ║ a           ║
    ║  3 ║ C  ║ a           ║    ╚════╩═════════════╝    ║  3 ║ C  ║ a           ║
    ║  4 ║ A  ║ a           ║                            ║  4 ║ A  ║ d           ║
    ╚════╩════╩═════════════╝                            ╚════╩════╩═════════════╝

    From the filtered index’s point of view (filtered for status_code = ‘a’ and shown in nonclustered index key order) the overall effect of the query is:

      Before           After
    ╔════╦════╗    ╔════╦════╗
    ║ pk ║ ak ║    ║ pk ║ ak ║
    ╠════╬════╣    ╠════╬════╣
    ║  4 ║ A  ║    ║  1 ║ A  ║
    ║  2 ║ B  ║    ║  2 ║ B  ║
    ║  3 ║ C  ║    ║  3 ║ C  ║
    ╚════╩════╝    ╚════╩════╝

    The single net change there is a change of pk from 4 to 1 for the nonclustered index entry ak = ‘A’.  This is the magic performed by the split, sort, and collapse.  Notice in particular how the original changes to the index key (on the ‘ak’ column) have been transformed into an update of a non-key column (pk is included in the nonclustered index).  By not updating any nonclustered index keys, we are guaranteed to avoid transient key violations.

    The Execution Plans

    The estimated MERGE execution plan that produces the incorrect key-violation error looks like this (click to enlarge in a new window):

    MERGE plan

    The successful UPDATE execution plan is (click to enlarge in a new window):

    UPDATE execution plan

    The MERGE execution plan is a narrow (per-row) update.  The single Clustered Index Merge operator maintains both the clustered index and the filtered nonclustered index.  The UPDATE plan is a wide (per-index) update.  The clustered index is maintained first, then the Split, Filter, Sort, Collapse sequence is applied before the nonclustered index is separately maintained.

    There is always a wide update plan for any query that modifies the database. The narrow form is a performance optimization where the number of rows is expected to be relatively small, and is not available for all operations.  One of the operations that should disallow a narrow plan is maintaining a unique index where intermediate key violations could occur.


    The MERGE can be made to work (producing a wide update plan with split, sort, and collapse) by:

    • Adding all columns referenced in the filtered index’s WHERE clause to the index key (INCLUDE is not sufficient); or
    • Executing the query with trace flag 8790 set e.g. OPTION (QUERYTRACEON 8790).

    Undocumented trace flag 8790 forces a wide update plan for any data-changing query (remember that a wide update plan is always possible).  Either change will produce a successfully-executing wide update plan for the MERGE that failed previously.


    The optimizer fails to spot the possibility of transient unique key violations with MERGE under the conditions listed at the start of this post.  It incorrectly chooses a narrow plan for the MERGE, which cannot provide the protection of a split/sort/collapse sequence for the nonclustered index maintenance.

    The MERGE plan may fail at execution time depending on the order in which rows are processed, and the distribution of data in the database.  Worse, a previously solid MERGE query may suddenly start to fail unpredictably if a filtered unique index is added to the merge target table at any point.

    Connect bug filed here

    Bug reproduced on the following SQL Server versions (all x64 Developer Edition):

    2012 SP1 CUI (build 11.0.3321 – November 2012)
    2008 R2 SP2 CU3 (build 10.50.4266 – October 2012)
    2008 SP3 CU8 (build 10.0.5828 – November 2012)

    © 2012 Paul White – All Rights Reserved

    Twitter: @SQL_Kiwi

More Posts Next page »
Privacy Statement