THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Page Free Space

See also my SQL Server technical articles on

  • Cardinality Estimation for Disjunctive Predicates in SQL Server 2014

  • Nested Loops Prefetching

  • Incorrect Results Caused By Adding an Index

  • Improving Partitioned Table Join Performance

  • Hello Operator, My Switch Is Bored

  • Execution Plan Analysis: The Mystery Work Table


    I love SQL Server execution plans. It is often easy to spot the cause of a performance problem just by looking at one. The task is considerably easier if the plan includes run-time information (a so-called ‘actual’ execution plan), but even a compiled plan can be very useful. Nevertheless, there are still times where the execution plan does not tell the whole story, and we need to think more deeply about query execution to really understand a performance problem. This post looks at one such example, based on a recent question posted on the SQL Performance Q & A site.

    The Execution Plan

    Original Query Plan

    This plan is reasonably large (20MB cached plan size) but not massively complex once you break it down (click on the image above to view it full-size in a new window). The context of the question is that this query usually executes in less than a minute, but sometimes it runs for nearly twenty minutes – though the plan appears identical.

    High-Cost Operators

    There are many different things to look for in execution plans. What you choose to look at first is as much a matter of personal preference as anything, but many people are drawn to high-cost operators, so I will start there. In this plan, the cost of one operator dominates all others, shown as being responsible for 100% of the cost of the query. It is highlighted in red in Plan Explorer; I have expanded the relevant plan section (the top right) below:

    100% operator cost

    There is no doubt that this seek is busy little thing. It is executed 249,484 times, though it only produces a grand total of 167,813 rows over all iterations of the loop join – an average of just 0.7 rows per seek. There are all sorts of interesting details in the plan about this seek – I could write a whole blog post about it – but two details that stand out are the “Force Seek: True” and “Partitioned: True” attributes. These tell us that the base table is partitioned, and the query writer had to use a FORCESEEK table hint to get this plan.

    Without this hint, the optimizer would almost certainly choose a Hash Match or Merge Join rather than Nested Loops. This is understandable given the optimizer’s cost model and the simplifying assumptions it makes (such as assuming every query starts with a cold buffer cache). That’s fine, but we can see from the query plan that the inner-side table has 643 million rows. Left to its own devices, the optimizer would estimate that it would be faster to perform a sequential scan of 643 million rows (with large-block read-ahead) than it would be to run a quarter-million randomly-distributed seeks driven by a Nested Loops join.

    I doubt that the optimizer’s reasoning here is sound (at least on any reasonably modern hardware) but there we go. The query author probably knows that a good fraction of this table is likely to be in cache, so with all that in mind, I think we can reasonably assume at this stage that the FORCESEEK hint is genuinely needed here, and this part of the plan is at least reasonably optimal.

    Important note: The seek certainly does not account for 100% of the runtime cost of this query. Remember cost percentages are always estimates – even in ‘actual’ plans. It can be useful to check the reasons for high-estimated-cost operators, but they should never be used as a primary tuning metric.

    Execution Warnings

    Sort Warning

    This query was executed on SQL Server 2012, so there is a handy warning triangle on the Sort operator indicating that one or more sort runs had to be spilled to physical tempdb disk. The plan clearly shows this spilling is a result of an inaccurate cardinality estimate at the Filter operator (the estimates are not bad at all prior to this). The Sort expects 9,217 rows totalling approximately 5MB, but actually encountered 61,846 rows in 35MB. As you may know, memory for sorts and hashes is allocated before execution starts, and generally cannot expand dynamically at run time.

    The spilled sort is undesirable, of course, but it is unlikely to be a major cause of the occasional poor performance given the small size of the spilled data. Nevertheless, this might be a good place to split this query up. The idea would be to write the results of the query (up to and including the Filter) to a temporary heap table using SELECT INTO, and then create a clustered index with the same keys as the Sort operator. The temporary table would not be large, and may well perform better overall than the spilled sort, including the cost of creating the clustered index. Of course, creating this index will involve a sort, but it will be one based on the known cardinality of the temporary heap table. The part of the plan that could be replaced by a temporary table is shown below:

    Plan subtree replaced with a temp table

    I am a big fan of simplifications like this. Smaller execution plans tend to optimize better for all sorts of reasons, and the source code usually becomes easier to maintain as well. I should mention there is another warning triangle in the 2012 execution plan (shown on the root icon), which relates to some implicit conversions that I will mention later.

    I/O Information

    The execution plan was captured with Plan Explorer, so we can also easily see I/O statistics for the two executions. The first is for a fast (sub-60-second) run:

    I/O data - fast

    Overall, these I/O numbers show pretty much what we would expect: a decent number of logical reads associated with the seeks into the Trans table (but certainly not 100% of the total, ha ha), a very small number of physical reads, and a small amount of read-ahead activity on a couple of tables.

    The second set of I/O data is from a slow run (18 minutes or so):

    I/O data - slow

    The very obvious difference is the appearance of a work table, with 178 million logical reads and 130 million LOB logical reads. It seems very likely this work table, and its 300 million logical reads, is responsible for the dramatic decrease in query performance. But given that the execution plans are identical (right down to the XML) what is causing this?

    My answer to that question (on the Q & A site) was that it is related to the increased level of read-ahead activity, but to see why that is the case, we will need to reproduce the issue and dig a bit deeper.

    Execution Outline

    Before we really get going on this, it will be useful to take a look at what the execution plan is doing in outline. We saw the first part of the plan earlier when looking at the spilling sort. The data set at that point (which we would like to write to a temporary table, remember) essentially represents source data for a second query, which uses a series of Nested Loops Left Joins to lookup information from other tables:

    Nested Loop Lookups

    The inner side of each join involves some reasonably involved logic, which is thankfully not important to the present discussion. What is important is that the result of each lookup is a LOB data type. This begins to shed some light on the LOB logical reads reported against the work table, but it does not explain why the work table (and the 300 million associated reads) do not appear when the query runs quickly (with the same execution plan).

    Reproducing the problem

    Table Creation

    The first part of the repro involves creating six tables that represent the lookup tables in the original query plan. Each table will have 10,000 rows, consisting of a sequential reference number and a second column containing a 2048 single-byte-character string. The source table used to drive the lookups will be a regular Numbers table containing just a single integer column.

    CREATE TABLE dbo.T1 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T2 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T3 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T4 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T5 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T6 (id integer IDENTITY PRIMARY KEY, d char(2048));
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;

    The next step is to ensure that each lookup table is optimally organized for read-ahead:


    Test Query

    The original query translates into our simplified test rig as:

    DECLARE @d nvarchar(max) = NCHAR(10000);
    SELECT n.n,
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T1 AS t WHERE = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T2 AS t WHERE = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T3 AS t WHERE = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T4 AS t WHERE = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T5 AS t WHERE = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T6 AS t WHERE = n.n)
    FROM dbo.Numbers AS n
    WHERE n.n BETWEEN 1 AND 10000
    ORDER BY n.n

    The broad idea there is to concatenate our 2048-character column to itself five times and include a Unicode character that was used in the original query as a delimiter that could not appear in the source data. Each lookup performs the same basic operation against its target table, and the final result is the result of concatenating all the intermediate results. The query hints are necessary to get the right plan shape, just because my test rig tables are so much smaller than the real ones.

    Note that the Unicode delimiter means the 2048-character single-byte data is implicitly converted to Unicode, doubling in size. It is not a crucial feature of the test, but it did appear in the original query and explains the type conversion warnings in the execution plan I mentioned earlier. The execution plan for the test query is (click to enlarge if necessary):

    Test query execution plan

    I should also stress that the CONCAT operator (new in SQL Server 2012) is not crucial either. If you are using an earlier version of SQL Server, an equivalent query (for present purposes) is shown below. I’m going to stick with CONCAT for the remainder of the post, however.

    DECLARE @d nvarchar(max) = NCHAR(10000);
    SELECT n.n,
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T1 AS t WHERE = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T2 AS t WHERE = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T3 AS t WHERE = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T4 AS t WHERE = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T5 AS t WHERE = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T6 AS t WHERE = n.n)
    FROM dbo.Numbers AS n
    WHERE n.n BETWEEN 1 AND 10000
    ORDER BY n.n

    Warm cache results

    With all data in memory, the test query (in either form) completes in about 1.6 seconds on my laptop. The result shows that each output row contains 147,468 bytes of Unicode character data. A typical set of I/O statistics follows:


    Nothing too exciting to see there, but this is just our baseline.

    Cold cache results


    With no data in memory, the test query now runs for 18.6 seconds – almost 12x slower. The I/O statistics show the expected (but still mysterious!) work table and its associated reads:


    The Extended Events wait statistics show SQL Server spent very little of that time waiting on my laptop’s slow hard drive – just 402 ms:



    The are a number of factors in play here that we will look at in turn.

    Nested Loops Prefetching

    One of the reasons the optimizer prefers Hash Match and Merge Join for larger inputs is that the data access patterns tend to favour large sequential read-ahead. Both hash and merge tend to scan (range-scan in the case of a seek) their inputs, and the SQL Server Storage Engine automatically issues read-ahead when it detects this type of access. There is nothing in the execution plan to show that a base table will be read with read-ahead, it just happens.

    A very basic implementation of Nested Loops join would not benefit from read-ahead at all on its inner side. The outer (driving) side of the loops join might well be a scan or range-scan of an index, and so benefit from automatic read-ahead, of course. The inner side is executed once per outer row, resulting in a rapid succession of small index seeks for different values. These small seeks will typically not be large enough to trigger the automatic read-ahead mechanism. Indeed, in our test, each inner side seek is for precisely one value.

    SQL Server improves on this by implementing a second read-ahead mechanism especially for Nested Loops joins (not all N-L joins, it is a cost-based decision the optimizer makes). The basic idea is to buffer extra rows from the outer side of the join, and to use the row values in the buffer to drive read-ahead for the inner side. The effect is that the Nested Loops join becomes a partly blocking operator as outer-side rows are read into the buffer and read-ahead issued based on buffered index key values.

    This read-ahead may be either order-preserving or not, and is indicated in the execution plan by the Nested Loop attributes With Ordered Prefetch and With Unordered Prefetch, respectively. When unordered prefetch occurs, the inner side is processed in whatever order the asynchronous reads happen to complete. With ordered prefetching, the mechanism is careful to ensure that the order of rows entering the join is preserved on the output.

    In the test rig, the ORDER BY clause means there is a need to preserve row order, so Ordered Prefetch is used:

    Ordered Prefetch

    The issue described in this post is not specific to ordered prefetching – the same behaviour is just as likely with unordered prefetching. The point is that Nested Loops prefetching is one of the requirements.

    Documented trace flags 652 and 8744 may be used (with care, and after serious testing) to disable automatic read-ahead and Nested Loops prefetching respectively. This is sometimes beneficial where all data is expected to be in memory (in which case read-ahead processing consumes resources better used by query execution) or where the I/O subsystem is extremely fast. In case you were wondering, there is no background thread for prefetching – all the work of checking whether the data is in memory, and issuing I/O if not, is performed by the worker thread executing the query.

    I should stress that read-ahead and Nested Loops prefetching is generally A Very Good Thing with typical storage solutions (e.g. SANs) and both work best (or at all) when indexes have low logical fragmentation.

    Manufactured LOBs

    The issue described here also requires that a large object data type is manufactured before prefetching. The Compute Scalar operators in the test execution plan perform that function:

    Manufactured LOB

    By ‘manufactured’, I mean that the source columns are not LOB types, but the expression output is – notice the implicit conversion to nvarchar(max). To be clear about it, the issue we are analysing here does not occur when Nested Loops prefetching occurs with an expression that was a LOB type to begin with.

    The Outer Join

    The optimizer is quite good, generally speaking, at moving scalar expressions around. If the query had featured inner joins (whether by query design or through optimizer activities) the chances are quite good that the problematic expressions (the LOB manufacturers) would have moved beyond the prefetching, and so out of harm’s way. It is quite tricky to preserve NULL-extension and other outer-join semantics properly when moving expressions above an outer join, so the optimizer generally does not even try. In essence, the outer join represents an optimization barrier to the LOB-manufacturing expressions.

    Memory Allocation

    When Nested Loops prefetching occurs with a manufactured LOB, the question arises of where to store the created LOBs when buffering rows for prefetch. If the source data were already a LOB type, the execution engine would already have memory structures in place to handle them. When prefetching encounters a manufactured LOB, it needs to store it somewhere, since the engine is no longer processing a stream of one row at a time. It turns out that there is a small memory buffer set aside for this eventuality, which empirical tests show to be 24KB.

    However, this 24KB (directly allocated, not via workspace memory grant) is shared across all concurrently executing prefetching joins in the query. With six such joins in the test rig plan and large manufactured LOBs, the buffer stands no chance. As a result, query execution engages a bail-out option: a work table created in tempdb. Though the pages of the worktable may in fact remain memory-resident, overheads (including latching and using general-purpose code interfaces for access to the buffered rows) mean this is very much slower than using the direct-memory cache.

    As with most internal work tables, the logical reads reported on this work table indicate the number of rows processed (not 8KB pages, as for regular I/O statistics). This fact, together with the large number of items processed via the worktable in our test, accounts for the millions of reads reported.

    The creation and use of the work table depends on run time conditions and timing. If execution finds the data it needs is already in memory, the prefetch checks are still performed, but no asynchronous read requests end up being posted. The 24KB buffer is never filled, so the need to create a work table never arises. The more prefetch that actually occurs, the higher the chances that the buffer will fill. It is quite possible to experience a low level of prefetch with manufactured LOBs without the engine needing to bail out to a work table, especially if the LOBs are not very big and the I/O system is quite fast.


    We can rewrite the query to avoid feeding manufactured LOB data to the prefetch buffer. The idea is to use OUTER APPLY to return the data that contributes to the concatenation, rather than the result of the concatenation. We can then perform the CONCAT operation (which handles NULLs nicely without extra work) after the join, avoiding the prefetch buffer issue completely. In SQL Server versions prior to 2012, we would need to use direct string concatenation, and handle rows that are NULL-extended explicitly using ISNULL or COALESCE.

    DECLARE @d nvarchar(max) = NCHAR(10000);
                CONCAT(oa1.i0, oa1.i1, oa1.i2, oa1.i3, oa1.i4, oa1.i5, oa1.i6),
                CONCAT(oa2.i0, oa2.i1, oa2.i2, oa2.i3, oa2.i4, oa2.i5, oa2.i6),
                CONCAT(oa3.i0, oa3.i1, oa3.i2, oa3.i3, oa3.i4, oa3.i5, oa3.i6),
                CONCAT(oa4.i0, oa4.i1, oa4.i2, oa4.i3, oa4.i4, oa4.i5, oa4.i6),
                CONCAT(oa5.i0, oa5.i1, oa5.i2, oa5.i3, oa5.i4, oa5.i5, oa5.i6),
                CONCAT(oa6.i0, oa6.i1, oa6.i2, oa6.i3, oa6.i4, oa6.i5, oa6.i6)
    FROM dbo.Numbers AS n
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T1 AS t WHERE = n.n) AS oa1
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T2 AS t WHERE = n.n) AS oa2
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T3 AS t WHERE = n.n) AS oa3
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T4 AS t WHERE = n.n) AS oa4
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T5 AS t WHERE = n.n) AS oa5
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T6 AS t WHERE = n.n) AS oa6
    WHERE n.n BETWEEN 1 AND 10000
    ORDER BY n.n

    The execution plan for the rewritten query looks visually similar to the problematic one:

    Rewritten query plan

    However, the Compute Scalars no longer manufacture a LOB data type, they just emit column and variable references:


    All the concatenation work (and LOB manufacture) is performed by the final top-level Compute Scalar in a single monster expression [Expr1056]:


    Warm cache results

    With all data in memory, the new query completes in 1.8 seconds (very slightly up on 1.6 seconds before):


    Cold cache results

    When all data must be fetched from disk, the query issues optimal prefetching and completes in 7.3 seconds (down from 18.6 seconds) with no work table:


    The Extended Events wait statistics now show 3.8 seconds spent waiting for my laptop’s slow spinning disk (which is a good thing!)


    Final Thoughts

    Work tables can appear in STATISTICS IO output for a wide range of reasons, but if you encounter one with a very large number of reads – particularly LOB reads – you may be encountering this issue. The rewrite proposed above may not always be possible, but you should be able to refactor your query to avoid the issue now you know it exists.

    I am not a fan of doing large amounts of string manipulation in SQL Server. I am always particularly suspicious of the perceived need to split or concatenate large volumes of strings.

    I am, however, a fan of always using explicit data types (rather than relying on implicit conversions) and generating relatively small query plans that offer the query optimizer clear and obvious choices. By necessity, this often means writing small SQL queries in logical steps (and no, long chains of common table expressions do not count!)

    The real world does not always make these things possible, of course, but it is good to have goals :)

    © 2013 Paul White – All Rights Reserved
    twitter: @SQL_Kiwi

    Screenshots acquired using SnagIt by TechSmith
    Query plan details obtained using Plan Explorer PRO by SQLSentry

  • Halloween Protection – The Complete Series

  • Incorrect Results with Indexed Views

    Summary: If you use MERGE, indexed views and foreign keys, your queries can return incorrect results.

    Microsoft have released a fix for incorrect results returned when querying an indexed view. The problem applies to:

    • SQL Server 2012
    • SQL Server 2008 R2
    • SQL Server 2008

    The Knowledge Base article does not go into much detail, or provide a reproduction script, but this blog entry does :)

    The KB does say that reproducing the bug requires these features:

    • An indexed view on two tables that have a foreign key relationship
    • An update performed against the base tables
    • A query executed against the indexed view using a NOEXPAND hint

    There are two important details I would like to add right away:

    • The NOEXPAND hint is not required to reproduce the bug on Enterprise Edition
    • The update must be performed by a MERGE statement

    The fix is available in the following cumulative update packages:

    Cumulative Update 2 for SQL Server 2012 SP1 [build 11.0.3339]
    Cumulative Update 5 for SQL Server 2012 RTM [build 11.0.2395]
    Cumulative Update 4 for SQL Server 2008 R2 SP2 [build 10.50.4270]
    Cumulative Update 10 for SQL Server 2008 R2 SP1 [build 10.50.2868]
    Cumulative Update 8 for SQL Server 2008 SP3 [build 10.00.5828]

    No service pack contains this fix, you must apply one of the hotfix packages above.

    Steps to Reproduce

    The first thing we will need is two tables:

    CREATE TABLE dbo.Parent 
        parent_id integer IDENTITY NOT NULL,
        value varchar(20) NOT NULL,
        CONSTRAINT PK_Parent_id 
            PRIMARY KEY CLUSTERED (parent_id)
    CREATE TABLE dbo.Child
        child_id integer IDENTITY NOT NULL,
        parent_id integer NOT NULL,
        CONSTRAINT PK_Child_id 
            PRIMARY KEY CLUSTERED (child_id)

    And a few rows of data:

    INSERT dbo.Child (parent_id)
    SELECT New.parent_id 
        INSERT Parent 
        OUTPUT inserted.parent_id 
        ) AS New;

    The tables now look like this (parent first):

    Original Data

    We can now add the required FOREIGN KEY relationship:

    ALTER TABLE dbo.Child
    ADD CONSTRAINT FK_Child_Parent
    FOREIGN KEY (parent_id)
    REFERENCES dbo.Parent (parent_id);

    Next, a very simple indexed view that joins the two tables (the view could contain more complex features like aggregates):

    CREATE VIEW dbo.ParentsAndChildren
    FROM dbo.Parent AS p
    JOIN dbo.Child AS c ON 
        c.parent_id = p.parent_id;
    ON dbo.ParentsAndChildren (child_id);

    The final step is to use a MERGE statement to make some changes to the Parent table:

    DECLARE @ParentMerge AS TABLE
        parent_id integer PRIMARY KEY, 
        value varchar(20) NOT NULL
    INSERT @ParentMerge
        (parent_id, value)
        (1, 'Kiwi Fruit'),
        (4, 'Dragon Fruit');
    MERGE dbo.Parent AS p
    USING @ParentMerge AS s ON 
        s.parent_id = p.parent_id
        UPDATE SET value = s.value
        INSERT (value) VALUES (s.value)
        deleted.value AS old_value, 
        inserted.value AS new_value;

    This MERGE performs two actions:

    1. Updates the value column of parent row 1 from Apple to Kiwi Fruit
    2. Adds a new parent row 4 for Dragon Fruit

    The statement includes an OUTPUT clause to show the changes it makes (this is not required for the repro):

    MERGE changes

    This confirms that the changes have been made as we requested: parent row 1 has changed; and row 4 has been added. The changes are reflected in the base tables:

    SELECT * FROM dbo.Parent AS p;
    SELECT * FROM dbo.Child AS c;

    Updated Base Table Data

    As highlighted, row 1 has changed from Apple to Kiwi Fruit and row 4 has been added.

    We do not expect to see row 4 in the indexed view because there are no child records for that row, and the indexed view uses an inner join. Checking the indexed view using the NOEXPAND table hint (required in non-Enterprise SKUs to use indexes on a view):

    SELECT *
    FROM dbo.ParentsAndChildren 

    Incorrect indexed view data

    The results are incorrect. They show the old value of the data for parent row 1.

    Now we try using the EXPAND VIEWS query hint to force SQL Server to access the base tables rather than reading view indexes:

    SELECT *
    FROM dbo.ParentsAndChildren

    Expand Views Hint

    This query produces correct results.

    On SQL Server Enterprise Edition, the optimizer chooses whether to access the indexed view or the base tables. For following query, without any hints, the optimizer chooses not to expand the view. It reads the view index and produces incorrect results:

    -- Enterprise Edition ONLY
    SELECT * 
    FROM dbo.ParentsAndChildren;

    Indexed View Matching

    Perhaps adding a child row to match the new parent row 4 will somehow fix things up?

    INSERT dbo.Child (parent_id) VALUES (4);
    SELECT * FROM dbo.ParentsAndChildren WITH (NOEXPAND);
    SELECT * FROM dbo.ParentsAndChildren OPTION (EXPAND VIEWS);

    After Insert

    No. The query plan that accesses the view index still returns an incorrect value for row 1. It seems MERGE has corrupted our indexed view.

    Analysis using DBCC CHECKTABLE

    Checking the view with DBCC CHECKTABLE returns no errors:

    DBCC CHECKTABLE (ParentsAndChildren);

    DBCC output 1

    Unless we use the EXTENDED_LOGICAL_CHECKS option:


    DBCC output 2

    The damage is repairable:


    DBCC repair

    You probably do not want to set your database to SINGLE_USER mode and run a DBCC repair after every MERGE statement, however. We could also rebuild the indexed view’s clustered index manually, of course.


    For the MERGE statement above, the query optimizer builds a plan that does not update the indexed view (click to enlarge):

    Incorrect Update Plan

    In a version of SQL Server with the fix applied, the same MERGE statement produces a plan that does maintain the indexed view:

    Correct Update Plan

    The plan operators used to keep the view index in step with the base tables are highlighted. Without these operators, the changes to the base table are not correctly written to any indexes defined on the view. The root cause is related to the same simplification that allows the optimizer to remove the reference to the Parent table in this query:

    FROM dbo.Parent AS p 
    JOIN dbo.Child AS c ON 
        c.parent_id = p.parent_id;

    Simplification Example Plan

    The FOREIGN KEY relationship and NOT NULL constraints on the referencing column together mean that the join to Parent cannot affect the result of the query, so the join is simplified away. In SQL Server 2012, we can see when this simplification is performed because the following message appears when undocumented trace flags 8619 and 3604 are enabled during compilation:

    Trace Flag 8619 output

    The same message is emitted when a MERGE statement contains a WHEN MATCHED THEN UPDATE clause and either a WHEN NOT MATCHED THEN INSERT or WHEN MATCHED … THEN DELETE clause. These conditions combine such that the optimizer incorrectly concludes that a table reference can be removed, when in fact it is needed later on when the update side of the plan is built.

    Other details of the query and database can affect whether the simplification can be misapplied. For example, if the FOREIGN KEY constraint contains an ON DELETE CASCADE clause, and the MERGE contains a DELETE clause, the simplification is not performed, the TF 8619 message does not appear, and the bug does not manifest.

    The key to determining whether a particular query is vulnerable to this bug (TF 8619 aside) is to check whether the query plan includes operators to maintain the indexed view. At a minimum, you should see a update operator for the view:

    View Update Operator

    SQL Sentry Plan Explorer identifies the operator as applying to a view explicitly, in SSMS you need to click on the graphical operator and inspect the Properties window.


    The updated conditions for incorrect results are:

    • An indexed view that joins tables
    • Two of the tables have a single-column FOREIGN KEY constraint
    • A MERGE statement contains an UPDATE action that affects one of the tables
    • The MERGE statement also contains an INSERT or DELETE action (or both)
    • The optimizer applies a simplification that removes a table reference based on the FK relationship and other metadata
    • As a result, the MERGE execution plan does not contain the operators necessary to correctly maintain the indexed view
    • A subsequent query plan accesses an index on the view, either explicitly or via indexed-view matching (Enterprise Edition)


    • The simplification is not applied in tempdb
    • The simplification is not applied to multi-column foreign key constraints

    Under these conditions, the indexes on the view do not reflect the state of the base tables and incorrect results are returned. Once the hot fix is applied, the optimizer does not misapply the simplification so the correct indexed view maintenance features are built into execution plans.

    Update: I am adding this based on Ian Yates’ great question in the comments: my expectation is that applying this hotfix will not remove any existing indexed view corruption. You would need to test the hotfix as usual, apply it, and then either rebuild all affected indexed views manually, or run DBCC CHECKDB with the EXTENDED_LOGICAL_CHECKS option (which could take a long time).

    © 2013 Paul White – All Rights Reserved


    Twitter: @SQL_Kiwi

  • A creative use of IGNORE_DUP_KEY

  • Optimizing T-SQL queries that change data


    Most tuning efforts for data-changing operations concentrate on the SELECT side of the query plan. Sometimes people will also look at important storage engine considerations like locking and transaction log throughput that can have dramatic effects. As a consequence, a number of common practices have emerged, such as avoiding large numbers of row locks and lock escalation, splitting large changes into smaller batches of a few thousand rows and combining a number of small changes into a single transaction in order to optimize log flushes.

    This is all good of course, but what about the data-changing side of the query plan – the INSERT, UPDATE, DELETE or MERGE operation itself – are there any query processor considerations we should take into account? The short answer is yes. The query optimizer considers different plan options for the write-side of an I/U/D/M query, though there isn’t a huge amount of T-SQL language support that allows us to affect these choices directly. Nevertheless, there are things to be aware of, and things we can look to change.

    Side note: In this post I am going to use the term update (lower case) to apply to any operation that changes the state of the database (INSERT, UPDATE, DELETE and MERGE in T-SQL). This is a common practice in the literature, and is used inside SQL Server too as we will see.

    The Three Basic Update Forms

    First a bit of background. All query plans execute using a demand-driven iterator model, and updates are no exception. Parent operators drive child operators (to the right of the parent) to do work by asking them for a row at a time. Take the following simple AdventureWorks INSERT for example:

    DECLARE @T AS TABLE (ProductID int);
    INSERT @T (ProductID) 
    SELECT p.ProductID 
    FROM Production.Product AS p;

    Query Plan Execution Order Example

    Plan execution starts at the far left, where you can think of the green T-SQL INSERT icon as representing rows returned to the client. This root node asks its immediate child operator (the Table Insert) for a row. The Table Insert requests a row from the Index Scan, which provides one. This row is inserted into the heap, and an empty row is returned to the root node (if the INSERT query contained an OUTPUT clause, the returned row would contain data). This row-by-row process continues until all source rows have been processed. Notice that the XML show plan output shows the Heap Table Insert is performed by an Update operator internally. Ok, on to the matter at hand:

    1. Wide, per-index updates

    Wide per-index update plans have a separate update operator for each clustered and nonclustered index. It is always possible for the optimizer to produce this type of plan; the options described later have not been (or cannot be) implemented when certain engine features are used. For example, if the update target has an indexed view defined on it, a wide update plan has to be used to maintain the indexed view. An example per-index update plan is shown below:

    Wide update plan

    This plan updates the base table using a Clustered Index Delete operator. This operator may also read and output extra column values necessary to find and delete rows from nonclustered indexes. The iterative delete activity on the base table is driven by the Eager Table Spool on the top branch. The spool is eager because it stores all the rows from its input in a worktable before returning the first row to its parent operator (the Index Delete on the top branch). The effect is that all qualifying base table rows are deleted before any nonclustered index changes occur.

    The spool in this plan is a common subexpression spool. It is populated once and then acts as a row source for multiple consumers. In this case, the contents of the spool are consumed first by the top-branch Index Delete, which removes index entries from one of the nonclustered indexes. When the Sequence operator switches to asking for rows from the lower branch, the spool is rewound and replays its rows for the second nonclustered index. Note that the spool contains the UNION of all columns required by nonclustered index changes.

    2. Narrow, per-row updates

    Narrow per-row updates have a single update operator which maintains the base table (heap or clustered) and one or more nonclustered indexes. Each row arriving at the update operator updates all indexes associated with the operator before processing the next row. An example:

    DECLARE @T AS TABLE (pk int PRIMARY KEY, col1 int UNIQUE, col2 int UNIQUE);
    DECLARE @S AS TABLE (pk int PRIMARY KEY, col1 int, col2 int);
    INSERT @T (pk, col1)
    FROM @S AS s;

    The execution plan viewed using SQL Sentry Plan Explorer shows the nonclustered index maintenance explicitly (hover over the Clustered Index Insert for a tooltip showing the names of the nonclustered indexes involved):

    Plan Explorer Narrow Update Plan

    In SQL Server Management Studio, there is no information about the nonclustered index maintenance in the graphical plan or tooltips, we need to click on the update operator (Clustered Index Insert) and then check the Properties window:

    SSMS narrow index updates

    This shows the clustered index and two nonclustered indexes being maintained. Because all indexes are updated for each row streaming through the update operator, this plan shape is also known as a per-row update plan.

    The query optimizer uses cost-based reasoning to decide whether to update each nonclustered index separately (per-index) or as part of the base table changes (per-row). An example that updates one nonclustered index per-row, and another per-index is shown below:

    CREATE TABLE #T (pk int IDENTITY NOT NULL, col1 int, col2 varchar(8000) DEFAULT '');
    CREATE INDEX nc1 ON #T (col1);
    CREATE INDEX nc2 ON #T (col1) INCLUDE (col2);

    The combination strategy can be seen with Plan Explorer (click to expand in a new tab):

    Combination index update

    The details are also displayed in the SSMS Properties window when the Clustered Index Insert operator is selected as well.

    3. Single-operator updates

    The third form of update is an optimization of the per-row update plan for very simple operations. The cost-based optimizer still emits a per-row update plan, but a rewrite is subsequently applied to collapse the reading and writing operations into a single operator:

    -- Simple operations
    DECLARE @T AS TABLE (pk int PRIMARY KEY, col1 int);
    INSERT @T (pk, col1) VALUES (1, 1000);
    UPDATE @T SET col1 = 1234 WHERE pk = 1;
    DELETE @T WHERE pk = 1;

    The complete execution plans (and extracts from the XML plans) are shown below:

    Single operator update plans

    The UPDATE and DELETE plans have been collapsed to a `Simple Update` operation, while the INSERT is simplified to a `Scalar Insert`. We can see the original optimizer output tree with trace flags 3604 and 8607:

    DECLARE @T AS TABLE (pk int PRIMARY KEY, col1 int);
    DBCC TRACEON (3604, 8607);
    INSERT @T (pk, col1) VALUES (1, 1000) OPTION (RECOMPILE);
    UPDATE @T SET col1 = 1234 WHERE pk = 1 OPTION (RECOMPILE);
    DBCC TRACEOFF (3604, 8607);

    The trace output for the INSERT statement shown above is:

    Optimizer output tree

    Notice the two physical operators: a constant scan (in-memory table) containing the literal values specified in the query; and a stream update operator that performs the database changes per row. All three statements produce a similar optimizer tree output (and all use a stream update). We can prevent the single-operator optimization being applied with undocumented trace flag 8758:

    DECLARE @T AS TABLE (pk int PRIMARY KEY, col1 int);
    DBCC TRACEON (8758);
    INSERT @T (pk, col1) VALUES (1, 1000) OPTION (RECOMPILE);
    UPDATE @T SET col1 = 1234 WHERE pk = 1 OPTION (RECOMPILE);
    DBCC TRACEOFF (8758);

    This exposes the expanded per-row update plans produced by the optimizer before the single-operator rewrite:

    Expanded single operator plan

    Single operator plans can be significantly more efficient than the original multiple-operator form in the right circumstances, for example if the plan is executed very frequently.

    Update plan optimizations

    From this point forward, I’m going to use simple AdventureWorks DELETE examples, but the general points apply equally well to INSERT, UPDATE and MERGE as well. The examples will not have a complicated SELECT component, because I want to concentrate on the update side of the plan rather than the reading side.

    Per-row updates

    It’s worth emphasising again that per-row update plans are an optimization that is not available for every update query. The optimizer favours a per-row plan if the expected number of rows at the update operator is low. In the example below, the optimizer expects to delete 25 rows:

    Per-row updates

    This plan updates the base table (a clustered index in this case) and all nonclustered indexes in sequence per row. The row source here is an ordered scan of a nonclustered index, which means rows will not generally be presented in clustered index key order to the update operator.

    An unordered stream like this tends to result in random I/O (assuming physical I/O is required at all, of course). If the plan is expected to process only a small number of rows, the optimizer decides it is not worth adding extra operations to encourage sequential I/O.

    Unordered Prefetch

    If the expected number of rows is a bit larger, the optimizer may decide to build a plan that applies prefetching to one or more update operators. The basic idea is the same as ordinary prefetching (a.k.a read-ahead) for scans and range seeks. The engine looks ahead at the keys of the incoming stream and issues asynchronous IOs for pages that will be needed by the update operator soon. Prefetching helps reduce the impact of the random I/O mentioned a moment ago. An example of prefetching on the update operator is shown below for the same query and an expected cardinality of 50 rows:


    The prefetching is unordered from the perspective of the clustered index. Neither SSMS nor Plan Explorer show the prefetch information prominently. In Plan Explorer, we need to have the With Unordered Prefetch optional column selected. To do this, switch to the Plan Tree tab, open the Column Chooser, and drag the column to the grid. The unordered prefetch property is ticked for the Clustered Index Delete operator in the screenshot below:

    Unordered Prefetch in Plan Explorer

    In SSMS, click on the Clustered Index Delete icon in the graphical plan, and then look in the Properties window:

    SSMS Unordered Prefetch

    If the query plan were reading from the clustered index, the read operation would issue regular read-ahead, so there would be no point prefetching from the Clustered Index Delete as well. The example below shows the expected cardinality increased to 100, where the optimizer switches from scanning the nonclustered index with unordered prefetching to scanning the clustered index. No prefetching occurs on the update operator in this plan:

    No clustered index prefetch

    Where the plan property With Unordered Prefetch is not set, it is simply omitted – it does not appear set to False as you might expect.

    Ordered Prefetch

    Where rows are expected to arrive at an update operator in non-key order, the optimizer might consider adding an explicit sort. The idea is that presenting rows in key order will encourage sequential I/O for the index pages. The optimizer weighs the cost of the sort against the expected benefits of avoiding random I/O. The execution plan below shows an example of a sort being added for this reason:

    Sort to reduce random I/O

    The sort is on Transaction ID, the clustering key for this table. With the incoming stream now sorted, the update operator can use ordered prefetching. The plan is still scanning the nonclustered index on the read side, so read-ahead issued by that operator does not bring in the clustered index pages needed by the update operator. Ordered prefetching tends to result in sequential I/O, compared with the mostly random I/O of unordered prefetching. The With Ordered Prefetch column is also not shown by default by Plan Explorer, so it needs to be added as we did before:

    Plan Explorer Ordered Prefetch

    Not every update operator with ordered prefetching requires a sort. If the optimizer finds a plan that naturally presents keys in nonclustered index order, an update operator may still use ordered prefetching.

    DML Request Sort

    When the optimizer decides to explore a plan alternative that requires ordered input to an update operator, it will set the DML Request Sort property for that operator. Plan Explorer shows this setting in the update operator tooltip:

    Plan Explorer DML Request Sort

    SSMS shows With Ordered Prefetch and DML Request Sort in the Properties window:

    SSMS DMLRequestSort and WithOrderedPrefetch

    Nonclustered index updates

    Sorting and ordered prefetching may also be considered for the nonclustered indexes in a per-index update plan.

    Sort for nonclustered index

    This plan shows all three update operators with DML Request Sort set. The Clustered Index Delete does not require an explicit sort because rows are being read (with internal ordering guarantees) from the Clustered Index Scan. The two non-clustered indexes do require explicit sorts, however. Both nonclustered index update operators also use ordered prefetching; the clustered index update does not because the scan automatically invokes read-ahead for that index.

    The next example shows that each update operator is considered separately for the various optimizations:

    Indexes optimized separately

    That plan features unordered prefetching on the Clustered Index Delete (because the row source is a scan of a nonclustered index), an explicit sort with ordered prefetching on the top branch Index Delete, and unordered prefetching on the bottom branch Index Delete.

    The per-index and per-row update plan

    Is the following a per-index or per-row update plan?

    Per-index and per-row plan

    It appears to be a per-index plan because there are two separate update operators, but there is no blocking operator (eager spool or sort) between the two. This plan is a pipeline: each row returned from the seek will be processed first by the Clustered Index Delete and then by the Index Delete. In that sense, the both updates (clustered and nonclustered) are certainly per-row.

    The important thing is not the terminology, it is being able to interpret the plan. Now we know what the various optimizations are for, we can see why the optimizer chose the plan above over the seemingly equivalent but more efficient-looking narrow plan:

    Narrow plan

    The narrow plan has no prefetching. The Seek provides read-ahead for the clustered index pages, but the single non-clustered index has nothing to prefetch any pages from disk it might need. By contrast, the previous wide update plan has two update operators. The Clustered Index Delete has no prefetching for the same reason as in the narrow plan, but the Index Delete has unordered prefetching enabled.

    So, although the smaller narrow plan looks like it ought to be more efficient, it might perform less well if nonclustered index pages are required from disk and unordered prefetching proves effective. On the other hand, if all required nonclustered index pages are in memory at the time the query is executed, the wide plan might perform less well since it features an extra operator and has to perform all the work associated with prefetching (even though ultimately no physical reads are issued).

    Performance Impacts

    You may have noticed a lack of performance numbers (I/O statistics, elapsed times and so on) in this post. There is a good reason for this. The optimizations presented here are quite dependent on SQL Server configuration and hardware. I don’t want you drawing general conclusions based on the performance of the rather ordinary single SATA drive in my laptop, the 256MB of RAM I have allowed for my SQL Server 2012 instance, and some rather trivial AdventureWorks examples.

    Which Optimizations are Good?

    It depends, naturally.

    If the data structures you are changing are very likely to be in memory, and/or if you have a very fast random I/O system, the optimizer’s goal of minimizing random I/O and enhancing sequential I/O may not make much sense for you. The strategies the optimizer employs may well end up costing you much more than they save.

    On the other hand, if your system has a much smaller buffer pool than database working set size, and your I/O system works very much faster with large sequential I/O than with smaller random I/O requests, you should generally find the optimizer makes reasonable choices for you, subject to the usual caveats about useful up-to-date statistics and accurate cardinality estimation, of course.

    If your system fits the optimizer’s model, at least to a first approximation, you will usually find that narrow update plans are best for low cardinality update operations, unordered prefetching helps a bit when more rows are involved, ordered prefetching helps more, and explicit sorts before nonclustered index updates helps most of all for the largest sets. That is the theory, anyway.

    What are the Problems?

    The problem I see most often is with the optimization that is supposed to help most: the explicit sort before a nonclustered index update. The idea is that sorting the input to the index update in key order promotes sequential I/O, and often it does. The problem occurs if the workspace memory allocated to the sort at execution time proves to be too small to perform the sort entirely in memory. The memory grant is fixed based on cardinality estimation and row size estimates, and may be reduced if the server is under memory pressure at the time.

    In any case, however it occurs, a sort that runs out of memory spills whole sort runs to physical tempdb disk, often repeatedly. Unsurprisingly, this is not at all good for performance, and can result in queries that complete very much slower than if the sort had not been introduced at all. SQL Server reuses the general Sort operator for sorting index keys – it does not have a sort specially optimized for index updates which could make a best effort to sort within the memory allocated, but never spill to disk. Perhaps I should suggest this on Connect :)

    The narrow plan optimization can cause problems where it is selected due to a low cardinality estimate, where several nonclustered indexes need to be maintained, the necessary pages are not in memory, and the I/O system is slow at random reads.

    The ordered and unordered prefetch options cause problems much more rarely. For a system with all data in memory, there is a small but possibly measurable impact due to the prefetch overhead (finding pages to consider read-ahead for, checking to see if they are already in the buffer pool, and so on). Even if no asynchronous I/O is ever issued by the worker, it still spends time on the task that it could spend doing real work.

    What options are there?

    The usual answer to deal with poor execution plan choices due to a system’s configuration or performance characteristics not matching the optimizer’s model is to try different T-SQL syntax, and/or to try query hints. There are times when it is possible to rewrite the query to get a plan that performs better, and other times where rethinking the problem a bit can pay dividends. Various creative uses of partitioning, minimal logging, and bulk loading are possible in some cases, for example.

    There are very few hints that can help with the update side of a plan that does not respond to the usual tricks, however. The two I use most often affect the optimizer’s cardinality estimation - OPTION (FAST n) and a trick involving TOP and the OPTIMIZE FOR query hint. OPTION (FAST n) affects cardinality estimates in a plan by setting a final row goal, but its effects can be difficult to predict, and may not have much of an effect if the plan contains blocking operators. Anyway, the general idea is to vary the value of ‘n’ in the hint until the optimizer chooses the desired plan shape or optimization options as described in this post. Best of luck.

    The idea with TOP is similar, but tends to work better in my experience. The trick is to declare a bigint variable with the number of rows the query should return (or a very large value such as the maximum value of a bigint if all rows are required), use that as the parameter to a TOP, and then use the OPTIMIZE FOR hint to set a value for ‘n’ that the optimizer should use when considering alternative plans. This option is particularly useful for encouraging a narrow plan.

    Most of the examples in this post used the TOP trick in the following general form:

    DECLARE @n bigint = 9223372036854775807;
    DELETE TOP (@n) 
    FROM Production.TransactionHistory
    OPTION (OPTIMIZE FOR (@n = 50));

    No magic trace flags?

    Sure there are trace flags that affect the optimizations in this post. They are undocumented and unsupported of course, and may only work on some versions, but they can be handy to validate a performance analysis, or to generate a plan guide for a particularly crucial query. They can also be fun to play with on a personal system to explore their effects. The main ones that affect the optimizations described here are:

    2332 : Force DML Request Sort (CUpdUtil::FDemandRowsSortedForPerformance)

    8633:  Enable prefetch (CUpdUtil::FPrefetchAllowedForDML and CPhyOp_StreamUpdate::FDoNotPrefetch)

    8744 : Disable prefetch (CUpdUtil::FPrefetchAllowedForDML)

    8758 : Disable rewrite to a single operator plan (CPhyOp_StreamUpdate::PqteConvert)

    8790 : Force a wide update plan (CUpdUtil::PexprBuildWideUpdatePlan)

    8795 : Disable DML Request Sort (CUpdUtil::FDemandRowsSortedForPerformance)

    9115 : Disable prefetch (CUpdUtil::FPrefetchAllowedForDML)



    These trace flags can all be manipulated using the usual DBCC commands or enabled temporarily for a particular query using the OPTION (QUERYTRACEON xxxx) hint.

    Final Thoughts

    The optimizer has a wide range of choices when building the writing side of an update plan. It may choose to update one or more indexes separately, or as part of the base table update. It may choose to explicitly sort rows as part of a per-index update strategy, and may elect to perform unordered or ordered prefetching as well. As usual, these decisions are made based on costs computed using the optimizer’s model. This model may not always produce plans that are optimal for your hardware (and as usual the optimizer’s choices are only as good as the information you give it).

    If a particular update query is performance critical for you, make sure cardinality estimates are reasonably accurate. Test with alternate syntax and/or trace flags to see if an alternative plan would perform significantly better in practice. It is usually possible to use documented techniques like TOP clauses and OPTIMIZE FOR hints to produce an execution plan that performs well. Where that is not possible, more advanced tricks and techniques like plan guides may be called for.

    Thanks for reading today, I hope this post was interesting and provided some new insights into update query optimization.

    © 2013 Paul White – All rights reserved
    twitter: @SQL_Kiwi

  • MERGE Bug with Filtered Indexes

    A MERGE statement can fail, and incorrectly report a unique key violation when:

    • The target table uses a unique filtered index; and
    • No key column of the filtered index is updated; and
    • A column from the filtering condition is updated; and
    • Transient key violations are possible

    Example Tables

    Say we have two tables, one that is the target of a MERGE statement, and another that contains updates to be applied to the target.  The target table contains three columns, an integer primary key, a single character alternate key, and a status code column.  A filtered unique index exists on the alternate key, but is only enforced where the status code is ‘a’:

    CREATE TABLE #Target 
        pk          integer NOT NULL, 
        ak          character(1) NOT NULL,
        status_code character(1) NOT NULL,
        PRIMARY KEY (pk)
    ON #Target (ak)
    INCLUDE (status_code)
    WHERE status_code = 'a';

    The changes table contains just an integer primary key (to identify the target row to change) and the new status code:

    CREATE TABLE #Changes
        pk          integer NOT NULL,
        status_code character(1) NOT NULL,
        PRIMARY KEY (pk)

    Sample Data

    The sample data for the example is:

    INSERT #Target 
        (pk, ak, status_code)
        (1, 'A', 'a'),
        (2, 'B', 'a'),
        (3, 'C', 'a'),
        (4, 'A', 'd');
    INSERT #Changes
        (pk, status_code)
        (1, 'd'),
        (4, 'a');

             Target                     Changes
    ╔════╦════╦═════════════╗    ╔════╦═════════════╗
    ║ pk ║ ak ║ status_code ║    ║ pk ║ status_code ║
    ╠════╬════╬═════════════╣    ╠════╬═════════════╣
    ║  1 ║ A  ║ a           ║    ║  1 ║ d           ║
    ║  2 ║ B  ║ a           ║    ║  4 ║ a           ║
    ║  3 ║ C  ║ a           ║    ╚════╩═════════════╝
    ║  4 ║ A  ║ d           ║

    The target table’s alternate key (ak) column is unique, for rows where status_code = ‘a’.  Applying the changes to the target will change row 1 from status ‘a’ to status ‘d’, and row 4 from status ‘d’ to status ‘a’.  The result of applying all the changes will still satisfy the filtered unique index, because the ‘A’ in row 1 will be deleted from the index and the ‘A’ in row 4 will be added.

    Merge Test One

    Let’s now execute a MERGE statement to apply the changes:

    MERGE #Target AS t
    USING #Changes AS c ON =
        AND c.status_code <> t.status_code 
        THEN UPDATE SET status_code = c.status_code;

    The MERGE changes the two target rows as expected.  The updated target table now contains:

    ║ pk ║ ak ║ status_code ║
    ║  1 ║ A  ║ d           ║ <—changed from ‘a’
    ║  2 ║ B  ║ a           ║
    ║  3 ║ C  ║ a           ║
    ║  4 ║ A  ║ a           ║ <—changed from ‘d’

    Merge Test Two

    Now let’s repopulate the changes table to reverse the updates we just performed:

    TRUNCATE TABLE #Changes;
    INSERT #Changes
        (pk, status_code)
        (1, 'a'),
        (4, 'd');

    This will change row 1 back to status ‘a’ and row 4 back to status ‘d’.  As a reminder, the current state of the tables is:

             Target                        Changes
    ╔════╦════╦═════════════╗    ╔════╦═════════════╗
    ║ pk ║ ak ║ status_code ║    ║ pk ║ status_code ║
    ╠════╬════╬═════════════╣    ╠════╬═════════════╣
    ║  1 ║ A  ║ d           ║    ║  1 ║ a           ║
    ║  2 ║ B  ║ a           ║    ║  4 ║ d           ║
    ║  3 ║ C  ║ a           ║    ╚════╩═════════════╝
    ║  4 ║ A  ║ a           ║

    We execute the same MERGE statement:

    MERGE #Target AS t
    USING #Changes AS c ON =
        AND c.status_code <> t.status_code 
        THEN UPDATE SET status_code = c.status_code;

    However this time we receive the following message:

    Msg 2601, Level 14, State 1, Line 1
    Cannot insert duplicate key row in object 'dbo.#Target' with unique index 'uq1'. The duplicate key value is (A).
    The statement has been terminated.

    Applying the changes using UPDATE

    Let’s now rewrite the MERGE to use UPDATE instead:

    UPDATE t
    SET status_code = c.status_code
    FROM #Target AS t
    JOIN #Changes AS c ON =
        c.status_code <> t.status_code;

    This query succeeds where the MERGE failed.  The two rows are updated as expected:

    ║ pk ║ ak ║ status_code ║
    ║  1 ║ A  ║ a           ║ <—changed back to ‘a’
    ║  2 ║ B  ║ a           ║
    ║  3 ║ C  ║ a           ║
    ║  4 ║ A  ║ d           ║ <—changed back to ‘d’

    What went wrong with the MERGE?

    In this test, the MERGE query execution happens to apply the changes in the order of the ‘pk’ column.

    In test one, this was not a problem: row 1 is removed from the unique filtered index by changing status_code from ‘a’ to ‘d’ before row 4 is added.  At no point does the table contain two rows where ak = ‘A’ and status_code = ‘a’.

    In test two, however, the first change was to change row 1 from status ‘d’ to status ‘a’.  This change means there would be two rows in the filtered unique index where ak = ‘A’ (both row 1 and row 4 meet the index filtering criteria ‘status_code = a’).

    The storage engine does not allow the query processor to violate a unique key (unless IGNORE_DUP_KEY is ON, but that is a different story, and doesn’t apply to MERGE in any case).  This strict rule applies regardless of the fact that if all changes were applied, there would be no unique key violation (row 4 would eventually be changed from ‘a’ to ‘d’, removing it from the filtered unique index, and resolving the key violation).

    Why it went wrong

    The query optimizer usually detects when this sort of temporary uniqueness violation could occur, and builds a plan that avoids the issue.  I wrote about this a couple of years ago in my post Beware Sneaky Reads with Unique Indexes (you can read more about the details on pages 495-497 of Microsoft SQL Server 2008 Internals or in Craig Freedman’s blog post on maintaining unique indexes).  To summarize though, the optimizer introduces Split, Filter, Sort, and Collapse operators into the query plan to:

    1. Split each row update into delete followed by an inserts
    2. Filter out rows that would not change the index (due to the filter on the index, or a non-updating update)
    3. Sort the resulting stream by index key, with deletes before inserts
    4. Collapse delete/insert pairs on the same index key back into an update

    The effect of all this is that only net changes are applied to an index (as one or more insert, update, and/or delete operations).  In this case, the net effect is a single update of the filtered unique index: changing the row for ak = ‘A’ from pk = 4 to pk = 1.  In case that is less than 100% clear, let’s look at the operation in test two again:

             Target                     Changes                   Result
    ╔════╦════╦═════════════╗    ╔════╦═════════════╗    ╔════╦════╦═════════════╗
    ║ pk ║ ak ║ status_code ║    ║ pk ║ status_code ║    ║ pk ║ ak ║ status_code ║
    ╠════╬════╬═════════════╣    ╠════╬═════════════╣    ╠════╬════╬═════════════╣
    ║  1 ║ A  ║ d           ║    ║  1 ║ d           ║    ║  1 ║ A  ║ a           ║
    ║  2 ║ B  ║ a           ║    ║  4 ║ a           ║    ║  2 ║ B  ║ a           ║
    ║  3 ║ C  ║ a           ║    ╚════╩═════════════╝    ║  3 ║ C  ║ a           ║
    ║  4 ║ A  ║ a           ║                            ║  4 ║ A  ║ d           ║
    ╚════╩════╩═════════════╝                            ╚════╩════╩═════════════╝

    From the filtered index’s point of view (filtered for status_code = ‘a’ and shown in nonclustered index key order) the overall effect of the query is:

      Before           After
    ╔════╦════╗    ╔════╦════╗
    ║ pk ║ ak ║    ║ pk ║ ak ║
    ╠════╬════╣    ╠════╬════╣
    ║  4 ║ A  ║    ║  1 ║ A  ║
    ║  2 ║ B  ║    ║  2 ║ B  ║
    ║  3 ║ C  ║    ║  3 ║ C  ║
    ╚════╩════╝    ╚════╩════╝

    The single net change there is a change of pk from 4 to 1 for the nonclustered index entry ak = ‘A’.  This is the magic performed by the split, sort, and collapse.  Notice in particular how the original changes to the index key (on the ‘ak’ column) have been transformed into an update of a non-key column (pk is included in the nonclustered index).  By not updating any nonclustered index keys, we are guaranteed to avoid transient key violations.

    The Execution Plans

    The estimated MERGE execution plan that produces the incorrect key-violation error looks like this (click to enlarge in a new window):

    MERGE plan

    The successful UPDATE execution plan is (click to enlarge in a new window):

    UPDATE execution plan

    The MERGE execution plan is a narrow (per-row) update.  The single Clustered Index Merge operator maintains both the clustered index and the filtered nonclustered index.  The UPDATE plan is a wide (per-index) update.  The clustered index is maintained first, then the Split, Filter, Sort, Collapse sequence is applied before the nonclustered index is separately maintained.

    There is always a wide update plan for any query that modifies the database. The narrow form is a performance optimization where the number of rows is expected to be relatively small, and is not available for all operations.  One of the operations that should disallow a narrow plan is maintaining a unique index where intermediate key violations could occur.


    The MERGE can be made to work (producing a wide update plan with split, sort, and collapse) by:

    • Adding all columns referenced in the filtered index’s WHERE clause to the index key (INCLUDE is not sufficient); or
    • Executing the query with trace flag 8790 set e.g. OPTION (QUERYTRACEON 8790).

    Undocumented trace flag 8790 forces a wide update plan for any data-changing query (remember that a wide update plan is always possible).  Either change will produce a successfully-executing wide update plan for the MERGE that failed previously.


    The optimizer fails to spot the possibility of transient unique key violations with MERGE under the conditions listed at the start of this post.  It incorrectly chooses a narrow plan for the MERGE, which cannot provide the protection of a split/sort/collapse sequence for the nonclustered index maintenance.

    The MERGE plan may fail at execution time depending on the order in which rows are processed, and the distribution of data in the database.  Worse, a previously solid MERGE query may suddenly start to fail unpredictably if a filtered unique index is added to the merge target table at any point.

    Connect bug filed here

    Bug reproduced on the following SQL Server versions (all x64 Developer Edition):

    2012 SP1 CUI (build 11.0.3321 – November 2012)
    2008 R2 SP2 CU3 (build 10.50.4266 – October 2012)
    2008 SP3 CU8 (build 10.0.5828 – November 2012)

    © 2012 Paul White – All Rights Reserved

    Twitter: @SQL_Kiwi

  • Cardinality Estimation Bug with Lookups

  • Why Doesn’t Partition Elimination Work?

  • Compute Scalars, Expressions and Execution Plan Performance

  • Deletes that Split Pages and Forwarded Ghosts

More Posts Next page »
Privacy Statement