THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul White: Page Free Space

A technical SQL Server blog from New Zealand. See also my articles on SQLperformance.com

  • Why Doesn’t Partition Elimination Work?

    Given a partitioned table and a simple SELECT query that compares the partitioning column to a single literal value, why does SQL Server read all the partitions when it seems obvious that only one partition needs to be examined?

    Sample Data

    The following script creates a table, partitioned on the char(3) column ‘Div’, and populates it with 100,000 rows of data:

    USE Sandpit;
    GO
    CREATE PARTITION FUNCTION PF (char(3))
    AS RANGE RIGHT FOR VALUES
    (
        '1', '2', '3', '4', '5', '6', '7', '8', '9',
        'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I',
        'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R',
        'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'
    );
    GO
    CREATE PARTITION SCHEME PS
    AS PARTITION PF
    ALL TO ([PRIMARY]);
    GO
    CREATE TABLE dbo.Test
    (
        Div     char(3) NOT NULL,
        Data    char(50) NOT NULL,
    )
    ON PS(Div);
    GO
    INSERT dbo.Test WITH (TABLOCKX)
        (Div, Data)
    SELECT TOP (100000)
        CONVERT(char(3), CONVERT(char(36), NEWID())),
        CONVERT(char(50), REPLICATE('X', 50))
    FROM sys.columns AS c
    CROSS JOIN sys.columns AS c2
    CROSS JOIN sys.columns AS c3;

    Sample data:

    Partition Elimination Sample Data

    The ‘Div’ column is pseudo-random so your results will be slightly different, but this is the distribution across partitions I saw:

    SELECT
        PartitionID = F.PartitionID,
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t WITH (TABLOCK)
    CROSS APPLY (VALUES($PARTITION.PF(t.Div))) AS F (PartitionID)
    GROUP BY
        F.PartitionID
    ORDER BY
        F.PartitionID;

    Sample Data Distribution

    (The partition function defines a total of 36 partitions but only 16 are used by the sample data.)

    The Test Query

    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = 'ABF';

    Since ‘Div’ is the partitioning column, the expectation is that only one partition will be scanned to count the rows that match the WHERE clause predicate.  However, the execution plan shows that query execution scans all 36 partitions (a full table scan):

    Execution Plan 1

    Why did SQL Server not just look at the one partition it knows the value ‘ABF’ must lie in?

    Here’s Why:

    The answer lies in the predicate:

    Predicate

    The value ‘ABF’ is nowhere to be seen; it has been replaced by [@1].  This means the query has been considered for simple parameterization by SQL Server to promote query plan reuse.  The presence of the [@1] does not mean that simple parameterization was successful (considered safe for all possible values by the optimizer).  We have to check the plan cache to see if an associated prepared plan exists and is reused for different ‘parameter’ values.  It turns out that this time the simple parameterization attempt is considered safe, so a prepared plan is created and is reused for different values.  The full parameterized text is:

    (@1 varchar(8000))SELECT COUNT_BIG(*) [RowCnt] FROM [dbo].[Test] [t] WHERE [t].[Div]=@1

    Static Partition Elimination

    The reason SQL Server cannot apply static partition elimination (determining the partition number at compile time) is that the plan now contains a parameter.  Subsequent executions that reuse the plan might specify a different value for the parameter, so static elimination would not be safe.  It is not possible to disable simple parameterization, but we can prevent it applying to this query in a number of ways.  One way is to incorporate an inequality predicate of the form expression <> constant:

    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = 'ABF'
        AND 1 <> 2;

    The redundant 1 <> 2 predicate is completely removed by the optimizer, and the query still qualifies for TRIVIAL optimization as it did previously:

    Execution Plan 2

    There are a number of interesting things to notice here:

    • The query is no longer parameterized (the literal value ‘ABF’ appears)
    • Only one partition is touched (partition 11)
    • This unindexed heap table has a seek predicate
    • The ‘ordered’ property has changed from False to True

    The problem with this plan is that it cannot be reused for different predicate values – so you could end up with a separate cached plan for each possible literal value.

    Adding OPTION (RECOMPILE) also defeats simple parameterization and therefore achieves static elimination.  In an ad-hoc SQL batch, OPTION (RECOMPILE) also means the query plan will not be cached (stored procedures are not so lucky).  The downside is a fresh trip through the optimizer on each call.  This query does qualify for trivial plan, so optimization time is minimal, but more complex queries with the same desire for partition elimination might well require full optimization.

    Dynamic Partition Elimination

    Ideally, we would like a query plan that is reusable but still only touches one partition for any given string literal value.  What we need is dynamic partition elimination.  In case the concept is new to you, the optimizer can produce plans that determine the correct partition at execution time (rather than at compile-time, as for static elimination).  To achieve this, the optimizer builds a plan that uses an internal function called RangePartitionNew().  One might think the original simple-parameterized plan ought to have included this feature; the reason it didn’t is quite interesting.

    When SQL Server parameterizes a query using simple or forced parameterization, it has to decide on a data type for the parameter.  For string literals, it always chooses varchar(8000) for non-Unicode strings and nvarchar(4000) for Unicode strings.  If it chose a specific type like varchar(3), we could end up with the same plan cache bloating and non-reuse problem as for the non-parameterized case – and there are potentially 8000 different lengths for varchar.  Also, prior to SQL Server 2005 SP2, the parameter-declaration part of the query text was not included in the hash value used to decide in which cache bucket to store the plan.  This could result in hundreds of same-text plans (with different parameters) occupying the same hash bucket (this does not work well).

    Anyway, faced with the parameterized predicate Div = [@1], where Div is char(3) and [@1] is varchar(8000), dynamic partition elimination does not occur because the value of [@1] might be truncated.  One way to address this is to explicitly convert the to-be-parameterized string literal to the same type as the target column (char(3)):

    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = CONVERT(char(3), 'CDC');

    Now we have a parameterized query plan that uses dynamic partition elimination:

    Dynamic Elimination 1

    This plan has all the interesting features of the static elimination plan, but the heap seek no longer selects a hard-coded partition ID.  The single partition searched is the one returned at runtime by the call to RangePartitionNew().  The explicit convert prevents data truncation when the parameterized value is compared with the partition boundary values (defined as char(3) in the partition function).  The extra CONVERT_IMPLICIT to char(3) is required for internal reasons, and only appears with non-Unicode partition function types.

    If we explicitly convert to char(4) instead of char(3), the possibility of truncation arises again and we are back to scanning all 36 partitions:

    No Partition Elimination

    Using varchar(3) instead of char(3) also results in dynamic elimination, which surprises some people:

    Dynamic Elimination 2

    Implicit Conversions and Precedence

    An alternative explanation I have seen for the original behaviour is that the string literal ‘ABF’ is implicitly typed as a varchar(3) by SQL Server and data type precedence means the char(3) ‘Div’ column has to be implicitly converted to varchar(3) to perform the comparison.  This is incorrect, but it is interesting to look at why that is so:

    Literal Types

    We can see the implicit type of the string literal ‘ABF’ using a query:

    SELECT
        BaseType = SQL_VARIANT_PROPERTY(V.v, 'BaseType'),
        MaxLength = SQL_VARIANT_PROPERTY(V.v, 'MaxLength')
    FROM (VALUES(CONVERT(sql_variant, 'ABF'))) AS V (v);

    Literal Data Type

    The implied type of ‘ABF’ does seem to be varchar(3), so that checks out.

    Type Precedence

    The data type precedence table in Books Online shows that varchar does indeed have a higher precedence than char:

    Data Type Precedence

    Also, from the Books Online topic Data Types (Transact-SQL):

    Precedence

    The example below uses a UNION ALL over char(5) and varchar(3) columns:

    DECLARE @T1 AS TABLE (col1 char(5) NOT NULL);
    DECLARE @T2 AS TABLE (col1 varchar(3) NOT NULL);
     
    INSERT @T1 VALUES ('12345');
    INSERT @T2 VALUES ('123');
     
    SELECT
        BaseType = SQL_VARIANT_PROPERTY(CONVERT(sql_variant, t.col1), 'BaseType'),
        MaxLength = SQL_VARIANT_PROPERTY(CONVERT(sql_variant, t.col1), 'MaxLength'),
        t.col1
    FROM
    (
        SELECT col1 FROM  @T1 
        UNION ALL
        SELECT col1 FROM @T2
    ) AS t;

    Result Data Type

    The result column ‘col1’ has to have a well-defined type.  As expected, the result is varchar(5) since varchar has a higher precedence than char and 5 is the maximum length of the result.  All good so far.

    Strings Are Special

    We get used to seeing (and preferably avoiding) implicit conversions in query plans where two mismatched types are compared:

    DECLARE @T1 AS TABLE (col1 integer NULL);
    DECLARE @v tinyint;
    SELECT col1 FROM @T1 AS t WHERE col1 = @v;

    Mismatched Types Comparison

    The plan shows the tinyint variable being implicitly converted to the higher-precedence integer data type, as expected.  Now let’s do the same thing with varchar and char (perhaps expecting char to be converted to the higher-precedence varchar):

    DECLARE @T1 AS TABLE (col1 varchar(5) NULL);
    DECLARE @v char(3);
    SELECT col1 FROM @T1 AS t WHERE col1 = @v;

    No Implicit Conversion

    Note the lack of an implicit conversion in the predicate.

    Important

    SQL Server can compare char with varchar without performing a conversion.  They are fundamentally the same type, it’s just that one is fixed-length and the other is variable-length.  The SQL standard requires padding for the character strings used in comparisons so that their lengths match before comparing, thereby removing the only distinction between char and varchar.  The same optimization can apply to a comparison between nchar and nvarchar (though not to Unicode versus non-Unicode comparisons, since they are genuinely different types of course).

    The present case involves a predicate “WHERE Div = ‘ABF’”.  This is a comparison that can use this optimization, so there is no implicit conversion.  This is the reason the explicit conversion to varchar(3) works in the main example – the precedence rules do not need to be invoked, and the data column is not converted from char(3) to varchar(3).

    Why Use a Partitioned Heap?

    In case you were wondering, the only reason I chose a partitioned heap for this post is just because I could (it’s more interesting in many ways).  All the demonstrations work just as well for a partitioned clustered table.  A complete script using a clustered partitioned table is below:

    USE Sandpit;
    GO
    CREATE PARTITION FUNCTION PF (char(3))
    AS RANGE RIGHT FOR VALUES
    (
        '1', '2', '3', '4', '5', '6', '7', '8', '9',
        'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I',
        'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R',
        'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'
    );
    GO
    CREATE PARTITION SCHEME PS
    AS PARTITION PF
    ALL TO ([PRIMARY]);
    GO
    CREATE TABLE dbo.Test
    (
        ID      integer IDENTITY NOT NULL,
        Div     char(3) NOT NULL,
        Data    char(50) NOT NULL,
     
        CONSTRAINT PK__dbo_Test_Div_ID
        PRIMARY KEY CLUSTERED (Div, ID)
        ON PS(Div)
    );
    GO
    INSERT dbo.Test WITH (TABLOCKX)
        (Div, Data)
    SELECT TOP (100000)
        CONVERT(char(3), CONVERT(char(36), NEWID())),
        CONVERT(char(50), REPLICATE('X', 50))
    FROM sys.columns AS c
    CROSS JOIN sys.columns AS c2
    CROSS JOIN sys.columns AS c3;
    GO
    -- Show the distribution of data
    SELECT
        PartitionID = F.PartitionID,
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t WITH (TABLOCK)
    CROSS APPLY (VALUES($PARTITION.PF(t.Div))) AS F (PartitionID)
    GROUP BY
        F.PartitionID
    ORDER BY
        F.PartitionID;
    GO
    -- Seek on all 36 partitions
    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = 'ABF';
    GO
    -- Static elimination = 1 hard-coded partition ID
    -- Cached plan unlikely to be reused
    SELECT
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = 'ABF'
        AND 1 <> 2;
    GO
    -- Static elimination
    -- Compilation overhead
    SELECT
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = 'ABF'
    OPTION (RECOMPILE);
    GO
    -- Dynamic elimination with convert to char(3)
    -- Reusable plan
    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = CONVERT(char(3), 'ABF');
    GO
    -- Dynamic elimination with convert to varchar(3)
    -- Data type precedence not applied:
    -- varchar(3) convert does not cause the char(3)
    -- column Div to be converted (would prevent a seek)
    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = CONVERT(varchar(3), 'ABF');
    GO
    -- char(4) means truncation could occur when
    -- comparing with the char(3) partition function
    -- No partition elimination, seek on 36 partitions
    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = CONVERT(char(4), 'ABF');
    GO
    -- Clean up
    DROP TABLE dbo.Test;
    DROP PARTITION SCHEME PS;
    DROP PARTITION FUNCTION PF;

    This post applies to SQL Server 2008, 2008 R2 and 2012 only (partitioned table query execution was very different in 2005).  As far as I can tell, SQL Server 2005 did not attempt simple parameterization on a partitioned table.  I’m in two minds whether the SQL Server 2008+ behaviour is a bug, an oversight, or an undesirable consequence of fixing something else…so I opened a Connect item for it.  Update: fixing this is under consideration for the next release.

    Sean Broadley raises an important practical point in the comments below – the issue highlighted in this post only occurs when a literal value is used.  By correctly parameterizing your queries (including application code and dynamic SQL) you will avoid simple parameterization, achieve dynamic partition elimination, and cache a more reusable query plan.

    Acknowledgement

    Thanks to Jonathan Kehayias (blog | twitter) who first altered me to the MSDN forum question that prompted this post.

    Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

    © 2012 Paul White – All rights reserved

    image454

  • Compute Scalars, Expressions and Execution Plan Performance

    Compute Scalar Operator

    The humble Compute Scalar is one of the least well-understood of the execution plan operators, and usually the last place people look for query performance problems.  It often appears in execution plans with a very low (or even zero) cost, which goes some way to explaining why people ignore it.

    Some readers will already know that a Compute Scalar can contain a call to a user-defined function, and that any T-SQL function with a BEGIN…END block in its definition can have truly disastrous consequences for performance (see this post by Rob Farley for details).  This post is not about those sorts of concerns.

    Compute Scalars

    The Books Online description of this operator is not very detailed:

    Books Online Description

    This leads people to think that Compute Scalar behaves like the majority of other operators: as rows flow through it,  the results of whatever computations the Compute Scalar contains are added to the stream.  This is not generally true.  Despite the name, Compute Scalar does not always compute anything, and does not always contain a single scalar value (it can be a vector, an alias, or even a Boolean predicate, for example).  More often than not, a Compute Scalar simply defines an expression; the actual computation is deferred until something later in the execution plan needs the result.


    Compute Scalars are not the only operators that can define expressions. You can find expression definitions with labels like [Expr1008] in many query plan operators, including Constant Scans, Stream Aggregates and even Inserts.

    Deferred expression evaluation was introduced in SQL Server 2005 as a performance optimization.  The idea is that the number of rows tends to decrease in later parts of the plan due to the effects of filtering and joining (reducing the number of expression evaluations at runtime).  To be fair to the Books Online entry, it does mention this behaviour in a note:

    Books Online Note

    The take away from this first section is that a Compute Scalar is most often just a placeholder in the execution plan.  It defines an expression and assigns a label to it (e.g. [Expr1000]) but no calculations are performed at that point.  The defined expression may be referenced by multiple operators later in the execution plan.  This makes counting the number of ‘executions’ more complex than for a regular plan operator., and goes some way to explaining why Compute Scalar operators only ever show estimated row counts (even in an ‘actual’ execution plan).

    Execution Plans and the Expression Service

    The execution plans you see in Management Studio (or Plan Explorer) are a highly simplified representation that is nevertheless useful for many kinds of plan analysis.  The XML form of show plan has a more detail than the graphical representation, but even that is a shallow depiction of the code that actually runs inside SQL Server when a query is executed.  It is easy to be seduced into thinking that what we see as an ‘execution plan’ is more complete than it actually is.  As a tool for high-level analysis, it is excellent, but do not make the mistake of taking it too literally.  More detail on this later.

    One more thing to appreciate before we get into a concrete example: the SQL Server engine contains a component known as the Expression Service.  This is used by components such as the query optimizer, query execution, and the storage engine to perform computations, conversions and comparisons.  It is the call to the expression service in the execution plan that can be deferred and performed by a plan operator other than the source Compute Scalar (or other expression-defining operator).

    An Example

    I came across a forum thread recently which demonstrated the concepts I want to explore in this post really well.  The topic of discussion was delimited string-splitting (an old favourite of a topic, for sure) but I want to emphasise that the particular code is not at all important – it is the query plan execution I want to focus on.  The particular string-splitting method employed was one I first saw written about by Brad Shultz; it replaces the delimiters with XML tags, and then uses XML shredding to return the results.  I should say that I am not a fan of this technique because it is (a) a strange use of XML; and (b) it does not work for all possible inputs.  Anyway, I said the example was not that important in itself, so let’s get on to the first example:

    Test One – XML Split

    DECLARE
        @ItemID bigint,
        @Item varchar(8000),
        @Input varchar(8000);
     
    SET @Input = REPLICATE('a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z', 32);
     
    SELECT
        @ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), 
        @Item = x.i.value('./text()[1]', 'varchar(8000)')    
    FROM
    (
        VALUES
            (
            CONVERT
                (xml, 
                '<r>' + 
                REPLACE(@Input, ',', '</r><r>') + 
                '</r>', 0
                )
            )
    ) AS ToXML (result) 
    CROSS APPLY ToXML.result.nodes('./r') AS x(i);

    The code above creates a single comma-delimited string containing the letters of the alphabet, repeated 32 times (a total string length of 1632 characters).  The string-splitting SELECT statement replaces the commas with XML tags, fixes up the start and end of the string, and then uses the nodes() XML method to perform the actual splitting.  There is also a ROW_NUMBER call to number the elements returned in whatever order the nodes() method spits them out.  Finally, two variables are used to ‘eat’ the output so returning results to the client does not affect execution time.  The execution plan is as follows (click to enlarge):

    XML String Splitter Test 1

    The Constant Scan operator defines an expression labelled as [Expr1000] which performs the string replacement and conversion to the XML type.  There are four subsequent references to this expression in the query plan:

    • Two XML Reader table-valued functions (performing the nodes() shredding and value() functions)
    • The Filter operator (a start-up predicate checks that the expression is not NULL)
    • The Stream Aggregate (as part of a CASE expression that determines what the value() XML method should finally return)

    The query takes 3753ms to execute on my laptop.  This seems a rather long time to shred a single string containing just 801 elements.

    Test Two – The Empty String

    The following code makes one small modification to the test one script:

    SELECT
        @ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), 
        @Item = x.i.value('./text()[1]', 'varchar(8000)')    
    FROM
    (
        VALUES
            (
            CONVERT
                (xml, 
                '<r>' + 
                REPLACE(@Input, ',', '</r><r>') + 
                '</r>' +
                LEFT(@@DBTS, 0)  -- NEW
                , 0)
            )
    ) AS ToXML (result) 
    CROSS APPLY ToXML.result.nodes('./r') AS x(i);

    The only difference in the execution plan is in the definition of [Expr1000] (click to enlarge):

    XML String Splitter Test 2

    This plan returns the same correct results as the first test, but it completes in 8ms (quite an improvement over 3753ms).

    Beyond the Execution Plan – Test One

    There is nothing in the query plan to explain why the second form of the query runs 500 times faster than the first.  To see what is going on under the covers of the execution engine when executing the slow query, we can attach a debugger and look closer at the code SQL Server is really running.  Using SQL Server 2012 and setting a breakpoint on the CXMLTypeProvider::ConvertStringToXMLForES method (as the name suggests, it converts a string to XML for the Expression Service) the first debugger break occurs with this stack trace:

    ConvertStringToXMLForES

    I know that not everyone reading this post will routinely analyse stack traces, so for this first example I will describe the interesting calls from the bottom up (the currently executing code is on top, each line below is the method that called that routine):

    1. There is nothing very interesting below CQueryScan::GetRow (it is just start-up code)
    2. CQueryScan:GetRow drives the plan as we are used to seeing it.  You might think of it as approximately the code behind the green SELECT query plan icon in a graphical plan.  It reads a row at a time from the first plan operator to its right
    3. The first visible plan operator to execute is the leftmost Nested Loops join (CQScanNLJoinTrivialNew::GetRow).  (The Compute Scalar to its left in the graphical plan simply defines an alias for an expression computed by the Stream Aggregate – it does not even exist in the code executed by the execution engine).
    4. The nested loops join operator asks the Sequence Project operator for a row (CQScanSeqProjectNew::GetRow)
    5. The sequence project operator asks the Segment operator for a row (CQScanSegmentNew::GetRow)
    6. The segment operator asks the second Nested Loops Join for a row (CQScanNLJoinTrivialNew::GetRow)
    7. The nested loops join has obtained a row from the Constant Scan and is now preparing to fetch a row from the inner side of the join
    8. The start-up Filter operator is executing its Open() method (CQScanStartupFilterNew::Open)
    9. The start-up Filter calls into the general entry point for the Expression Service to evaluate expression [Expr1000] defined by the Constant Scan
    10. CEsExec::GeneralEval4 calls an ES method to convert a string to XML (ConvertFromStringTypesAndXmlToXml)
    11. The XML type provider executes the code that actually converts the string to XML (ConvertStringToXMLForES)

    The stack trace illustrates two things nicely:

    • The iterative nature of plan operators (one row at a time, Open, GetRow, and Close methods)
    • An expression defined by one operator (the Constant Scan) being deferred for evaluation until a later operator (the start-up filter) requires the result

    Continuing execution with the debugger, the next break occurs when the Table-valued Function to the right of the filter is initialised:

    Table-valued Function Stack Trace

    Another break when the Table-valued Function on the lowest branch of the plan initialises:

    Table-valued Function Stack Trace 2

    And finally, a break from the Stream Aggregate, when it needs to evaluate [Expr1016] (which contains a reference to [Expr1000]):

    Stream Aggregate Stack Trace

    The last two breaks (lowest table-valued function and stream aggregate) execute a total of 801 times.  The reason for the slow performance of this query is now clear: [Expr1000] is being computed more than 1604 times.  Yes, that means the REPLACE and CONVERT to XML really does run 1604 times on the same value – no wonder performance is so poor!

    If you are inclined to test it, the expression service method that performs the REPLACE is sqlTsEs!BhReplaceBhStrStr.  It turns out that the string REPLACE is very much faster than conversion to XML (remember XML is a LOB type and also has complex validation rules) so the dominant performance effect is the conversion to XML.

    Beyond the Execution Plan – Test Two

    Running test two with the debugger attached reveals that only one call is made to CXMLTypeProvider::ConvertStringToXMLForES:

    FillExprCache

    Reading down the stack trace we see that execution of the plan (as we know it) has not started yet.  The call to convert the string to XML is called from InitForExecute() and SetupQueryScanAndExpression.  The conversion to XML is evaluated once and cached in the execution context (CQueryExecContext::FillExprCache).  Once the iterative part of plan execution begins, the cached result can be retrieved directly without calling ConvertStringToXMLForES() 1604 times.

    Test Three – Column Reference

    Before you go off adding LEFT(@@DBTS, 0) to all your expressions (!) look what happens when we use a table to hold the input data rather than a variable:

    DECLARE
        @ItemID bigint,
        @Item varchar(8000);
     
    DECLARE @Table AS TABLE (Input varchar(8000) NOT NULL);
     
    INSERT @Table (Input)
    VALUES (REPLICATE('a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z', 32));
     
    SELECT
        @ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), 
        @Item = x.i.value('./text()[1]', 'varchar(8000)')    
    FROM @Table AS t
    CROSS APPLY
    (
        VALUES
            (
            CONVERT
                (xml, 
                '<r>' + 
                REPLACE(t.Input, ',', '</r><r>') + 
                '</r>' +
                LEFT(@@DBTS, 0)
                , 0)
            )
    ) AS ToXML (result) 
    CROSS APPLY ToXML.result.nodes('./r') AS x(i);

    Instead of the Constant Scan we saw when using a variable, we now have a table scan and a Compute Scalar that defines the expression:

    [Expr1003] = Scalar Operator(
    CONVERT(xml,'<r>'+replace(@Table.[Input] as [t].[Input],',','</r><r>')+'</r>'+
    substring(CONVERT_IMPLICIT(varchar(8),@@dbts,0),(1),(0)),0))

    XML String Splitter Test 3

    The other plan details are the same as before; the expression label [Expr1003] is still referenced by the Filter, two Table-valued functions and the Stream Aggregate.  The test three query takes 3758ms to complete.  Debugger analysis shows that CXMLTypeProvider::ConvertStringToXMLForES is again being called directly by the Filter, Table-valued Functions, and Stream Aggregate.

    Test Four – CRYPT_GEN_RANDOM

    This final test replaces @@DBTS with the CRYPT_GEN_RANDOM function:

    SELECT
        @ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), 
        @Item = x.i.value('./text()[1]', 'varchar(8000)')    
    FROM @Table AS t
    CROSS APPLY
    (
        VALUES
            (
            CONVERT
                (xml, 
                '<r>' + 
                REPLACE(t.Input, ',', '</r><r>') + 
                '</r>' +
                LEFT(CRYPT_GEN_RANDOM(1), 0)    -- CHANGED
                , 0)
            )
    ) AS ToXML (result) 
    CROSS APPLY ToXML.result.nodes('./r') AS x(i);

    The execution plan shows an extra Compute Scalar next to the Table Scan:

    [Expr1018] = Scalar Operator('<r>'+replace(@Table.[Input] as [t].[Input],',','</r><r>')+'</r>')

    The Compute Scalar to the left of that is the same one seen previously, though it now references the new expression in its definition:

    [Expr1003] = Scalar Operator(
    CONVERT(xml,[Expr1018]+
    substring(CONVERT_IMPLICIT(varchar(8000),Crypt_Gen_Random((1)),0),(1),(0)),0))

    XML String Splitter Test 4

    The remainder of the query plan is unaffected; the same operators reference [Expr1003] as before.  As you might have been expecting, this query executes in 8ms again.

    Explanation

    In test one, there is nothing to stop the query processor deferring execution of the expression, which results in the same expensive computation being performed 1604 times (once each for the filter and first table-valued function, and 801 times each for the second function and the stream aggregate).  As an interesting aside, notice that even with OPTION (RECOMPILE) specified, constant-folding only applies to the REPLACE and string concatenation; constant-folding cannot produce a LOB type (including XML).

    In test two, we introduced the non-deterministic function @@DBTS.  This is one of the functions (like RAND and GETDATE) that are implemented to always return the same value per query (despite being non-deterministic, see the links in the Further Reading section below for more details).  These runtime constants are evaluated before the query starts processing rows and the result is cached.  There are some hints (ConstExpr labels) in the execution plan that may indicate a runtime constant but I have not found this to be reliable.

    In test three, we changed the game again by replacing the reference to a variable with a column reference from a table.  The value of this column reference could change for each table row so the engine can no longer treat the expression as a runtime constant.  As a consequence, the result is no longer cached and we see the same behaviour (and slow performance) as in test one.

    In test four, we replaced the @@DBTS function with the side-effecting function CRYPT_GEN_RANDOM (the other such function is NEWID).  One of the effects of this is to disable the deferred evaluation of the expression defined by the Compute Scalar.  The result is a XML LOB and a second optimization occurs where handles to the same LOB (evaluated by the defining operator) can be shared by other operators that reference the LOB.  This combination of effects means the expensive computation is only performed once in test four, even though the runtime constant caching mechanism is not used.  By the way, there was a bug with CRYPT_GEN_RANDOM which meant it was incorrectly assessed by the optimizer in early versions of SQL Server 2008.  If you encounter this issue when running the scripts above, apply a later service pack (or use the NEWID function for SQL Server 2005).

    Final Thoughts

    • There is a lot more going on in query execution than we can see in execution plans
    • In general, there are no guarantees about the order of execution of expressions, or how many times they will be evaluated
    • Please do not start using @@DBTS or CRYPT_GEN_RANDOM for their undocumented effects
    • It might be nice if show plan output did indicate if a computation was deferred or cached, but it is not there today
    • In many cases we can work around the potential for poor performance caused by repeated evaluation by explicitly materializing the result of the expression using a variable, table, non-inline function…though not always
    • Please do not suggest improvements to the XML string splitter unless it handles all input characters and can accept a semicolon as a delimiter ;c)
    • I always use SQLCLR for string splitting

    Acknowledgements

    I would like to particularly thank Usman Butt and Chris Morris from SQL Server Central for their contributions and thoughts which helped me write this entry.

    Further Reading

    When a Function is indeed a Constant – Andrew Kelly
    Runtime Constant Functions – Conor Cunningham
    RAND() and other runtime constant functions – Conor again
    Scalar Operator Costing – Guess who

    © 2012 Paul White
    twitter: @SQL_Kiwi
    email: SQLkiwi@gmail.com

    image454

  • Deletes that Split Pages and Forwarded Ghosts

    Can DELETE operations cause pages to split?  Yes.  It sounds counter-intuitive on the face of it; deleting rows frees up space on a page, and page splitting occurs when a page needs additional space.  Nevertheless, there are circumstances when deleting rows causes them to expand before they can be deleted.  The mechanism at work here is row versioning (extract from Books Online below):

    Row versioning space usage

    Isolation Levels

    The relationship between row-versioning isolation levels (the first bullet point) and page splits is reasonably clear.  Any data that existed before either of the isolation levels was enabled will need to have the 14 bytes added by future data modifications, perhaps causing pages to split.  In this scenario, tables will likely contain a mix of records to start with, but over time (particularly as index maintenance is performed) the database will end up with row versions on most records, reducing the chances of a page split for that particular reason.

    There is nevertheless a window of opportunity where adding the 14 bytes to an existing record could cause a page split.  No doubt there’s a recommendation out there somewhere to rebuild all tables and indexes when enabling or disabling a row-versioning isolation level on a database.  This is not all that interesting though, so let’s look at the second bullet point instead:

    Triggers

    The documentation says that versioning information is added if the table has a trigger.  What it doesn’t say is:

    • The extra bytes for row versioning can be added even where both READ_COMMITTED_SNAPSHOT and SNAPSHOT isolation are OFF.
    • This only applies to AFTER triggers, not INSTEAD OF triggers
    • The AFTER trigger also needs to be enabled to generate row versions, the mere existence of a trigger is not enough.
    • There is a very important exception to all the above…

    SQL Server can still avoid adding the row-versioning information even where an enabled AFTER TRIGGER exists (the remainder of this post assumes that both row-versioning isolation levels are OFF, by the way).

    Avoiding Row Versioning with Triggers

    To explore this behaviour in a bit of detail, we’ll need a test rig:

    USE tempdb;
    GO
    IF  OBJECT_ID(N'dbo.Test', N'U') IS NOT NULL
        DROP TABLE dbo.Test;
    GO
    -- Test table
    CREATE TABLE dbo.Test
    (
        ID          integer IDENTITY PRIMARY KEY,
        Column01    nvarchar(20) NULL,
        Column02    nvarchar(4000) NULL,
    );
    GO
    -- Add some rows
    INSERT dbo.Test WITH (TABLOCKX)
        (Column01)
    SELECT TOP (100000)
        CONVERT(nvarchar(20), N'X')
    FROM sys.columns AS c
    CROSS JOIN sys.columns AS c2
    CROSS JOIN sys.columns AS c3
    OPTION (MAXDOP 1);
    GO
    -- A trigger that does nothing
    CREATE TRIGGER trg
    ON dbo.Test
    AFTER DELETE
    AS RETURN;
    GO
    -- Write any dirty pages to disk
    CHECKPOINT;
    GO
    -- Physical storage before any changes
    SELECT
        ddips.index_type_desc,
        ddips.alloc_unit_type_desc,
        ddips.page_count,
        ddips.record_count,
        ddips.max_record_size_in_bytes
    FROM sys.dm_db_index_physical_stats(
        DB_ID(), OBJECT_ID(N'dbo.Test', N'U'), NULL, NULL, 'DETAILED') AS ddips
    WHERE
        ddips.index_level = 0;
    GO
    -- Buffer pool pages before any changes
    SELECT 
        dobd.[file_id],
        dobd.page_id,
        dobd.page_type,
        dobd.row_count,
        dobd.free_space_in_bytes,
        dobd.is_modified
    FROM sys.partitions AS p
    JOIN sys.allocation_units AS au 
        ON au.container_id = p.hobt_id
    JOIN sys.dm_os_buffer_descriptors AS dobd ON 
        dobd.allocation_unit_id = au.allocation_unit_id
    WHERE
        p.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
    ORDER BY
        dobd.page_id;
    GO
    SET STATISTICS IO ON;
     
    -- Delete 1 in 10 rows
    DELETE dbo.Test
    WHERE ID % 10 = 0;
     
    SET STATISTICS IO OFF;
    GO
    SELECT
        [Page Splits] = COUNT_BIG(*)
    FROM sys.fn_dblog(NULL,NULL) AS fd 
    WHERE 
        fd.[Transaction Name] = N'SplitPage';
    GO
    -- Ensure ghosted records are processed so
    --  we see accurate per-page row counts
    DBCC FORCEGHOSTCLEANUP;
    GO
    -- Physical storage after the delete operation
    SELECT
        ddips.index_type_desc,
        ddips.alloc_unit_type_desc,
        ddips.page_count,
        ddips.record_count,
        ddips.max_record_size_in_bytes
    FROM sys.dm_db_index_physical_stats(
        DB_ID(), OBJECT_ID(N'dbo.Test', N'U'), NULL, NULL, 'DETAILED') AS ddips
    WHERE
        ddips.index_level = 0;
    GO
    -- Buffer pool pages after the delete operation
    SELECT 
        dobd.[file_id],
        dobd.page_id,
        dobd.page_type,
        dobd.row_count,
        dobd.free_space_in_bytes,
        dobd.is_modified
    FROM sys.partitions AS p
    JOIN sys.allocation_units AS au 
        ON au.container_id = p.hobt_id
    JOIN sys.dm_os_buffer_descriptors AS dobd ON 
        dobd.allocation_unit_id = au.allocation_unit_id
    WHERE
        p.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
    ORDER BY
        dobd.page_id;
    GO
     
    --DBCC TRACEON (3604);
    --DBCC PAGE (tempdb, 1, 4720, 3);

    The script performs the following actions:

    1. Creates a clustered table with an ID and two data columns
    2. Adds 100,000 rows each with a single ‘X’ character in the first column, and NULL in the second
    3. Creates an AFTER DELETE trigger that does nothing at all except exist and be enabled
    4. Displays physical storage and buffer pool information using DMVs
    5. Deletes every tenth row in the table
    6. Shows the number of leaf-level page splits that occurred
    7. Displays the physical storage and buffer pool information again

    Test One – Clustered Table

    Typical output:

    Clustered Table Delete Test 1

    There are 235 data pages with a maximum physical record size of 17 bytes before and after the delete.  Before the delete, each data page contains 426 rows with 2 bytes of free space.  After the DELETE:

    • A total of 10,000 records have been deleted
    • The data page count remains at 235
    • The maximum record size is still 17 bytes
    • Each data page has lost 42 or 43 rows
    • The free space on each page has risen to 800 or 819 bytes
    • All data pages are marked as being modified in memory
    • A total of 237 logical reads are reported

    No surprises there.

    Test Two – Clustered Table

    Now, run the script again with the only change being that Column01 is defined as nvarchar(22) instead of nvarchar(20).  The before picture is the same as before, but the situation after the DELETE is very different:

    Clustered Table Delete Test 2

    There have been 234 page splits, increasing the data page count from 235 pages to 469 pages, and halving the number of rows on each data page.  The number of reads reported has also blown out from 237 previously to 2342 logical reads in this run (a factor of ten worse).

    Explanation

    The cause of the page splitting is that the deleted records must be versioned.  SQL Server 2005 and later uses the version store to build the inserted and deleted pseudo-tables used by AFTER triggers.  Where the data has no pre-existing versioning data, adding the 14 bytes will result in a clustered index page split if the page contains insufficient free space to accommodate this expansion.

    Temporarily turning off ghost record clean-up using global trace flag 661 and examining an affected data page using DBCC PAGE shows the following (remember to turn the trace flag off afterward if you try this):

    Row versioning ghost data record

    Slots 8 and 10 on this page hold records that were unaffected by the DELETE; the physical row length is 17 bytes as displayed previously.  The record that was in slot 9 has been deleted.  It is a ghost record with versioning information added.  The record size is 17 + 14 = 31 bytes, and this expansion with only 2 bytes of free space on the page caused it to split.

    This explains why the nvarchar(22) DELETE test caused page splitting, but why didn’t the original nvarchar(20) script behave the same?

    There is a performance optimization that can avoid adding row versioning information, but only if the table cannot generate ROW_OVERFLOW or LOB allocation units.  This means that the definition of the table must not allow for LOBs or for the possibility of variable length columns moving off row.  The actual size of the data stored is immaterial – it is the potential size that matters.

    In our test, the nvarchar(22) column definition caused the maximum possible INROW size to just exceed the 8060 byte limit.  (The exact INROW limit also depends on the table definition; marking one of the data columns SPARSE would reduce the limit to 8015-8019 bytes – whatever the right number is).

    Heaps of Forwarded Ghosts

    “Page splitting only occurs in index structures, are heap structured tables affected by this issue too?”

    It is true that heap pages do not split, but when a heap row needs to expand, the engine will move the row to another page if insufficient free space exists on the current page.  When the storage engine does this, it leaves a forward pointer behind to avoid updating all non-clustered indexes to reflect the new physical row locator.

    For a heap table with an active AFTER trigger, and a LOB column (or the possibility of row-overflow) the row has to be versioned and ghosted.  If the page contains insufficient free space to accommodate the versioning,  the row moves to another page leaving a forwarding stub behind.  This results in a forwarded ghost record.  Ghost clean-up will normally remove this record pretty quickly, so we will need to disable that process temporarily.  The following script creates the very special circumstances necessary to produce a forwarded ghost record this way (note this script is for test systems only should not be run in tempdb):

    USE Sandpit;
    GO
    IF  OBJECT_ID(N'dbo.Test', N'U') IS NOT NULL
        DROP TABLE dbo.Test;
    GO
    -- Heap test
    CREATE TABLE dbo.Test
    (
        ID          integer IDENTITY PRIMARY KEY NONCLUSTERED,
        Column01    nvarchar(16) NULL,
        Column02    nvarchar(4000) NULL
    );
    GO
    -- Add some all-NULL rows
    INSERT dbo.Test WITH (TABLOCKX)
        (Column01)
    SELECT TOP (100000)
        NULL
    FROM sys.columns AS c
    CROSS JOIN sys.columns AS c2
    CROSS JOIN sys.columns AS c3
    OPTION (MAXDOP 1);
    GO
    -- Ensure rows are tightly packed
    ALTER TABLE dbo.Test REBUILD;
    GO
    -- A trigger that does nothing
    CREATE TRIGGER trg
    ON dbo.Test
    AFTER DELETE
    AS RETURN;
    GO
    -- Write any dirty pages to disk
    CHECKPOINT;
    GO
    -- Physical storage before any changes
    SELECT
        ddips.index_type_desc,
        ddips.page_count,
        ddips.record_count,
        ddips.max_record_size_in_bytes,
        ddips.ghost_record_count,
        ddips.forwarded_record_count
    FROM sys.dm_db_index_physical_stats(
        DB_ID(), OBJECT_ID(N'dbo.Test', N'U'), NULL, NULL, 'DETAILED') AS ddips
    WHERE
        ddips.alloc_unit_type_desc = N'IN_ROW_DATA'
        AND ddips.index_level = 0;
    GO
    -- Buffer pool pages before any changes
    SELECT 
        dobd.[file_id],
        dobd.page_id,
        dobd.page_type,
        dobd.row_count,
        dobd.free_space_in_bytes,
        dobd.is_modified
    FROM sys.partitions AS p
    JOIN sys.allocation_units AS au 
        ON au.container_id = p.hobt_id
    JOIN sys.dm_os_buffer_descriptors AS dobd ON 
        dobd.allocation_unit_id = au.allocation_unit_id
    WHERE
        p.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
        AND dobd.page_type = N'DATA_PAGE'
    ORDER BY
        dobd.page_id;
    GO
    -- Disable ghost clean-up
    DBCC TRACEON (661, -1);
    GO
    SET STATISTICS IO ON;
     
    -- Delete three records on the same page
    DELETE dbo.Test
    WHERE ID BETWEEN 1 AND 3;
     
    SET STATISTICS IO OFF;
    GO
    -- Physical storage after the delete operation
    SELECT
        ddips.index_type_desc,
        ddips.page_count,
        ddips.record_count,
        ddips.max_record_size_in_bytes,
        ddips.ghost_record_count,
        ddips.forwarded_record_count
    FROM sys.dm_db_index_physical_stats(
        DB_ID(), OBJECT_ID(N'dbo.Test', N'U'), NULL, NULL, 'DETAILED') AS ddips
    WHERE
        ddips.alloc_unit_type_desc = N'IN_ROW_DATA'
        AND ddips.index_level = 0;
    GO
    -- Buffer pool pages after the delete operation
    SELECT 
        dobd.[file_id],
        dobd.page_id,
        dobd.page_type,
        dobd.row_count,
        dobd.free_space_in_bytes,
        dobd.is_modified
    FROM sys.partitions AS p
    JOIN sys.allocation_units AS au 
        ON au.container_id = p.hobt_id
    JOIN sys.dm_os_buffer_descriptors AS dobd ON 
        dobd.allocation_unit_id = au.allocation_unit_id
    WHERE
        p.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
        AND dobd.page_type = N'DATA_PAGE'
        AND dobd.is_modified = 1
    ORDER BY
        dobd.page_id;
    GO
    -- View the appropriate page
    --DBCC TRACEON (3604);
    --DBCC PAGE (0, 1, 339408, 3);
     
    -- Enable ghost clean-up
    DBCC TRACEOFF (661, -1);

    Example output, showing page 453648 receiving a versioned forwarded ghost record (click to enlarge in a new tab):

    Versioned Forwarded Ghost Record

    Partial DBCC PAGE output for the highlighted page:

    DBCC PAGE forward ghost record

    Summary

    If you ever wonder why your deletes are so slow, it is worth checking to see if you are suffering from page splitting due to an enabled trigger and a table definition that allows for LOB allocations or ROW_OVERFLOW.  Any table with a LOB column (including the max data types) qualifies, as does one with even a surprisingly small number of variable-length columns, as shown in the examples in this post.  This is a great reason to avoid using ‘max’ or old-style LOB data types unnecessarily, and to be careful about the maximum length of ‘normal’ variable-length data types too.  Remember, it is the potential maximum row size that is important, not the actual row size.

    On a related note, remember that deletes on a heap can only deallocate empty pages if a table lock is acquired?  A table definition that allows for LOB or ROW_OVERFLOW prevents that optimization too.  So, if your heaps are growing despite DELETE WITH (TABLOCKX), check the maximum possible row length!  You could also convert them to clustered tables as well, of course, but that’s a quite different debate.

    I would like to acknowledge and thank SQL Server MVP Dmitri V. Korotkevitch who first brought the basic issue to my attention with UPDATE queries.  I would strongly encourage you to also read his blog entry showing how this behaviour also affects UPDATE queries, resulting in slow performance and excessive fragmentation.

    Thanks for reading.

    image45

    © 2012 Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • Temporary Table Caching Explained

    SQL Server 2005 onward caches temporary tables and table variables referenced in stored procedures for reuse, reducing contention on tempdb allocation structures and catalogue tables.  A number of things can prevent this caching (none of which are allowed when working with table variables):

    • Named constraints (bad idea anyway, since concurrent executions can cause a name collision)
    • DDL after creation (though what is considered DDL is interesting)
    • Creation using dynamic SQL
    • Table created in a different scope
    • Procedure executed WITH RECOMPILE

    Temporary objects are often created and destroyed at a high rate in production systems, so caching temporary objects can be an important optimization.  The temporary object cache is just another SQL Server cache (using the general framework) though it’s entries are a bit more visible than most.  The sys.dm_os_performance_counters DMV exposes a number of counters under the ‘Temporary Tables & Table Variables’ instance of the Plan Cache object.  The cache is also visible through the usual cache DMVs, for example as CACHESTORE_TEMPTABLES in sys.dm_os_memory_cache_counters.

    Cached Object Names

    The cached objects themselves are visible in tempdb.sys.tables, named with a single # character followed by the 8-character hexadecimal representation of the object id.  This is different from the names of ordinary temporary tables, which have the user-supplied name followed by a bunch of underscores and an id.

    The following procedure shows a total of nine cache objects created using CREATE TABLE #xyz syntax, table variables, and SELECT…INTO:

    CREATE PROCEDURE dbo.Demo
    AS
    BEGIN
        CREATE TABLE #T1 (dummy int NULL);
        CREATE TABLE #T2 (dummy int NULL);
        CREATE TABLE #T3 (dummy int NULL);
     
        DECLARE @T1 AS TABLE (dummy int NULL);
        DECLARE @T2 AS TABLE (dummy int NULL);
        DECLARE @T3 AS TABLE (dummy int NULL);
        
        SELECT * INTO #T4 FROM #T1;
        SELECT * INTO #T5 FROM @T2;
        SELECT * INTO #T6 FROM #T3;
     
        WAITFOR DELAY '00:00:01'
    END;
    GO
    DBCC FREEPROCCACHE;
    EXECUTE dbo.Demo;
    GO
    SELECT
        t.* 
    FROM tempdb.sys.tables AS t 
    WHERE
        t.name LIKE N'#[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]';

    The results show nine separate cached objects:

    Cached temporary objects

    Notice the relationship between object id and the name e.g. –1383692523 = hex AD868715.

    Caching is per object not per procedure

    If any of the temporary objects in a procedure are not cacheable for any reason, the others may still be cached. So, for example, if we modify the test above to create an index on table #T5, that particular table will not be cached, but the other temporary objects will be:

    CREATE PROCEDURE dbo.Demo
    AS
    BEGIN
        CREATE TABLE #T1 (dummy int NULL);
        CREATE TABLE #T2 (dummy int NULL);
        CREATE TABLE #T3 (dummy int NULL);
     
        DECLARE @T1 AS TABLE (dummy int NULL);
        DECLARE @T2 AS TABLE (dummy int NULL);
        DECLARE @T3 AS TABLE (dummy int NULL);
        
        SELECT * INTO #T4 FROM #T1;
        SELECT * INTO #T5 FROM @T2;
        SELECT * INTO #T6 FROM #T3;
     
        -- Prevents caching of #T5
        CREATE INDEX nc1 ON #T5 (dummy);
     
        WAITFOR DELAY '00:00:01'
    END;
    GO
    DBCC FREEPROCCACHE;
    WAITFOR DELAY '00:00:05';
    GO
    EXECUTE dbo.Demo;

    There are now only eight cached objects:

    Only some objects cached

    Apparently, DROP TABLE is not DDL

    Dropping a temporary table in a procedure does not count as DDL, and neither does TRUNCATE TABLE, nor UPDATE STATISTICS.  None of these things prevent temporary table caching (so it does not matter whether you explicitly drop a temporary table at the end of a procedure or not).

    CREATE PROCEDURE dbo.Demo
    AS
    BEGIN
        CREATE TABLE #T1 (dummy int NULL);
        CREATE TABLE #T2 (dummy int NULL);
        CREATE TABLE #T3 (dummy int NULL);
     
        DECLARE @T1 AS TABLE (dummy int NULL);
        DECLARE @T2 AS TABLE (dummy int NULL);
        DECLARE @T3 AS TABLE (dummy int NULL);
        
        SELECT * INTO #T4 FROM #T1;
        SELECT * INTO #T5 FROM @T2;
        SELECT * INTO #T6 FROM #T3;
     
        -- Does not prevent caching
        DROP TABLE #T1, #T4, #T6;
        TRUNCATE TABLE #T2;
        TRUNCATE TABLE #T5;
        UPDATE STATISTICS #T3;
     
        WAITFOR DELAY '00:00:01'
    END;
    GO
    DBCC FREEPROCCACHE;
    WAITFOR DELAY '00:00:05';
    GO
    EXECUTE dbo.Demo;

    There are nine cached objects again:

    DROP TABLE is not DDL

    Concurrent executions

    If a stored procedure is executed concurrently, multiple separate cached objects may be created in tempdb.  There is one cached plan for the procedure but one cached temporary object per execution context derived from that cached plan.  Recall that execution contexts are relatively lightweight instances of a cached plan, populated with execution-specific data such as temporary object ids and parameter values (image reproduced from the plan caching white paper):

    Cached Plans and Execution Contexts

    The runtime contents of a temporary object (table or variable) are obviously specific to a particular execution, so it makes sense for the cached object to be associated with execution contexts rather than the parent plan.  There may also be more than one cached plan for a procedure in cache (for example due to compilations with different SET options) and each parent plan will have its own collection of execution contexts, so there can be one cached tempdb object per execution context per plan.

    There does not appear to be a fixed limit on the number of these cached objects; I was able to quickly create 2,000 of them using the test procedures above and Adam Machanic’s SQL Query Stress tool running 200 threads.  This is the reason for the 1 second delay in the procedure – to make sure the procedure runs for a little while so new execution contexts are generated for each execution rather than reused.  The contents of tempdb after that test were as follows:

    2000 cached temporary tables

    Statistics on cached temporary objects

    Any auto-stats created are linked to the cached temporary table:

    CREATE PROCEDURE dbo.Demo
    AS
    BEGIN
        CREATE TABLE #T1 (dummy int NULL);
     
        DECLARE @dummy int;
     
        -- Trigger auto-stats
        SELECT @dummy = dummy 
        FROM #T1
        WHERE dummy > 0;
     
        WAITFOR DELAY '00:00:01'
    END;
    GO
    DBCC FREEPROCCACHE;
    WAITFOR DELAY '00:00:05';
    GO
    EXECUTE dbo.Demo;
    GO
    SELECT
        t.name,
        t.[object_id],
        s.name,
        s.auto_created
    FROM tempdb.sys.tables AS t
    JOIN tempdb.sys.stats AS s ON
        s.[object_id] = t.[object_id];

    Cached temporary object statistics

    In the case of multiple execution contexts, you might be wondering if each of the tempdb objects can have auto-created statistics associated with them.  The answer is no: auto-stats are used to compile the parent plan, and execution contexts are derived from that same plan as needed.  The metadata looks a little odd though; the statistics are explicitly linked to the object id of the cached temporary object that caused the auto-stats to be created.  Other cached tables for the same plan have different ids, and so do not link to the sys.stats entry.

    Statistics created using an explicit CREATE STATISTICS statement are not linked to a cached temporary object, for the simple reason that CREATE STATISTICS is considered DDL, and prevents caching from occurring in the first place.

    Drop and Create in Detail

    The first time a procedure containing a cacheable temporary object is executed, the temporary object is created as normal, then renamed to the hexadecimal internal form described previously when the object is dropped (explicitly or implicitly at the end of the procedure).  On subsequent executions, the cached object is renamed to the normal user-visible name when ‘created’ in the procedure, and renamed back to the internal form when it is ‘dropped’.  The following script demonstrates the creation and renaming of cached temporary objects:

    USE tempdb;
    GO
    CREATE PROCEDURE dbo.Demo
    AS
    BEGIN
        CREATE TABLE #Demo (i int);
     
        SELECT 
            t.name,
            t.object_id,
            t.type_desc,
            t.create_date
        FROM sys.tables AS t
        WHERE
            t.name LIKE N'#Demo%';
     
        DROP TABLE #Demo;
     
        SELECT
            t.name,
            t.object_id,
            t.type_desc,
            t.create_date
        FROM sys.tables AS t 
        WHERE
            t.name LIKE N'#________';
    END;
    GO
    DBCC FREEPROCCACHE;
    WAITFOR DELAY '00:00:05';
    GO
    CHECKPOINT;
    EXECUTE dbo.Demo;
    GO
    SELECT
        fd.[Current LSN],
        fd.Operation,
        fd.AllocUnitName,
        fd.[Transaction Name],
        fd.[Transaction ID],
        CONVERT(sysname, SUBSTRING(fd.[RowLog Contents 0], 3, 256)),
        CONVERT(sysname, SUBSTRING(fd.[RowLog Contents 1], 3, 256))
    FROM sys.fn_dblog(NULL, NULL) AS fd;

    The first time it is run the first part of the output is:

    Name remapping

    Notice that the object ids are the same, and the object has familiar external name while in scope, but is renamed after the DROP TABLE statement. The transaction log entries displayed are (click to enlarge):

    Transaction Log for First Execution

    The highlighted section shows the table being renamed in the internal catalogue tables from the user-visible name to the internal name. On the second run, there is an extra renaming log entry in the CREATE TABLE system transaction, as the object is renamed from the internal form to the user-visible name:

    Transaction Log 2

    The demonstration shows that CREATE TABLE and DROP TABLE for a cached temporary object are replaced by renaming operations.

    Cached Object Scope

    The cached object is scoped to the query plan that references it.  If the plan is evicted from cache for any reason (perhaps by ALTER or DROP PROCEDURE or an explicit DBCC FREEPROCCACHE command) a background thread removes the tempdb object. This is not synchronous to the command that causes the eviction; the delay seems to be 5 seconds or less on current SQL Server versions, and is performed by a system process id.  The following code shows the link between the cached temporary table and the cached plans for the stored procedure:

    USE tempdb;
    GO
    IF  OBJECT_ID(N'dbo.Demo', N'P') IS NOT NULL
        DROP PROCEDURE dbo.Demo;
    GO
    CREATE PROCEDURE dbo.Demo
    AS
    BEGIN
        CREATE TABLE #Demo (i int);
    END;
    GO
    DBCC FREEPROCCACHE;
    WAITFOR DELAY '00:00:05';
    GO
    EXECUTE dbo.Demo;
    GO
    SELECT
        t.name,
        t.[object_id],
        t.type_desc,
        t.create_date
    FROM tempdb.sys.tables AS t
    WHERE
        t.name LIKE N'#________';
    GO
    SELECT
        decp.cacheobjtype,
        decp.objtype,
        decp.plan_handle,
        dest.[text]
    FROM sys.dm_exec_cached_plans AS decp
    CROSS APPLY sys.dm_exec_plan_attributes(decp.plan_handle) AS depa
    CROSS APPLY sys.dm_exec_sql_text(decp.plan_handle) AS dest
    WHERE
        decp.cacheobjtype = N'Compiled Plan'
        AND decp.objtype = N'Proc'
        AND depa.attribute = N'objectid'
        AND CONVERT(integer, depa.value) = OBJECT_ID(N'dbo.Demo', N'P');

    Example output:

    Link to the cached plan

    Aside from checking for suitably-named objects in tempdb, there are a number of ways to see how many cached temporary objects (tables and variables) exist, and how many are in use by executing code.  One way is to query the sys.dm_os_memory_cache_counters DMV:

    SELECT
        domcc.name,
        domcc.[type],
        domcc.entries_count,
        domcc.entries_in_use_count
    FROM sys.dm_os_memory_cache_counters AS domcc
    WHERE domcc.[type] = N'CACHESTORE_TEMPTABLES';

    Cache Counters DMV

    Another way is to check the performance counters (also accessible via DMV):

    SELECT
        dopc.[object_name], 
        dopc.counter_name, 
        dopc.cntr_value
    FROM sys.dm_os_performance_counters AS dopc
    WHERE
        dopc.[object_name] LIKE N'MSSQL%Plan Cache%'
        AND dopc.instance_name = N'Temporary Tables & Table Variables'
        AND dopc.counter_name IN (N'Cache Object Counts', N'Cache Objects in use');

    Perf Counters 1

    This second example shows 200 cached objects on a different run:

    Perf Counters 2

    On SQL Server 2008 and later, we can evict a particular plan handle from cache to show that this removes the cached temporary table:

    DBCC FREEPROCCACHE(0x05000200CC7BDE0940E16D8C000000000000000000000000);

    As mentioned, there can be a delay of up to 5 seconds before the cached object is removed after the DBCC statement completes (though the DMVs reflect the cache changes immediately).  The more comprehensive test below shows all these things combined:

    -- Cache a temporary object
    EXECUTE dbo.Demo;
    GO
    -- Show the cached table
    SELECT
        t.name,
        t.[object_id],
        t.type_desc,
        t.create_date
    FROM tempdb.sys.tables AS t
    WHERE
        t.name LIKE N'#________';
     
    DECLARE @plan_handle varbinary(64);
     
    -- Find the plan handle
    SELECT
        @plan_handle = decp.plan_handle
    FROM sys.dm_exec_cached_plans AS decp
    CROSS APPLY sys.dm_exec_plan_attributes(decp.plan_handle) AS depa
    CROSS APPLY sys.dm_exec_sql_text(decp.plan_handle) AS dest
    WHERE
        decp.cacheobjtype = N'Compiled Plan'
        AND decp.objtype = N'Proc'
        AND depa.attribute = N'objectid'
        AND CONVERT(integer, depa.value) = OBJECT_ID(N'dbo.Demo', N'P');
     
    -- Truncate the log
    CHECKPOINT;
     
    -- Evict the plan
    DBCC FREEPROCCACHE(@plan_handle);
     
    -- Show log entries
    SELECT
        fd.[Current LSN],
        fd.SPID,
        fd.Operation,
        fd.Context,
        fd.AllocUnitName,
        fd.[Transaction Name],
        fd.[Transaction ID],
        fd.[Begin Time],
        fd.[End Time]
    FROM sys.fn_dblog(NULL,NULL) AS fd;
     
    -- Cached object not dropped yet
    SELECT
        t.name,
        t.[object_id],
        t.type_desc,
        t.create_date
    FROM tempdb.sys.tables AS t
    WHERE
        t.name LIKE N'#________';
     
    WAITFOR DELAY '00:00:05';
     
    -- Show cleanup
    SELECT
        fd.[Current LSN],
        fd.SPID,
        fd.Operation,
        fd.Context,
        fd.AllocUnitName,
        fd.[Transaction Name],
        fd.[Transaction ID],
        fd.[Begin Time],
        fd.[End Time]
    FROM sys.fn_dblog(NULL,NULL) AS fd;
     
    -- Gone!
    SELECT
        t.name,
        t.[object_id],
        t.type_desc,
        t.create_date
    FROM tempdb.sys.tables AS t
    WHERE
        t.name LIKE N'#________';

    The first result set shows the cached temporary object:

    Cached Temp Object

    The transaction log entries immediately after evicting the plan from cache show no activity aside from the CHECKPOINT we issued to truncate the log:

    Empty Transaction Log

    Then we see that the cached object still exists at this point (though the DMVs now show zero cached temporary objects):

    Cached Object Still Exists

    After an up-to-five-second delay, the transaction log contains:

    Asynchronous Log Activity

    Notice the system transaction named ‘droptemp’ is performed by system SPID 14, and instead of the renaming we saw earlier, all references to the cached object are deleted from the system tables.

    More about Table Variables

    You might recognise the internal hexadecimal name for cached temporary objects; the same is used for table variables outside a stored procedure:

    DECLARE @T AS TABLE (dummy int NULL);
     
    SELECT
        t.name,
        t.[object_id],
        t.type_desc,
        t.create_date
    FROM tempdb.sys.tables AS t
    WHERE
        t.name LIKE N'#________';
    GO 5

    The batch runs five times and produces output like this:

    Table Variable Output

    Notice that the object id is different each time (and so, therefore, is the name).  As already mentioned, table variables in a stored procedure can be cached just like temporary tables:

    CREATE PROCEDURE dbo.Demo
    AS
    BEGIN
        DECLARE @T AS TABLE (dummy int NULL);
     
        SELECT
            t.name,
            t.[object_id],
            t.type_desc,
            t.create_date
        FROM tempdb.sys.tables AS t
        WHERE
            t.name LIKE N'#________';
    END
    GO
    EXECUTE dbo.Demo;
    GO 5

    Now, though, we see the same object id and create date each time:

    Cached Table Variables

    Once a table variable is cached, the transaction log records for a simple procedure are quite interesting:

    DBCC FREEPROCCACHE;
    WAITFOR DELAY '00:00:05';
    GO
    CREATE PROCEDURE dbo.Demo
    AS
    BEGIN
        DECLARE @T AS TABLE (dummy int NULL);
     
        INSERT @T VALUES (1);
    END
    GO
    -- Cache the table variable
    EXECUTE dbo.Demo;
    GO
    SELECT
        t.name,
        t.[object_id],
        t.type_desc,
        t.create_date
    FROM tempdb.sys.tables AS t
    WHERE
        t.name LIKE N'#________';
    GO
    CHECKPOINT;
    GO
    EXECUTE dbo.Demo;
    GO
    SELECT
        fd.[Current LSN],
        fd.Operation,
        fd.AllocUnitName,
        fd.[Transaction Name],
        fd.[Transaction ID]
    FROM sys.fn_dblog(NULL, NULL) AS fd;

    Table Variable Transaction Log

    Notice the reference to the internal name #628FA481 and the clean up activity.  The same procedure with a temporary table instead of a table variable generates a bit more work for the server:

    Temporary Table Transaction Log

    Many of the entries are similar to the table variable case, with extra steps to rename the cached object when CREATE TABLE and the implicit DROP TABLE at the end of the procedure are executed.  Clearly, some efforts have been made to make table variables more efficient than temporary tables, while sharing many features in common at quite a low level.

    Another interesting thing, as I mentioned right at the start of this post, is that table variables disallow just about all of the actions that prevent caching of a temporary table.  Table variables do not allow named constraints or DDL that affects caching (e.g. CREATE INDEX, CREATE STATISTICS).  Table variables are also scoped more tightly than temporary tables.  While we can create a temporary table in one procedure, and refer to it in  another, the same thing cannot be done with table variables.  For the same scoping reasons, table variables cannot be defined using dynamic SQL and referenced outside that context.  One oddity is TRUNCATE TABLE; disallowed by table variables, but which does not affect caching.

    Anyway, the restrictions mean that table variables can always be cached, and don’t allow some of the crazy things that are possible with temporary tables (particularly as regards scoping, but also the cached statistics issue I described in my last post).  They also have the potential to perform better (no hidden renaming) at least in some high-volume circumstances.  If only we were able to create statistics (with intuitive behaviour!) and indexes after creation, table variables might well make the old Sybase ‘non-shareable temporary tables’ finally redundant.  Until then, we are left having to choose one or the other as best we can.

    © 2012 Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

    image454

  • Temporary Tables in Stored Procedures

    Ask anyone what the primary advantage of temporary tables over table variables is, and the chances are they will say that temporary tables support statistics and table variables do not.  This is true, of course; even the indexes that enforce PRIMARY KEY and UNIQUE constraints on table variables do not have populated statistics associated with them, and it is not possible to manually create statistics or non-constraint indexes on table variables.  Intuitively, then, any query that has alternative execution plans to choose from ought to benefit from using a temporary table rather than a table variable.  This is also true, up to a point.

    The most common use of temporary tables is in stored procedures, where they can be very useful as a way of simplifying a large query into smaller parts, giving the optimizer a better chance of finding good execution plans, providing statistical information about an intermediate result set, and probably making future maintenance of the procedure easier as well.  In case it is not obvious, breaking a complex query into smaller steps using temporary tables makes life easier for the optimizer in several ways.  Smaller queries tend to have a smaller number of possible execution plans, reducing the chances that the optimizer will miss a good one.  Complex queries are also less likely to have good cardinality (row count) estimates and statistical information, since small errors tend to grow quickly as more and more operators appear in the plan.

    This is a very important point that is not widely appreciated.  The SQL Server query optimizer is only as good as the information it has to work with.  If cardinality or statistical information is badly wrong at any point in the plan, the result will most likely be a poor execution plan selection from that point forward.  It is not just a matter of creating and maintaining appropriate statistics on the base tables, either.  The optimizer does use these as a starting point, but it also derives new statistics at every plan operator, and things can quickly conspire to make these (invisible) derived statistics hopelessly wrong.  The only real sign that something is wrong (aside from poor performance, naturally) is that actual row counts vary widely from the optimizer’s estimate.  Sadly, SQL Server does not make it easy today to routinely collect and analyse differences between cardinality estimates and runtime row counts, though some small (but welcome) steps forward have been made in SQL Server 2012 with new row count information in the sys.dm_exec_query_stats view.

    The benefits of using simplifying temporary tables where necessary are potentially better execution plans, now and in the future as data distribution changes and new execution plans are compiled.  On the cost side of the ledger we have the extra effort needed to populate the temporary table, and maintain the statistics.  In addition, we expect a higher number of recompilations for optimality reasons due to changes in statistics.  In short, we have a trade-off between potential execution plan quality and maintenance/recompilation cost.

    The problem, though, is that temporary tables do not work quite how (almost) everyone expects them to…

    A World Without Temporary Objects

    Imagine if temporary tables and table variables did not exist, and we had a complex query with cardinality-estimation difficulties that would be best addressed by breaking the query into parts.  The example presented below is necessarily simplified, and uses the AdventureWorks sample database to make it accessible.  It is not a query with insurmountable optimization problems by any means, but bear with me.

    IF  OBJECT_ID(N'dbo.Demo', N'P') IS NOT NULL
        DROP PROCEDURE dbo.Demo;
    GO
    CREATE PROCEDURE dbo.Demo
        @StartsWith nvarchar(50)
    AS
    BEGIN
        SET NOCOUNT ON;
     
        SELECT
            p.Name,
            OrderCount = COUNT_BIG(DISTINCT th.ReferenceOrderID)
        FROM Production.Product AS p
        JOIN Production.TransactionHistory AS th ON
            th.ProductID = p.ProductID
        WHERE
            p.Name LIKE @StartsWith + N'%'
        GROUP BY
            p.Name;
    END;

    This procedure returns a count of orders containing a part whose name starts with the specification passed in as a parameter.  For example:

    DBCC FREEPROCCACHE;
    EXECUTE dbo.Demo @StartsWith = N'E';

    image

    A Short Digression about Parameter Sniffing

    The only real optimization issue in this simplified example is parameter-sniffing.  The number of rows that match the LIKE predicate varies widely depending on the value of the parameter, which can impact execution plan selection.  When the SELECT query is compiled or recompiled, SQL Server uses the actual value of the parameter at the time to estimate the number of rows qualified from the Product table.  If the SELECT statement happens to compile or recompile with a parameter that matches very few rows, we get an execution plan like this (click to enlarge the graphic):

    DBCC FREEPROCCACHE;
    EXECUTE dbo.Demo @StartsWith = N'[0-9]';

    Parameter Sniffing [0-9]

    This plan is cached and reused next time the procedure is called, perhaps with a parameter value that matches more rows than before, but not enough to cause a recompilation:

    EXECUTE dbo.Demo @StartsWith = N'M';

    Parameter Sniffing M

    This results in a large number of Key Lookups and a Sort that spills to tempdb physical disk (the warning triangle is new in SQL Server 2012).  The memory grant for the sort was sized for the expected 370 rows, not the 22,527 that actually turned up.  (If you are curious about the Constant Scan and Compute Scalar in these plans, see my previous post on dynamic seeks for details).

    One way (certainly not the only way) to address this parameter-sniffing problem is to ask SQL Server to compile a fresh query plan for the SELECT statement on every execution, using the OPTION (RECOMPILE) query hint:

    ALTER PROCEDURE dbo.Demo
        @StartsWith nvarchar(50)
    AS
    BEGIN
        SET NOCOUNT ON;
     
        SELECT
            p.Name,
            OrderCount = COUNT_BIG(DISTINCT th.ReferenceOrderID)
        FROM Production.Product AS p
        JOIN Production.TransactionHistory AS th ON
            th.ProductID = p.ProductID
        WHERE
            p.Name LIKE @StartsWith + N'%'
        GROUP BY
            p.Name
        OPTION (RECOMPILE);
    END;

    Now we get plans optimized for the particular parameter value at runtime, and (in SQL Server 2008 SP1 CU5 or later) the added bonus that the runtime value of the parameter is embedded into the query so the dynamic seek is not required:

    EXECUTE dbo.Demo @StartsWith = N'[0-9]';

    Parameter Sniffing with RECOMPILE

    EXECUTE dbo.Demo @StartsWith = N'M';

    Parameter Sniffing with RECOMPILE 2

    Back to the Main Example

    However, we are imagining that the example query is much more complex than shown, and we decide to break it into two parts.  The first part will fetch qualified items from the Products table, and the second will join the results (with statistics) to the History table.  Remember that temporary objects do not exist for this part of the discussion, so we have to use a permanent table:

    ALTER PROCEDURE dbo.Demo
        @StartsWith nvarchar(50)
    AS
    BEGIN
        SET NOCOUNT ON;
     
        -- Real table to hold intermediate result
        CREATE TABLE dbo.Temp
        (
            ProductID   integer NOT NULL, 
            Name        nvarchar(50) COLLATE DATABASE_DEFAULT NOT NULL
        );
     
        -- First part of the 'complex' query
        INSERT INTO dbo.Temp
            (ProductID, Name)
        SELECT
            p.ProductID,
            p.Name
        FROM Production.Product AS p
        WHERE
            p.Name LIKE @StartsWith + N'%';
     
        -- Second part referencing the 'temp' table
        SELECT
            t.Name,
            OrderCount = COUNT_BIG(DISTINCT th.ReferenceOrderID)
        FROM dbo.Temp AS t
        JOIN Production.TransactionHistory AS th ON
            th.ProductID = t.ProductID
        GROUP BY
            t.Name;
     
        -- Show the statistics for the Name column
        DBCC SHOW_STATISTICS (Temp, Name) WITH STAT_HEADER, HISTOGRAM;
     
        DROP TABLE dbo.Temp;
    END;

    This procedure is not very practical as it stands, because the CREATE TABLE and DROP TABLE statements would make it impossible for more than one user to execute it at the same time.  Nevertheless, it does show how most people expect things to work regarding statistics, temporary table lifetimes, and compilation:

    DBCC FREEPROCCACHE;
    EXECUTE dbo.Demo @StartsWith = N'A';

    There are three matching Product records, shown below together with the SHOW_STATISTICS output:

    Permanent Table Output A

    Nothing too surprising in the execution plans: the table INSERT is a trivial plan (no choices for the optimizer to make), and the SELECT query has cardinality estimates that are very close to the actual numbers encountered.  The warning triangle on the SELECT (again, new for SQL Server 2012) is just suggesting an index for the History table, which we won’t be adding on this occasion.

    Permanent Execution Plan A

    While the procedure was executing, a Profiler trace captured the following compilation and statistics activity:

    image

    The table is created, and since this is the first time the procedure has been run, the INSERT and SELECT statements both incur a deferred compilation.  As part of the SELECT compilation, new statistics are created on the Name and ProductID columns of the new table, the DBCC SHOW_STATISTICS command executes, and finally the table is dropped.  Now we execute the procedure again but for products that start with ‘E’ rather than ‘A’ as above:

    EXECUTE dbo.Demo @StartsWith = N'E';

    Permanent Output E

    This time there are nine matching records, though the statistics-gathering algorithms compress the information into the five histogram steps shown.  The execution plan is below (trivial INSERT plan not repeated):

    image

    Again, the cardinality estimates are extremely good, and the optimizer selected a nested loops plus key-lookup plan based on costs computed according with the smaller number of expected rows.  The trace captured the following:

    image

    The only change is that this time recompilation is triggered because the cached INSERT and SELECT plans reference a table that no longer exists, leading to a ‘schema changed’ recompilation.  The crucial observations here are that the table is recreated and new statistics are gathered on every execution.  This is the way most people expect tables, statistics, and compilations to behave inside stored procedures, so don’t worry if there’s nothing too surprising so far.

    Using a Temporary Table

    The previous example is not terribly practical.  It is possible to conjure up schemes where ordinary tables can serve the purpose effectively, but as it stands, the procedure would likely cause runtime errors if more than one user or process attempted to execute it concurrently.  In addition, ordinary tables do not benefit from engine optimizations that only apply to genuine temporary tables, so they tend to be less efficient, even if created in tempdb.  We now return to the real world (where temporary tables do exist) and modify the procedure to use a temporary table instead of a permanent table:

    ALTER PROCEDURE dbo.Demo
        @StartsWith nvarchar(50)
    AS
    BEGIN
        SET NOCOUNT ON;
     
        CREATE TABLE #Temp
        (
            ProductID   integer NOT NULL,
            Name        nvarchar(50) COLLATE DATABASE_DEFAULT NOT NULL
        );
     
        INSERT INTO #Temp
            (ProductID, Name)
        SELECT
            p.ProductID,
            p.Name
        FROM Production.Product AS p
        WHERE
            p.Name LIKE @StartsWith + N'%';
     
        SELECT
            t.Name,
            OrderCount = COUNT_BIG(DISTINCT th.ReferenceOrderID)
        FROM #Temp AS t
        JOIN Production.TransactionHistory AS th ON
            th.ProductID = t.ProductID
        GROUP BY
            t.Name;
     
        DBCC SHOW_STATISTICS (N'tempdb..#Temp', Name) WITH STAT_HEADER, HISTOGRAM;
     
        DROP TABLE #Temp;
    END;

    Executing this procedure with an ‘E’ parameter produces exactly the same output as when the permanent table was used.  Same execution plan, same statistics, same Profiler output, everything:

    DBCC FREEPROCCACHE;
    EXECUTE dbo.Demo @StartsWith = N'E';

    Temporary Table Execution Plan E

    Now we will execute the procedure with a parameter value of ‘T’, which has 65 matching records in the Product table.  This is the output (only the first nine of the 65 matching names are shown in the first result set):

    image

    The results are as expected (products that begin with ‘T’) but the SHOW_STATISTICS output displays ‘E’ statistics from the previous execution!

    Temporary Table Execution Plan T

    The execution plan has serious cardinality estimation errors.  It has the same nested loops and key-lookup shape and estimated row counts as the ‘E’ execution, instead of being optimized for the ‘T’ data.  This causes slow performance and a sort that spills data to physical tempdb disk.

    image

    The Profiler output explains why: no recompilations occurred, and no statistics were created.  In fact, with the AdventureWorks database, there is no way to persist a different execution plan or update the statistics for this procedure without changing it (more on that later).  Even executing with [A-Z] as the parameter reuses the plan optimized for the ‘E’ value:

    EXECUTE dbo.Demo @StartsWith = N'[A-Z]';

    Temporary Table Execution Plan A-Z

    The key lookup is now executed 113,443 times, and the sort has to perform a multi-pass spill to tempdb (previously only a single pass spill was required).  Executing the procedure with the WITH RECOMPILE option will produce better plans, at the cost of recompiling the whole procedure, and the resulting plans will not be cached for reuse.  In general, recompiling the whole procedure every time could be bad since many procedures have many more statements to recompile than just a single INSERT and SELECT.  Although we have not yet identified a cause for the behaviour we are seeing, perhaps the answer lies in adding a RECOMPILE query hint to the SELECT statement as we did earlier for the parameter-sniffing example?

    OPTION (RECOMPILE) to the Rescue?

    Thinking that the cause might be a new form of the parameter-sniffing problem, we now modify the procedure to add the RECOMPILE query hint to just the SELECT statement.  This avoids recompiling the whole procedure on each call as EXECUTE … WITH RECOMPILE would do, and will hopefully solve our optimization problems:

    ALTER PROCEDURE dbo.Demo
        @StartsWith nvarchar(50)
    AS
    BEGIN
        SET NOCOUNT ON;
     
        CREATE TABLE #Temp
        (
            ProductID   integer NOT NULL,
            Name        nvarchar(50) COLLATE DATABASE_DEFAULT NOT NULL
        );
     
        INSERT INTO #Temp
            (ProductID, Name)
        SELECT
            p.ProductID,
            p.Name
        FROM Production.Product AS p
        WHERE
            p.Name LIKE @StartsWith + N'%';
     
        SELECT
            t.Name,
            OrderCount = COUNT_BIG(DISTINCT th.ReferenceOrderID)
        FROM #Temp AS t
        JOIN Production.TransactionHistory AS th ON
            th.ProductID = t.ProductID
        GROUP BY
            t.Name
        OPTION (RECOMPILE);
     
        DBCC SHOW_STATISTICS (N'tempdb..#Temp', Name) WITH STAT_HEADER, HISTOGRAM;
     
        DROP TABLE #Temp;
    END;

    Unsurprisingly, clearing the plan cache and executing with a parameter value of ‘E’ produces the exact same nested loops and key lookup plan as previously shown (and the same statistics and Profiler events too).

    DBCC FREEPROCCACHE;
    EXECUTE dbo.Demo @StartsWith = N'E';

    This is fine because it is the optimal plan for this value.  The test for OPTION (RECOMPILE) comes when we try a second execution with the 65-row ‘T’ value:

    EXECUTE dbo.Demo @StartsWith = N'T';

    Temporary Table Execution Plan RECOMPILE

    The execution plan has a different shape now, with the optimizer correctly preferring a hash join over the nested loops plus key lookup.  The estimated cardinality is correct for the temporary table (65 rows versus 9 previously seen for ‘E’), but it is badly wrong for the hash join (811 rows expected, 12,273 actual) and the distinct sort, which has to perform a single-pass spill to tempdb disk.  The query output shows the reason for the inaccuracy:

    Temporary Table RECOMPILE T

    The results are correct of course (65 products starting with ‘T’) but the statistics are again still reflecting the parameter value ‘E’.  The RECOMPILE query hint allows the optimizer to see the true cardinality of the temporary table (65 rows) but the 100% wrong statistics on the ProductID column used to estimate the cardinality of the join to the History table result in a mess of a plan from that point forward.  (Note that statistics on the Name column are also considered interesting by the optimizer for this query due to the GROUP BY clause.  The SHOW_STATISTICS output above shows the Name statistics because it more immediately shows that ‘E’ values are stored; the ProductID statistics are integers, and much harder for the reader to correlate with the parameter).

    image

    The Profiler output above confirms the recompilation due to the query hint, and that no statistics were created or updated.  An interesting thing occurs if we execute the procedure four more times with the same ‘T’ parameter (a total of five times with ‘T’):

    EXECUTE dbo.Demo @StartsWith = N'T';
    EXECUTE dbo.Demo @StartsWith = N'T';
    EXECUTE dbo.Demo @StartsWith = N'T';
    EXECUTE dbo.Demo @StartsWith = N'T';

    The first three of those executions are exactly as before, but the fourth one suddenly produces this different execution plan:

    Temporary Table RECOMPILE T5

    The estimates are now well within the boundaries of expected accuracy, and the sort now receives enough workspace memory to avoid spilling rows to tempdb.

    image

    The statistics have been updated, and now show 65 rows that start with ‘T’ sampled (compressed into 49 histogram steps).  This is confirmed by the trace output:

    image

    Notice the StatMan and AutoStats entries showing the ProductID and Name statistics being updated.  Unfortunately, executing the procedure again with a new parameter value takes us back to inaccurate cardinality estimates again.  The execution plans for this procedure aren’t always a disaster, but this is pure luck – poor estimates are poor estimates, whether the query plan happens to perform acceptably or not.  We really need to get to the root cause here.  Perhaps OPTION (RECOMPILE) was not the way to go, and we should try a manual statistics update instead of relying on automatic statistics updates…?

    Manual Statistics Update to the Rescue?

    We know we want to add an UPDATE STATISTICS command to our stored procedure instead of the OPTION (RECOMPILE) … but where to put the new statement exactly?  The statistics we want to update are created during compilation or recompilation of the SELECT statement, suggesting the UPDATE STATISTICS should come after the SELECT; but then how could it possibly affect the preceding SELECT plan?  It seems the only sensible place to put it is before the SELECT, however counter-intuitive it might seem.

    ALTER PROCEDURE dbo.Demo
        @StartsWith nvarchar(50)
    AS
    BEGIN
        SET NOCOUNT ON;
     
        CREATE TABLE #Temp
        (
            ProductID   integer NOT NULL,
            Name        nvarchar(50) COLLATE DATABASE_DEFAULT NOT NULL
        );
     
        INSERT INTO #Temp
            (ProductID, Name)
        SELECT
            p.ProductID,
            p.Name
        FROM Production.Product AS p
        WHERE
            p.Name LIKE @StartsWith + N'%';
     
        UPDATE STATISTICS #Temp;
     
        SELECT
            t.Name,
            OrderCount = COUNT_BIG(DISTINCT th.ReferenceOrderID)
        FROM #Temp AS t
        JOIN Production.TransactionHistory AS th ON
            th.ProductID = t.ProductID
        GROUP BY
            t.Name;
     
        DBCC SHOW_STATISTICS (N'tempdb..#Temp', Name) WITH STAT_HEADER, HISTOGRAM;
     
        DROP TABLE #Temp;
    END;

    Following the same pattern as before, we clear the plan cache (though this isn’t strictly necessary since the ALTER PROCEDURE evicts plans from the cache anyway) then execute with ‘E’ and ‘T’ parameters as before.

    DBCC FREEPROCCACHE;
    EXECUTE dbo.Demo @StartsWith = N'E';
    EXECUTE dbo.Demo @StartsWith = N'T';

    The results for the ‘E’ execution are the same as always, so here are just the results for the ‘T’ execution:

    UPDATE STATISTICS T

    The above output looks very promising: correct results and the statistics have been updated to reflect the ‘T’ parameter.

    image

    The trace output shows the statistics updates too…but no recompilation.

    UPDATE STATISTICS Execution Plan

    The execution plan is a disaster.  The estimated number of rows in the temporary table is 9 (the number of rows the ‘E’ execution produced not the 65 actually present), and everything is wrong from that point forward, including the sort spilling to tempdb again.  The UPDATE STATISTICS command did its job, but the cached query plan did not recompile – though we might expect that it should, right?

    As with the first temporary table example (no RECOMPILE or UPDATE STATISTICS) this plan is stuck in cache, given the AdventureWorks database.  No parameter value will cause a recompilation, and every subsequent execution will use the ‘E’ plan.  This is very reminiscent of the parameter-sniffing problem we encountered earlier, though the SELECT query in question has no direct parameter value to be sniffed!  In a final effort to produce reliable plans, our last attempt will be to combine both RECOMPILE and UPDATE STATISTICS…

    RECOMPILE and UPDATE STATISTICS to the Rescue?

    Our stored procedure now includes an UPDATE STATISTICS before the SELECT, and an OPTION (RECOMPILE) query hint on the SELECT itself:

    ALTER PROCEDURE dbo.Demo
        @StartsWith nvarchar(50)
    AS
    BEGIN
        SET NOCOUNT ON;
     
        CREATE TABLE #Temp
        (
            ProductID   integer NOT NULL,
            Name        nvarchar(50) COLLATE DATABASE_DEFAULT NOT NULL
        );
     
        INSERT INTO #Temp
            (ProductID, Name)
        SELECT
            p.ProductID,
            p.Name
        FROM Production.Product AS p
        WHERE
            p.Name LIKE @StartsWith + N'%';
     
        UPDATE STATISTICS #Temp;
     
        SELECT
            t.Name,
            OrderCount = COUNT_BIG(DISTINCT th.ReferenceOrderID)
        FROM #Temp AS t
        JOIN Production.TransactionHistory AS th ON
            th.ProductID = t.ProductID
        GROUP BY
            t.Name
        OPTION (RECOMPILE);
     
        DBCC SHOW_STATISTICS (N'tempdb..#Temp', Name) WITH STAT_HEADER, HISTOGRAM;
     
        DROP TABLE #Temp;
    END;

    Our usual test executions:

    DBCC FREEPROCCACHE;
    EXECUTE dbo.Demo @StartsWith = N'E';
    EXECUTE dbo.Demo @StartsWith = N'T';

    The first ‘E’ execution performs as it always has: correct statistics and an optimal query plan.  The results for ‘T’ are:

    STATS and RECOMPILE T

    The output is correct, and so are the statistics: 65 rows sampled, stored in 49 histogram steps.

    image

    Success!  All estimates are very good, the sort is performed in memory, and the query executes quickly.  Further experimentation with other parameter values shows that new statistics are always generated, and the SELECT query plan always recompiles to use those statistics.  If you are concerned by the small discrepancy after the hash join (estimate 10,449 rows; actual 12,273) don’t be.  Cardinality estimation is a tough problem – try using the statistics histograms yourself to predict how the ProductID values in the temporary table (65 rows in 49 steps, remember) will join to the histogram on the same column for the much larger History table.  You’ll find the data there even more compressed, so the task of matching values to values using RANGE_ROWS, EQ_ROWS and so on is really not an easy one at all.

    Enter the Table Variable…

    The final option we are going to look at is using a table variable instead of a temporary table:

    ALTER PROCEDURE dbo.Demo
        @StartsWith nvarchar(50)
    AS
    BEGIN
        SET NOCOUNT ON;
     
        DECLARE @Temp AS TABLE
        (
            ProductID   integer NOT NULL,
            Name        nvarchar(50) NOT NULL
        );
     
        INSERT INTO @Temp
            (ProductID, Name)
        SELECT
            p.ProductID,
            p.Name
        FROM Production.Product AS p
        WHERE
            p.Name LIKE @StartsWith + N'%';
     
        SELECT
            t.Name,
            OrderCount = COUNT_BIG(DISTINCT th.ReferenceOrderID)
        FROM @Temp AS t
        JOIN Production.TransactionHistory AS th ON
            th.ProductID = t.ProductID
        GROUP BY
            t.Name
        OPTION (RECOMPILE);
    END;

    There are a couple of small changes to the procedure here.  The COLLATE clause is not necessary in the definition of the Name column, since the default for table variables is the collation of the host database anyway (the default for temporary tables is the collation of tempdb).  There is no need (or option) to drop the table at the end of the procedure; table variables simply go out of scope.  There is no SHOW_STATISTICS statement either, since table variables do not support statistics (even on constraint-supporting indexes).  The first execution with a parameter value of ‘E’ produces this execution plan and trace output:

    image

    image

    The OPTION (RECOMPILE) on the SELECT statement ensures that the cardinality estimate for the table variable is exactly right, but the lack of statistics means the optimizer resorts to guesses after that point.  The guessed estimates turn out to be higher than the actual number of rows, so no performance problems result, although the sort is allocated more memory than it needs.  The trace output is a bit smaller than before, since no statistics are created.

    The second test is with a parameter value of ‘T’:

    image

    Again, the number of rows in the table variable is exactly right due to the RECOMPILE, and the later guesswork also turns out to work quite well.  As you can see, the overall shape of the plan is quite different for this parameter value (hash joins versus nested loops for example).  There is one cardinality estimation problem at the middle hash operator, but whether it causes performance problems or not depends on your version of SQL Server.  To make the issue clearer, we will perform one final test, a parameter value of ‘[A-Z]’, qualifying all 504 products:

    image

    Once more, the table variable cardinality is exactly right (504 rows).  Both inputs to the hash join look exactly right, and the guessed output from that join is also very close to the actual.  The lack of statistical information makes itself felt at the middle hash operator though; the optimizer has no idea how many distinct values there might be of the combination (Name, OrderID), so it guesses at 2,825.  This is very much smaller than the 113,338 actual, and the hash table spills out to tempdb physical disk as a result.

    The table variable approach happens to produce reasonable plans in some cases, based on the sniffed table cardinality provided by OPTION (RECOMPILE) but the lack of statistics leads to wild guesses and potential performance issues deeper in the tree.  It’s interesting that one of the main perceived benefits of table variables – fewer recompilations – has to be abandoned in order to produce semi-reasonable plans.  This follows hard on the heels of our seeing the main benefit of temporary tables – statistics – being 100% wrong earlier!

    The last thing I want to mention about table variables is that for all their limitations, they do something else temporary tables cannot: a join from a table variable is estimated for the average case summed over all rows.  This is similar to hinting OPTIMIZE FOR UNKNOWN for a temporary table query with a parameter, but the average-case estimation occurs for all joined rows, not just one predicate, and of course there is no parameter variable in a join for OPTIMIZE FOR UNKNOWN to work on anyway.

    Question Time

    At this point, you should have some questions.  Why does the temporary table behave so differently from the permanent table?  Do you really have to UPDATE STATISTICS before they are created, and add OPTION (RECOMPILE) when you use temporary tables this way?  Why does the RECOMPILE query hint only produce the desired effect with temporary tables every so often?  Part of the answer to these questions lies with the caching of temporary tables first introduced to SQL Server in the 2005 version.

    CREATE and DROP, Don’t

    I will talk about this in much more detail in my next post, but the key point is that CREATE TABLE and DROP TABLE do not create and drop temporary tables in a stored procedure, if the temporary object can be cached.  The temporary object is renamed to an internal form when DROP TABLE is executed, and renamed back to the same user-visible name when CREATE TABLE is encountered on the next execution.  In addition, any statistics that were auto-created on the temporary table are also cached.  This means that statistics from a previous execution remain when the procedure is next called.  The following code demonstrates the object id of a temporary table remaining the same between executions (there will be a more comprehensive demonstration in my next post):

    ALTER PROCEDURE dbo.Demo
    AS
    BEGIN
        CREATE TABLE #Temp (col1 int NULL);
     
        SELECT ObjectID = OBJECT_ID(N'tempdb..#Temp');
     
        DROP TABLE #Temp;
    END;
    GO
    EXECUTE dbo.Demo;
    EXECUTE dbo.Demo;
    EXECUTE dbo.Demo;

    image

    This is very different from the case where we used a permanent table.  With a permanent table, CREATE and DROP definitely do create and drop a table on each execution of the procedure.  The table is new each time, new statistics are created, and new plans are compiled on each execution.  The permanent table behaviour is probably more intuitive than the apparently strange behaviour of cached temporary tables, but it does allow temporary tables an opportunity to cache plans and temporary objects for reuse, which is not possible when using a permanent table in the way we did.

    Recompilation Thresholds

    Most of you will be familiar with the concept of recompilation thresholds, as described in the Plan Caching in SQL Server White Paper.  Briefly though, cached query plans include a Recompilation Threshold (RT) for statistics considered interesting by the optimizer while compiling the plan.  The query plan also records the cardinality of tables and indexed views referenced by the plan, at the time the plan is compiled.  (I demonstrated how to determine which statistics are considered ‘interesting’ by the optimizer during compilation in a previous post).

    Generally, if the column modification counter (colmodctr) for interesting statistics change by RT or more, the query recompiles.  Similarly, if the table cardinality has changed by RT or more compared with the value stored at the time the cached plan was created, the query plan also recompiles.  The value of RT depends on the type of table (permanent or temporary) and the number of rows (n) in the table or indexed view at the time of compilation.  The following summary of RT values is taken from the White Paper:

    Recompilation Thresholds

    There is another condition not listed above.  If the table is empty at compilation time, RT is set to 1.

    I say “generally” above because although queries referencing permanent tables recompile when interesting statistics modification counters change by RT or more, this is not the case for cached temporary tables in stored procedures, which only recompile when table cardinality changes by RT or more, unless OPTION (RECOMPILE) is used, when statistics counters come back into play.  If this sounds confusing, that’s because it is.  It may be a bug, but on the other hand, it does seem reasonable that a cached temporary table should not cause a recompilation simply because the procedure is called a number of times with the same parameter, with the number of modifications accumulating on each call to eventually pass RT.  In that specific case, we would probably prefer no recompilations.  I will probably file a Connect item to get the situation with statistics modification counters and temporary tables clarified.  Update: the Connect item can be found here.

    In our experiments, this explains why the vanilla temporary table procedure (with no RECOMPILE or UPDATE STATISTICS) stuck with the same plan, whereas adding RECOMPILE meant the plan would occasionally recompile (when accumulated changes passed RT).  The vanilla procedure was initially called with a parameter value of ‘E’, which compiled a SELECT execution plan when 9 rows were in the temporary table.  Following the rules above for n=9, this means RT was set to 500, so the plan would recompile when n reached 9 + 500 = 509.  The AdventureWorks Product table only contains 504 rows, so that target could not be reached, and the plan never recompiled.

    When OPTION (RECOMPILE) was added, the modification counters did accumulate to eventually pass RT.  The first execution was the same, with the ‘E’ parameter value resulting in 9 rows in the temporary table at compilation time, with RT = 509 as before.  We then ran the procedure a total of five times with a parameter value of ‘T’.  There happen to be 65 product names that start with ‘T’, so five executions resulted in 5 * 65 = 325 changes.  This does not reach the target of 509, but there is a second factor in play here.  Although the temporary table is not dropped at the end of the procedure (just renamed), it is implicitly truncated (whether an explicit DROP TABLE is included in the procedure or not).  This implicit truncation removes 65 rows on each execution, and this action adds to the modification counters.

    On each ‘T’ execution then, 65 modifications are counted when the rows are added to the temporary table, and 65 more when the table is silently cleared at the end of the procedure.  In total, the first four ‘T’ executions result in 65 * 2 * 4 = 520 modifications, which is enough to pass the 509 target.  That in itself is not enough to cause a recompilation though, because the RT test is performed as each query is retrieved from the plan cache for execution, and the final 65-row truncation happens at the end of the procedure, after the SELECT has already completed.  On the next execution, whatever parameter is passed in, a recompilation will occur at the SELECT query.  The OPTION (RECOMPILE) version of the procedure is reproduced below:

    ALTER PROCEDURE dbo.Demo
        @StartsWith nvarchar(50)
    AS
    BEGIN
        SET NOCOUNT ON;
     
        CREATE TABLE #Temp
        (
            ProductID   integer NOT NULL,
            Name        nvarchar(50) COLLATE DATABASE_DEFAULT NOT NULL
        );
     
        INSERT INTO #Temp
            (ProductID, Name)
        SELECT
            p.ProductID,
            p.Name
        FROM Production.Product AS p
        WHERE
            p.Name LIKE @StartsWith + N'%';
     
        SELECT
            t.Name,
            OrderCount = COUNT_BIG(DISTINCT th.ReferenceOrderID)
        FROM #Temp AS t
        JOIN Production.TransactionHistory AS th ON
            th.ProductID = t.ProductID
        GROUP BY
            t.Name
        OPTION (RECOMPILE);
     
        DBCC SHOW_STATISTICS (N'tempdb..#Temp', Name) WITH STAT_HEADER, HISTOGRAM;
     
        DROP TABLE #Temp;
    END;

    Now run the tests as described above:

    -- 9 rows, RT = 509
    EXECUTE dbo.Demo @StartsWith = N'E';
     
    -- 65 rows, 4 times = 260
    -- Plus truncation modifications = 520
    EXECUTE dbo.Demo @StartsWith = N'T';
    EXECUTE dbo.Demo @StartsWith = N'T';
    EXECUTE dbo.Demo @StartsWith = N'T';
    EXECUTE dbo.Demo @StartsWith = N'T';
     
    -- No rows, but plan still recompiled
    EXECUTE dbo.Demo @StartsWith = N'5';

    There are no products that start with ‘5’, but the 520 modifications already performed cause a recompilation anyway:

    image

    No results, and NULL statistics created!  Luckily, the new RT value after this compilation is 1, so just one row is needed to produce a new plan.  This is demonstrated by calling the same procedure again with a ‘K’ parameter value (just one product name starts with ‘K’):

    EXECUTE dbo.Demo @StartsWith = N'K';

    image

    After this compilation, the new RT is 1 + 6 = 7, which can be confirmed by executing the procedure with a parameter value of ‘P’ (there are exactly seven Product rows where the name starts with ‘P’).  And so on.

    I should mention though, that it is just the carry-over of accumulated modifications that requires OPTION (RECOMPILE).  Without this query hint, modifications are still accumulated within a single execution of the procedure (and may result in recompilation local to that single execution) but any accumulation is not carried over to the next procedure execution.

    Statistics Updates

    So far we have explained how the vanilla and OPTION (RECOMPILE) procedures produce the observed behaviour.  The next configuration to examine is where OPTION (RECOMPILE) is not specified, but an UPDATE STATISTICS statement appears before the problematic SELECT query.  It should now be apparent that the placing of the UPDATE STATISTICS statement works because auto-created statistics from a previous execution are cached, and available for update by the next execution.  The one remaining question is why the updated statistics do not cause a recompilation in our example – remember we saw correct statistics, but the execution plan was not rebuilt to reflect the new information.  This is either a bug, or another example of different behaviour for cached temporary tables.  In any case, the observed behaviour does not match that described in the White Paper (the same diagram appears in the SQL Server Internals books from Microsoft Press):

    image

    Whether this is a bug or undocumented designed behaviour (and it is the same on all versions of SQL Server from 2005 to 2012 inclusive) the effect is clear: UPDATE STATISTICS on a cached temporary table in a stored procedure is not enough on its own to force recompilation of a query plan.  A recompilation may still occur for other reasons, for example if the cardinality of the table has changed by RT or more compared with the cardinality value stored with the cached plan.  This cannot happen with our UPDATE-STATISTICS-only test procedure as shown earlier, since the first execution with the ‘E’ parameter value adds 9 rows to the temporary table, setting RT to 509, and the AdventureWorks Product table only has 504 rows in total as mentioned earlier.

    For the moment, at least, we may need to specify OPTION (RECOMPILE) as well as UPDATE STATISTICS if we want to be sure of a plan specific to the contents of a cached temporary table on each execution.

    Summary and Final Thoughts

    Temporary tables in stored procedures have a number of characteristics that may seem unexpected:

    • Temporary objects may be cached across executions, despite explicit CREATE and DROP statements
    • Statistics associated with a cached temporary object are also cached
    • Statistics may be 100% wrong compared with the current contents of a temporary table
    • An implicit TRUNCATE TABLE at the end of the procedure doubles the expected impact on column modification counters
    • It is possible to update cached statistics before the statement that caused them to be created
    • OPTION (RECOMPILE) changes the way column modifications are carried over between procedure calls
    • Manual UPDATE STATISTICS is not enough to force a recompilation

    It is possible that one or more of these behaviours is unintentional, but they have been this way for a very long time, and so are very unlikely to be changed.  Temporary tables behave quite similarly (aside from the UPDATE STATISTICS behaviour) to an in-procedure reference to a permanent table that was created outside the procedure.  It may be that some of the features of temporary tables shown in this post have their roots in a design that sought to mimic permanent table behaviours using non-sharable local temporary tables.  On the other hand, it might be simply that the current arrangement provides the best chances for caching and reuse, and we are expected to take action where we suspect reusing the cached plan, temporary object, and/or statistics may not be optimal.  Who knows.

    How likely you are to be affected by the issues illustrated in this post depends on usage.  A temporary table that contains wildly differing numbers of rows on every execution is unlikely to be affected.  If the recompilation threshold is passed, chances are you will get good statistics and any cached plan with recompile without intervention.  For a plan cached when the temporary table contained 10,000 rows the default threshold is 500 + 0.2 * 10,000 = 2,500 rows.  This means the next execution with more than 12,500 rows or fewer than 7,500 rows will pass the threshold test.  The formula to compute RT values when trace flag 2371 is enabled is unknown to me.

    In principle, in later versions of SQL Server, we can look in the plan cache to check the cardinality of a temporary table and deduce the RT from there, but this is not convenient to do routinely.  In practice, RT moves around a fair bit, so it can be hard to know how often a procedure might use stale cached statistics or a sub-optimal cached plan.  Avoiding temporary table caching removes the risk of stale statistics, but a cached plan may still be used unless the RECOMPILE query hint is used.  Avoiding temporary table caching will also tend to increase contention on tempdb, and can be difficult to enforce in practice.

    Procedures that routinely store relatively small result sets and use cached temporary tables are probably most at risk.  This is quite a common scenario: temporary tables are often used for smaller sets (writing away large temporary tables is more difficult to justify).  One scenario that springs to mind is where a delimited string is passed as a parameter, split to a cached temporary table, which is then used to drive a more complex query.  The plan for the complex query may be very sensitive to the cardinality and statistical distribution of values in the temporary table, so 100% wrong cached statistics and/or inappropriate plan reuse may be a problem.  I’m sure there are many other scenarios.

    It is quite ironic that the poor temporary table execution plans shown in this post are caused by the thing that is perceived to be their greatest strength: the availability of distribution statistics.  Some care is needed when choosing between table variables, temporary tables or another technique; the right choice in any particular circumstance will depend on many factors, only a few of which have been discussed today.  Whichever option you choose, be sure to understand the choice you have made and the trade-offs that result.

    There are also times where the best-performing solution will be to carefully craft one huge query, perhaps using hints and/or specific syntax tricks.  This approach is perfectly valid but requires careful thought as future maintenance may be harder, and data changes over time may mean the carefully-designed manual execution plan gradually (or suddenly) becomes sub-optimal.  My general preference is still for simpler relational queries and, where necessary, for temporary tables over table variables as a starting point, but the details do indeed depend on the details.  If you happen to be attending the PASS Summit 2012, I will be speaking more about these sorts of things in my half day and regular sessions.

    I hope you enjoyed this post, the next one will cover details of temporary table caching that I did not have space for today.  I would like to thank Dave Ballantyne (blog | twitter) for his help with today’s post.

    Further Reading

    Working with tempdb in SQL Server 2005 – TechNet Library
    Plan Caching in SQL Server 2008 – MSDN Library

    image

    © 2012 Paul White
    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi

  • Parallel Execution Plans Suck

    Summary: A deep dive into SQL Server parallelism, and a potential performance problem with parallel plans that use TOP.

    There was an interesting question asked by Mark Storey-Smith on dba.stackexchange.com back in October 2011.  He was looking at the execution plan for a query that counts a million rows from a virtual auxiliary table of numbers.  The query below is a slightly-modified version of the one in the original post:

    WITH
        N1 (n) AS
        (
            -- 10 rows
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1
        ),
        N2 AS (SELECT L.n FROM N1 AS L CROSS JOIN N1 AS R), -- 100 rows
        N3 AS (SELECT L.n FROM N2 AS L CROSS JOIN N2 AS R), -- 10,000 rows
        N4 AS (SELECT L.n FROM N3 AS L CROSS JOIN N3 AS R), -- 100,000,000 rows
        N5 AS (SELECT L.n FROM N4 AS L CROSS JOIN N1 AS R), -- 1,000,000,000 rows
        N6 AS 
        (
            SELECT TOP (1000000) n
            FROM N5
        )
    SELECT 
        COUNT_BIG(*)
    FROM N6
    OPTION (RECOMPILE, MAXDOP 1);

    This particular virtual numbers table is capable of producing up to a (short-scale) billion rows, via a number of cross joins, but the final common table expression N6 limits it to one million with the TOP clause.  The query plan is just a sequence of cross joins of the ten in-memory rows defined by the first common table expression, N1 (click to enlarge):

    One million row serial quey plan

    As usual, the Constant Scan operator is used to generate rows in memory without accessing a physical table, but this one has an interesting property: the ‘virtual table’ contains ten rows, but no columns.  The query itself only counts rows, it does nothing at all with any column in those rows, and the optimizer contains logic to only project columns that are needed later in the query plan.  If you look at the query plan in SSMS or SQL Sentry Plan Explorer, you will see that the Constant Scans have a blank output columns list; they project no columns at all.

    Execution Plans Suck

    The plan above does illustrate an important concept in plan-reading: execution plans start executing at the left-most node.  People are often told to read execution plans from the top-right, and that is fine if you just want to follow the flow of data – so long as you bear in mind that the flow of program control starts at the root (far left).

    Rob Farley (blog | twitter) sums this up by saying “execution plans suck”; a reference to the fact that rows are pulled up the tree by parent operators requesting rows, one at a time, from the its child operator(s).  Query execution follows a demand-based pipeline model (except in batch mode in parallel 2012 column-store plans, but that is a tangent we will not be pursuing today).  I’m not going to labour the point, but if you are interested to understand this better, take a look at my previous post on the topic, or Brad Schulz’s entertaining overview.

    Let’s look at the execution plan (with runtime statistics) changing the TOP specification from one million to one hundred to make it easier to see what’s going on:

    TOP 100 Plan

    I’m just showing part of the plan for clarity.  There are another four Constant Scans off to the right that all produce one row.  If query plans really did start executing at the top right, we would expect one row from the right-most Constant Scan, ten rows from its parent, and one hundred at the next operator up the tree (working right to left).  As it is, the expected pattern (one, then ten, then one hundred) appears closest to the Top operator.  This only makes sense if a row at a time is sucked up the plan from the root.  The pipelined (row-by-row) model also explains why execution stops after 100 rows; the Top operator in the plan simply stops requesting a new row from its immediate child at that point.

    The Question

    Back to the main thrust of today’s post.  The question arose when Mark ran his query with parallelism enabled:

    WITH
        N1 (n) AS
        (
            -- 10 rows
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1
        ),
        N2 AS (SELECT L.n FROM N1 AS L CROSS JOIN N1 AS R), -- 100 rows
        N3 AS (SELECT L.n FROM N2 AS L CROSS JOIN N2 AS R), -- 10,000 rows
        N4 AS (SELECT L.n FROM N3 AS L CROSS JOIN N3 AS R), -- 100,000,000 rows
        N5 AS (SELECT L.n FROM N4 AS L CROSS JOIN N1 AS R), -- 1,000,000,000 rows
        N6 AS 
        (
            SELECT TOP (1000000) n
            FROM N5
        )
    SELECT 
        COUNT_BIG(*)
    FROM N6
    OPTION (RECOMPILE, MAXDOP 2);

    That produces an actual execution plan like this (click to enlarge):

    Parallel Top One Million

    The question relates to the actual number of rows shown entering and leaving the Gather Streams exchange operator:

    Gather Streams Exchange

    As expected, one million rows leave the exchange operator, but the plan shows 1,004,588 rows entering it.  So the question is, are the row counts wrong, or if not, where did the rows go?

    The Answer

    As you may know, the answer lies in the fact that exchanges contain buffers.  For efficiency reasons, rows are not passed across threads one by one as in the normal model discussed above.  Instead, producer threads (on the right side of the exchange operator) pack rows one at a time into packets, and push completed packets across the exchange.  Consumer thread(s) on the left side of the exchange retrieve rows one at a time from the current packet on demand, re-establishing the demand-based pipeline model.  The internal class name for the packets, by the way, is Class eXchange PACKET – which gives us the familiar CXPACKET moniker.

    Exploring the detail at little more, we can see from the execution plan that the two independent producer threads to the right of the exchange pack a total of 499,225 + 505,363 = 1,004,588 rows into packets, and the single consumer thread (thread zero) retrieves just the one million rows needed by the Top operator:

    Per Thread Row Counts

    So, the actual row counts shown on the execution plan are correct, and the ‘missing rows’ are accounted for by rows that were added to packets but not ultimately needed by the consumer.  After the one millionth row is passed by the Top to the Stream Aggregate (and remember all these things happen one row at a time), the next time the Top gets control, it starts the process of shutting down the pipeline below it, rather than asking the Gather Streams exchange for another row.  Glossing over the finer details a little, instead of the Top calling a GetRow() method on the Gather Streams iterator, it calls a Close() method instead.

    SQL Server 2008+

    On SQL Server 2005 that is the end of the story.  On SQL Server 2008 and later (including 2012), there is more.  Let’s run the query one more time, but this time with a degree of parallelism of three, rather than two.  Sometimes, we will get a plan that contains this sort of row count arrangement at the Gather Streams:

    DOP 3 Gather Streams

    Much the same as before, a few thousand extra rows are processed by the producer threads than are ultimately needed by the consumer.  That’s fine, of course, the threads in a parallel plan execute independently, so there are bound to be small timing differences that lead to this sort of effect.  Every so often, however, executing this query on SQL Server 2008 or above will produce a result like this:

    Monster Row Count

    Whoa.  Nearly 335 million rows entering the exchange – and the query now runs for 50 seconds or so, having run consistently for around 200ms previously.  Looking at the per-thread actual row counts is revealing:

    Monster Per Thread

    If everything were perfectly balanced, we might expect each of the three producer threads to process around 333,000 rows.  Indeed, thread 1 and thread 2 are in this ballpark area, but thread 3 ground through 334 million rows on its own!  I should mention that there is nothing special about thread 3 (you are as likely to find the huge row count on thread 1 or thread 2, the labels are arbitrary).  Indeed, the problem can occur on any or all threads, as a second example run at DOP 3 shows below:

    DOP 3 Second Example

    DOP 3 Per Thread Row Counts 2

    This time two threads went rogue, resulting in over 667 million rows being processed in a total execution time of 67 seconds.

    Parallelism Problems?

    There are other parallelism (exchange) operators in the plan, though we have concentrated only on the final Gather Streams operator so far.  By the way, Gather Streams is also known as a Start Parallelism operator – a name that might surprise you since it seems to mark the end of parallelism in the plan, rather than the start.  Remember that execution plans suck.  The ‘final’ Gather Streams is in fact the first operator to start executing, and it is responsible for starting the parallel workers and attaching them to the threads that were reserved for this plan at the very start of execution.  Anyway, I digress.  Let’s look at the next exchange in the plan – reading left to right of course.  This is a Repartition Streams exchange operating in Round Robin distribution mode:

    Repartition Streams

    This shows that 1,000 rows arriving on the producer side of the exchange, and a total of 668 on the consumer side.  Not shown in the diagram above, the producer side has 340 rows on thread 1, and 330 rows on each of thread 2 and thread 3.  Note that these are not the same threads as the ones we saw numbered the same way before.  The numbering scheme is restarted for each independent parallel zone in the plan (and zones are bounded by a parallelism operator of one sort or another).  Anyway, the point is that the rows are pretty evenly distributed on the producer side of the exchange.

    On the consumer side (row counts illustrated above), things are rather different.  Thread 1 (in this zone) processes 334 rows, thread 2 gets 333, and thread 3 gets only one.  Now these are the same threads as shown in the 667 million row diagram.  I mentioned that parallel zones are bordered by parallelism operators; the current zone is bordered by the Repartition Streams on its right side, and by the Gather Streams on its left.  The same three threads are consumers at the Repartition Streams, and producers at the Gather Streams operator:

    Consumer Producer Branch

    There is a clear relationship between the thread row counts at the consumer side of the Repartition Streams (334, 333, and 1 row) and the row counts at the producer side of the Gather Streams (334 million, 333 million, 338,614 rows).  The two problematic threads have multiplied their row counts by a factor of one million – precisely the effect of the six cross joins in this parallelism zone.  The Constant Scan virtual tables contain ten rows each, and multiplying by ten, six times over, gives a factor of one million.

    Rogue Row Goals

    Thread 3 in the example above ends up processing 338,614 rows.  This number has no special significance, except it shows that this thread did not run this portion of the plan to completion.  If it had, the single row it started with would have ended up as one million rows by the time it had been cross-joined six times with the ten-row Constant Scan table.

    This is the row goal in action (if you need a details on exactly what a row goal is, please see my previous post).  Essentially, though, things like a TOP operator or a FAST n hint set a row goal.  Plans normally run to completion, but row goals modify this, producing a plan that tries to optimize for a certain number of rows rather than the full potential set.  The TOP operator is even more special.  As discussed briefly before, TOP can bring execution to an early end – instead of continuing to ask for rows from its child iterator, it calls the Close() method instead.  This call filters down the tree, and execution comes to an early end.  (Strictly, row goals affect optimizer choices and plan costing rather than being associated with the early end of execution, but I like the phrase ‘rogue row goal’).

    So, thread 3 did not run to completion – it responded to the early Close() call and only processed 338 thousand of the one million rows it could have produced if left alone.  Threads 1 and 2 never received the Close() call, or chose to ignore it.  These two rogues went on to process their full potential row set – 667 million rows – despite the fact that the TOP had seen the million rows it needed and was waiting for operations elsewhere in the parallel plan to stop.  We can see this by looking at the sys.dm_os_waiting_tasks DMV while the long-running query is executing:

    e_waitPortClose

    The output is split across two lines for readability, and shows execution context zero blocked on a CXPACKET wait by both execution context 1 and execution context 3.  Execution context zero is always thread zero – the so-called coordinator thread that runs the serial part of the execution plan to the left of the left-most Gather Streams operator.  Put another way, context zero always runs the part of the plan before (reading left to right) the Start Parallelism operator (and therefore always runs a serial plan).

    OK, so CXPACKET means the thread is involved in a parallelism-related wait.  The extra detail in the resource_description column tells us that the wait is occurring at the node id 2.  Checking the execution plan, we see that node 2 is indeed the Gather Streams exchange:

    Node 2

    The wait type of e_waitPortClose means the consumer is waiting for the port to close.  If you are wondering what a port is in this context, I will just say the parallelism architecture is more complex than just CXPACKET – the wiring includes a CXPort class, a CXPipe class, a CXTransLocal (local transport) class and a CXTransRemote class (for the Parallel Data Warehouse edition).  There is also a linked map structure that shows how the various pipes, ports, and transports connect together.  Closing the port is one step in shutting down part of a parallel plan which is running on the other side of an exchange (via a transport, pipe, and port).  The stack trace below shows thread zero waiting for the port to close:

    Stack Trace

    The important things are that (a) this problem does not occur in SQL Server 2005; and (b) a number of changes were made to the internal parallelism implementation in SQL Server 2008.  These changes seem to have introduced a bug, where the consumer can wait for the port to close, but one or more producers either ignore the request, or fail to check for it, and go on to process the whole potential result set instead of stopping early.

    Not Just Nested Loops

    Fellow SQLblogger Joe Chang (blog) suggested in the comments that this problem may only occur with parallel nested loops joins.  The script below reproduces the problem with parallel hash joins:

    WITH
        N1 (n) AS
        (
            -- 10 rows
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1
        ),
        N2 AS (SELECT L.n FROM N1 AS L JOIN N1 AS R ON L.n % 1 = R.n % 1), -- 100 rows
        N3 AS (SELECT L.n FROM N2 AS L JOIN N2 AS R ON L.n % 1 = R.n % 1), -- 10,000 rows
        N4 AS (SELECT L.n FROM N3 AS L JOIN N3 AS R ON L.n % 1 = R.n % 1), -- 100,000,000 rows
        N5 AS (SELECT L.n FROM N4 AS L JOIN N1 AS R ON L.n % 1 = R.n % 1), -- 1,000,000,000 rows
        N6 AS 
        (
            SELECT TOP (1000000) n
            FROM N5
        )
    SELECT 
        COUNT_BIG(*)
    FROM N6
    OPTION (RECOMPILE, MAXDOP 4, QUERYTRACEON 8649);

    The full execution plan is rather large, but the key part is shown below (click to enlarge):

    parallel hash join plan

    Yes, that is 700 million rows entering the Gather Streams exchange (click to enlarge):

    parallel hash row counts

    The bug does not occur in every query plan with Top and parallelism, but the choice of nested loops join is not the cause.

    Final Thoughts

    The potential for poor performance and excessive processor usage here is obvious; and the chance of hitting the race condition gets worse at higher DOP.  With eight threads per parallel zone (DOP 8), I hit this issue almost every time on SQL Server 2008, 2008 R2, and 2012.  Because this behaviour does not occur on SQL Server 2005, but does on 2008 and later, I have filed this as a bug on Connect:

    https://connect.microsoft.com/SQLServer/feedback/details/740234/poor-performance-with-parallelism-and-top

    Further Reading:

    http://sqlblog.com/blogs/paul_white/archive/2010/08/05/iterators-query-plans-and-why-they-run-backwards.aspx
    http://bradsruminations.blogspot.co.nz/2010/11/second-in-life-of-query-operator.html
    http://www.simple-talk.com/sql/learn-sql-server/understanding-and-using-parallelism-in-sql-server
    http://blogs.msdn.com/b/craigfr/archive/2006/10/25/the-parallelism-operator-aka-exchange.aspx
    http://sqlblog.com/blogs/paul_white/archive/2010/08/18/inside-the-optimiser-row-goals-in-depth.aspx

    © 2012 Paul White
    Twitter: @SQL_Kiwi (with an underscore)
    Email: SQLkiwi@gmal.com (no underscore)

  • Query Optimizer Deep Dive - Part 4

    This is the final part in a series of posts based on the content of the Query Optimizer Deep Dive presentations I have given over the last month or so at the Auckland SQL Users’ Group and the SQL Saturday events in Wellington, New Zealand and Adelaide, Australia.

    Links to other parts of this series: Part 1 Part 2 Part 3

    Beating the Optimizer

    Our test query produces an optimized physical execution plan that is quite different from the logical form of the query.  The estimated cost of the execution plan shown below is 0.0295 units.

    Optimized Query Plan

    Since we know the database schema very well, we might wonder why the optimizer did not choose to use the unique nonclustered index on names in the product table to filter rows based on the LIKE predicate.  We could use an index hint to force the name index to be used:

    Forced Index Hint

    That’s great, and no doubt the index seek is cheaper than the scan we had previously, but the optimizer has still chosen to use a merge join, and that means having both inputs sorted on Product ID.  The result of the index seek is ordered by name (the index key) rather than Product ID, so a sort is required.  It looks like the new sort adds a little more cost than the seek saves over the scan, because the estimated cost of the query plan with the index hint is 0.0316 units.

    Naturally, these numbers are rather small since AdventureWorks is not a large database, but these differences can be important in real systems.  Anyway, let’s persist with the index seek idea; why is the optimizer so keen on a merge join, even though it involves an extra sort, and we don’t have an ORDER BY Product ID clause on our query?  Without a top-level ORDER BY, we are giving the optimizer the freedom to return results in any order that is convenient – perhaps we can do better by forcing the index seek and a hash join instead of a merge join?

    Forced Hash Join

    Well the sort has gone, so the plan looks visually a little simpler, but the estimated cost has increased again, to 0.0348 units.  Hash join has quite a high start-up cost (and it requires a memory workspace grant).  We could try other things, but it certainly seems that the optimizer had it right to begin with in this case.

    The manual exploration above shows that the optimizer does generally find a good plan quickly (and sometimes it may even find the best possible plan). The terms ‘good’ and ‘best’ here are measured in terms of the optimizer’s own cost model. Whether one particular plan shape actually executes faster on a given system is another question. I might find, for example, that the first merge join plan runs fastest for me, whereas you might find the seek and sort runs fastest for you. We might both find that which is faster depends on whether the indexes and data needed are in memory or need to be retrieved from persistent storage. All these things aside, the important thing is that we are both likely to find that the optimizer’s plans are pretty good most of the time.

    Models and Limitations

    There are three principal models used by the optimizer to provide a basis for its reasoning: cardinality estimation, logical relational equivalences, and physical operator costing.  Good cardinality estimation (row count expectations at each node of the logical tree) is vital; if these numbers are wrong, all later decisions are suspect.  Fortunately, it is relatively easy to check cardinality estimation by comparing actual and estimated row counts in query plans.  There are some subtleties to be aware of – for example when interpreting the inner side of a nested loops join, at least in SSMS.  If you use the free SQL Sentry Plan Explorer tool, many of these common misinterpretations are handled for you.

    The model used in cardinality estimation is complex, and contains all sorts of hairy-looking formulas and calculations.  Nevertheless, it is still a model, and as such will deviate from reality at least to some extent.  I’ll have more to say later on about what we can do to help ensure good cardinality estimates, but for now we will just note that the model has its roots in relational theory and statistical analysis.

    Relational equivalences (such as A inner join B = B inner join A) are the basis for exploration rules in the cost-based optimizer.  Not all possible relational transformations are included in the product (remember the goal of the optimizer is to find a good plan quickly, not to perform an exhaustive search of all possible plans).  As a consequence, the SQL syntax you use will often affect the plan produced by the optimizer, even where different SQL syntaxes express the same logical requirement.  Also, the skilled query tuner will often be able to do better than the optimizer, given enough time and perhaps a better insight to the data.  The downside of such manual tuning is that it will usually require manual intervention again in the future as data volumes and distributions change.

    Physical operator costing is very much the simplest of the three models, using formulas that have been shown to produce good physical plan selections for a wide range of queries on a wide range of hardware.  The numbers used probably do not closely model your hardware or mine, but luckily that turns out not to be too important in most cases.  No doubt the model will have to be updated over time as new hardware trends continue to emerge, and there is some evidence in SQL Server 2012 show plan output that things are heading in that direction.

    Assumptions

    All models make simplifying assumptions, and the cardinality estimation and costing models are no different.  Some things are just too hard to model, and other things just haven’t been incorporated into the model yet.  Still other things could be modelled, but turn out not to add much value in practice and cost too much in terms of complexity or resource consumption.  Some of the major simplifying assumptions that can affect real plan quality are:

    • All queries start executing with a cold cache
      This isn’t as crazy as it sounds.  Fetching data from disk tends to dominate the overall cost of a query, modelling the amount of data that can be expected to be in cache already is hard, and this assumption does at least affect everything equally.  The optimizer does contain some logic to account for pages that might be in cache after the first access.
    • Statistical information is independent
      Correlations do frequently exist between columns in real databases, so this assumption can be problematic.  Multi-column statistics, filtered indexes, and indexed views can sometimes help with this.
    • Distribution is uniform
      This is assumed where the system has no information to the contrary.  One example: costing assumes that seeks (lookups) into an index are randomly distributed throughout the full range of the index.

    Helping the Optimizer

    There are three approaches to working with the optimizer:

    1. Ignore it.  It works well out of the box with default settings and automatic statistics.
    2. Override it using syntax tricks, hints, and plan guides.
    3. Provide it with the best information you can, and override it for just a small number of problematic queries.

    All three are valid approaches in different circumstances, though the third option is probably the one to recommend as the best starting point.  There is much more to helping the optimizer than just making sure statistics are up-to-date, however:

    Use a relational design

    This is probably the biggest single item.  The various models all assume that your database has a reasonably relational design, not necessarily 3NF or higher, but the closer your structures are to relational principles, the better.  Cardinality estimation works best with simpler relational operations like joins, projects, selects, unions and group by.  Avoid complex expressions and non-relational query features that force cardinality estimation to guess.  Also remember that the cost-based optimizer’s exploration rules are based on relational equivalences, so having a relational design and writing relational queries gives you the best chance of leveraging the hard work that has gone into those rules.  Some features added in 2005 (e.g. ranking functions) operate on sequences rather than multi-sets (hence the Sequence Project physical operator); the original rules all work with multi-sets, and the seams between the two approaches often show though.

    Use constraints

    Use check constraints, foreign keys, unique constraints to provide the optimizer with information about your data.  Simplification and exploration rules match patterns in the logical tree, and may also require certain properties to be set on the matched nodes.  Constraints and keys provide some of the most powerful and fundamental logical properties – omitting these can prevent many of the optimizer’s rules from successfully transforming to an efficient physical plan.  If an integer column should only contain certain values, add a check constraint for it.  If a foreign key relationship exists, enforce it.  If a key exists, declare it.  It is not always possible to predict the benefits of providing this type of optimizer information, but my own experience is that it is much greater than most people would ever expect.

    Statistics, indexes and computed columns

    Provide more than just default-sampled automatically-created statistics where appropriate (for example where distribution is unusual or correlations exist).  Create indexes that will provide potentially useful access paths for a wide range of queries.  If you have expressions on columns in your WHERE clause, consider adding a computed column that matches the expression exactly.  The computed column does not have to be persisted or indexed to be useful; the system can auto-create statistics on the computed column, avoiding cardinality guessing.

    Deep Trees

    Large, complex queries produce large, complex logical trees.  Any errors tend to multiply (perhaps exponentially) as the size of the tree increases, the search space of possible plans expands greatly, and things just become much more difficult in general.  Breaking a complex query into smaller, simpler, more relational, steps will generally get greater value out of the optimizer.  A side benefit is that small intermediate results stored in a temporary table can have statistics created, which will also help in many cases.  There will always be cases where a large complex query is required for ultimate performance, but very often the performance difference is relatively small and may not be worth the future maintenance costs as hand-tuned monolithic queries tend to be fragile.

    Opaque operators and new features

    User-defined functions (other than the in-line variety) may seem convenient, but they are almost completely opaque to the optimizer – it has to guess at the cardinality and distribution of rows produced.  Newer features also tend to have much shallower support in the engine than longer-established features that have been around for multiple versions of the product.  By all means use all the shiny new features, just be aware that combining them may produce plans of variable quality.

    Trace Flags

    To summarize the flags used in this series (all assume 3604 is also active):

    7352 : Final query tree
    8605 : Converted tree
    8606 : Input, simplified, join-collapsed, and normalized trees
    8607 : Output tree
    8608 : Initial memo
    8615 : Final memo
    8675 : Optimization stages and times

    The above were used in the presentation because they all work from 2005 to 2012.  There are a large number of other optimizer-related flags (some of which only work on 2012).  Some are listed below for people that like this sort of thing:

    2373 : Memory before and after deriving properties and rules (verbose)
    7357 : Unique hash optimization used
    8609 : Task and operation type counts
    8619 : Apply rule with description
    8620 : Add memo arguments to 8619
    8621 : Rule with resulting tree

    As usual, these are undocumented, and unsupported (including by me!) and purely for educational purposes.  Use with care at your own risk.

    Final Thoughts

    I hope you have gained some insight and intuition for the workings of the query optimizer: logical trees, simplification, cardinality estimation, logical exploration and physical implementation.  The geeky internals stuff is fun, of course, but I rather hope people came away from these sessions with a better understanding of how query text is transformed to an executable plan, and how relational designs and simpler queries can help the optimizer work well for you.  Taking advantage of the optimizer frees up time for more productive things – new projects, tuning the odd interesting query, whatever it is you would rather be doing than fighting the same query-tuning battles over and over again.

    There are some great resources out there to learn more about SQL Server. Make full use of them to continue to improve your technical skills and just as importantly, experiment with things to build your experience level. Take a deeper look at query plans and the properties they contain; there is a wealth of information there that sometimes requires a bit of thinking and research to understand, but the insights can be invaluable.

    The original slides and demo code is again attached as a zip file below.  Thanks for reading, all comments and feedback are very welcome.

    Finally, I would like to thank Adam Machanic (blog | twitter) for introducing me to the QUERYTRACEON syntax all that time ago, and Fabiano Amorim (blog | twitter) for his email last year that kick-started the process of developing this presentation and blog series.

    Links to other parts of this series: Part 1 Part 2 Part 3

    © 2012 Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • Query Optimizer Deep Dive – Part 3

    This is the third part in a series of posts based on the content of the Query Optimizer Deep Dive presentations I have given over the last month or so at the Auckland SQL Users’ Group and the SQL Saturday events in Wellington, New Zealand and Adelaide, Australia.

    Links to other parts of this series: Part 1 Part 2 Part 4

    Storage of Alternative Plans

    We saw in part 2 how optimizer rules are used to explore logical alternatives for parts of the query tree, and how implementation rules are used to find physical operations to perform each logical steps.  To keep track of all these options, the cost-based part of the SQL Server query optimizer uses a structure called the Memo.  This structure is part of the Cascades general optimization framework developed by Goetz Graefe.  The Memo provides an efficient way to store many plan alternatives via the use of groups.

    Each group in the Memo initially contains just one entry – a node from the input logical tree.  As exploration and implementation phases progress, new groups may be added, and new physical and logical alternatives may be added to existing groups (all alternatives in a group share the same logical properties, but will often have quite different physical properties).  Groups are also shared between plan alternatives where possible, allowing Cascades to search more plans in the same time and space compared with other optimization frameworks.

    Continuing with our example query, the input tree to cost-based optimization is first copied into the Memo structure.  We can see this structure using trace flag 8608:

    USE AdventureWorks2008R2;
    GO
    SET NOCOUNT ON;
    DBCC FREEPROCCACHE;
    DBCC TRACEON(3604);
    GO
    -- Initial memo contents
    SELECT
        p.Name,
        Total = SUM(inv.Quantity)
    FROM Production.Product AS p
    JOIN Production.ProductInventory AS inv ON
        inv.ProductID = p.ProductID
    WHERE
        p.Name LIKE N'[A-G]%'
    GROUP BY
        p.Name
    OPTION (QUERYTRACEON 8608);

    As a reminder, this is the logical input tree to cost-based optimization shown in part 2:

    Cost-Based Optimization Input Tree

    This is copied to the Memo, one group per logical node, illustrated below:

    Optimizer Initial Memo Contents

    The group numbers and linked structure shown are obtained directly from the trace flag output:

    Trace Flag 8608 Output

    Each group has an entry number (all zero above since there is only one logical entry per group at this stage), with a logical operation (e.g. LogOp_Join), and the group numbers of any child groups (e.g. the logical join in group 13 references its two inputs, groups 8 and 9, and a logical comparison defined by the sub-tree starting at group 12).

    Optimization Phases

    One the initial Memo has been populated from the input tree, the optimizer runs up to four phases of search.  There is a documented DMV, sys.dm_exec_query_optimizer_info, that contains a number of counters specific to query optimization.  Four of these counters are ‘trivial plan’, search 0’, ‘search 1’, and ‘search 2’.  The value for each counter keeps track of the number of times a search phase was entered.  By recording the values before and after a specific optimization, we can see which phases were entered.  The current counter values on one of my test SQL Server instances is shown below, as an example:

    sys.dm_exec_query_optimizer_info

    We look at each of the four phases in a bit more detail next:

    Search 0 – Transaction Processing

    This phase is primarily concerned with finding good plans for OLTP-type queries, which usually join a number of tables using a navigational strategy (looking up a relatively small number of rows using an index).  This phase primarily considers nested-loops joins, though hash match may be used when a loops join implementation is not possible.  Many of the more advanced (and costly to explore) optimizer rules are not enabled during this search phase – for example search 0 never considers indexed view matching or parallel plans.

    Search 1 – Quick Plan (also known as Complex Query I)

    This search phase can use most or all of the available rules, can perform limited join reordering, and may be run a second time (to consider parallel plans only) if the first run produces a plan with a high enough cost.  Most queries find a final plan during one of the Search 1 runs.

    Search 2 – Full Optimization

    This phase uses the most comprehensive search configuration, and may result in significant compilation times in some cases.  The search is either for a serial or parallel plan, depending on which type was cheaper after search 1.

    The scripts accompanying this series show another way to see which phases were run, using trace flag 8675.  The output provides some extra insight into how things work (for example it shows the number of optimization moves (tasks) made in each stage).  The only documented and supported way to see the search phases is via the DMV, however.

    Entry and Termination Conditions

    Each phase has entry conditions, a set of enabled rules, and termination conditions.  Entry conditions mean that a phase may be skipped altogether; for example, search 0 requires at least three joined tables in the input tree.  Termination conditions help to ensure the optimizer does not spend more time optimizing than it saves – if the current lowest plan cost drops below a configured value, the search will terminate early with a ‘Good Enough Plan Found’ result.

    The optimizer also sets a budget at the start of a phase for the number of optimization ‘moves’ it considers sufficient to find a pretty good plan (remember the optimizer’s goal is to find a good enough plan quickly).  If the process of exploring and implementing alternatives exceeds this ‘budget’ during a phase, the phase terminates with a ‘Time Out’ message.  Early termination (for whatever reason) is part of the optimizer’s design, completely normal, and not generally a cause for concern.

    From time to time, we might wish that the optimizer had different goals – perhaps that we could ask it to continue searching for longer – but this is not how it works.  It is all too easy to over-spend on optimization (finding transformations is not hard – finding ones that are robustly useful in a wide range of circumstances is), and full cost-based exploration is a memory and processor-intensive operation.  The optimizer contains many heuristics, checks and limits to avoid exploring unpromising alternatives, to prune the search space as it goes, and ultimately produce a pretty good plan pretty quickly, most of the time, on most hardware, in most databases.

    Costing

    First a quick summary: The optimizer runs up to four phases, each of which performs exploration using rules (finding new logical alternatives to some part of the tree), then a round of implementation rules (finding physical implementations for a logical part of the tree).  New logical or physical alternatives are added to the Memo structure – either as an alternative within an existing group, or as a completely new group.  Note that the Memo allows the optimizer to consider very many alternatives at once in a compact and quick-to-search structure; the optimizer does not just work on one overall plan, performing incremental improvements (this is a common misconception).

    Having found all these alternatives, the optimizer needs a way to choose between them.  Costing runs after each round of implementation rules, producing a cost value for each physical alternative in each group in the memo (only physical operations can be costed, naturally).  Costing considers factors like cardinality, average row size, expected sequential and random I/O operations, processor time, buffer pool memory requirements, and the effect of parallel execution.

    Like many areas of optimization, the costing calculations are based on a complex mathematical model.  This model generally provides a good basis on which to compare alternatives internally, regardless of the workload or particular hardware configuration.  The costing numbers do not mean very much by themselves, so it is generally unwise to compare them between statements or across different queries.  The numbers are perhaps best thought of as a unit-less quantity, useful only when comparing alternative plans for the same statement.

    The model is just that of course: a model, albeit one that happens to produce very good results for most people most of the time.  The slide deck that accompanies this series contains details of some of the simplifying assumptions made in the model.

    Final Memo Contents

    At the end of each search phase, the Memo may have expanded to include new logical and physical alternatives for each group, perhaps some completely new groups.  We can see the contents after each phase using trace flag 8615:

    -- Final memo
    SELECT
        p.Name,
        Total = SUM(inv.Quantity)
    FROM Production.Product AS p
    JOIN Production.ProductInventory AS inv ON
        inv.ProductID = p.ProductID
    WHERE
        p.Name LIKE N'[A-G]%'
    GROUP BY
        p.Name
    OPTION (RECOMPILE, QUERYTRACEON 8615);

    The final Memo (after each phase, remember) can be quite large, despite the techniques employed to constrain the search space.  Our simple test query only requires a single run through search 1, and terminates early after 509 moves (tasks) with a ‘Good Enough Plan Found’ message.  Despite that, the Memo contains 38 groups (up from 18) with many groups containing several alternatives.  The extract below highlights just the groups that relate directly to the join implementation:

    Final Memo Trace Flag 8615 Contents

    Looking at the green shaded areas, group 13 item 0 was the original (joining groups 8 and 9 and applying the condition described by the sub-tree starting with group 12).  It has gained an logical alternative (via the Join Commute rule, reversing the join input groups), and two surviving physical implementations: a one-to-many merge join as item 7, with group 8 option 2 as the first input, and group 9 option 2 as the second; and a physical inner Apply driven by group 8 option 6, on group 30 option 2.  Group 20 is completely new – another one-to-many merge join option, but this time with inputs from groups 19.4 and 8.2 with 12 familiar as the join condition.  Again, the scripts accompanying this series can be used to explore the details further, if you wish.

    The final selected plan consists of physical implementations chosen by starting with the lowest-cost physical option in the root group (18, shaded pink above).  The sub-tree from that point has a total estimated cost of 0.0295655 (this is the same cost shown in the final graphical execution plan).  The plan is produced by following the links in the Memo, from group 18 option 4, to group 20 option 6, and so on to the leaves:

    Good Enough Plan Found

    Optimized Query Plan

    End of Part 3

    The original slides and demo code are attached below as a zip file.

    Links to other parts of this series: Part 1 Part 2 Part 4

    © 2012 Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • Query Optimizer Deep Dive – Part 2

    This is the second part in a series of posts based on the content of the Query Optimizer Deep Dive presentations I have given over the last month or so at the Auckland SQL Users’ Group and the SQL Saturday events in Wellington, New Zealand and Adelaide, Australia.

    Links to other parts of this series: Part 1 Part 3 Part 4

    Cost-Based Optimization Overview

    The input to cost-based optimization is a tree of logical operations produced by the previous optimization stages discussed in part one.  Cost-based optimization takes this logical tree, explores logical alternatives (different logical tree shapes that produce the same results), generates physical implementations, assigns an estimated cost to each, and finally chooses the cheapest physical option overall.

    The goal of cost-based optimization is not to find the best possible physical execution plan by exploring every possible alternative; rather, the goal is to find a good plan quickly.  This approach gives us an optimizer that works pretty well for most workloads, most of the time.  If you think about it, that’s quite an achievement – after all, my database is likely very different from yours, and we probably run on quite different hardware as well.  The point about finding a good plan quickly is also important – generally, we would not want the optimizer to spend hours optimizing a query that runs for only a few seconds.

    This design has a number of important consequences.  First, a skilled query tuner will often be able to come up with a better plan than the optimizer does.  In some cases, this will because the human can reason about the query in a way the optimizer cannot.  Other times, it is simply a question of time – we might be happy spending half a day finding a great execution plan for a crucial query, whereas the optimizer places strict limits on itself to avoid spending more time on optimization than it saves on a single execution.

    Perhaps a future version of SQL Server will allow us to configure the optimizer to spend extra time on a particular query – today, we have to use query hints and ‘manual optimizations’ to encourage a particular execution plan shape.  The down side to circumventing the optimizer’s natural behaviour in this way is that we then have to take responsibility for ensuring that the execution plan remains good in the future, as data volumes and distributions change.  In most systems, it is best to limit the number of manually-tuned queries to a minimum, and thoroughly document any tricks you use to obtain a particular plan.

    Input Tree to Cost-Based Optimization

    Back to the task at hand.  We have a sample query (reproduced below) that has been through all the previous optimization stages, did not qualify for a trivial plan, and so needs to go through the cost-based optimization process.

    SELECT
        p.Name,
        Total = SUM(inv.Quantity)
    FROM 
        Production.Product AS p,
        Production.ProductInventory AS inv
    WHERE
        inv.ProductID = p.ProductID
        AND p.Name LIKE N'[A-G]%'
    GROUP BY
        p.Name;

    The tree of logical operations passed to the optimizer looks like this:

    Cost-Based Optimizer Input Tree

    This looks very similar to the original logical tree seen in part one, though cardinality estimates have been added to each primary node, and the tree layout has been expanded a little into a form that is easier for the system to work with.  In addition to simple cardinality estimates, remember that each node also has statistics objects (histograms and frequency information) derived from the familiar statistics associated with the tables in the query.

    The tree above was built from information obtained using trace flag 3604 and 8606 (see part one for more details).  Trace flag 8606 shows the tree at various stages, the diagram above is built from the view after project normalization:

    Trace Flag 8606 Output

    Properties

    There are also a number of other ‘properties’ associated with each node, that have been added by previous optimization stages.  These properties include logical attributes, such as:

    • Output columns and expressions
    • Uniqueness (key) information
    • Type information and nullability
    • Functional dependencies
    • Domain ranges for each column or expression

    There are also some physical properties that may be present on each node, for example:

    • Sort order at each node
    • Partitioning information
    • Halloween protection

    The Halloween protection information is worth explaining in a bit more detail.  A query generally executes as a pipeline, with rows being requested one at a time from the top of the tree – so rows are in effect pulled up the tree one at a time.  This pipeline arrangement has a number of important advantages, but a problem can arise where the query modifies data: changes made by the query can affect rows being read by the same query – the classic example is a query that adds 10% to the salaries of every employee.  When processed as a pipeline, we can get stuck in an infinite loop as rows that have had 10% added re-qualify for reading.  This problem happened to be first discovered on 31 October 1976 and became known as the Halloween Problem.

    One solution is to separate operations that read data from those that update data by reading all rows into temporary storage before performing any updates.  In SQL Server, this simple solution typically manifests as an Eager Table Spool in the query plan – all rows are written eagerly to temporary storage before any updates are performed.  This is a bit of a sledgehammer solution though, so the optimizer keeps track of which columns need protection from the Halloween problem (and how much protection they need) at each node by setting properties.  Some logical operations (like sorting) naturally mean that all input rows have to be read before the first output row can be produced.  These operations provide Halloween protection naturally, so an explicit Eager Table Spool would not be necessary.  There are a number of logical operations that can provide some degree of Halloween protection, and properties are used to ensure that just enough protection is provided, at minimum cost.

    Rules

    The optimizer contains rules to transform trees.  There are four classes of rule:

    • Simplification
    • Exploration
    • Implementation
    • Physical property enforcement

    Simplification rules transform some part of the logical tree to a simpler logical form, and are responsible for the simplification transforms we saw in part one.  Exploration rules run only during cost-based optimization, generating new logical equivalents for some part of the existing logical tree.  Implementation rules produce a physical operation (like a hash or merge join) for a logical operation (a logical join).  Property enforcement rules introduce new elements to the logical tree to enforce some desired physical property at that point, for example to enforce a particular sorted order of rows (as might be required by a merge join or stream aggregate).

    There are 395 rules in SQL Serer 2012, most of which perform quite simple operations; it is the fact that many rules can be run in various orders that accounts for the apparent complexity of optimizer behaviour.  The way that simple rules can combine to produce complex behaviours reminds me of Conway’s Game of Life – a cellular automation program with only four rules that nevertheless can produce extraordinarily complex and beautiful patterns as the rules are applied over and over.

    Exploration Rule Examples

    One example of a simple exploration rule is SELonJN, a rule that merges a suitable relational SELECT (row filter) into a logical join operation:

    SELonJN Exploration Rule

    Another common exploration rule that generates a new logical alternative is JoinCommute:

    JoinCommute Exploration Rule

    As the name suggests, JoinCommute explores a different logical join order, exploiting the fact that A JOIN B is equivalent to B JOIN A for inner joins.  Though logically equivalent, the different join orders may have different performance characteristics once the logical alternatives are translated to a physical implementation by a later implementation rule.  A third exploration rule that is relevant to our test query is GbAggBeforeJoin, a rule that explores the possibility of pushing an aggregation operation under a join:

    GbAggBeforeJoin Transformation

    Implementation Rule Examples

    As mentioned previously, implementation rules transform part of a logical tree into a physical alternative:

    Optimizer Implementation Rules

    The diagram shows three join implementation rules, JNtoNL (nested loops), JNtoHS (hash join), JNtoSM (sort-merge join); two implementations for Group-By Aggregate, GbAggToStrm (Stream Aggregate) and GbAggToHS (Hash Aggregate); SelectToFilter, GetToScan, and GetIdxToRng.

    Which Rules Were Used?

    If we want to see which rules were used when optimizing a query, one option is to use an undocumented view which shows the number of times a rule has been asked to estimate how valuable it might be in the current context (its ‘promise’ value), the number of times the rule was used to generate an alternative section of the tree (‘built’), and the number of times the output of the rule was successfully incorporated into the search space (‘succeeded’).  A sample of the contents of the sys.dm_exec_query_transformation_stats DMV is shown below:

    sys.dm_exec_query_transformation_stats

    By taking a snapshot view of this information before and after optimizing a particular query, we can see which rules were run.  The DMV is instance-wide however, so you would need to run these tests on a system to which you have exclusive access.  The scripts that accompany this series of posts contain a complete test rig to show the rules used by each query.

    End of Part 2

    Part three in this series covers the Memo structure and optimization phases in detail.  The original presentation slides and demo code are contained in the zip file shown below.

    Links to other parts of this series: Part 1 Part 3 Part 4

    © 2012 Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • Query Optimizer Deep Dive - Part 1

    This is the first in a series of posts based on the content of the Query Optimizer Deep Dive presentations I have given over the last month or so at the Auckland SQL Users’ Group and the SQL Saturday events in Wellington, New Zealand and Adelaide, Australia.

    Links to other parts of this series: Part 2 Part 3 Part 4

    Introduction

    The motivation behind writing these sessions is that I have found that relatively few people have a good intuition for the way the optimizer works, partly because the official documentation is rather sparse, and partly because what information is available is dispersed across many books and blog posts. The content presented here is very much geared to my preferred way of learning – it shows the concepts in what seems to me to be a reasonably logical sequence, and provides tools to enable the interested reader to explore further, if desired.

    When we write a query, we are not writing a computer program that can be directly executed.  SQL is a declarative language used to logically describe the results we want to see.  SQL Server goes through a process of compilation and optimization to create a physically executable implementation of the stated logical requirement.  To put it another way, a SQL query is the specification for an executable program that SQL Server writes for us.  The SQL Server component responsible for finding an efficient physical execution plan for a given logical query requirement is the Query Optimizer.

    The SQL Server Core Engine

    The diagram below shows conceptually where the Query Optimizer sits in the core SQL Server engine:

    SQL Server Core Engine Diagram

    Language Processing is concerned with things like parsing the text of the query to make sure it is syntactically valid, binding table references to real database objects, deriving types for expressions, semantic analysis (for example checking that the query is not trying to execute a table), and binding GROUP BY references to the appropriate logical scope.  The Query Optimizer aims to find a good physical execution plan that matches the logical semantic of the query, and the Query Executor is responsible for running that plan to completion.  Below that, the Storage Engine provides access to physical storage, together with locking and transaction services.  Finally, SQL-OS is the layer that provides threading, memory management, and scheduling services.

    Using the same colours for the components shown above, a very high-level overview of query processing looks like this:

    Query Processing Overview

    This series of blog posts is mostly concerned with the “Optimize” step above.  It takes as its input a tree of logical operations produced by the previous stages.

    Optimization Pipeline

    The diagram below shows the principal stages within query optimization, starting with the bound tree produced by the parser and algebrizer.  The coloured steps are the ones we will explore in depth – the right hand side of the diagram lists a bit more detail that I talk through when presenting this material live:

    Query Optimization Pipeline

    Input Tree

    We start by looking at the bound logical tree (input tree).  To make the discussion a little less abstract, we will use a test query issued against the AdventureWorks sample database.  This simple query shows products and the quantity in stock, limited to products with names that start with a letter between A and G:

    -- Test query
    SELECT
        p.Name,
        Total = SUM(inv.Quantity)
    FROM 
        Production.Product AS p,
        Production.ProductInventory AS inv
    WHERE
        inv.ProductID = p.ProductID
        AND p.Name LIKE N'[A-G]%'
    GROUP BY
        p.Name;

    The input tree generated by this query is shown below, slightly simplified:

    Optimizer Logical Input Tree

    The operations in this tree are all purely logical, and have their roots in Relational Theory.  The GET operation logically reads an entire table, though you should not yet be thinking in terms of physical operations such as a scan or a seek.  The JOIN is a cartesian product, logically joining every row in one table with every row in the other.  The SELECT is not a SQL statement, it is a relational operation: filtering rows based on some condition (known as a predicate).  Here, we are applying two SELECT operations to filter on matching product IDs and rows where the product name starts with a letter between A and G.  The next logical operation is a Group-By Aggregate, an extension to the basic relational algebra that logically groups rows by some common attribute, and optionally computes an aggregate expression for each group.  Finally, we have a relational Project operation, which filters columns – here we only need the Name and Total columns, and those are the only two returned in the logical result set.

    To see this tree directly, we can use undocumented trace flag 8605.  This flag also requires the common trace flag 3604 to produce output to the SSMS messages window.  We can use another undocumented syntax, the QUERYTRACEON hint, to set the trace flag just for this query:

    -- Input tree (ISO-92)
    SELECT
        p.Name,
        Total = SUM(inv.Quantity)
    FROM Production.Product AS p
    JOIN Production.ProductInventory AS inv ON
        inv.ProductID = p.ProductID
    WHERE
        p.Name LIKE N'[A-G]%'
    GROUP BY
        p.Name
    OPTION (RECOMPILE, QUERYTRACEON 8605);

    Now, in the SSMS messages tab, we see a textual representation of the tree of logical operators I drew with coloured boxes before:

    Trace Flag 8605 Output

    There are a few minor differences between this and the previous diagram.  In particular, the expression computed by the aggregate is added by a separate projection (omitted for clarity before), and the SQL-92 join syntax means the join predicate is more tightly bound to the logical join.  Rewriting the query using SQL-89 syntax highlights the second difference, producing a tree closer to the cartesian product followed by a relational SELECT:

    -- Input tree (ISO-89)
    SELECT
        p.Name,
        Total = SUM(inv.Quantity)
    FROM 
        Production.Product AS p,
        Production.ProductInventory AS inv
    WHERE
        inv.ProductID = p.ProductID
        AND p.Name LIKE N'[A-G]%'
    GROUP BY
        p.Name
    OPTION (RECOMPILE, QUERYTRACEON 8605);

    Input Tree SQL-89 Syntax

    Simplification

    The first step of the optimization pipeline we will look at is simplification, where the optimizer looks to rewrite parts of the logical tree to remove redundancies and move logical operations around a bit to help later stages.  Major activities that occur around the time simplification occurs (I have taken a little artistic licence here to group a few separate stages together) include:

    • Constant folding
    • Domain simplification
    • Predicate push-down
    • Join simplification
    • Contradiction detection

    Constant folding is the process of evaluating an expression during optimization, rather than repeatedly at execution time.  In the following example, the complex expression in the WHERE clause can safely be evaluated early, resulting in “WHERE p.Name LIKE ‘D%’:

    SELECT p.Name
    FROM Production.Product AS p
    WHERE p.Name LIKE SUBSTRING(LEFT(CHAR(ASCII(CHAR(68))), 1) + '%', 1, 2);

    Domain simplification enables the optimizer to reason about the range of valid values a column or expression can take.  In the next query, the only value for Product ID that logically matches all the predicates in the extended WHERE clause is the single value ‘400’.  The query plan shows that simplification has reduced the three predicates to a single equality predicate, “WHERE p.ProductID = 400”:

    SELECT TOP (10) * 
    FROM Production.Product AS p
    WHERE p.ProductID BETWEEN 300 AND 400
    AND p.ProductID BETWEEN 200 AND 500
    AND p.ProductID BETWEEN 400 AND 600;

    Predicate push-down involves pushing relational SELECTs (logically filtering rows based on a predicate) down the logical tree toward the leaves.  The heuristic here is that this is always a good thing to do (at least as far as expressions that reference a column are concerned) because it eliminates rows early, reducing the number for later logical operations.  Moving SELECTs closer to the logical GETs (reading tables) also helps index and computed-column matching.

    Join simplification allows the optimizer to remove unnecessary joins, convert logical outer joins to simpler inner joins where the NULLs introduces by the outer join are later rejected by another feature in the query, and remove provably empty subexpressions.  This is an incredibly powerful optimizer facility, with its roots again in relational theory.  Rob Farley has written (SQL Server MVP Deep Dives) and presented (http://bit.ly/SimpleRob) about ways to use this feature effectively.  In particular, he talks about how to write views to benefit from compile-time simplification, and avoid the views-on-views-on-views problem.  The following queries illustrate some of the join simplifications:

    -- Remove unnecessary joins
    SELECT
        th.ProductID,
        SUM(th.ActualCost)
    FROM Production.TransactionHistory AS th
    JOIN Production.Product AS p ON
        p.ProductID = th.ProductID
    GROUP BY
        th.ProductID;
    GO
    -- Outer join to join (null rejection)
    SELECT * 
    FROM Production.Product AS p
    LEFT JOIN Production.TransactionHistory AS th ON
        th.ProductID = p.ProductID
    WHERE th.ProductID < 10;
    GO
    -- Complex example combining multiple simplifications
    WITH Complex AS
    (
        SELECT
            pc.ProductCategoryID, pc.Name AS CatName,
            ps.ProductSubcategoryID, ps.Name AS SubCatName,
            p.ProductID, p.Name AS ProductName,
            p.Color, p.ListPrice, p.ReorderPoint,
            pm.Name AS ModelName, pm.ModifiedDate
        FROM Production.ProductCategory AS pc
        FULL JOIN Production.ProductSubcategory AS ps ON
            ps.ProductCategoryID = pc.ProductCategoryID
        FULL JOIN Production.Product AS p ON
            p.ProductSubcategoryID = ps.ProductSubcategoryID
        FULL JOIN Production.ProductModel AS pm ON
            pm.ProductModelID = p.ProductModelID
    )
    SELECT c.ProductID, c.ProductName
    FROM Complex AS c
    WHERE c.ProductName LIKE N'G%';

    That last example, with three full outer joins in a CTE (in-line view), simplifies down to a simple index seek on one table.

    Cardinality Estimation

    The optimizer only has direct information about table cardinality (total number of rows) for base tables.  Statistics can provide additional histograms and frequency statistics, but again these are only maintained for base tables.  To help the optimizer choose between competing strategies later on, it is essential to know how many rows are expected at each node in the logical tree, not just at the GET leaf nodes.  In addition, cardinality estimates above the leaves can be used to choose an initial join order (assuming the query contains several joins).

    Cardinality estimation computes expected cardinality and distribution statistics for each node, working up from the leaves one node at a time.  The logic used to create these derived estimates and statistics uses a model that makes certain assumptions about your data.  For example, where the distribution of values is unknown, it is assumed to be uniform across the whole range of potential values.

    You will generally get more accurate cardinality estimates if your queries fit the model used, and, as with many things in the wider optimizer, simple and relational is best.  Complex expressions, unusual or novel techniques, and using non-relational features often means cardinality estimation has to resort to guessing.

    Most of the choices made by the optimizer are driven by cardinality estimation, so if these are wrong, any good execution plans you might see are highly likely to be down to pure luck.

    Trivial Plan

    The next stage of the optimizer pipeline is Trivial Plan, which is a fast path to avoid the time and effort involved in full cost-based optimization, and only applies to logical query trees that have a clear and obvious ‘best’ execution plan.  The details of which types of query can benefit from Trivial Plan change frequently, but things like joins, subqueries, and inequality predicates generally prevent this optimization.  The queries below show some examples where Trivial Plan can, and cannot be applied:

    -- Trivial plan
    SELECT p.ProductID
    FROM Production.Product AS p
    WHERE p.Name = N'Blade';
    GO
    -- Still trivial
    SELECT
        p.Name,
        RowNumber =
            ROW_NUMBER() OVER (
                ORDER BY p.Name)
    FROM Production.Product AS p
    WHERE p.Name LIKE N'[A-G]%';
    GO
    -- 'Subquery' prevents a trivial plan
    SELECT (SELECT p.ProductID)
    FROM Production.Product AS p
    WHERE p.Name = N'Blade';
    GO
    -- Inequality
    SELECT p.ProductID
    FROM Production.Product AS p
    WHERE p.Name <> N'Blade';

    One interesting case where a Trivial Plan is not applied is where the estimated cost of the Trivial Plan query exceeds the configured ‘cost threshold for parallelism’.  The reasoning here is that any trivial plan that would qualify for a parallel plan suddenly presents a choice, and a choice means the query is no longer ‘trivial’, and we have to go through full cost-based optimization.  Taking this to the extreme, a SQL Server instance with the parallelism cost threshold set to zero will never produce a trivial plan (since all queries have a cost greater than zero).  At the risk of stating the slightly obvious, note that a plan produced at this stage will always be serial.  For experimentation only, the trivial plan stage can also be disabled using trace flag 8757.

    You can check whether a query used a trivial plan or not by inspecting the Properties window in SSMS, and clicking on the root node of the graphical query plan.  The ‘optimization level’ property will show ‘Trivial’ or ‘Full’ – the latter shows that a trip through full cost-based optimization was required.  If a query results in a trivial plan, a physical execution plan is produced and the optimization process ends here.

    End of Part 1

    The original slides and demo code are contained in the zip file attached below.  Links to other parts of this series: Part 2 Part 3 Part 4

    © 2012 Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • Fun with Aggregates

    There are interesting things to be learned from even the simplest queries.  For example, imagine you are given the task of writing a query to list AdventureWorks product names where the product has at least one entry in the transaction history table, but fewer than ten.

    One possible query to meet that specification is:

    SELECT
        p.Name
    FROM Production.Product AS p
    JOIN Production.TransactionHistory AS th ON
        p.ProductID = th.ProductID
    GROUP BY
        p.ProductID,
        p.Name
    HAVING
        COUNT_BIG(*) < 10;

    That query correctly returns 23 rows (execution plan and data sample shown below):

    image

    The execution plan looks a bit different from the written form of the query: the base tables are accessed in reverse order, and the aggregation is performed before the join.  The general idea is to read all rows from the history table, compute the count of rows grouped by ProductID, merge join the results to the Product table on ProductID, and finally filter to only return rows where the count is less than ten.

    This ‘fully-optimized’ plan has an estimated cost of around 0.33 units.  The reason for the quote marks there is that this plan is not quite as optimal as it could be – surely it would make sense to push the Filter down past the join too?  To answer that, let’s look at some other ways to formulate this query.  This being SQL, there are any number of ways to write logically-equivalent query specifications, so we’ll just look at a couple of interesting ones.  The first query is an attempt to reverse-engineer T-SQL from the optimized query plan shown above.  It joins the result of pre-aggregating the history table to the Product table before filtering:

    SELECT p.Name
    FROM
    (
        SELECT 
            th.ProductID, 
            cnt = COUNT_BIG(*) 
        FROM Production.TransactionHistory AS th 
        GROUP BY
            th.ProductID
    ) AS q1
    JOIN Production.Product AS p 
        ON p.ProductID = q1.ProductID
    WHERE
        q1.cnt < 10;

    Perhaps a little surprisingly, we get a slightly different execution plan:

    image

    The results are the same (23 rows) but this time the Filter is pushed below the join!  The optimizer chooses nested loops for the join, because the cardinality estimate for rows passing the Filter is a bit low (estimate 1 versus 23 actual), though you can force a merge join with a hint and the Filter still appears below the join.  In yet another variation, the < 10 predicate can be ‘manually pushed’ by specifying it in a HAVING clause in the “q1” sub-query instead of in the WHERE clause as written above.

    The reason this predicate can be pushed past the join in this query form, but not in the original formulation is simply an optimizer limitation – it does make efforts (primarily during the simplification phase) to encourage logically-equivalent query specifications to produce the same execution plan, but the implementation is not completely comprehensive.

    Moving on to a second example, the following query specification results from phrasing the requirement as “list the products where there exists fewer than ten correlated rows in the history table”:

    SELECT p.Name
    FROM Production.Product AS p
    WHERE EXISTS 
    (
        SELECT *
        FROM Production.TransactionHistory AS th 
        WHERE th.ProductID = p.ProductID 
        HAVING COUNT_BIG(*) < 10
    );

    Unfortunately, this query produces an incorrect result (86 rows):

    image

    The problem is that it lists products with no history rows, though the reasons are interesting.  The COUNT_BIG(*) in the EXISTS clause is a scalar aggregate (meaning there is no GROUP BY clause) and scalar aggregates always produce a value, even when the input is an empty set.  In the case of the COUNT aggregate, the result of aggregating the empty set is zero (the other standard aggregates produce a NULL).  To make the point really clear, let’s look at product 709, which happens to be one for which no history rows exist:

    -- Scalar aggregate
    SELECT COUNT_BIG(*)
    FROM Production.TransactionHistory AS th 
    WHERE th.ProductID = 709;
     
    -- Vector aggregate
    SELECT COUNT_BIG(*)
    FROM Production.TransactionHistory AS th 
    WHERE th.ProductID = 709
    GROUP BY th.ProductID;

    The estimated execution plans for these two statements are almost identical:

    image

    You might expect the Stream Aggregate to have a Group By for the second statement, but this is not the case.  The query includes an equality comparison to a constant value (709), so all qualified rows are guaranteed to have the same value for ProductID and the Group By is optimized away.

    In fact there are some minor differences between the two plans (the first is auto-parameterized and qualifies for trivial plan, whereas the second is not auto-parameterized and requires cost-based optimization), but there is nothing to indicate that one is a scalar aggregate and the other is a vector aggregate.  This is something I would like to see exposed in show plan so I suggested it on Connect.  Anyway, the results of running the two queries show the difference at runtime:

    image

    The scalar aggregate (no GROUP BY) returns a result of zero, whereas the vector aggregate (with a GROUP BY clause) returns nothing at all.  Returning to our EXISTS query, we could ‘fix’ it by changing the HAVING clause to reject rows where the scalar aggregate returns zero:

    SELECT p.Name
    FROM Production.Product AS p
    WHERE EXISTS 
    (
        SELECT *
        FROM Production.TransactionHistory AS th 
        WHERE th.ProductID = p.ProductID 
        HAVING COUNT_BIG(*) BETWEEN 1 AND 9
    );

    The query now returns the correct 23 rows:

    image

    Unfortunately, the execution plan is less efficient now – it has an estimated cost of 0.78 compared to 0.33 for the earlier plans.  Let’s try adding a redundant GROUP BY instead of changing the HAVING clause:

    SELECT p.Name
    FROM Production.Product AS p
    WHERE EXISTS 
    (
        SELECT *
        FROM Production.TransactionHistory AS th 
        WHERE th.ProductID = p.ProductID
        GROUP BY th.ProductID
        HAVING COUNT_BIG(*) < 10
    );

    Not only do we now get correct results (23 rows), this is the execution plan:

    image

    I like to compare that plan to quantum physics: if you don’t find it shocking, you haven’t understood it properly :)  The simple addition of a redundant GROUP BY has resulted in the EXISTS form of the query being transformed into exactly the same optimal plan we found earlier.  What’s more, in SQL Server 2008 and later, we can replace the odd-looking GROUP BY with an explicit GROUP BY on the empty set:

    SELECT p.Name
    FROM Production.Product AS p
    WHERE EXISTS 
    (
        SELECT *
        FROM Production.TransactionHistory AS th 
        WHERE th.ProductID = p.ProductID
        GROUP BY ()
        HAVING COUNT_BIG(*) < 10
    );

    I offer that as an alternative because some people find it more intuitive (and it perhaps has more geek value too).  Whichever way you prefer, it’s rather satisfying to note that the result of the sub-query does not exist for a particular correlated value where a vector aggregate is used (the scalar COUNT aggregate always returns a value, even if zero, so it always ‘EXISTS’ regardless which ProductID is logically being evaluated).

    The following query forms also produce the optimal plan and correct results, so long as a vector aggregate is used (you can probably find more equivalent query forms):

    WHERE Clause

    SELECT
        p.Name
    FROM Production.Product AS p
    WHERE 
    (
        SELECT COUNT_BIG(*) 
        FROM Production.TransactionHistory AS th
        WHERE th.ProductID = p.ProductID
        GROUP BY ()
    ) < 10;

    APPLY

    SELECT p.Name
    FROM Production.Product AS p
    CROSS APPLY
    (
        SELECT NULL
        FROM Production.TransactionHistory AS th 
        WHERE th.ProductID = p.ProductID
        GROUP BY ()
        HAVING COUNT_BIG(*) < 10
    ) AS ca (dummy);

    FROM Clause

    SELECT q1.Name 
    FROM
    (
        SELECT
            p.Name, 
            cnt = 
            (
                SELECT COUNT_BIG(*) 
                FROM Production.TransactionHistory AS th 
                WHERE th.ProductID = p.ProductID 
                GROUP BY ()
            )
        FROM Production.Product AS p
    ) AS q1 
    WHERE
        q1.cnt < 10;

    This last example uses SUM(1) instead of COUNT and does not require a vector aggregate…you should be able to work out why :)

    SELECT q.Name 
    FROM
    (
        SELECT
            p.Name, 
            cnt = 
            (
                SELECT SUM(1)
                FROM Production.TransactionHistory AS th 
                WHERE th.ProductID = p.ProductID
            )
        FROM Production.Product AS p
    ) AS q
    WHERE q.cnt < 10;

    image

    The semantics of SQL aggregates are rather odd in places.  It definitely pays to get to know the rules, and to be careful to check whether your queries are using scalar or vector aggregates.  As we have seen, query plans do not show in which ‘mode’ an aggregate is running and getting it wrong can cause poor performance, wrong results, or both.

    © 2012 Paul White

    Twitter: @SQL_Kiwi
    email: SQLkiwi@gmail.com

  • Dynamic Seeks and Hidden Implicit Conversions

    Most people know that a LIKE predicate with only a trailing wildcard can usually use an index seek:

    SELECT 
        p.Name 
    FROM Production.Product AS p
    WHERE 
        p.Name LIKE N'D%';

    Index Seek on LIKE

    As the execution plan shows, SQL Server determines a covering range (which depends on the collation), seeks the string index using the range as the start and end points of a partial scan, and applies the original LIKE condition as a residual predicate to just the rows that match the initial seek operation.  Specifically, the Storage Engine seeks the index to locate rows in the covering range, and the Query Processor applies the residual predicate (the LIKE) to the rows it receives.

    Dynamic Seeks

    But what if the LIKE search term is in a variable?

    DECLARE @Like NVARCHAR(50) = N'D%'
     
    SELECT 
        p.Name 
    FROM Production.Product AS p
    WHERE 
        p.Name LIKE @Like;

    SQL Server can still perform a seek here, but it needs to determine the covering seek range for the search term at execution time, not at compilation time:

    Dynamic Index Seek

    The plan now contains an extra Constant Scan,  a Compute Scalar and a Nested Loops Join.  These operators are interesting because they have zero cost estimates: no CPU, no I/O, nothing.  That’s because they are purely architectural: a workaround for the fact that SQL Server cannot currently perform a dynamic seek within the Index Seek operator itself.  To avoid affecting plan choices, this extra machinery is costed at zero.

    The Constant Scan produces a single in-memory row with no columns.  The Compute Scalar defines expressions to describe the covering seek range (using the runtime value of the @Like variable).  Finally, the Nested Loops Join drives the seek using the computed range information as correlated values.

    The upper tooltip shows that the Compute Scalar uses three internal functions, LikeRangeStart, LikeRangeEnd, and LikeRangeInfo.  The first two functions describe the range as an open interval.  The third function returns a set of flags encoded in an integer, that are used internally to define certain seek properties for the Storage Engine.  The lower tooltip shows the seek on the open interval described by the result of LikeRangeStart and LikeRangeEnd, and the application of the residual predicate ‘LIKE @Like’.

    More Dynamic Seeks

    Something very similar occurs in plans that use IN or OR with variables:

    DECLARE 
        @1 INTEGER = 320,
        @2 INTEGER = 325,
        @3 INTEGER = 330
     
    SELECT
        p.Name
    FROM Production.Product AS p
    WHERE 
        p.ProductID IN (@1,@2,@3);