THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Paul White: Page Free Space

Technical stuff about SQL Server - from the Kāpiti Coast, New Zealand

  • Dynamic Seeks and Hidden Implicit Conversions

    Most people know that a LIKE predicate with only a trailing wildcard can usually use an index seek:

    SELECT 
        p.Name 
    FROM Production.Product AS p
    WHERE 
        p.Name LIKE N'D%';

    Index Seek on LIKE

    As the execution plan shows, SQL Server determines a covering range (which depends on the collation), seeks the string index using the range as the start and end points of a partial scan, and applies the original LIKE condition as a residual predicate to just the rows that match the initial seek operation.  Specifically, the Storage Engine seeks the index to locate rows in the covering range, and the Query Processor applies the residual predicate (the LIKE) to the rows it receives.

    Dynamic Seeks

    But what if the LIKE search term is in a variable?

    DECLARE @Like NVARCHAR(50) = N'D%'
     
    SELECT 
        p.Name 
    FROM Production.Product AS p
    WHERE 
        p.Name LIKE @Like;

    SQL Server can still perform a seek here, but it needs to determine the covering seek range for the search term at execution time, not at compilation time:

    Dynamic Index Seek

    The plan now contains an extra Constant Scan,  a Compute Scalar and a Nested Loops Join.  These operators are interesting because they have zero cost estimates: no CPU, no I/O, nothing.  That’s because they are purely architectural: a workaround for the fact that SQL Server cannot currently perform a dynamic seek within the Index Seek operator itself.  To avoid affecting plan choices, this extra machinery is costed at zero.

    The Constant Scan produces a single in-memory row with no columns.  The Compute Scalar defines expressions to describe the covering seek range (using the runtime value of the @Like variable).  Finally, the Nested Loops Join drives the seek using the computed range information as correlated values.

    The upper tooltip shows that the Compute Scalar uses three internal functions, LikeRangeStart, LikeRangeEnd, and LikeRangeInfo.  The first two functions describe the range as an open interval.  The third function returns a set of flags encoded in an integer, that are used internally to define certain seek properties for the Storage Engine.  The lower tooltip shows the seek on the open interval described by the result of LikeRangeStart and LikeRangeEnd, and the application of the residual predicate ‘LIKE @Like’.

    More Dynamic Seeks

    Something very similar occurs in plans that use IN or OR with variables:

    DECLARE 
        @1 INTEGER = 320,
        @2 INTEGER = 325,
        @3 INTEGER = 330
     
    SELECT
        p.Name
    FROM Production.Product AS p
    WHERE 
        p.ProductID IN (@1,@2,@3);

    Dynamic Seek With IN and OR

    Now we have three ranges: one for each of the variables in the original query.  The Compute Scalar operators again define three columns containing the start and end of the range, and the associated informational flags (previously seen as a result of the LikeRangeInfo function).  This time, we see the decimal representation of these flags, which happens to be 62 for an equality comparison.  The IN expands to (ProductID = @1 OR ProductID = @2 OR ProductID = @3), so each of the ‘ranges’ here is in fact a single value, so the start and end range values are the same in each Compute Scalar.

    The three dynamic ranges are concatenated, sorted (so any overlapping ranges appear next to each other in the stream) and the Merge Interval collapses these intervals into one or more disjoint (non-overlapping) ranges.  This is important, because the three variables might, for example, all contain the same value, and it would be incorrect to return that value three times.  Anyway, for each disjoint range produced, the Nested Loops Join drives a new seek of the Clustered Index.  The overall effect is that an arbitrary number of possibly overlapping ranges are computed, merged, and then used to drive one or more seek operations. The final result of the query will be the combination of all the seek results, as you would expect.

    Hidden Conversions

    The following example contains a table with DATETIME2 values, and a query with a expression that at first sight seems unlikely to be able to seek on an index (the variable is typed as DATE, and there is a CONVERT function applied to the DATETIME2 column):

    DECLARE @Example TABLE (date_time DATETIME2 PRIMARY KEY);
     
    INSERT @Example (date_time) 
    VALUES ('20110101 12:34:56');
     
    DECLARE @date DATE = '2011-01-01';
     
    SELECT * 
    FROM @Example AS e 
    WHERE 
        @date = CONVERT(DATE, e.date_time);

    Nevertheless, a query plan that uses a seek can be produced:

    GetRangeThroughConvert

    In this case, neither SSMS or Plan Explorer will show the contents of the Compute Scalar (this is probably just an oversight, rather than deliberate concealment!).  We have to open the XML form of the execution plan to see the three familiar expressions, wrapped in a Value Vector (just a fancy container for multiple expressions).

    Another internal function, GetRangeThroughConvert, is responsible for determining the the range of DATETIME2 values covered by the DATE variable @date, and the informational flags needed.  In the same way the engine works out covering ranges for some LIKE predicates, this function determines ranges where certain problematic type conversions are required.  Otherwise, the machinery is the same: a range description is defined by the Compute Scalar, and the Nested Loops Join driving a seek using those values.

    More Hidden Conversions

    There is another related internal function used when the Query Processor needs to determine a range for a comparison between different data types.  This example returns rows based on a greater-than-or-equal comparison between DATE column values and the DATETIME return value of the GETDATE() intrinsic function:

    DECLARE @Example TABLE (col1 date PRIMARY KEY)
    SELECT * FROM @Example AS e 
    WHERE e.col1 >= DATEADD(DAY, -7, GETDATE());

    GetRangeWithMismatchedTypes

    Again, the SSMS graphical plan and Plan Explorer cannot display the contents of the Value Vector, so we have to dig into the XML again.  The function evaluates the DATEADD(GETDATE()) expression, computes the open-interval start point of a DATE range accounting for the conversion from DATETIME to DATE, and specifies NULL as the end of the range (since this is a >= comparison, there is no end value).  The flags value in this case is 22 (the flags for a >= seek operation).

    Everything All At Once

    This last example features all sorts type sloppiness, resulting in an execution plan that uses GetRangeThroughConvert on the string expression and GetRangeThroughConvert on the result of GetRangeWithMismatchedTypes applied to the result of the GETDATE function.  The whole thing is then wrapped in a dynamic seek with the Merge Interval enforcing the (annoying) BETWEEN requirement that the first parameter must be less than or equal to the second.  See if you can work out all the conversions necessary for this query, using the rules of data type precedence.  It is really quite impressive that this example of lazy T-SQL coding results in an index seek, don’t you think?

    DECLARE @Example TABLE (col1 DATETIME PRIMARY KEY)
    SELECT * FROM @Example AS e
    WHERE CONVERT(DATE, e.col1) BETWEEN '20000101' AND GETDATE();

    Merge Interval With Everything

    Conclusion

    SQL Server works quite hard sometimes to produce index seeks where they might seem unlikely.  This is a good thing, and it would be great to see this capability extended further in future.  The downside is that this extra effort means you are less likely to see an Index Scan when you have done something daft with data types.

    Why is this a bad thing if you get a seek anyway?  The problem is that these hidden implicit conversions can result in inaccurate cardinality and distribution estimations at any stage of the plan.  So, even if you get a seek, the plan might be way off overall.  If that isn’t persuasive enough, consider this: will having hidden nested range calculations improve your chances of getting a good query plan?  Probably not, no.  Be very aware of types, and in particular of the types returned by functions and expressions.  If in doubt, use SELECT INTO to materialize the results of an expression or query, and check the types of the columns produced.


    Note: if you have any scripts that trawl the plan cache looking for implicit conversions (CONVERT_IMPLICIT), you might want to look into updating them to check for these conversions too.  Not all conversions are bad ones, of course :)

    © 2012 Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

    Further Reading:

    Implicit Conversions – Craig Freedman
    More on Implicit Conversions – Craig Freedman
    Join Performance, Implicit Conversions, and Residuals - Me

  • Forcing a Parallel Query Execution Plan

    This post is for SQL Server developers who have experienced the special kind of frustration, which only comes from spending hours trying to convince the query optimizer to generate a parallel execution plan.  This situation often occurs when making an apparently innocuous change to the text of a moderately complex query; a change which somehow manages to turn a parallel plan that executes in ten seconds, into a five-minute serially-executing monster.

    SQL Server provides a number of query and table hints that allow the experienced practitioner to take greater control over the final form of a query plan.  These hints are usually seen as a tool of last resort, because they can make code harder to maintain, introduce extra dependencies, and may prevent the optimizer reacting to future changes in indexing or data distribution.  One such query hint is the (over) popular OPTION (MAXDOP 1), which prevents the optimizer from considering plans that use parallelism.  Sadly, there is currently no corresponding hint to force the optimizer to choose a parallel plan.

    The result of all this is a great deal of wasted time trying increasingly obscure query syntax, until eventually the desired parallel plan is obtained, or the developer gives up in despair.  Even where success is achieved, the price is often fragile code that risks reverting to serial execution any time indexing or statistics change.  In any case, the resulting SQL is usually hard to read, and scary to maintain.

    Why Expensive Queries Produce Serial Plans

    Whenever the query optimizer produces a serial plan instead of the ‘obviously better’ parallel plan, there is always a reason.  Leaving aside the more obvious causes, such as the configuration setting max degree of parallelism being set to one, running under a Resource Governor workload group with MAX_DOP = 1, or having only one logical processor available to SQL Server, the usual causes of a serial plan are parallelism-inhibiting operations, cardinality estimation errors, costing model limitations, and code path issues.

    Parallelism-Inhibiting Components

    There are many things that prevent parallelism, either because they make no sense in a parallel plan, or because the product just does not support them yet.  Some of these force the whole plan to run serially, others require a ‘serial zone’ – a part of the plan that runs serially, even though other parts may use multiple threads concurrently.

    That list changes from version to version, but for example these things make the whole plan serial on SQL Server 2008 R2 SP1:

    • Modifying the contents of a table variable (reading is fine)
    • Any T-SQL scalar function (which are evil anyway)
    • CLR scalar functions marked as performing data access (normal ones are fine)
    • Random intrinsic functions including OBJECT_NAME, ENCYPTBYCERT, and IDENT_CURRENT
    • System table access (e.g. sys.tables)

    Inconveniently, the list of intrinsic functions is quite long and does not seem to follow a pattern.  ERROR_NUMBER and @@TRANCOUNT also force a serial plan, @@ERROR and @@NESTLEVEL do not.  The T-SQL scalar function restriction is also a bit sneaky.  Any reference to a table with a computed column that uses such a function will result in a serial plan, even if the problematic column is not referenced in the query.

    These query features are examples that require a serial zone in the plan:

    • TOP
    • Sequence project (e.g. ROW_NUMBER, RANK)
    • Multi-statement T-SQL table-valued functions
    • Backward range scans (forward is fine)
    • Global scalar aggregates
    • Common sub-expression spools
    • Recursive CTEs

    The information presented above is based on the original list published by Craig Freedman, and updated for 2008 R2.

    One way to check that a query does not have any parallelism-inhibiting components is to test the query using a CPU cost multiplier.  This should only be done on a private test system where you are able to flush the whole plan cache after testing.  The idea is to use an undocumented and unsupported DBCC command to temporarily increase the CPU cost of the query plan operators.  It is not a fool-proof test (some rare parallelizable queries will not generate a parallel plan with this technique) but it is quite reliable:

    DBCC FREEPROCCACHE
    DBCC SETCPUWEIGHT(1000)
    GO
    -- Query to test
    SELECT
        COUNT_BIG(*) 
    FROM Production.Product AS p 
    LEFT JOIN Production.TransactionHistory AS th ON 
        p.ProductID = th.ProductID
    GO
    DBCC SETCPUWEIGHT(1)
    DBCC FREEPROCCACHE

    The final commands to reset the CPU weighting factor and flush the plan cache are very important.

    If you get an parallel estimated plan for a particular test query, it shows that a parallel plan is at least possible.  Varying the value passed to the DBCC command adjusts the multiplier applied to normal CPU costs, so you will likely see different plans for different values.  The illustrated factor of a thousand is often enough to produce a parallel estimated plan, but you may need to experiment with higher values.  It is not recommended to use estimated plans obtained using this technique directly in USE PLAN hints or plan guides because these are not plans the optimizer would ever produce naturally.  To be clear, direct use of the plans would likely render a production system unsupported and the person responsible might be fired, shot, or possibly both.

    Cardinality Estimation Errors

    If there is nothing that absolutely prevents parallelism in the target query, the optimizer may still choose a serial alternative if it has a lower estimated cost.  For that reason, there are a couple of things we can do to promote the parallel option here, all based on the very sound notion of giving the optimizer accurate information to base its estimates on.  The considerations here go well beyond just ensuring statistics are up-to-date, or building them with the FULLSCAN option.  For example, depending on the nature of the query, you may need to provide all or some of the following:

    • Multi-column statistics (for correlations)
    • Filtered statistics or indexes (for more histogram steps)
    • Computed columns on filtering expressions in the query (avoid cardinality guesses)
    • Good constraint information (foreign keys and check constraints)
    • Materialize parts of the query in temporary tables (more accurate statistics in deep plans)
    • Regular hints such as OPTIMIZE FOR

    In general, anything you can do to ensure that estimated row counts are close to the runtime values will help the optimizer cost the serial and parallel alternatives more accurately.  Many failures to choose a parallel plan are caused by inaccurate row counts.

    Model Limitations

    SQL Server uses a model to estimate the runtime cost of each operator in a query plan.  The exact calculations vary between operators, but most are based on a minimum cost, with an additional per-row component.  None of these estimates of expected CPU and I/O cost take into account the specific hardware SQL Server finds itself running on.  The advantage of this is that plans from one machine can be readily reproduced and compared on another machine running the same version of the software, without having to worry about hardware differences.

    Not all operators can be costed reasonably, and things like functions are particularly problematic because the optimizer has no clue how many rows might be produced or what the distribution of values might look like.  Even very normal-looking operators can pose problems.  Consider the task of estimating the number of rows and distribution of values resulting from a join or a complex GROUP BY clause.  Even where reasonable estimates can be made, the derived statistics that propagate up the query tree (from the persistent statistics at the leaves) tend to become quickly less reliable.  The optimizer includes many heuristics that aim to prevent these inaccuracies getting out of control, so it might resort to complete guesses after only a few operators as the compounding effect of deriving new statistics takes hold.

    There are many other assumptions and limitations of the model that will not fit into a blog post, the interested reader can find more detailed information in Chapter 8 of the indispensable SQL Server 2008 Internals book.

    Costing Limitations

    When SQL Server costs a parallel plan, it generally reduces the CPU cost for a parallel iterator by the a factor equal to the expected runtime DOP.  For example the previous query can produce the following serial and parallel plans:

    image

    image

    Taking the Merge Join operator as an example, the parallel version has its CPU cost reduced by a factor of 4 when the expected runtime DOP is four (serial plan on the left, parallel on the right):

    image

    On the other hand, the Index Scans show no reduction in I/O cost, though the CPU cost is again reduced by a factor of four:

    image

    As mentioned earlier, different operators cost themselves differently (for example a many-to-many merge join also has an I/O cost component that also happens to be reduced by a factor of four).  These details also vary somewhat between releases, so the presentation here is to give you an appreciation of the general approach rather than to dwell too much on the specifics.

    Looking again at the serial and parallel plans, it is clear that which of the two plans costs cheaper depends on whether the parallel plan saves enough by reducing CPU and I/O costs in the various operators, to pay for the extra operators in the plan.  In this case, the extra operators are three exchange (parallelism) operators – two Repartition Streams to redistribute rows for correct results when joining, and one Gather Streams to merge the threads back to a single final result.

    The way the numbers work means that it is often a tight race between the best parallel and serial plan alternatives.  In many real-world cases, the difference between the two can be extremely small – making it even more frustrating when the serial version turns out to take fifty times as long as the parallel version to execute.  One other point worth mentioning again here is that the DOP estimate is limited to the number of logical processors that SQL Server sees, divided by two.  My test machine has eight cores, all available to SQL Server, but the DOP estimate used for costing calculations is limited to four.  This has obvious consequences for costing, where CPU and I/O costs are typically divided by the estimated DOP of four, rather than eight.

    A Note about Parallel Nested Loops

    Plans with Nested Loops joins can be a particular problem, because the inner side almost always runs multiple threads serially.  The parallelism icons are still present, but they indicate that there are DOP independent serial threads.  The distinction is perhaps a subtle one, but it (a) explains why operators that normally force a serial zone can run ‘in parallel’ on the inner side of a loops join; and (b) the optimizer does not reduce the CPU costs on the inner side by the estimated runtime DOP.  This puts nested loops at an unfair disadvantage when it comes to parallelism costing, compared with Hash and Merge Joins.

    Code Path Issues

    This last category concerns the fact that the optimizer may not get as far as evaluating a parallel plan at all.  One way this can occur is if a final plan is found during the Trivial Plan stage.  If a Trivial Plan is possible, and the resulting cost is less than the configured cost threshold for parallelism, the full optimization stages are skipped and a serial plan is returned immediately.

    Trivial Plan

    The following query has an estimated serial plan cost of around 85 units, but with the parallelism threshold set to 100 a Trivial Plan is produced (as shown in the plan property ‘Optimization Level’ or by checking the changes in sys.dm_exec_query_optimizer_info as shown below:

    SELECT
        deqoi.[counter],
        deqoi.occurrence
    FROM sys.dm_exec_query_optimizer_info AS deqoi 
    WHERE
        [counter] IN ('trivial plan', 'search 0', 'search 1', 'search 2')
    GO
    SET SHOWPLAN_XML ON
    GO
    SELECT
        COUNT_BIG(*) 
    FROM dbo.bigTransactionHistory AS bth
    OPTION (RECOMPILE)
    GO
    SET SHOWPLAN_XML OFF
    GO
    SELECT
        deqoi.[counter],
        deqoi.occurrence
    FROM sys.dm_exec_query_optimizer_info AS deqoi 
    WHERE
        [counter] IN ('trivial plan', 'search 0', 'search 1', 'search 2')

    image

    When the cost threshold is reduced to 84 we get a parallel plan…

    image

    A deeper analysis shows that the query still qualified for Trivial Plan (and the stage was run) but the final cost exceeded the parallelism threshold so optimization continued.  This query does not qualify for ‘search 0’ (TP or Transaction Processing) because a minimum of three tables are required there.

    So, optimization moves on to ‘search 1’ (Quick Plan) which runs twice.  It runs once considering only serial plans, and comes out with a best cost of 84.6181.  Since this exceeds the threshold of 84, Quick Plan is re-run with the parallel plan option enabled.  The result is a parallel plan at cost 44.7854.  The plan does not meet the entry conditions for ‘search 2’ (Full Optimization) so the finished plan is copied out.

    Good Enough Plan & Time Out

    Returning to code path reasons that prevent a parallel plan, the last category covers queries that enter the Quick Plan stage, but that stage terminates early, either with a Good Enough Plan Found message, or a Time Out.  Both of these are heuristics to prevent the optimizer spending more time optimizing than it stands to gain by reducing estimated execution time (cost).  Good Enough Plan results when the current lowest cost plan is so cheap that further optimization effort is no longer justified.

    Time Out is a related phenomenon: at the start of a stage, the optimizer sets itself a ‘budget’ of a number of rule applications it estimates it can perform in the time justified by the initial cost of the plan.  This means that query trees that start with a higher cost get a correspondingly larger allowance of rule applications (roughly comparable to the number of moves a chess program thinks ahead).  If the optimizer explores the allowed number of rules before the natural end of the optimization stage, it returns the best complete plan at that point with a Time Out message.  This may well occur during the first run of ‘search 1’, preventing us reaching the second run that adds parallelism.

    One interesting consequence of the rule concerning Trivial Plan and the cost threshold for parallelism is that a system configured with a threshold of zero can never produce a Trivial Plan.  Bearing this in mind, we can generate a surprising Time Out with this query:

    SELECT * FROM Production.Product AS p

    As you would expect, this query is normally optimized using Trivial Plan (there are no cost-based plan choices here):

    image

    …but when the cost threshold is set to zero, we get Full Optimization with a Time Out…the optimizer timed out working out how to do SELECT * from a single table!

    image

    In this particular case, the optimizer ‘timed out’ after 15 tasks (it normally runs through many thousands).  A Time Out result can sometimes also be an indicator that the input query is over-complex, but the interpretation is not at all that straightforward.

    The Solution

    We need a robust query plan hint, analogous to MAXDOP, that we can specify as a last resort when all other techniques still result in a serial plan, and where the parallel alternative is much to be preferred of course.  I really want to emphasise that very many cases of unwanted serial plans are due to designers and developers not giving the optimizer good quality information.  I see very few systems with things like proper multi-column statistics, filtered indexes/statistics, and adequate constraints.  Even less frequently, do I see (perhaps non-persisted) computed columns created on query filter expressions to aid cardinality estimation.  On the other hand, non-relational database designs with poor indexing, and decidedly non-relational queries are extremely common.  (As are database developers complaining about the poor decisions the optimizer makes sometimes!)

    There’s always a Trace Flag

    In the meantime, there is a workaround.  It’s not perfect (and most certainly a choice of very last resort) but there is an undocumented (and unsupported) trace flag that effectively lowers the cost threshold to zero for a particular query.  It actually goes a little further than that; for example, the following query will not generate a parallel plan even with a zero cost threshold:

    SELECT TOP (1)
        p.Name
    FROM Production.Product AS p
    JOIN Production.TransactionHistory AS th ON
        th.ProductID = p.ProductID
    ORDER BY
        p.Name

    image

    This is a completely trivial query plan of course – the first row from the scan is joined to a single row from the seek.  The total estimated cost of the serial plan is 0.0065893.  Returning the cost threshold for parallelism to the default of 5 just for completeness, we can obtain a parallel plan (purely for demonstration purposes) using the trace flag:

    SELECT TOP (1)
        p.Name
    FROM Production.Product AS p
    JOIN Production.TransactionHistory AS th ON
        th.ProductID = p.ProductID
    ORDER BY
        p.Name
    OPTION (RECOMPILE, QUERYTRACEON 8649)

    image

    The parallel alternative is returned, despite the fact it costs much higher at 0.0349929 (5.3 times the cost of the serial plan).  In my testing, this trace flag has proved invaluable in certain particularly tricky cases where a parallel plan is essential, but there is no reasonable way to get it from the standard optimizer.

    Conclusion

    Even experts with decades of SQL Server experience and detailed internal knowledge will want to be careful with this trace flag.  I cannot recommend you use it directly in production unless advised by Microsoft, but you might like to use it on a test system as an extreme last resort, perhaps to generate a plan guide or USE PLAN hint for use in production (after careful review).

    This is an arguably lower risk strategy, but bear in mind that the parallel plans produced under this trace flag are not guaranteed to be ones the optimizer would normally consider.  If you can improve the quality of information provided to the optimizer instead to get a parallel plan, go that way :)

    If you would prefer to see a fully supported T-SQL OPTION (MINDOP) or OPTION (PARALLEL_PLAN) hint, please vote here:

    https://connect.microsoft.com/SQLServer/feedback/details/714968/provide-a-hint-to-force-generation-of-a-parallel-plan

    © 2011 Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • SQL Server Optimizer Bug with JOIN and GROUP BY

    I came across a SQL Server bug recently that made me wonder how on earth I never noticed it before.  As the title of this post suggests, the bug occurs in common JOIN and GROUP BY queries, and while it does not cause incorrect results to be returned, it will often cause a poor query plan to be selected by the optimizer.  If you are just interested in the bug itself, you will find a description and Connect item toward the end of this post.  As the regular reader will be expecting from me though, I am going to work up to it with a bit of background…

    Example Query

    Using everyone’s favourite Adventure Works database, here’s a simple query that includes a JOIN and a GROUP BY:

    SELECT
        COUNT_BIG(*)
    FROM Production.Product AS p
    JOIN Production.TransactionHistory AS th ON
        th.ProductID = p.ProductID
    GROUP BY
        p.Class
    OPTION (MAXDOP 1, RECOMPILE)

    One perfectly reasonable query plan possibility looks like this (estimated row counts shown):

    image

    After the Merge Join, rows go on to be grouped and counted.  This simple plan has an estimated cost of 1.1 units on my system.  The optimizer has chosen a Merge Join because indexes exist on the join column to provide sorted inputs, and the foreign key relationship between these two tables means that the join will be the efficient one-to-many type.  A Hash Match Aggregate is estimated to be cheaper than the alternative Sort and Stream Aggregate combination in this plan.  For reasons that will become apparent shortly, note that the aggregate computes the following expression (a simple count of all the rows):

    [Expr1004] = Scalar Operator(COUNT(*))

    Partial Aggregates

    The query plan shown above is not the one selected by the optimizer, however.  One of the tricks available to it is to move most of the aggregation work before the join.  The idea is that reducing row counts as early as possible in a plan generally pays off.  The plan actually selected by the optimizer is this one (estimated row counts again):

    image

    The new Stream Aggregate below the join computes the count of rows from the history table grouped by product id.  There are 441 unique product id values in the history table, so 441 product ids (and row counts) flow on to be joined with the 504 rows from the product table.  Note that the estimate of 441 groups exactly matches reality.

    With a smaller number of estimated rows coming out of the join, the optimizer chooses a Sort and Stream Aggregate combo instead of the Hash Match Aggregate seen previously.  To get the correct query result, the second aggregate uses SUM to add the per-product row counts together.  The estimated cost of this plan is 0.35 units according to SQL Server’s costing model, so you can see why the optimizer prefers it over the previous plan with a cost of 1.1 units.

    Clicking on the new aggregate below the join and looking at its properties, we see that it groups by product id and computes a partial aggregate expression:

    [partialagg1005] = Scalar Operator(Count(*))

    In this case, the expression label is the only way to identify this as a partial aggregate – there is no other more obvious indication.  The expression computed by the second Stream Aggregate (the one after the Sort) is:

    [Expr1004] = Scalar Operator(SUM([partialagg1005]))

    This aggregate is a Global aggregate: it computes the correct overall result based on partial results calculated earlier in the plan.  To summarize, the single COUNT in the original plan has been replaced with a partial aggregate COUNT, followed by a global aggregate SUM of those counts.

    Local and Global Aggregation

    This idea of computing part of an aggregate early and combining the partial results later on originated in parallel execution plans.  To illustrate the point, take a look at the following query and execution plan:

    SELECT 
        COUNT_BIG(*) 
    FROM Production.TransactionHistory AS th
    OPTION (RECOMPILE)

    [You will need to set the instance’s cost threshold for parallelism configuration value to zero in order to get a parallel plan for this query.]

    image

    The parallel aggregate computes the following partial aggregate expression per thread:

    [partialagg1003] = Scalar Operator(Count(*))

    The serial aggregate combines the partial aggregates using this expression:

    [Expr1002] = Scalar Operator(SUM([partialagg1003]))

    In parallel plans, the partial aggregate is more often referred to as a local aggregate, since it computes an aggregate that is local to the thread it is running on.  With four threads, there will be four local aggregations, which are combined to form the final result by the global aggregate, running on a single thread in this example.

    One other fact that will be important later on: notice that the plan above estimates that four rows will be produced by the parallel Stream Aggregate operator – one for each thread.  This plan was generated on a computer with eight processors available to SQL Sever.  The optimizer always estimates that half the processors will be available for parallel execution – this is a heuristic, and not one that makes an enormous amount of sense to me intuitively, but there you go, it is how it is.

    Partial or Local?

    To illustrate how inconsistent SQL Server is with the terms ‘partial’ and ‘local’:

    SELECT
        COUNT_BIG(*)
    FROM Production.TransactionHistory AS th
    GROUP BY
        th.Quantity
    OPTION (RECOMPILE)

    image

    The idea is the same, but the labels are different: notice that the hash aggregate now has an explicit Partial Aggregate label in the query plan.  You will only see this helpful change in parallel plans that use a Hash Partial (= Local) Aggregate (there is a sort of reason for this, explained in the next section on memory grants).

    The parallel hash aggregate is computing COUNT(*) again, but now it is grouping by Quantity.  The query plan needs to do a bit more work to get a correct result, because as it stands rows from the index scan are distributed unpredictably between threads, so each thread can see rows from any or all Quantity groups.  The aggregate is still ‘partial’ but this time it is because each thread only sees part of the full row set.

    To get the right answer (according to the semantic of the SQL query) SQL Server goes on to repartition the rows among new threads using Hash Partitioning on the Quantity column.  This step ensures that rows with the same Quantity value always end up on the same thread.  To be clear, it says nothing about how many Quantity groups go to each thread, just that the repartitioning step guarantees that rows associated with a particular Quantity value will all end up on the same thread.  For efficiency, we certainly hope that the partitioning results in a good distribution of groups among threads, but that is a separate issue completely.

    Each Stream Aggregate (one per thread) can now safely compute a SUM of the partial aggregate results it receives on its thread, though it still has to group by Quantity since one thread may receive more than one Quantity group.  The final Stream Aggregate is still a Global Aggregate, since it uses local (= partial) aggregate results, even though it executes in parallel (if you prefer, the result each thread’s Global Aggregate produces is globally valid for output).

    Memory Grants

    Another interesting thing about the Hash Match Partial Aggregate is that it never acquires more than the very small amount of memory necessary to create a minimally-size hash table.  If the aggregate runs out of memory while processing, it simply stops aggregating, and passes individual rows instead.  The global aggregate will still produce the correct result, but the query won’t be as efficient (for example, more rows have to move across the Repartition Streams exchange).  The partial aggregate is a performance optimization, it is not required for correctness (and note that a partial Stream Aggregate never needs a memory grant, so the same issue does not arise).

    Optimizer Rules

    The query optimizer rule that explores splitting ‘ordinary’ aggregates into local and global parts is called GenLGAgg for “Generate Local and Global Aggregate”.  In parallel plans, whether this transformation comes out cheaper depends on whether the plan saves enough cost (for example by reducing the number of rows that flow across the repartitioning exchange) to pay for the overhead of the extra aggregate operation.

    In serial plans, there are no exchanges (by definition) so this transformation needs to find another way to pay for itself.  The way we saw earlier, is to push the partial aggregate below a join such that it reduces the cost of that join (and later operations) enough to at least pay for itself.  There is a second optimizer rule to explore the option of pushing a local aggregate below a join; its name is LocalAggBelowJoin.  Note how both rules refer to the term Local, as in ‘local to a thread’.

    The Bug Revealed

    Here’s our original query and the execution plan the optimizer selected again:

    SELECT
        COUNT_BIG(*)
    FROM Production.Product AS p
    JOIN Production.TransactionHistory AS th ON
        th.ProductID = p.ProductID
    GROUP BY
        p.Class
    OPTION (MAXDOP 1, RECOMPILE)

    image

    Now here’s the plan produced for exactly the same query, with a MAXDOP 2 hint instead of MAXDOP 1:

    SELECT
        COUNT_BIG(*)
    FROM Production.Product AS p
    JOIN Production.TransactionHistory AS th ON
        th.ProductID = p.ProductID
    GROUP BY
        p.Class
    OPTION (MAXDOP 2, RECOMPILE)

    image

    The estimate of the number of rows emitted by the partial aggregate has doubled, from 441 rows (exactly right, remember) to 882.  If we specify MAXDOP 3, the row estimate trebles to 1323.  At MAXDOP 4, the estimate is 1764.  After MAXDOP 4, no increase occurs on my test machine because it has eight processors, and the optimizer estimates a maximum runtime DOP of half the number of processors available for parallel queries, as noted earlier.  If you have 64 processors, you could get the cost to multiply by 32…and so on.  To be absolutely clear about this, all these MAXDOP options still result in serial execution plans.

    Yes, but I don’t use MAXDOP hints

    Neither do I, much – at least not other than specifying one or zero.  Anyway, let’s see what happens on the same machine without any hints at all:

    SELECT
        COUNT_BIG(*)
    FROM Production.Product AS p
    JOIN Production.TransactionHistory AS th ON
        th.ProductID = p.ProductID
    GROUP BY
        p.Class

    image

    Yikes!  The correct estimate of 441 rows has been multiplied by 4 – the maximum possible on this 8-core machine, exactly as if we had specified (MAXDOP 4) to (MAXDOP 8) or even (MAXDOP 0).  Setting the server-wide max degree of parallelism configuration value to 1 (or using Resource Governor to do the same thing) will result in a correct estimate of 441 rows again (unless MAXDOP is specified).  You will also get correct estimates from the partial aggregate if SQL Server is running on a machine with only one processing unit, or if the affinity mask is set such that SQL Server can only see one processor.  See the pattern?

    Cause & Effects

    In short, the cardinality estimation problem comes down to the partial aggregate (and only the partial aggregate) costing itself as if it were running in a parallel plan.  In a parallel plan, each instance of the partial aggregate would produce the estimated number of rows per thread – so for a plan estimated to execute at DOP 4, we would get a (correct) multiplier effect on the estimate.  For a serial plan, this is just plain wrong.

    In the plan above, we are lucky that cardinality estimation for the merge join recognises that it cannot produce more rows than the 504 contained in the product table because of the foreign key relationship.  Nevertheless, the estimates for every operator after the problematic aggregate are still incorrect.

    In other plans, this effect can completely change the plan choice above a partial aggregate.  As an example, here is the SQL Server 2012 query from my last post, with the MAXDOP 1 hint:

    SELECT
        p.ProductModelID,
        COUNT_BIG(*),
        COUNT_BIG(DISTINCT th.TransactionDate)
    FROM Production.Product AS p
    JOIN Production.TransactionHistory AS th ON 
        th.ProductID = p.ProductID
    GROUP BY p.ProductModelID
    OPTION (RECOMPILE, MAXDOP 1)

    image

    This gives us the efficient new query transformation present in SQL Server 2012 (the middle hash aggregate is a partial one).  The exact same query, at MAXDOP 3 or above (or without a MAXDOP hint at all on a machine with 6 cores or more):

    image

    Now, we get the horribly inefficient old Eager Spool plan!

    The optimizer still considers both alternatives, but the estimated plan costs explain the choice it makes: the good plan has an estimated cost of 2.8 at MAXDOP 1 (that is, without the partial aggregate costing error).  At MAXDOP 2, the good plan’s cost increases to 3.45 units.  The estimated cost of the Eager Spool plan is 3.5 units, regardless of the MAXDOP setting (exactly as it should be for a serial plan), so you can see that the spool plan becomes the apparently cheaper option at MAXDOP 3.  (I’m sure it doesn’t need saying again, but the MAXDOP hinted queries here all still produce non-parallel plans.)

    The estimated I/O and CPU costs of the partial aggregate operator are also costed as if it were running in parallel at the estimated available DOP, further adding to the costing errors of the plan alternative.  Moreover, the cardinality estimation error will tend to propagate up the plan tree – increasing the costs of upstream operators, and causing larger memory grants to be requested than should be estimated to be necessary.

    Overall, the effect is that you are very likely to receive the wrong plan for your query.  The consequences, for non-trivial Adventure Works queries at any rate, could be severe.  Look out for these estimation errors in any serial plan that features a partial aggregate.  Queries that include a JOIN and a GROUP BY are likely candidates for the GenLGAgg and LocalAggBelowJoin rules to introduce a partial aggregate – and remember that in serial plans there is no way to know that an aggregate is partial without checking the details of the expressions it computes…

    The Connect item for this bug can be found here:
    https://connect.microsoft.com/SQLServer/feedback/details/711685/costing-cardinality-errors-with-partial-aggregates-in-serial-execution-plans

    This issue reproduces on all versions of SQL Server from 2005 to 2012 RC0 with TF4199 enabled for optimizer fixes (the behaviour on 2005 is even more erratic than described above).


    Update 8 Dec 2011: Thanks everyone for the votes, Microsoft have responded by confirming the bug. They plan to fix it sometime next year.

    © 2011 Paul White

    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • Is Distinct Aggregation Still Considered Harmful?

    Back in 2008, Marc Friedman of the SQL Server Query Processor Team wrote a blog entry entitled “Distinct Aggregation Considered Harmful”, in which he shows a way to work around the poor performance that often results simply from adding the keyword DISTINCT to an otherwise perfectly reasonable aggregate function in a query.  This post is an update to that, presenting a query optimizer enhancement in SQL Server 2012 RC0 that reduces the need to perform the suggested rewrite manually.  First, though, it makes sense for me to cover some broader aggregation topics in general.

    Query Plan Options for DISTINCT

    There are two main strategies to extract distinct values from a stream of rows.  The first option is to keep track of unique values in a hash table.  The second is to sort the incoming rows into groups and return one value from each group.  SQL Server uses the Hash Match operator to implement the hash table option, and Stream Aggregate or Sort Distinct for the sorting option.

    Hash Match Aggregate

    The optimizer tends to prefer the Hash Match Aggregate on larger rowsets, with fewer groups, where there is no reason to produce a sorted output, and where the incoming rows are not sorted on the DISTINCT expression(s).  Larger inputs favour hash matching because the algorithm generally scales well (although it does require a memory grant) and can make good use of parallelism.  Fewer groups are better for hashing because it means fewer entries in the hash table, and the memory needed to store unique values is proportional to the number of groups (and the size of the group).  Hash matching does not require or preserve the order of the incoming row stream.  The query and plan below show a Hash Match Aggregate building a hash table on values in the Quantity column:

    SELECT DISTINCT 
        th.Quantity
    FROM Production.TransactionHistory AS th
    OPTION (RECOMPILE)

    image

    The standard Hash Match Aggregate is blocking: it produces no output until it has processed its entire input stream.  If we restrict the number of rows using TOP or SET ROWCOUNT the aggregate can operate in a streaming fashion, producing new unique values as soon as they are encountered.  This streaming mode is known as Flow Distinct, and is activated depending on the estimated number of unique values in the stream.  In the example above, the estimated number of output rows from the Hash Match Aggregate is 455 (based on the statistics on my test machine).  Limiting the query to 455 or fewer rows using TOP or ROWCOUNT produces a Hash Match Aggregate running in Flow Distinct mode:

    SET ROWCOUNT 10
     
    SELECT DISTINCT 
        th.Quantity
    FROM Production.TransactionHistory AS th
    OPTION (RECOMPILE)
     
    SET ROWCOUNT 0

    image

    This plan is interesting because it limits the output to 10 rows without including a specific TOP operator for that purpose.  TOP is generally preferred to ROWCOUNT for the reasons set out in the Books Online topic “Limiting Result Sets by Using TOP and PERCENT”.  Ignore the part that says “because SET ROWCOUNT is used outside a statement that executes a query, its value cannot be used to generate a query plan for a query” – we have just seen how it can affect a query plan.

    For completeness, this is the TOP query and execution plan:

    SELECT DISTINCT TOP (10)
        th.Quantity
    FROM Production.TransactionHistory AS th
    OPTION (RECOMPILE)

    image

    Stream Aggregate

    Performing a DISTINCT is the same as applying a GROUP BY on every expression in the SELECT list.  A stream of rows sorted on these GROUP BY expressions has the useful property that all rows with the same GROUP BY values appear together, so all we need to do is output a row of GROUP BY keys each time a change occurs.  If the required order can be obtained without an explicit sort, the query plan can use a Stream Aggregate directly:

    SELECT DISTINCT 
        th.ProductID 
    FROM Production.TransactionHistory AS th
    OPTION (RECOMPILE)

    image

    This query plan shows an ordered scan of an index on the ProductID column, followed by a Stream Aggregate with a GROUP BY on the same column.  The Stream Aggregate emits a new row each time a new group is encountered, similar to the Hash Match Aggregate running in Flow Distinct mode.  This operator does not require a memory grant because it only needs to keep track of one set of GROUP BY expression values at a time.

    Sort Distinct

    The third alternative is to perform an explicit Sort followed by a Stream Aggregate.  The optimizer can often collapse this combination into a single Sort operating in Sort Distinct mode:

    SELECT DISTINCT
        p.Color
    FROM Production.Product AS p 
    OPTION (RECOMPILE)

    image

    To see the original Sort and Stream Aggregate combination, we can temporarily disable the GbAggToSort optimizer rule that performs this transformation:

    DBCC RULEOFF('GbAggToSort')
     
    SELECT DISTINCT
        p.Color
    FROM Production.Product AS p 
    OPTION (RECOMPILE)
     
    DBCC RULEON('GbAggToSort')

    The query plan now shows a regular (non-distinct) Sort followed by the Stream Aggregate:

    image

    Distinct Aggregates

    The DISTINCT keyword is most commonly used with the COUNT and COUNT_BIG aggregate functions, though it can be specified with a variety of built-in and CLR aggregates.  The interesting thing is that SQL Server always processes distinct aggregates by first performing a DISTINCT (using any of the three methods shown previously) and then applying the aggregate (for example COUNT) as a second step:

    SELECT 
        COUNT_BIG(DISTINCT p.Color) 
    FROM Production.Product AS p 
    OPTION (RECOMPILE)

    image

    One thing worth highlighting is that COUNT DISTINCT does not count NULLs.  The previous query that listed the distinct colours from the Product table produced 10 rows (9 colours and one NULL), but the the COUNT DISTINCT query here returns the value 9.  The query plan uses a Distinct Sort to perform the DISTINCT (which includes the NULL group) and then counts the groups using a Stream Aggregate.  The aggregate expression uses COUNT(expression) rather than COUNT(*), which eliminates the NULL group produced by the Distinct Sort.  This second example shows a Hash Match Aggregate performing the DISTINCT, followed by a Stream Aggregate to count the non-NULL groups:

    SELECT
        COUNT_BIG(DISTINCT th.Quantity)
    FROM Production.TransactionHistory AS th
    OPTION (RECOMPILE)

    image

    So far, the counting step has always been performed by a Stream Aggregate.  This is because it has been a scalar aggregate – an aggregate without a GROUP BY clause – and SQL Server always implements scalar aggregates using Stream Aggregate (there would be no point using hashing since there will only be one group by definition).  If we add a GROUP BY clause, the final COUNT is no longer a single (scalar) result, so we can get a plan with two Hash Match Aggregates:

    SELECT
        th.ProductID,
        COUNT_BIG(DISTINCT th.Quantity)
    FROM Production.TransactionHistory AS th
    GROUP BY
        th.ProductID
    OPTION (RECOMPILE)

    image

    In this plan, the rightmost aggregate is performing the DISTINCT, and the leftmost one implements the COUNT (with GROUP BY).

    Combining Aggregates

    The decision to always implement distinct aggregates as separate DISTINCT and aggregate steps makes life easier for the optimizer.  Separating the two operations makes it easier to plan and cost parallelism, partial aggregation, combining similar aggregates, moving aggregates around (for example pushing an aggregate below a join) and so on.  On the downside, it creates problems for queries that include multiple distinct aggregates, or combine a single distinct aggregate with an ‘ordinary’ aggregate.

    Multiple Aggregates

    As you probably know, there is no problem combining multiple ordinary aggregations into a single operator:

    SELECT
        COUNT_BIG(*),
        MAX(th.ActualCost),
        STDEV(th.Quantity)
    FROM Production.TransactionHistory AS th
    GROUP BY
        th.TransactionType

    image

    The single Hash Match Aggregate operator computes the three separate aggregates concurrently on the incoming stream.

    Multiple Distinct Aggregates

    This next query does a similar thing, but with three DISTINCT aggregations on the stream:

    SELECT 
        COUNT_BIG(DISTINCT th.ProductID), 
        COUNT_BIG(DISTINCT th.TransactionDate), 
        COUNT_BIG(DISTINCT th.Quantity)
    FROM Production.TransactionHistory AS th
    OPTION (RECOMPILE)

    It produces a more complex query plan:

    image

    The issue here is that once a DISTINCT has been performed on a stream, the stream no longer contains the columns necessary to perform the other DISTINCT operations.  To work around this, the optimizer builds a plan that reads the source stream once per DISTINCT, computes the aggregates separately, and then combines the results using a cross join (which is safe because these are all scalar aggregates, guaranteed to produce one row each).  The same basic pattern is employed if the query contains an outer GROUP BY clause, but instead of cross joins there will be inner joins on the GROUP BY columns.

    More often than not, the source of rows will not be an unrestricted table scan.  Where the source is complex (and therefore expensive to re-run for every distinct aggregate) or where a filter significantly reduces the number of rows, the query optimizer may choose to Eager Spool the source rows and replay them once per distinct aggregate:

    SELECT 
        COUNT_BIG(DISTINCT th.ProductID), 
        COUNT_BIG(DISTINCT th.TransactionDate), 
        COUNT_BIG(DISTINCT th.Quantity)
    FROM Production.TransactionHistory AS th
    WHERE
        th.ActualCost < $5
    GROUP BY
        th.TransactionType
    OPTION (RECOMPILE)

    image

    This is a plan shape that you are likely to encounter in the real world, since most queries will likely have a filtering condition or have a row source that is more complex than a simple scan of an index or table.  For queries over larger tables than Adventure Works provides, this plan is likely to perform poorly, as you might expect.  Aside from the obvious concerns, inserting rows into a spool has to be performed on a single thread (like all data modification operations in today’s SQL Server).  Another limitation is that the spool does not support parallel scan for reading, so the optimizer is very unlikely to restart parallelism after the spool (or any of its replay streams).  In queries that operate on large data sets, the parallelism implications of the spool plan can be the most important cause of poor performance.

    Performing Multiple Distinct Aggregates Concurrently

    If SQL Server were able to perform the whole COUNT DISTINCT aggregate in a single operator (instead of splitting the DISTINCT and COUNT into two steps) there would be no reason to split plans with spools as seen above.  This could not be done with a Stream Aggregate since that operator requires the stream to be sorted on the DISTINCT expression, and it is not possible to sort a single stream in more than one way at the same time.

    On the other hand, the Hash Match Aggregate does not require sorted input (it keeps the distinct values in a hash table, remember), so it should be possible to design a Hash Match Aggregate that computes COUNT DISTINCT in a single operation.  We can test this idea with a User-Defined Aggregate (UDA) – SQL Server 2008 required:

    using System;
    using System.Collections.Generic;
    using System.Data.SqlTypes;
    using System.IO;
    using Microsoft.SqlServer.Server;
     
    [SqlUserDefinedAggregate
        (
        Format.UserDefined,
        IsInvariantToDuplicates = true,
        IsInvariantToNulls = true,
        IsInvariantToOrder = true,
        IsNullIfEmpty = false,
        MaxByteSize = -1
        )
    ]
    public struct CountDistinctInt : IBinarySerialize
    {
        // The hash table
        private Dictionary<int, object> dict;
     
        // Recreate the hash table for each new group
        public void Init()
        {
            dict = new Dictionary<int, object>();
        }
     
        // Ignore NULLs, store key values in the hash table
        public void Accumulate(SqlInt32 Data)
        {
            if (!Data.IsNull)
            {
                dict[Data.Value] = null;
            }
        }
     
        // Merge partial aggregates
        public void Merge(CountDistinctInt Group)
        {
            foreach (var item in Group.dict.Keys)
            {
                dict[item] = null;
            }
        }
     
        // Return the DISTINCT COUNT result
        public int Terminate()
        {
            return dict.Count;
        }
     
        // Required by SQL Server to serialize this object
        void IBinarySerialize.Write(BinaryWriter w)
        {
            w.Write(dict.Count);
     
            foreach (var item in dict.Keys)
            {
                w.Write(item);
            }
        }
     
        // Required by SQL Server to deserialize this object
        void IBinarySerialize.Read(BinaryReader r)
        {
            var recordCount = r.ReadInt32();
            dict = new Dictionary<int, object>(recordCount);
     
            for (var i = 0; i < recordCount; i++)
            {
                dict[r.ReadInt32()] = null;
            }
        }
    }

    This UDA does nothing more than create a hash table, add (non-NULL) values to it, and return the count of values when aggregation is complete (the hash table takes care of duplicates).  There is a little extra infrastructure code in there to allow SQL Server to serialize and reconstruct the hash table when needed (and merge partial aggregates) but the core of the function is just four lines of code.  The example above only aggregates integer values, but it is easy to extend the idea to include other types.  Anyway, armed with the integer and date time versions of the UDA, I now return to the multiple-distinct-count query that caused all the spooling, with COUNT DISTINCT replaced by UDA references:

    SELECT 
        dbo.CountDistinctInt(th.ProductID), 
        dbo.CountDistinctDateTime(th.TransactionDate), 
        dbo.CountDistinctInt(th.Quantity)
    FROM Production.TransactionHistory AS th
    WHERE
        th.ActualCost < $5
    GROUP BY
        th.TransactionType
    OPTION (RECOMPILE)

    Instead of all the Eager Spools, we now get this query plan:

    image

    You may be surprised to see that the three distinct count aggregates are being performed by a Stream Aggregate; after all I just finished explaining why a Stream Aggregate could not possibly do what we want here.  The thing is that all CLR UDAs are interfaced to query plans using the Stream Aggregate model.  The fact that this UDA uses a hash table internally does not change that.

    The Sort in this plan is there to ensure that groups of rows arrive at the ‘Stream Aggregate’ interface in the required GROUP BY order.  This is so SQL Server knows when to call the Init() and Terminate() methods on our UDA.  The COUNT DISTINCT aggregation that is happening inside the UDA for each group could not care less about ordering, of course.  (In case you were wondering, yes, the UDA produces the same results as the original T-SQL code).

    The point here is to demonstrate that multiple DISCOUNT COUNT operations can be performed within a single operator, not that UDAs are necessarily always a great way to do that in general.  As far as performance is concerned, the original spool query runs in around 220ms, and the UDA settles down around 160ms which isn’t bad, all things considered.

    We can also improve the performance of the T-SQL query by rewriting it to avoid the spools by scanning the source table three times (this executes in around 75ms).  Part of the problem here (to go with the spool plan issues mentioned earlier) is that the optimizer assumes that all queries start executing with an empty data cache, and it does not account for the fact that the three scans complete sequentially, so the pages are extremely likely to be available from memory for the second and third scans.  The rewrite and query plan are below.

    WITH 
        Stream AS 
        (
            SELECT 
                th.TransactionType, 
                th.ProductID, 
                th.TransactionDate, 
                th.Quantity 
            FROM Production.TransactionHistory AS th 
            WHERE 
                th.ActualCost < $5
        ),
        CountDistinctProduct AS
        (
        SELECT 
            TransactionType, 
            COUNT_BIG(DISTINCT ProductID) AS c
        FROM Stream 
        GROUP BY 
            TransactionType
        ),
        CountDistinctTransactionDate AS
        (
        SELECT 
            TransactionType, 
            COUNT_BIG(DISTINCT TransactionDate) AS c
        FROM Stream 
        GROUP BY 
            TransactionType
        ),
        CountDistinctQuantity AS
        (
        SELECT 
            TransactionType, 
            COUNT_BIG(DISTINCT Quantity) AS c
        FROM Stream 
        GROUP BY 
            TransactionType
        )
    SELECT
        p.c,
        d.c,
        q.c
    FROM CountDistinctProduct AS p
    JOIN CountDistinctTransactionDate AS d ON
        d.TransactionType = p.TransactionType
    JOIN CountDistinctQuantity AS q ON
        q.TransactionType = d.TransactionType

    image

    Combining a Single Distinct Aggregate with other Aggregates

    Marc Friedman’s blog post presented a way to rewrite T-SQL queries that contain a single distinct aggregate and one or more non-distinct aggregates so as to avoid spools or reading the source of the rows more than once.  The essence of the method is to aggregate first by the GROUP BY expressions in the query and the DISTINCT expressions in the aggregate, and then to apply some relational math to aggregate those partial aggregates to produce the final result.  I encourage you to read the full post to see all the detail, but I will quickly work through an example here too:

    SELECT
        dp.EnglishProductName,
        SUM(frs.SalesAmount),
        COUNT_BIG(DISTINCT frs.SalesTerritoryKey)
    FROM dbo.FactResellerSales AS frs
    JOIN.dbo.DimProduct AS dp ON
        frs.ProductKey = dp.ProductKey
    GROUP BY
        dp.EnglishProductName
    OPTION (MAXDOP 1, RECOMPILE)

    This query contains a regular SUM aggregate and a COUNT DISTINCT, so as expected the query optimizer produces a plan with an Eager Spool (click to enlarge):

    image

    To the left of the spool, the top branch performs the DISTINCT followed by the COUNT per group, and the spool replay on the lower branch of the plan computes the SUM per group.  Finally, the two branches join on the common GROUP BY expression (EnglishProductName).

    The rewrite starts by grouping on EnglishProductName (the GROUP BY expression) and SalesTerritoryKey (the DISTINCT expression) to produce a partial aggregate:

    SELECT 
        dp.EnglishProductName, 
        frs.SalesTerritoryKey, 
        SUM(frs.SalesAmount) AS ssa
    FROM dbo.DimProduct AS dp
    JOIN dbo.FactResellerSales AS frs ON 
        frs.ProductKey = dp.ProductKey
    GROUP BY 
        frs.SalesTerritoryKey, 
        dp.EnglishProductName

    This query contains no distinct aggregates, so we get a plan with an ordinary join and and ordinary SUM aggregate:

    image

    To produce the results specified by the original query, we now need to SUM the partial SalesAmount sums, and COUNT (ignoring NULLs) the SalesTerritoryKey values.  The final rewrite looks like this:

    WITH PartialAggregate AS
    (
        SELECT 
            dp.EnglishProductName, 
            frs.SalesTerritoryKey, 
            SUM(frs.SalesAmount) AS ssa
        FROM dbo.DimProduct AS dp
        JOIN dbo.FactResellerSales AS frs ON 
            frs.ProductKey = dp.ProductKey
        GROUP BY 
            frs.SalesTerritoryKey, 
            dp.EnglishProductName
    )
    SELECT
        pa.EnglishProductName, 
        SUM(pa.ssa),
        COUNT_BIG(pa.SalesTerritoryKey)
    FROM PartialAggregate AS pa
    GROUP BY pa.EnglishProductName
    OPTION (RECOMPILE)

    The query plan adds another layer of aggregation on top of the partial aggregate plan above:

    image

    This plan avoids the Eager Spools seen earlier and improves execution time from 320ms to 95ms in this example.  On larger sets, where parallelism becomes important and the spools might need to use physical tempdb storage, the gains are likely to be much larger.

    New for SQL Server 2012 RC0

    The good news is that SQL Server 2012 RC0 adds a new optimizer rule (ReduceForDistinctAggs) to perform this rewrite automatically on the original form of the query.  This is particularly good because the rewrite, while ingenious, can be somewhat inconvenient to do in practice, and all too easy to get wrong (particularly ensuring that NULL partially-aggregated groups are handled correctly).  The new optimizer rule was not available in CTP3, so you will need RC0 to see SQL Server turn this T-SQL:

    SELECT
        dp.EnglishProductName,
        SUM(frs.SalesAmount),
        COUNT_BIG(DISTINCT frs.SalesTerritoryKey)
    FROM dbo.FactResellerSales AS frs
    JOIN.dbo.DimProduct AS dp ON
        frs.ProductKey = dp.ProductKey
    GROUP BY
        dp.EnglishProductName
    OPTION (RECOMPILE)

    …directly into this query plan:

    image

    This transformation is only valid for queries with a single distinct aggregate (and at least one non-distinct aggregate of course).  If your query contains multiple distinct aggregates, it may not help you directly, though you may be able to refactor the T-SQL to take advantage.

    If you want to see SQL Server 2012 RC0 produce the Eager Spool plan instead (created by the long-standing ExpandDistinctGbAgg rule), disable the new rule with:

    DBCC RULEOFF ('ReduceForDistinctAggs')

    …and then recompile.  Don’t forget to enable it again afterward using RULEON or by reconnecting to the server.

    Even with the new rule enabled, you may still see the spool or multiple-scan plan from time to time.  As always, the optimizer may explore many alternative plan forms and choose the one that looks cheapest.  In some cases, the optimizer may still choose the spool plan, though it probably won’t be right to do so…

    Thanks for reading; please consider voting for the following Connect item (to allow CLR UDA hash aggregates).  Thank you.

    http://connect.microsoft.com/SQLServer/feedback/details/629920/allow-option-hash-group-with-sqlclr-udas

    © 2011 Paul White

    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • How to Find the Statistics Used to Compile an Execution Plan

    In this post, I show you how to determine exactly which statistics objects were used by the query optimizer to produce an execution plan.

    Trace Flags

    We will need three undocumented trace flags.  The first one (3604) is well-known – it redirects trace output to the client so it appears in the SSMS messages tab.

    The second trace flag is 9292.  With this enabled, we get a report of statistics objects which are considered ‘interesting’ by the query optimizer when compiling, or recompiling the query in question.  For potentially useful statistics, just the header is loaded.

    The third trace flag is 9204.  With this enabled, we see the ‘interesting’ statistics which end up being fully loaded and used to produce cardinality and distribution estimates for some plan alternative or other.  Again, this only happens when a plan is compiled or recompiled – not when a plan is retrieved from cache.

    You can enable and disable these flags with the usual DBCC TRACEON and TRACEOFF commands, but it is also possible to enable them just for a particular statement using the undocumented QUERYTRACEON query hint (demonstrated below).

    Sample Query

    DBCC FREEPROCCACHE
     
    SELECT 
        p.Name,
        total_quantity = SUM(th.Quantity)
    FROM AdventureWorks.Production.Product AS p
    JOIN AdventureWorks.Production.TransactionHistory AS th ON
        th.ProductID = p.ProductID
    WHERE
        th.ActualCost >= $5.00
        AND p.Color = N'Red'
    GROUP BY
        p.Name
    ORDER BY
        p.Name
    OPTION
    (
        QUERYTRACEON 3604,
        QUERYTRACEON 9292,
        QUERYTRACEON 9204
    )

    The DBCC FREEPROCCACHE is just there to empty the plan cache so we get a compilation.  You can also evict the current plan from cache if you know its handle (SQL Server 2008) or use a RECOMPILE query hint.  Using RECOMPILE is often convenient, but you may get a different plan compared to that obtained without the hint.  Note that compiling the query is enough – we do not need to execute the query; simply requesting an ‘estimated plan’ will do.  It doesn’t hurt to run it either though, just to be clear.

    Sample Output

    Stats header loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.Product, 
        IndexId: 1, 
        ColumnName: ProductID, 
        EmptyTable: FALSE
     
    Stats loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.Product, 
        IndexId: 1, 
        ColumnName: ProductID, 
        EmptyTable: FALSE
     
    Stats header loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.Product, 
        IndexId: 3, 
        ColumnName: Name, 
        EmptyTable: FALSE
     
    Stats loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.Product, 
        IndexId: 3, 
        ColumnName: Name, 
        EmptyTable: FALSE
     
    Stats header loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.Product, 
        IndexId: 11, 
        ColumnName: Color, 
        EmptyTable: FALSE
     
    Stats loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.Product, 
        IndexId: 11, 
        ColumnName: Color, 
        EmptyTable: FALSE
     
    Stats header loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.TransactionHistory, 
        IndexId: 2, 
        ColumnName: ProductID, 
        EmptyTable: FALSE
     
    Stats loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.TransactionHistory, 
        IndexId: 2, 
        ColumnName: ProductID, 
        EmptyTable: FALSE
     
    Stats header loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.TransactionHistory, 
        IndexId: 5, 
        ColumnName: ActualCost, 
        EmptyTable: FALSE
     
    Stats loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.TransactionHistory, 
        IndexId: 5, 
        ColumnName: ActualCost, 
        EmptyTable: FALSE

    There’s no sign of an official way to get this very useful information in Denali, despite it being requested many times over the years.  Trace flag 9204 works at least as far back as SQL Server 2005.  Both 92xx flags work in 2008, R2, and Denali CTP 3.

    Enjoy!

    © 2011 Paul White

    SQLkiwi@gmail.com
    @SQL_Kiwi

  • Can a SELECT query cause page splits?

    Books Online has this to say about page splits:

    When a new row is added to a full index page, the Database Engine moves approximately half the rows to a new page to make room for the new row.  This reorganization is known as a page split.  A page split makes room for new records, but can take time to perform and is a resource intensive operation. Also, it can cause fragmentation that causes increased I/O operations.

    Given that information, how can a SELECT statement be responsible for page splits?  Well, I suppose we could SELECT from a function that adds rows to a table variable as part of its internal implementation, but that would clearly be cheating, and no fun at all from a blogging point of view.

    Tables & Sample Data

    To demonstrate the effects I want to talk about today, I’m going to use a couple of tables with data from the sales data in the Adventure Works sample database.  There’s a full script at the end of this post for those of you that like to run these things for yourselves, but the data generally looks like this: (header table on the left; detail table on the right):

    image

    Both tables have a clustered index on the Sales Order ID column.  The query I’m going to run is this:

    image

    I have expressed the natural join predicate as (X >= Y) AND (X <= Y) instead of (X = Y) to prevent the optimizer choosing a hash or merge join (both of which require at least one equality predicate to perform an inner join).  The “+ 0” is there to prevent the optimizer from using an indexed nested loop join (that is, using a seek on the inner side of the join).  These tricks are necessary to generate a plan with an eager index spool on this small sample data set.

    Checking for Splits

    There are several ways to check for page splits while this query is running.  I’m going to use the new extended events UI in SQL Server code name “Denali”, which, by the way, I’m quite impressed with: it’s a lot like using Profiler inside SSMS and does away with all that tedious mucking about with CREATE EVENT SESSION statements and XQuery (there are also no improbability factors to compute).  Ok, so running the query above and looking for page splits via extended events, I get output like this:

    image

    These page splits are all caused by the Index Spool in the query plan.  This iterator builds a temporary b-tree index on the data produced by its sub-tree, and like any index insert operation, page splits will occur.  In fact this is a bit of a cheat, because these page splits aren’t really page splits at all – notice the sequential nature of the pages being ‘split’ (the page_id column), and its relationship with the new page.

    Even in Denali, we don’t yet have a nice easy way to distinguish between a traditional page split (where roughly half the existing rows are moved to a new page) and the case where a whole new page is allocated at one end of the index.  This strikes me as an amazing oversight.  I don’t know about you, but I generally don’t fret much over tables or indexes allocating a whole new page at one end to store new rows – I’m much more interested in existing index pages that need to split in half!  I certainly hope this improves before RTM.

    Index Spool Internals

    Having shown page splits that aren’t really page splits, let’s get to demonstrating an index spool that ‘really does’ split pages.  To do that, we need to know something about the way index spools build their temporary index.  You have probably noticed that most query plans that build permanent indexes involve a sort, which in turn requires a memory grant.

    Index spools do not get a memory grant, and are not preceded by an explicit sort operator in query plans.  Given that indexes require sorting, how does an index spool manage to build a b-tree index structure at all?  The answer is that index spools just insert rows into an empty b-tree in the order they are received.  In general, the rows used for the index will not be received in sorted order, resulting in lots and lots of page splits in the temporary index b-tree.

    The first run of the query did not suffer from this problem due to the clustered index on the header table, which tends to produce rows ordered by Sales Order ID – exactly the order of the index built by the index spool.  Let’s rebuild the header table clustered index to use the Customer ID column instead, and re-run the test.  The query plan looks exactly the same – the only difference is that the index spool is fed by a clustered index scan that is now extremely unlikely to produce rows in Sales Order ID sorted order:

    image

    The page splits reported by the extended events ‘trace’ are now very different:

    image

    Notice the same source page (page_id) being split over and over again as rows arrive in unsorted order.  There are now very many more page splits than before, and they are a lot more expensive because we are now not just allocating new empty pages at the end of the index – we have to shift half the rows and fix up all the page pointers.  In practice, this is the sort of splitting you are likely to see in tempdb with an index spool – rows are very unlikely to be in index order (else why would we need the spool at all?).  On that last point, consider the hoops I had to jump through to demonstrate a plan where the rows were in temporary index order – all that mess with “>= AND <=” and “+ 0”…

    Parallel Index Spools

    The last thing I want to cover about index spools is how they behave in parallel plans.  You might have noticed the MAXOP 1 hint in the previous examples.  Let’s remove that now, add some wait type monitoring, and see how we go:

    image

    Don’t be fooled by the parallelism indicator on the index spool – the index building portion of its execution happens on a single thread.  One way to see this is to capture thread and task information using the page splitting event session, where the same task and thread ids are reported for every one of the hundreds of events:

    image

    The wait statistics provide additional confirmation:

    image

    Ignore the CXPACKET waits (they are perfectly normal) and look at the EXECSYNC waits.  This query ran at DOP 8 on my machine, so 7 parallel threads had to wait while a single thread built the temporary index.  The index build took 114ms, and 7 threads waiting simultaneously for that time gives 7 x 114 = 798ms.  That’s within one quantum (4ms) of the total recorded EXECSYNC wait time.  EXECSYNC is used for lots of different things – whenever synchronization of parallel workers needs to occur outside a Parallelism operator, so please don’t take away the idea that EXECSYNC = index spool.

    © Paul White 2011

    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

    Script:

    -- Any AdventureWorks version should be fine
    USE AdventureWorks2008R2
    GO
    -- Tables
    CREATE TABLE #OrdHeader
    (
        SalesOrderID        INTEGER NOT NULL,
        OrderDate           DATETIME NOT NULL,
        SalesOrderNumber    NVARCHAR(25) NOT NULL,
        CustomerID          INTEGER NOT NULL
    )
    GO
    CREATE TABLE #OrdDetail
    (
        SalesOrderID        INTEGER NOT NULL,
        OrderQty            SMALLINT NOT NULL,
        LineTotal           NUMERIC(38,6) NOT NULL
    );
    GO
    -- Data
    INSERT #OrdHeader
    SELECT
        h.SalesOrderID,
        h.OrderDate,
        h.SalesOrderNumber,
        h.CustomerID
    FROM Sales.SalesOrderHeader AS h
    GO
    INSERT #OrdDetail
    SELECT
        d.SalesOrderID,
        d.OrderQty,
        d.LineTotal
    FROM Sales.SalesOrderDetail AS d
    GO
    -- Clustered indexes
    CREATE UNIQUE CLUSTERED INDEX cuq ON #OrdHeader (SalesOrderID) WITH (MAXDOP = 1)
    CREATE CLUSTERED INDEX cuq ON #OrdDetail (SalesOrderID) WITH (MAXDOP = 1)
    GO
    -- Test query
    SELECT
        d.OrderQty,
        h.SalesOrderNumber,
        h.OrderDate
    FROM #OrdDetail AS d
    JOIN #OrdHeader AS h ON
        d.SalesOrderID >= h.SalesOrderID + 0
        AND d.SalesOrderID <= h.SalesOrderID + 0
    OPTION  (MAXDOP 1)
    GO
    -- Lose the ordering
    CREATE CLUSTERED INDEX cuq ON #OrdHeader (CustomerID) 
    WITH (DROP_EXISTING = ON, MAXDOP = 1)
    GO
    SELECT
        d.OrderQty,
        h.SalesOrderNumber,
        h.OrderDate
    FROM #OrdDetail AS d
    JOIN #OrdHeader AS h ON
        d.SalesOrderID >= h.SalesOrderID + 0
        AND d.SalesOrderID <= h.SalesOrderID + 0
    OPTION  (MAXDOP 1)
    GO
    -- Parallelism
    DBCC SQLPERF('sys.dm_os_wait_stats', 'CLEAR')
    GO
    SELECT
        d.OrderQty,
        h.SalesOrderNumber,
        h.OrderDate
    FROM #OrdDetail AS d
    JOIN #OrdHeader AS h ON
        d.SalesOrderID >= h.SalesOrderID + 0
        AND d.SalesOrderID <= h.SalesOrderID + 0
    GO
    SELECT
        ows.wait_type,
        ows.waiting_tasks_count,
        ows.wait_time_ms,
        ows.max_wait_time_ms,
        ows.signal_wait_time_ms
    FROM sys.dm_os_wait_stats AS ows
    WHERE
        ows.wait_type IN (N'CXPACKET', 'EXECSYNC')
    GO
    DROP TABLE 
        #OrdHeader, 
        #OrdDetail
    GO
    /*
    -- Denali Extended Events Session Script (change the session_id filter!)
    -- Try the UI :)
     
    CREATE EVENT SESSION [Page Splits] ON SERVER 
    ADD EVENT sqlserver.page_split
        (
        ACTION(sqlos.system_thread_id,sqlos.task_address,sqlserver.sql_text)
    WHERE 
        ([package0].[equal_uint64]([sqlserver].[session_id],(60)))) 
    ADD TARGET 
        package0.ring_buffer(SET max_memory=(65536))
    WITH 
    (
        MAX_MEMORY=4096 KB,
        EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,
        MAX_DISPATCH_LATENCY=30 SECONDS,
        MAX_EVENT_SIZE=0 KB,
        MEMORY_PARTITION_MODE=NONE,
        TRACK_CAUSALITY=OFF,
        STARTUP_STATE=OFF
    )
    */
  • SQL Server, Seeks, and Binary Search

    The following table summarizes the results from my last two blog entries, showing the CPU time used when performing 5 million clustered index seeks:

    Clustered Index Seek Test Results

    In test 1, making the clustered index unique improved performance by around 40%. In test 2, making the same change reduced performance by around 70% (on 64-bit systems – more on that later).  As a reminder, both tests use nested loops to join a single-column BIGINT table of numbers to itself:

    Query and Plans

    In test 1, the table contains numbers from 1 to 5 million inclusive; for test 2, the table contains the even numbers from 2 to 10 million (still 5 million rows).  In case you were wondering, the test 2 results are the same if we populate the table with odd numbers from 1 to 9,999,999 instead of even numbers.  How can we explain these results?

    Performing an Index Seek

    Perhaps we need to look a bit deeper into how SQL Server performs an index seek.  Most people know that SQL Server indexes are a B+ tree, so let’s work through an example of locating the row containing the value 1 million, when the table contains values from 1 to 5 million, with a unique clustered index.

    Index seeks start at the index root page, which we can find in a number of ways (e.g. sys.system_internals_allocation_units).  I happened to use DBCC IND to determine that the root of the index was page 117482 in file 1 (in Denali, we can use a new DMF: sys.dm_db_database_page_allocations).  Anyway, whichever way you find it, we can use DBCC PAGE to look at the contents of the root page:

    Index Root Page Contents

    The col1 (key) column shows the lowest key value that can possibly be present in the lower-level page specified by Child Page Id.  We can see that the lowest possible key value on page 117485 is 906305, and the next lowest key value possible is 1132881 on page 117486.  So, if a row containing the value 1 million exists, the next page to look at is 117485:

    Level 1 Index Contents

    We are now at level 1 of the index, and have to scroll down the output a bit further to see that the next page to look at is 119580.  This page is at the leaf level of the index, and we find our target value at position 399:

    Leaf level index record

    Binary Search

    When performing an index seek, SQL Server obviously doesn’t run DBCC PAGE and scroll down the output window to find the next page to look at like we just did.  You may know that SQL Server can use binary search to locate index entries, but it’s worth taking a closer look at the index pages to see how this works in practice:

    SQL Server Page Structure Diagram

    The slot array contains a series of two-byte offsets pointing to the index records.  The index records are not necessarily stored in any particular order on the page; only the slot array entries are guaranteed to be in sorted order.  Note that the sorted slot array entries do not contain the index keys themselves, just offsets.

    Performance

    When SQL Server uses binary search to locate an index entry, it starts by checking the middle element of the slot array (shown in red in the example above).  It then follows that offset pointer to the index record, and compares the value it finds there with the sought value.  Assuming the value is not the one we are looking for, the sorted nature of the slot array allows SQL Server to narrow the search range in half after this first comparison.

    If the red index record happens to contain a value higher than the sought value, we know the value we are looking for is in that half of the slot array that sorts lower than the red entry.  This process of cutting the search space in half continues until we find the record we are looking for, or until it becomes clear that the value does not exist.

    The average and worst case performance of binary search is very close to log2N comparisons to find the item of interest in a sorted list of N items.  For the 476 index entries found in our test example index structure at the leaf level, that means at most 9 comparison operations.  That compares well to a linear search of the sorted slot array, which might require as many as 476 comparisons if we are unlucky, 238 on average.

    None of this offers any particular insight into why a binary search into a unique indexes containing only even (or odd!) numbers should be so slow.  One of the advantages of binary search is that it expects to perform around the same amount of work, regardless of the distribution of the values in the index.  Luckily, binary search is not the only game in town.

    Linear Interpolation

    As efficient as binary search is in the general case, in some cases we can do better.

    When searching for the value 1 million, we know from level 1 of the index that the lowest key value that can appear on the leaf level page we are interested in is 999,601.  We also know that the lowest key value on the next page is 1,000,077.  From the header information on our target page, we also know there are 476 entries in the slot array on that page.  Given these facts, we can make an immediate guess at which slot the value 1,000,000 ought to appear in:

    expected slot = (1000077 - 999601) / 476 * (1000000 - 999601) = 399

    This simple calculation takes us immediately to slot 399.  As we saw earlier, slot 399 does indeed contain the value 1,000,000:

    Slot 399

    Naturally, the quality of the linear interpolation guess depends on how evenly distributed the values are within the page.  To a lesser extent, it also depends on how closely the minimum key value information at level 1 of the index matches reality.  In both tests shown in earlier blog entries, the values are precisely evenly distributed and the level 1 index key information is spot on.

    Performance

    Linear interpolation has the potential to be more efficient even than binary search, finding the target value in one comparison in this example, compared to nine comparisons for binary search.  Even where the data is not perfectly distributed, there is scope for linear interpolation to be superior, simply by applying the technique repeatedly on successively narrower ranges.

    A reasonable practical compromise might be to try linear interpolation two or three times, before falling back to linear or binary search on the remaining range.  The advantage of linear interpolation may not seem much, but consider that SQL Server makes these comparisons for every index seek operation – when doing many millions of seeks, the difference can soon add up.  It is probably not coincidence that the OLTP TPC benchmark tests perform a very great number of singleton index seeks.

    Linear Regression

    A natural extension to the idea of linear interpolation is to apply a linear regression (line of best fit) instead:

    Linear Regression Diagram

    In the diagram, the blue dotted line represents a linear interpolation based on the values of the smallest and highest values only.  The black line is obtained by a linear regression of all the data values, and results in a much better fit.

    Since every straight line can be represented by a formula of the form y = mx + b, we can completely specify it by recording the values of the ‘m’ (slope) and ‘b’ (y-axis intercept) parameters.  The R2 value gives an idea of how good a fit the linear regression line is to the data.  In the present context, the ‘y’ value represents the indexed value, and the ‘x’ value is the slot position where that value can be found.  It’s easy to see that once we have a linear regression line, we can estimate the slot position for a particular sought index value by solving x = (y – b) / m.

    To use linear regression in a database, we might imagine storing the ‘b’ and ‘m’ parameter values somewhere (perhaps in a compact format in the page header), and deciding whether to apply the technique based on the R2 value.

    What Does SQL Server Do?

    All this is fine in theory, but does SQL Server really ever use anything except binary search?  To see, I ran tests with a debugger attached to the SQL Server process, and stepped through the calls made while the seek test queries were running.  The call stack below was obtained when running 64-bit SQL Server:

    64-bit Call Stack

    By contrast, the following call stack comes from 32-bit SQL Server:

    32-bit Call Stack

    Interestingly, SQL Server follows the same code paths (at least at the call stack level, which is the best we can do without source code) regardless of whether the index is defined as unique or not, and for all the numeric data types I tested (INT, BIGINT, REAL, FLOAT, DECIMAL with a zero scale).

    The performance problem on x64 servers appears to be most pronounced when the key values have a small fixed gap between them.  As we have seen, performance is very much worse on x64 servers compared with x86 versions when the table contains just even or odd numbers (i.e. with a gap of 1 between index entries).  If we change the test to multiply the original entries by a factor of ten (to produce a sequence of 10, 20, 30, 40…) the performance penalty all but disappears, and the x64 unique index test is around 30% faster than the non-unique test.

    Conclusions and Further Reading

    From the evidence available, my best guess is that 64-bit SQL Server does use linear interpolation and/or linear regression instead of binary search under suitable conditions, but that the implementation has edge-case poor performance where the index keys are separated by a small fixed offset.  It seems that we can achieve best performance by using a unique index where possible, and ensuring that keys are, as far as practicable, sequential and contiguous.

    B-tree Indexes, Interpolation Search, and Skew – Microsoft Research
    Interpolation Search – a log log N Search (PDF)

    © 2011 Paul White

    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi

  • Avoiding Uniqueness for Performance

    In my last post, I showed how using a index unique could speed up equality seeks by around 40%.

    For today’s entry, I’m going to use the same tables as last time (single BIGINT column, one table with a non-unique clustered index, and one table with a unique clustered index):

    CREATE TABLE dbo.SeekTest
    (
        col1    BIGINT NOT NULL,
    );
    GO
    -- Non-unique clustered index
    CREATE CLUSTERED INDEX cx
        ON dbo.SeekTest (col1)
    GO
    CREATE TABLE dbo.SeekTestUnique
    (
        col1    BIGINT NOT NULL
    );
    GO
    -- Unique clustered index
    CREATE UNIQUE CLUSTERED INDEX cuq
        ON dbo.SeekTestUnique (col1)

    Test Data

    This time, instead of filling the tables with all the numbers from 1 to 5 million, we’ll add just the even numbers from 1 to 10 million.  At the risk of stating the slightly obvious, this results in tables with the same number of rows as previously, just no odd numbers:

    INSERT dbo.SeekTest WITH (TABLOCKX)
        (col1)
    SELECT TOP (5000000)
        2 * ROW_NUMBER() OVER (ORDER BY @@SPID)
    FROM 
        master.sys.columns AS c,
        master.sys.columns AS c2,
        master.sys.columns AS c3;
    GO
    INSERT dbo.SeekTestUnique WITH (TABLOCKX)
        (col1)
    SELECT TOP (5000000)
        2 * ROW_NUMBER() OVER (ORDER BY @@SPID)
    FROM 
        master.sys.columns AS c,
        master.sys.columns AS c2,
        master.sys.columns AS c3;

    Test Run: Non-Unique Index

    The test is the same as before: join the test table to itself using a nested loops join, and count the rows returned.

    Don’t be fooled by the simplistic nature of the test; I realize you rarely use loops join to self join all the rows in a table.  The point here is to check how long it takes to perform 5 million row joins – something that probably happens quite often in most production systems, either as 5 million lookups into a single very much larger table, or perhaps by running a query that does 50,000 row-joins 100 times.  Anyway, here’s the query:

    SELECT 
        COUNT_BIG(*)
    FROM dbo.SeekTest AS st WITH (TABLOCK)
    INNER LOOP JOIN dbo.SeekTest AS st2 WITH (TABLOCK)
        ON st.col1 = st2.col1
    OPTION (MAXDOP 1);

    And the ‘actual’ query plan:

    SNAGHTML10ed6a4

    I get these results:

    image

    Notice the 5,000,001 scan count showing that we are performing 5 million range scans (not singleton seeks).  As far as performance goes, running this query with all the data in cache uses 9,251ms of CPU.

    Test Run: Unique Index

    Cool.  Now we know from last time that we can improve on this result by making the index unique, and performing singleton seeks instead of range scans.  Let’s do that then:

    SELECT
        COUNT_BIG(*)
    FROM dbo.SeekTestUnique AS stu WITH (TABLOCK)
    INNER LOOP JOIN dbo.SeekTestUnique AS stu2 WITH (TABLOCK)
        ON stu.col1 = stu2.col1
    OPTION (MAXDOP 1);

    Actual plan:

    SNAGHTML1102d2d

    And the results are:

    image

    As expected, the scan count confirms we are now doing singleton seeks, and the CPU time has improved to 16,005ms.

    Er, Hang On…

    If you think that going from 9,251ms of CPU to 16,005ms of CPU is not exactly an improvement, you’d be right.  There’s no typo there, no inadvertent switching of the unique and non-unique examples, and no tricks.

    Making the index unique really has slowed down this query by around 70%.

    The explanation is so interesting, it deserves a full post of its own…so stay tuned for that.  Feel free to speculate about the cause in the meantime… in the comments below, by email, or on Twitter.

    There is some evidence coming through in the comments that this behaviour is specific to x64 installations. I’d love to get some more validation of this if you have time to test on x86 and/or x64. Thanks.

    Disclaimer:

    In general, you will want to specify an index as UNIQUE wherever you can.  Many queries will benefit from a unique index rather than a non-unique one.  This post is very much to show that “It Always Depends” and to set the stage for the next post in this series.  Don’t run with scissors.  Warning filling may be hot.  May contain nuts… Sarcastic smile

    © 2011 Paul White
    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi

  • Enforcing Uniqueness for Performance

    A little while back, I posted a short series on seeks and scans, and one of the things I highlighted was the difference between a singleton seek and a range scan.  You can find that post here, if you want a refresher.

    Anyway, the broad point is that a singleton equality seek always retrieves exactly one row, and is guaranteed to do so because a unique index exists to enforce it.  A range scan, on the other hand, seeks down the B-tree to a starting (or ending) point, and scans forward (or backward) from that point using the next or previous page pointers.  The point of today’s short post is to show how much faster a singleton seek is, compared with a range scan, even when both return exactly the same number of records.

    Pretty simple test rig today; a 5 million row table with a single BIGINT NOT NULL column with all the values from 1 to 5,000,000:

    CREATE TABLE dbo.SeekTest
    (
        col1    BIGINT NOT NULL
    );
    GO
    CREATE CLUSTERED INDEX cx ON dbo.SeekTest (col1);
    GO
    -- 5 million rows numbered 1 to 5,000,000
    INSERT dbo.SeekTest WITH (TABLOCKX)
        (col1)
    SELECT TOP (5000000)
        ROW_NUMBER() OVER (ORDER BY @@SPID)
    FROM 
        master.sys.columns AS c,
        master.sys.columns AS c2,
        master.sys.columns AS c3;

    Notice that the clustered index is not defined as unique.  There will not be any uniquifiers (in case you think I am cheating) because SQL Server only adds them when necessary, and there are no duplicates in this example.  The test query simply joins the table to itself, forcing a nested loops join so we get 5 million seek operations:

    SELECT 
        COUNT_BIG(*)
    FROM dbo.SeekTest AS st WITH (TABLOCK)
    INNER LOOP JOIN dbo.SeekTest AS st2 WITH (TABLOCK) ON
        st2.col1 = st.col1
    OPTION (MAXDOP 1);

    And here’s the query plan:

    image

    On my machine, typical results are:

    image

    Notice the Scan Count.  We get one scan for the Clustered Index Scan iterator in the plan, and one scan for every seek operation.  This is one way to see that we are getting range scans here – the lack of a unique index means that SQL Server cannot guarantee that only one row will be returned from each seek, so a range scan is performed to pick up any extra rows.

    Now let’s create a second table, where the only difference is that the index is declared as UNIQUE:

    CREATE TABLE dbo.SeekTestUnique
    (
        col1    BIGINT NOT NULL
    );
    GO
    CREATE UNIQUE CLUSTERED INDEX cuq ON dbo.SeekTestUnique (col1);
    GO
    -- 5 million rows numbered 1 to 5,000,000
    INSERT dbo.SeekTestUnique WITH (TABLOCKX)
        (col1)
    SELECT TOP (5000000)
        ROW_NUMBER() OVER (ORDER BY @@SPID)
    FROM 
        master.sys.columns AS c,
        master.sys.columns AS c2,
        master.sys.columns AS c3;

    As before, we join this table to itself using loops join (the TABLOCKs are just to reduce locking overheads):

    SELECT 
        COUNT_BIG(*)
    FROM dbo.SeekTestUnique AS stu WITH (TABLOCK)
    INNER LOOP JOIN dbo.SeekTestUnique AS stu2 WITH (TABLOCK) ON
        stu2.col1 = stu.col1
    OPTION (MAXDOP 1);

    And we get the same plan:

    image

    But very different results:

    image

    Now there is only one scan, because the seeks are all singleton seeks, and execution time has improved from 9472ms to 6096ms.

    There are lots of good reasons to enforce uniqueness where you can (and a few not to).  For further information see:

    Is it a Seek or a Scan?
    Sneaky Reads with Unique Indexes
    Unique Indexes with GROUP BY (Rob Farley)
    Scan Count Meaning in STATISTICS IO Output – Part 1 (Amit Banerjee)
    Scan Count Meaning in STATISTICS IO Output – Part 2 (Amit Banerjee)

    © 2011 Paul White

    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi

  • Join Performance, Implicit Conversions, and Residuals

    You probably already know that it’s important to be aware of data types when writing queries, and that implicit conversions between types can lead to poor query performance.  Some people have gone so far as to write scripts to search the plan cache for CONVERT_IMPLICIT elements, and others routinely inspect plans for that type of thing when tuning.

    Now, that’s all good, as far as it goes.  It may surprise you to learn that not all implicit conversions are visible in query plans, and there are other important factors to consider too.  To demonstrate, I’m going to use two single-column heap tables, containing a selection of pseudo-random numbers, as shown in the diagram below (there’s a script to create the test tables and data at the end of this post):

    image

    The first table contains 5,000 rows and the second table contains 5,000,000 rows.  The basic test query is to count the number of rows returned by a simple inner join between the two tables.  We’ll run a number of tests to explore the performance issues, using a number of tables using different data types to store the data.

    Test One – INTEGER NOT NULL

    The first test features two tables where both columns are typed as INTEGER, and constrained to be NOT NULL:

    image

    Since one table is significantly smaller than the other, the query optimizer chooses a hash join.  To avoid the (very useful) bitmap filter optimization featured in my last post, each query features a MAXDOP 1 query hint to guarantee a serial plan.  The test query gives this query plan:

    image

    This returns its result in 900ms.

    Test Two – INTEGER to REAL

    Now let’s try the same query, but where the larger table’s column uses the REAL data type (the smaller table is INTEGER as before):

    image

    image

    There are a couple of new things here.  First, SQL Server needs to convert the data type of one of the columns to match the other before it can perform the join.  The rules for data type precedence show that REAL has a higher precedence than INTEGER, so the integer data is converted to REAL using the CONVERT_IMPLICIT operation shown in the Compute Scalar iterator.

    The join then probes the hash table (built on the converted REAL values) for matches.  The possibility of hash collisions means that a residual predicate is applied to check that the values indicated as a possible match actually do match.  Nothing much new or surprising here then.  The query performs slightly less well than before, completing in 1000ms.

    Test Three – INTEGER to DECIMAL

    Same query again, but this time the larger table’s column is defined as a DECIMAL(7, 0):

    image

    image

    The first surprise is that this query takes 1600ms.

    The second surprise is that the query plan does not contain a Compute Scalar or a CONVERT_IMPLICIT anywhere, despite the fact that we are joining columns with different data types.  In fact there is no indication anywhere that an implicit conversion is occurring – not in the graphical plan, not even in the more detailed XML show plan representation.

    This is very bad news.  Whether you are tuning the query by hand, or using a script to scan the plan cache for implicit conversions, there really is no way to see the invisible type conversion here.  Even the new implicit type conversion warnings built into Denali CTP3 do not detect the problem.

    Test Four – DECIMAL to INTEGER

    Now let’s reverse the situation, so the smaller table’s column is DECIMAL(7,0) and the larger table is INTEGER:

    image

    image

    Now the query runs for 1900ms.

    The data precedence rules mean each INTEGER value from the large table has to be converted to DECIMAL to compute a hash value, and converted again to check the residual.  Again, there is no Compute Scalar, no CONVERT_IMPLICIT…absolutely no visible indication anywhere why this query should run over twice as long as the original.

    Adding an Explicit Conversion

    Let’s see what happens if we introduce an explicit conversion here, choosing to convert the DECIMAL values to INTEGER in order to perform the conversion on the smaller table:

    image

    image

    Ok, so now we have a Compute Scalar, an explicit CONVERT, and a Probe Residual that references the converted expression.  This query completes in 1000ms.  Compare this query plan with the one shown immediately above it – would you pick that the plan with the extra Compute Scalar would be nearly twice as fast?

    Conversion Families

    So what’s going on here?  Why is SQL Server hiding conversion details from us that have such a profound effect on execution speed?  It turns out that certain implicit conversions can be performed inside the join, and so do not require a separate Compute Scalar.  This is the case if the join columns have different types, but both types come from the same ‘family’.  The five families are:

    image

    In our example, DECIMAL and INTEGER are from family [1] so the implicit conversion can be performed inside the join, without a CONVERT_IMPLICIT or Compute Scalar.  Stop for a moment and think about how many joins you have in production code where the columns involved are INTEGER to BIGINT, or SMALLINT to INTEGER, to take just two possible examples.  Are those joins running at half speed?

    The NULL Issue

    There is one more subtlety to consider.  Let’s run our test query one last time, using tables with INTEGER columns, but this time the columns are declared as allowing NULLs (there are no NULLs in the test data, by the way).

    image

    image

    This query runs in around 1000ms.  Remember that the first query we ran (where both columns were defined as INTEGER NOT NULL) completed in 900ms.  Let’s look at the first query plan again:

    image

    Notice the lack of a Probe Residual.  If the join is on a single column typed as TINYINT, SMALLINT or INTEGER and if both columns are constrained to be NOT NULL, the hash function is ‘perfect’ – meaning there is no chance of a hash collision, and the query processor does not have to check the values again to ensure they really match.

    This optimization is the reason that the NOT NULL query performs 11% faster than when either or both join columns allow NULLs.  Also notice that this optimization does not apply to BIGINT columns.

    Summary

    Here’s my general advice for best join performance:

    • Be aware that conversions may occur without a CONVERT_IMPLCIT or Compute Scalar
    • Join on a single column
    • Ensure join column types match exactly
    • Constrain potential join columns to be NOT NULL
    • Use TINYINT, SMALLINT, or INTEGER types to avoid residual predicates
    • Keep reading my blog for further reasons to prefer INTEGER NOT NULL :)
    • Introduce explicit conversions where necessary, as a workaround

    © 2011 Paul White

    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi

    Further Reading

    Hash Match Probe Residual – Rob Farley

    Test Script:
    USE tempdb;
    GO
    DROP TABLE 
        #BuildInt, #BuildIntNullable, #BuildBigInt, #BuildDec, #BuildReal,
        #ProbeInt, #ProbeIntNullable, #ProbeBigInt, #ProbeDec, #ProbeReal;
    GO
    -- Test tables
    CREATE TABLE #BuildInt
    (
        col1    INTEGER NOT NULL
    );
    GO
    CREATE TABLE #BuildIntNullable
    (
        col1    INTEGER NULL
    );
    GO
    CREATE TABLE #BuildBigInt
    (
        col1    BIGINT NOT NULL
    );
    GO
    CREATE TABLE #BuildDec
    (
        col1    DECIMAL(7,0) NOT NULL
    );
    GO
    CREATE TABLE #BuildReal
    (
        col1    REAL NOT NULL
    );
     
    GO
    CREATE TABLE #ProbeInt
    (
        col1    INTEGER NOT NULL
    );
    GO
    CREATE TABLE #ProbeIntNullable
    (
        col1    INTEGER NULL
    );
    GO
    CREATE TABLE #ProbeBigInt
    (
        col1    BIGINT NOT NULL
    );
    GO
    CREATE TABLE #ProbeDec
    (
        col1    DECIMAL(7,0) NOT NULL
    );
    GO
    CREATE TABLE #ProbeReal
    (
        col1    REAL NOT NULL
    );
    GO
    ------------------------------------------------
    -- Load 5,000 INTEGER rows into the build tables
    ------------------------------------------------
    SET NOCOUNT ON;
    SET STATISTICS IO, TIME, XML OFF;
     
    DECLARE @I INTEGER = 1;
     
    INSERT #BuildInt
        (col1) 
    VALUES 
        (CONVERT(INTEGER, RAND(1) * 9999999));
     
    WHILE @I < 5000
    BEGIN
        INSERT #BuildInt
            (col1)
        VALUES 
            (RAND() * 9999999);
        SET @I += 1;
    END;
     
    -- Copy to the INT nullable build table
    INSERT #BuildIntNullable
        (col1)
    SELECT
        bi.col1
    FROM #BuildInt AS bi;
     
    -- Copy to the BIGINT build table
    INSERT #BuildBigInt
        (col1)
    SELECT
        CONVERT(BIGINT, bi.col1)
    FROM #BuildInt AS bi;
     
    -- Copy to the DECIMAL build table
    INSERT #BuildDec
        (col1)
    SELECT
        CONVERT(DECIMAL(7,0), bi.col1)
    FROM #BuildInt AS bi;
    GO
    -- Copy to the REAL build table
    INSERT #BuildReal
        (col1)
    SELECT
        CONVERT(REAL, bi.col1)
    FROM #BuildInt AS bi;
    GO
    ----------------------------------------------------
    -- Load 5,000,000 INTEGER rows into the probe tables
    ----------------------------------------------------
    SET NOCOUNT ON;
    SET STATISTICS IO, TIME, XML OFF;
     
    DECLARE @I INTEGER = 1;
     
    INSERT #ProbeInt
            (col1)
    VALUES 
        (CONVERT(INTEGER, RAND(2) * 9999999));
     
    BEGIN TRANSACTION;
     
    WHILE @I < 5000000
    BEGIN
        INSERT #ProbeInt
            (col1) 
        VALUES 
            (CONVERT(INTEGER, RAND() * 9999999));
            
        SET @I += 1;
        
        IF @I % 100 = 0
        BEGIN
            COMMIT TRANSACTION;
            BEGIN TRANSACTION;
        END;
    END;
     
    COMMIT TRANSACTION;
    GO
    -- Copy to the INT nullable probe table
    INSERT #ProbeIntNullable
        (col1)
    SELECT
        pri.col1
    FROM #ProbeInt AS pri
    GO
    -- Copy to the BIGINT probe table
    INSERT #ProbeBigInt
        (col1)
    SELECT
        CONVERT(BIGINT, pri.col1)
    FROM #ProbeInt AS pri
    GO
    -- Copy to the DECIMAL probe table
    INSERT #ProbeDec
        (col1)
    SELECT
        CONVERT(DECIMAL(7,0), pri.col1)
    FROM #ProbeInt AS pri
    GO
    -- Copy to the REAL probe table
    INSERT #ProbeReal
        (col1)
    SELECT
        CONVERT(DECIMAL(7,0), pri.col1)
    FROM #ProbeInt AS pri
    GO
    -----------
    -- TESTS --
    -----------
    SET STATISTICS IO, TIME ON;
     
    SELECT
        COUNT_BIG(*) 
    FROM #BuildInt AS bi
    JOIN #ProbeInt AS pri ON
        pri.col1 = bi.col1
    OPTION (MAXDOP 1);
     
    SELECT
        COUNT_BIG(*) 
    FROM #BuildInt AS bi
    JOIN #ProbeIntNullable AS pin ON
        pin.col1 = bi.col1
    OPTION (MAXDOP 1);
     
    SELECT
        COUNT_BIG(*) 
    FROM #BuildIntNullable AS bin
    JOIN #ProbeIntNullable AS pin ON
        pin.col1 = bin.col1
    OPTION (MAXDOP 1);
     
    SELECT
        COUNT_BIG(*) 
    FROM #BuildIntNullable AS bin
    JOIN #ProbeInt AS pri ON
        pri.col1 = bin.col1
    OPTION (MAXDOP 1);
     
    SELECT
        COUNT_BIG(*)
    FROM #BuildBigInt AS bbi
    JOIN #ProbeBigInt AS pbi ON
        pbi.col1 = bbi.col1
    OPTION (MAXDOP 1);
     
    SELECT
        COUNT_BIG(*)
    FROM #BuildInt AS bi
    JOIN #ProbeDec AS pd ON
        pd.col1 = bi.col1
    OPTION (MAXDOP 1);
     
    SELECT
        COUNT_BIG(*)
    FROM #BuildDec AS bd
    JOIN #ProbeInt AS pri ON
        pri.col1 = bd.col1
    OPTION (MAXDOP 1);
     
    SELECT
        COUNT_BIG(*)
    FROM #BuildDec AS bd
    JOIN #ProbeInt AS pri ON
        pri.col1 = CONVERT(INTEGER, bd.col1)
    OPTION (MAXDOP 1);
     
    SELECT
        COUNT_BIG(*)
    FROM #BuildDec AS bd
    JOIN #ProbeInt AS pri ON
        CONVERT(DECIMAL(7,0), pri.col1) = bd.col1
    OPTION (MAXDOP 1);
     
    SELECT
        COUNT_BIG(*)
    FROM #BuildInt AS bi
    JOIN #ProbeDec AS pd ON
        pd.col1 = bi.col1
    OPTION (MAXDOP 1);
     
    SELECT
        COUNT_BIG(*)
    FROM #BuildInt AS bi
    JOIN #ProbeReal AS pr ON
        pr.col1 = bi.col1
    OPTION (MAXDOP 1);
     
    SELECT
        COUNT_BIG(*)
    FROM #BuildReal AS br
    JOIN #ProbeInt AS pri ON
        pri.col1 = br.col1
    OPTION (MAXDOP 1);
     
    SET STATISTICS IO, TIME OFF;
  • Bitmap Magic (or… how SQL Server uses bitmap filters)

    Question: Can a parallel query use less CPU than the same serial query, while executing faster?

    The answer is yes; and to demonstrate, I'll use the following two (heap) tables, each containing a single column typed as INTEGER:

    image

    Let’s load the first table with 5,000 random integer values (I’m using RAND with a seed and a WHILE loop so you will get the same random numbers if you run this script yourself):

    image

    Now load the second table with 5,000,000 random integers:

    image

    Now let’s write a query to count the number of matches, using a MAXDOP 1 query hint to ensure that the execution plan uses only a single processor.  The illustration below shows the query, execution plan, and runtime statistics:

    image

    Turns out there are 13 matches, and this query uses 890ms of CPU time and runs for 891ms.  Now lets run the same query, but with a MAXDOP 2 hint:

    image

    The query now completes in 221ms using 436ms of CPU time.  That’s a four-times speed-up, while using half the CPU!

    Bitmap Magic

    The reason the parallel query is so much more efficient is down to the Bitmap operator.  To more clearly see the effect it has, take a look at the parallel query plan with runtime statistics included (a so-called actual execution plan):

    image

    Compare that to the serial plan:

    image

    The way these bitmap filters work is reasonably well documented, so I’ll just give an outline here and provide links to some existing documentation at the end of this post.

    Hash Join

    A hash join proceeds in two phases.  First, it reads all the rows from its build input and constructs a hash table containing the join keys.  Second, it reads a row at a time from the probe input, uses the same hash function as before to compute a hash value for the probe row’s join keys, and uses this hash value to check a single hash table bucket for matches.  Naturally, the possibility of hash collisions means that the join generally still has to compare the real join key values to ensure a true match.

    Bitmaps in Serial Plans

    Most people don’t know that a hash match join always creates a bitmap – even in a serial plan.  You can’t see the bitmap in the serial plan, because it is part of the internal implementation of the Hash Match iterator.  While processing build rows and creating its hash table, the hash join also sets one (or more) bits in a compact bitmap structure.  When the build phase completes, this bitmap provides an efficient way to check for potential hash matches without the cost of probing the hash table.

    In the case of a serial-plan hash match, incoming probe rows to the hash join are hashed on the join keys and the value is used to check the bitmap.  If the corresponding bits in the bitmap are all set, there might be a match in the hash table, so the process goes on to check the hash table.  Conversely, if even one of the bits corresponding to the hashed join key value is not set, we can be certain that there is no match in the hash table, and we can discard the current probe row immediately.  The small cost of building the bitmap is offset by the time saved not checking rows that cannot match in the hash table.  Because checking a bitmap is very much faster than probing a hash table, this optimization is often effective.

    Bitmaps in Parallel Plans

    In a parallel plan, a bitmap is exposed as a separate plan operator.  When the hash join transitions from its build phase to the probe phase, the bitmap is passed to an iterator on the probe side of the hash join.  At a minimum, the bitmap is pushed down the probe side as far as the exchange (Parallelism) operator immediately below the join.  At this location, the bitmap is able to eliminate rows that cannot join before the rows are passed between threads inside the exchange.  There are no exchange operators in serial plans, of course, so moving the bitmap just outside the hash join in this way would confer no extra advantage compared to the ‘built in’ bitmap inside the hash match iterator.

    In certain circumstances (though only in a parallel plan!) the query processor can push the bitmap even further down the plan on the probe side of the join.  The idea here is that eliminating rows earlier saves the cost of moving rows between iterators unnecessarily, and perhaps even eliminates some operations completely.  As an aside, the optimizer generally tries to push ordinary filters toward the leaves of a plan for similar reasons – eliminating rows as early as possible is usually worthwhile.  I should mention though, that the type of bitmap we are dealing with here is added after optimization has completed.  Whether this type of bitmap is added to a post-optimizer plan or not is a decision made based on the expected selectivity of the filter (so accurate statistics are essential).

    Pushing the Bitmap Filter

    Anyway, back to the concept of pushing the bitmap filter further down the probe side of the join than the exchange iterator immediately below it.  In many cases, the bitmap filter can be pushed all the way down to a scan or seek.  When this happens, the bitmap filter check appears as a residual predicate like this:

    image

    As a residual, it is applied to all rows that pass any seek predicates (for an index seek), or to all rows in the case of an index or table scan.  The diagram above shows a bitmap filter being applied to a heap table scan, for example.

    Deeper Still…

    If the bitmap filter is built on a single column or expression of the INTEGER or BIGINT types, and if the bitmap is to be applied to a single column of INTEGER or BIGINT type, it might be pushed down the plan even further than the seek or scan operator.  The predicate is still shown in the seek or scan as above, but it is annotated with the INROW attribute – meaning that the filter is pushed into the Storage Engine, and applied to rows as they are being read.

    When this optimization occurs, rows are eliminated before the Query Processor sees the row at all – only rows that might match the hash match join are passed up from the Storage Engine.  The exact circumstances in which this optimization is applied varies a little between SQL Server releases – for example, in SQL Server 2005, the probed column has to be defined as NOT NULL, in addition to the conditions noted previously.  This restriction was relaxed in SQL Server 2008 onward.

    You might be wondering how much difference the INROW optimization makes.  Surely pushing the filter as far down as the seek or scan must be nearly as good as pushing the filter into the Storage Engine?  I’ll cover that interesting question in a future post.  For now, I want to round of this entry by showing how merge join and nested loops join compare for this query.

    Other Join Options

    With no indexes, a query using the nested loops physical join type is an complete non-starter: we would have to scan one table fully for each row in the other table – a total of 5 billion comparisons.  That query would likely run for a very long time.

    Merge Join

    This type of physical join requires sorted inputs, so forcing a merge join results in a plan that fully sorts both inputs before joining.  The serial plan looks like this:

    image

    The query now uses 3105ms of CPU and overall execution time is 5632ms.  The extra non-CPU time there is due to one of the sort operations spilling to tempdb (despite SQL Server having more than sufficient memory available for the sort).  The spill occurs because the default memory grant algorithm happens to not reserve enough memory in advance.  Leaving that to one side, it is clear that the query would never complete in less than 3105ms anyway, so we will move on.

    Continuing to force a merge join plan, but allowing parallelism (MAXDOP 2) as before:

    image

    As in the parallel hash join plan seen earlier, the Bitmap filter is pushed down the other side of the merge join, all the way to the Table Scan, and applied using the INROW optimization.  At 468ms of CPU and 240ms elapsed time, the merge plan with the extra sorts is very nearly as fast as the parallel hash (436 ms/221 ms).  There is one downside to the parallel merge join plan: it reserves 330KB of workspace memory, based on the expected number of rows to sort.  Since this type of bitmap is considered after cost-based optimization is complete, no adjustment is made to the estimate even though only 2488 rows flow through the lower sort.

    A bitmap can only appear in a merge join plan if the bitmap is followed by a blocking operator (like a sort).  A blocking operator has to consume its entire input before producing its first row, guaranteeing that the bitmap is fully populated before rows from the inner input start being read from the inner-side table and checked against the bitmap.  It is not necessary to have a blocking operator on the other side of the merge join, and neither does it matter which side the bitmap appears on.  (Another topic for a future post).

    With Indexes

    The situation is different if suitable indexes are available.  The distribution of the ‘random’ data is such that we can create a unique index on the build table, but the probe table contains duplicates so we have to make do with a non-unique index:

    image

    The hash join (both serial and parallel versions) is largely unaffected by the change.  It cannot take advantage of the indexes, so the plans and performance are the same as seen previously.

    Merge Join

    The merge join no longer has to perform a many-to-many join operation, and also no longer requires a sort on either input.  The lack of a blocking sort operator means that the bitmap can no longer be used (and remember the optimizer has no say in the decision, so it rejects a sorting plan early on as being a silly idea).  The effect is that a serial plan is produced whatever setting for MAXDOP is specified, and performance is worse than the parallel plan before the indexes were added: 702ms CPU and 704ms elapsed time:

    image

    This does represent a marked improvement over the original serial merge join plan, however (3105ms/5632ms).  This is due to the elimination of the sorts and the better performance of the one-to-many join.

    Nested Loops Join

    The nested loops join benefits enormously, as you would expect.  As for merge join, the optimizer does not consider a parallel plan:

    image

    This is by far the best-performing query plan so far – just 16ms of CPU and 16ms elapsed time.  Of course, this assumes that the data required to satisfy the query is already in memory.  Each seek into the probe table would otherwise generate essentially random I/O, so if you still store your data on a spinning magnetic film, cold-cache performance might be considerably worse.

    On my laptop, a cold-cache nested loops run resulted in 78ms of CPU and 2152ms elapsed time.  Under the same circumstances, the merge join plan used 686ms CPU and 1471ms elapsed; the hash join plan used 391ms CPU and 905ms elapsed.  Merge join and hash join both benefit from the larger, possibly sequential I/O issued by the read-ahead mechanism.

    © 2011 Paul White
    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi

    Further reading:

    Parallel Hash Join – Craig Freedman
    Query Execution Bitmaps – The SQL Server Query Processing Team
    Bitmaps in Microsoft SQL Server 2000 – MSDN
    Interpreting Execution Plans Containing Bitmap Filters – Books Online
    Understanding Hash Joins – Books Online

    Test script:

    USE tempdb;
    GO
    CREATE TABLE #BuildInt
    (
        col1    INTEGER NOT NULL
    );
    GO
    CREATE TABLE #Probe
    (
        col1    INTEGER NOT NULL
    );
    GO
    CREATE TABLE #ProbeDec
    (
        col1    DECIMAL(10) NOT NULL
    );
    GO
    -- Load 5,000 rows into the build table
    SET NOCOUNT ON;
    SET STATISTICS XML OFF;
     
    DECLARE @I INTEGER = 1;
    INSERT #BuildInt
        (col1) 
    VALUES 
        (CONVERT(INTEGER, RAND(1) * 2147483647));
     
    WHILE @I < 5000
    BEGIN
        INSERT #BuildInt
            (col1)
        VALUES 
            (RAND() * 2147483647);
        SET @I += 1;
    END;
     
    INSERT #BuildDec
        (col1)
    SELECT 
        CONVERT(DECIMAL(10), bi.col1)
    FROM #BuildInt AS bi;
    GO
    -- Load 5,000,000 rows into the probe table
    SET NOCOUNT ON;
    SET STATISTICS XML OFF;
     
    DECLARE @I INTEGER = 1;
     
    INSERT #Probe
        (col1) 
    VALUES 
        (CONVERT(INTEGER, RAND(2) * 2147483647));
     
    BEGIN TRANSACTION;
     
    WHILE @I < 5000000
    BEGIN
        INSERT #Probe
            (col1) 
        VALUES 
            (CONVERT(INTEGER, RAND() * 2147483647));
            
        SET @I += 1;
        
        IF @I % 25 = 0
        BEGIN
            COMMIT TRANSACTION;
            BEGIN TRANSACTION;
        END;
    END;
     
    COMMIT TRANSACTION;
    GO
    -- Demos
    SET STATISTICS XML OFF;
    SET STATISTICS IO, TIME ON;
     
    -- Serial
    SELECT 
        COUNT_BIG(*) 
    FROM #BuildInt AS bi 
    JOIN #Probe AS p ON 
        p.col1 = bi.col1 
    OPTION (MAXDOP 1);
     
    -- Parallel
    SELECT 
        COUNT_BIG(*) 
    FROM #BuildInt AS bi 
    JOIN #Probe AS p ON 
        p.col1 = bi.col1 
    OPTION (MAXDOP 2);
     
    SET STATISTICS IO, TIME OFF;
     
    -- Indexes
    CREATE UNIQUE CLUSTERED INDEX cuq ON #BuildInt (col1);
    CREATE CLUSTERED INDEX cx ON #Probe (col1);
     
    -- Vary the query hints to explore plan shapes
    SELECT 
        COUNT_BIG(*) 
    FROM #BuildInt AS bi 
    JOIN #Probe AS p ON 
        p.col1 = bi.col1 
    OPTION (MAXDOP 1, MERGE JOIN);
    GO
    DROP TABLE #BuildInt, #Probe;
  • Undocumented Query Plans: The ANY Aggregate

    As usual, here’s a sample table:

    image

    With some sample data:

    image

    And an index that will be useful shortly:

    image

    There’s a complete script to create the table and add the data at the end of this post.  There’s nothing special about the table or the data (except that I wanted to have some fun with values and data types).

    The Task

    We are asked to return distinct values of col1 and col2, together with any one value from the thing column (it doesn’t matter which) per group.  One possible result set is shown below:

    image

    As is so often the case, there are many ways to write a query to do this in SQL.  In fact, there are a surprising number of alternatives, a point I might return to in a future post.  For now, let’s take a look at a natural SQL query solution:

    Using the MAX aggregate

    imageimage

    There’s absolutely nothing wrong with this query or execution plan.  The query is concise, returns the correct results, and executes quickly.  So why am I blogging about it?  It’s the aggregate function.  It bothers me that I have to use MAX (or MIN) here, when what I really want to write is something like:

    image

    The ANY aggregate

    That’s not valid syntax of course – there’s no ANY aggregate in T-SQL.  But, just because we can’t use an ANY aggregate doesn’t mean the Query Processor can’t!

    There’s an alternative way to think about the query we want to write: if we were to partition the input set on the grouping columns, and arbitrarily number the rows within each partition, we could just choose row #1 from each partition.  In T-SQL, we can use the ROW_NUMBER ranking function to do this:

    image

    If you are reasonably familiar with execution plans, you might expect that query to produce something like this:

    image

    …where the Index Scan produces rows ordered by col1 and col2, the Segment detects the start of each new group and sets a flag in the stream, the Sequence Project uses that flag to restart the row numbering for each group, and the Filter just restricts the output to rows where the row number is 1.  In fact, we don’t get a plan like that at all, we get:

    image

    This is the same plan produced for the query with the MAX aggregate!  Well, actually it isn’t quite the same.  If you click on the Stream Aggregate and take a look at its properties, you’ll see it isn’t doing a MAX.  The shots below show the values defined by the aggregate in the new query (shown on the left) and the original MAX query (shown on the right):

    SNAGHTML2a9c2de2SNAGHTML2aaad7a9

    Our new query form is using the ANY aggregate!

    What Magic Is This?

    SQL Server spotted that we didn’t really want to number the rows at all – we were just expressing the idea that we want one row per group, and we don’t care which one.

    The optimizer can simplify a logical query tree like the Sequence Project plan shown above to use the ANY aggregate instead.  The simplification rule used to perform this is called SelSeqPrjToAnyAgg.  As the name suggests, it matches on a selection (filter) and a Sequence Project (specifically one that uses ROW_NUMBER) and replaces it with the ANY aggregate.  Also, as a simplification rule, it runs before full cost-based optimization even gets started, making it available even in trivial plans.

    This particular optimization only matches very specific plan shapes, so you have to be careful with it.  The filter has to restrict the expression produced by the Sequence Project to be equal to one.  In our query, that correlates to the WHERE rn = 1 expression.  You also have to PARTITION BY and ORDER BY the grouping columns (though the order does not matter), and specifying a constant for the ORDER BY clause of the ranking function (to indicate that you don’t care about ordering) does not work:

    image

    That query (and ones like it that use (ORDER BY … SELECT 0, NULL, or NEWID() etc.) fail to match with the rule, resulting in the Segment, Sequence Project, and Filter plan.  (If you use SELECT <constant> you will see an extra Compute Scalar here):

    image

    If you want to take advantage of the ANY aggregate rewrite, you have to be careful to match the conditions for the rewrite exactly.  Even including the ROW_NUMBER column (rn) in the outer SELECT is enough to break the rule matching.  It really is quite sensitive, bless it.

    Invisible ANY

    Let’s drop the index we created earlier:

    image

    Now if we run the rule-matching ROW_NUMBER form of the query, we get this plan instead:

    image

    As I mentioned in my previous post on Row Goals and Grouping, a Sort followed by a Stream Aggregate can be transformed to a Sort running in Distinct Sort mode.  The Stream Aggregate (with its ANY aggregate) is subsumed by the Sort, and in the process, the ANY aggregation is lost to us – it’s logically still there, but not exposed in the query plan, not even in the XML:

    image

    The Sort produces the three columns needed (col1, col2, and thing), and performs Distinct ordered by col1 and col2, but there’s no longer any explicit reference to the ANY aggregate performed on the thing column.  To make it reappear, we need to temporarily disable the optimizer rule responsible for the transformation to Distinct Sort:

    image

    Now we see a plan with the separate Sort and a Stream Aggregate containing the ANY aggregate:

    image

    The final thing I want to show today is the ANY aggregate working in a Hash Match.  If we use an OPTION (HASH GROUP) query hint, we get this plan:

    image

    As promised, you can find the script for today’s entry below.  For more information on optimizer rules and internals, see my previous mini-series:

    Inside the Optimizer: Constructing a Plan

    © 2011 Paul White

    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi

    IF    OBJECT_ID(N'tempdb.bdpmet.#Example', N'U')
        IS NOT NULL
        DROP TABLE tempdb.penguin.#Example;
    GO
        CREATE TABLE #Example
        (
            pk        NUMERIC IDENTITY PRIMARY KEY NONCLUSTERED,
            col1    SQL_VARIANT NULL,
            col2        SQL_VARIANT NULL,
            thing    SQL_VARIANT NULL,
        );
    GO
    INSERT #Example
        (col1, col2, thing)
    VALUES
        ('A1', CONVERT(SQL_VARIANT, $100), CONVERT(SQL_VARIANT, PI())),
        ('A1', $100, {guid '1D008813-8E80-4821-A481-1A0DE5C4F4DC'}),
        ('A1', $100, 7.297352569824),
        ('A1', N'-U-', 1.3E8),
        ('A1', N'-U-', 9.10938291),
        ('A1', N'-U-', @@SERVICENAME),
        ('A2', {d '2011-07-11'}, 'aotearoa'),
        ('A2', {d '2011-07-11'}, 0xDEADBEEF),
        ('A2', {d '2011-07-11'}, N'संहिता'),
        ('A3', 1.054571726, {fn CURRENT_TIME}),
        ('A3', 1.054571726, RADIANS(RAND())),
        ('A3', 1.054571726, {fn DAYNAME (0)});
    GO
    CREATE INDEX nc1 ON #Example (col1, col2, thing);
    GO
    -- A natural query:
    SELECT
        e.col1,
        e.col2,
        MAX(e.thing)
    FROM #Example AS e
    GROUP BY
        e.col1,
        e.col2;
    GO
    -- Would prefer:
    SELECT
        e.col1,
        e.col2,
        ANY (e.thing)
    FROM #Example AS e
    GROUP BY
        e.col1,
        e.col2;
    GO
    -- Transformed to use ANY
    SELECT
        e.col1,
        e.col2,
        e.thing
    FROM
    (
        SELECT
            *,
            rn = ROW_NUMBER() OVER (
                PARTITION BY e2.col1, e2.col2
                ORDER BY e2.col1, e2.col2)
        FROM #Example AS e2
    ) AS e
    WHERE
        e.rn = 1;
    GO
    -- Not matched, no ANY aggregate
    SELECT
        e.col1,
        e.col2,
        e.thing
    FROM
    (
        SELECT
            *,
            rn = ROW_NUMBER() OVER (
                PARTITION BY e2.col1, e2.col2
                ORDER BY (SELECT 0))
        FROM #Example AS e2
    ) AS e
    WHERE
        e.rn = 1;
    GO
    -- Not matched - we use the rn column
    SELECT
        e.col1,
        e.col2,
        e.thing,
        e.rn
    FROM
    (
        SELECT
            *,
            rn = ROW_NUMBER() OVER (
                PARTITION BY e2.col1, e2.col2
                ORDER BY e2.col1, e2.col2)
        FROM #Example AS e2
    ) AS e
    WHERE
        e.rn = 1;
    GO
    DROP INDEX nc1 ON #Example;
    GO
    -- Distinct Sort with invisible ANY
    SELECT
        e.col1,
        e.col2,
        e.thing
    FROM
    (
        SELECT
            *,
            rn = ROW_NUMBER() OVER (
                PARTITION BY e2.col1, e2.col2
                ORDER BY e2.col1, e2.col2)
        FROM #Example AS e2
    ) AS e
    WHERE
        e.rn = 1;
    GO
    -- Disable rule
    DBCC RULEOFF('GbAggToSort');
     
    -- Same query, recompile to force re-optimization
    SELECT
        e.col1,
        e.col2,
        e.thing
    FROM
    (
        SELECT
            *,
            rn = ROW_NUMBER() OVER (
                PARTITION BY e2.col1, e2.col2
                ORDER BY e2.col1, e2.col2)
        FROM #Example AS e2
    ) AS e
    WHERE
        e.rn = 1
    OPTION (RECOMPILE);
     
    -- Enable rule again (*** IMPORTANT! ***)
    DBCC RULEOFF('GbAggToSort');
    GO
    -- ANY aggregate with Hash Match
    SELECT
        e.col1,
        e.col2,
        e.thing
    FROM
    (
        SELECT
            *,
            rn = ROW_NUMBER() OVER (
                PARTITION BY e2.col1, e2.col2
                ORDER BY e2.col1, e2.col2)
        FROM #Example AS e2
    ) AS e
    WHERE
        e.rn = 1
    OPTION (HASH GROUP);
  • Undocumented Query Plans: Equality Comparisons

    The diagram below shows two data sets, with differences highlighted:

    image

    To find changed rows using TSQL, we might write a query like this:

    image

    The logic is clear: join rows from the two sets together on the primary key column, and return rows where a change has occurred in one or more data columns.  Unfortunately, this query only finds one of the expected four rows:

    image

    The problem, of course, is that our query does not correctly handle NULLs.  The ‘not equal to’ operators <> and != do not evaluate to TRUE if one or both of the expressions concerned is NULL.  In this example, that statement is always true because we are comparing column references.  In other circumstances, the behaviour might depend on the ANSI_NULLS setting.  I am not going to go into that here, partly because new code should be written to assume ANSI_NULLS ON:

    image

    One obvious way to handle the NULLs in our working example is to write all the conditions out in full:

    image

    That produces the correct result (only the row with PK = 4 is identical in both sets):

    image

    Even with just three columns to check, the query is already getting quite long.  It is also quite easy to miss one of the combinations or to misplace a bracket.  In an attempt to reduce the size of the query, we might be tempted to use ISNULL or COALESCE:

    image

    The idea there is to replace any NULLs in the data with a particular value that allows the comparison to work as we expect.  This produces the correct result (with the test data) but there are a number of reasons to dislike this approach.

    For one thing, we have to be careful to choose a special replacement value that can never appear in the real data, now or at any time in the future.  Taking this approach is no different, in principle, from choosing not to store NULLs in the first place, and using the chosen special value as a default instead.

    Leaving the ‘to NULL or not to NULL’ debate to one side, the other issue is that COALESCE returns a value typed according to the input expression that has the highest data type precedence.  Many times, this will not matter, but it is possible to introduce subtle bugs this way.  Using ISNULL instead avoids this issue by ensuring that the data type of the first expression is used, but the problem of choosing appropriate special values remains.

    The final objection I have to this method is a bit more subjective: although the query looks simpler than before, COALESCE is just shorthand for a CASE expression.  Let’s compare the query plans for the two queries:

    image

    That’s the plan for the query with all the conditions written out in full.  The Filter contains this mess:

    SNAGHTML7ce46f7

    The COALESCE form query plan looks like this:

    image

    Now, the complexity is split between two operators.  The Compute Scalar contains the definitions shown below on the left, and the Clustered Index Seek contains the residual predicate shown on the right (click the image to enlarge it):

    SNAGHTML7d383a9

    Using ISNULL instead makes some difference – the graphical plan is visually identical to that obtained by using COALESCE, but the defined values and residual predicate are somewhat more concise:

    SNAGHTML7d89056

    An Alternative

    We are looking for rows that join based on the value of the PK column, but which contain a difference in at least one column.  Another way to state that is to say: for each row that joins, the intersection of the two rows should be empty.  If the two rows are identical the intersection of the two rows will be a single row; conversely if the two rows are different, the intersection will be empty.  Writing that logic in TSQL results in this query form:

    image

    The query accounts for NULLs correctly, and produces the correct result.  The query plan looks like this:

    image

    There are no surprises in the Clustered Index Scan, Clustered Index Seek, or Inner Join.  In particular, none of these operators define any expressions or apply any residual predicates.  The seek is the simple one we would expect: it seeks to the row with the matching PK column value.

    Looking at the Constant Scan reveals nothing, literally.  This operator produces a single row with no columns, using an in-memory query processor construct.  There really is nothing to see there, so we will move on to the last remaining operator: the Left Anti Semi Join.  If you were expecting to see complex logic similar to the CASE expressions seen earlier, prepare for a disappointment.  The Anti Semi Join contains the following:

    SNAGHTML7f1af99

    Aside from a (redundant) check that the two PK columns match, this predicate just checks that all three data column values match, using a simple equality (=) comparison.  (As a side note, we can avoid the redundant check on PK values by specifying just the data columns in the INTERSECT sub-query, rather than using the star syntax).

    To see how this works, consider that a Left Anti Semi Join passes rows through where no row is found on the inner input.  In this case, a row (with no columns) is always provided by the Constant Scan, but the predicate shown above is applied to it before the Anti Semi Join decides whether to pass the source row on or not.  If all the conditions evaluate to TRUE, the no-column row from the Constant Scan is retained, the Anti Semi Join finds that row on its inner input, and the source row does not pass on to the query output.

    The net effect is that if the two rows match in all columns, no row appears on the output.  In all other cases (where at least one difference exists) the current row is returned by the query plan.  This is the correct semantic, so the query returns the correct result. 

    NULL handling trickery

    At this point, you should be wondering how this query plan manages to handle NULLs correctly.  Consider the rows in the source tables with PK = 4.  Both rows are identical, but only if the NULLs present in the ival column compare equal.  The relevant part of the predicate shown above is:

    image

    In the case where both columns contain NULL, we would expect this simple equality comparison to return UNKNOWN, not the TRUE needed to ensure that the Anti Semi Join does not pass the source row to the output.  In other words, unless this equality comparison is doing something unusual, we would expect the query to incorrectly return a row for PK = 4 because the NULLs in the ival column should not compare equal.

    The answer lies in the way INTERSECT handles NULLs.  According to the INTERSECT documentation:

    image

    That explains why the INTERSECT query form produces correct results, but it does not say how the query plan achieves this.  Before we see the details of that, let’s look at what happens if we try to write the query using the same logic as the INTERSECT query plan:

    image

    That query produces the exact same graphical query plan as the INTERSECT form:

    image

    Even the predicate on the Anti Semi Join is identical:

    SNAGHTML811b17d

    But, despite the identical plan, this new query produces the wrong results!  It includes the row with PK = 4 in the output, due to the problem comparing the NULLs in those rows:

    image

    The Answer

    Although the graphical query plan (and even the extra detail available in the Properties window) shows no difference between the INTERSECT and NOT EXISTS query forms, there is a difference – one that implements the different comparison semantics involved.  In the INTERSECT form, the equality comparison must compare two NULLs as equal.  In the NOT EXISTS form, we are using a regular = comparison, one that should return UNKNOWN when two NULLs are compared.

    To see this difference, we have to look deeper into the query plan than the graphical form or properties window can take us.  Inspecting the XML behind the graphical plan, we see the following logic in both cases for the test on the PK column values:

    image

    Notice the compare operation is EQ – a test for equality between the two column references.  The EQ test does not return true for NULLs.  In the NOT EXISTS form of the query, the other columns are compared in exactly the same way, using EQ comparisons.  For example, this is the test on the ival column:

    image

    Now look at the XML for the ival column comparison in the INTERSECT query:

    image

    Now the compare operation is shown as IS instead of EQ.  This is the reason that NULLs compare equal in the INTERSECT test – it is using the comparison semantic familiar to TSQL users from expressions like WHERE x IS NULL.  This is the SQL language IS DISTINCT FROM feature – implemented by the query processor, but not yet available in the TSQL language.  If you agree that IS DISTINCT FROM would be a useful addition to TSQL, you can vote for Steve Kass’ Connect item here.

    For my part, I would rather see the language of the query processor exposed as an alternative to TSQL.  The query processor’s query specification language is much richer and more expressive than TSQL, and would not be bound by some of the bizarre behaviours maintained in TSQL for backward compatibility.  I live in hope, but I’m not holding my breath…

    © Paul White 2011

    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi

    Test script:

    DECLARE @Set1 TABLE
    (
        pk        BIGINT PRIMARY KEY,
        ival    INTEGER NULL,
        cval    CHAR(1) NULL,
        mval        MONEY NULL
    );
     
    DECLARE @Set2 TABLE
    (
        pk        BIGINT PRIMARY KEY,
        ival    INTEGER NULL,
        cval    CHAR(1) NULL,
        mval        MONEY NULL
    );
     
    INSERT @Set1
        (pk, ival, cval, mval)
    VALUES 
        (1, 1000, 'a', $1),
        (2, NULL, 'b', $2),
        (3, 3000, 'c', NULL),
        (4, NULL, 'd', $4),
        (5, 5000, 'e', $5);
        
    INSERT @Set2
        (pk, ival, cval, mval)
    VALUES
        (1, 1000, 'a', NULL),
        (2, 2000, 'b', $2),
        (3, NULL, 'c', $3),
        (4, NULL, 'd', $4),
        (5, 5999, 'z', $5);
     
    -- Incorrect results, doesn't account for NULLs
    SELECT
        *
    FROM @Set1 AS t
    JOIN @Set2 AS s ON
        s.pk = t.pk
    WHERE
        s.ival <> t.ival
        OR s.cval <> t.cval
        OR s.mval <> t.mval;
     
    -- Correct, but verbose and error-prone    
    SELECT
        *
    FROM @Set1 AS t
    JOIN @Set2 AS s ON
        s.pk = t.pk
    WHERE
        s.ival <> t.ival
        OR (s.ival IS NULL AND t.ival IS NOT NULL)
        OR (s.ival IS NOT NULL AND t.ival IS NULL)
        OR s.cval <> t.cval
        OR (s.cval IS NULL AND t.cval IS NOT NULL)
        OR (s.cval IS NOT NULL AND t.cval IS NULL)
        OR s.mval <> t.mval
        OR (s.mval IS NULL AND t.mval IS NOT NULL)
        OR (s.mval IS NOT NULL AND t.mval IS NULL);
     
    -- COALESCE: Correct results, but problematic
    SELECT
        *
    FROM @Set1 AS t
    JOIN @Set2 AS s ON
        s.pk = t.pk
    WHERE
        COALESCE(s.ival, -2147483648) <> COALESCE(t.ival, -2147483648)
        OR COALESCE(s.cval, '¥') <> COALESCE(t.cval, '¥')
        OR COALESCE(s.mval, $-922337203685477.5808 ) <> 
            COALESCE(t.mval, $-922337203685477.5808)
     
    -- ISNULL: Correct results, but problematic
    SELECT
        *
    FROM @Set1 AS t
    JOIN @Set2 AS s ON
        s.pk = t.pk
    WHERE
        ISNULL(s.ival, -2147483648) <> ISNULL(t.ival, -2147483648)
        OR ISNULL(s.cval, '¥') <> ISNULL(t.cval, '¥')
        OR ISNULL(s.mval, $-922337203685477.5808 ) <> 
            ISNULL(t.mval, $-922337203685477.5808)
     
    -- INTERSECT:
    -- Correct results in a compact form
    SELECT
        *
    FROM @Set1 AS t
    JOIN @Set2 AS s ON
        s.pk = t.pk
    WHERE
        NOT EXISTS (SELECT s.* INTERSECT SELECT t.*)
     
    -- NOT EXISTS:
    -- Same query plan, but different results!
    SELECT 
        *
    FROM @Set2 AS s
    JOIN @Set1 AS t ON 
        t.pk = s.pk
    WHERE 
        NOT EXISTS 
        (
            SELECT 1
            WHERE 
                t.pk = s.pk
                AND t.ival = s.ival
                AND t.cval = s.cval
                AND t.mval = s.mval
        )
  • How Parallelism Works in SQL Server

    You might have noticed that January was a quiet blogging month for me.  Part of the reason was that I was working on a series of articles for Simple Talk, examining how parallel query execution really works.  The first part is published today at:

    http://www.simple-talk.com/sql/learn-sql-server/understanding-and-using-parallelism-in-sql-server/.

    This introductory piece is not quite as deeply technical as my SQLblog posts tend to be, but I hope there be enough interesting material to make it worth a read.

    Paul White
    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi

  • SQL Server Bug: Slow T-SQL Sums and Averages

    It’s a curious thing about SQL that the SUM or AVG of no items is not zero, it’s NULL.  In this post, you’ll see how this means your SUM and AVG calculations might run at half speed, or worse.  As with most of my blog entries though, today’s instalment is not so much about the result, but the journey we take to get there.

    Before we get started on that, I just want to mention that there’s a problem with the Google Reader feed for this blog, so those of you that use that will have missed two recent entries: Seeking Without Indexes and Advanced TSQL Tuning: Why Internals Knowledge Matters.  Accessing the site directly always works of course :)

    Ok, on to today’s story.  Take a look at this query:

    SUM Of None Is NULL

    Notice the explicit NULL.  The optimizer can see that this query contains a contradiction (1=0) so there’s no need to access the table at all.  The query will always compute an average over an empty set of rows, and the answer will always be NULL.  The Constant Scan operator constructs a row in memory without accessing any base tables, or needing to acquire any locks.  Let’s look at a less trivial example now.

    SUM Of None Is NULL 2

    Following the flow of data from right to left, the first thing to notice is that the optimizer has decided that the cheapest way to find all the ProductID values from the table is to scan the non-clustered index AK_Product_rowguid.  As the index name suggests, this is a non-clustered index over the row_guid column. Because ProductID forms the unique clustered index for the table, the non-clustered index also includes the ProductID keys at the leaf level of the index.  The non-clustered index is much narrower than the clustered index (because the cluster contains all in-row table columns at the leaf level) so all ProductID values in the table are stored more densely in the non-clustered index.  The higher density (and smaller size) of the non-clustered index means it is the lowest cost access method if all we need are product ids.

    The output from the Index Scan operator is a stream of rows containing a single column – ProductID.  The Stream Aggregate consumes all the rows from the Index Scan to compute two values: the number of rows encountered (Count(*)); and the SUM of the same values (SUM(ProductID)).  The output from the Stream Aggregate is a single row with two columns: the result of the count is labelled [Expr1011] and the result of the sum is [Expr1012], though these labels may be different for you.

    The final step to compute the average of ProductID values is performed by the Compute Scalar.  It computes a single expression ([Expr1003]) using a CASE statement, which first checks to see if the count of rows ([Expr1011]) is zero.  If it is zero, the result is defined as NULL, otherwise the average is computed as the SUM divided by the COUNT – [Expr1012] / [Expr1011].  Notice that the left-to-right evaluation guarantee of the CASE statement ensures that a divide-by-zero error never occurs.

    The plan for the same query using SUM instead of AVG is very similar:

    SUM Of None Is NULL 3

    The only difference is in the Compute Scalar: the ELSE clause now returns the SUM (if any rows were counted), rather than the average.  The interesting thing about both cases (SUM and AVG) is that the optimizer produces a plan which counts the rows as well as summing them.  In both cases, the count is needed so the Compute Scalar can produce a NULL if no rows are aggregated.

    One might think that it would be more efficient for the Stream Aggregate to just set a bit flag indicating whether zero or more rows were seen, rather than counting all of them, but that is not the way the product works today.  The question is, does this extra counting operation add any significant overhead to the query execution?  The answer will surprise you – it isn’t as simple as it seems.  To demonstrate the effect, we will need a larger table than the AdventureWorks database can provide.

    USE     master;
    GO
    IF      DB_ID(N'SumTest')
            IS NOT NULL
    BEGIN
            ALTER DATABASE SumTest SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
            DROP DATABASE SumTest;
    END;
    GO
    CREATE  DATABASE SumTest
            COLLATE LATIN1_GENERAL_CS_AS;
    GO
    ALTER   DATABASE SumTest
            MODIFY FILE
            (
            NAME = N'SumTest',
            SIZE = 1GB,
            MAXSIZE = 1GB
            );
    GO
    ALTER   DATABASE SumTest
            MODIFY FILE
            (
            NAME = N'SumTest_log',
            SIZE = 256MB,
            MAXSIZE = 256MB
            );
    GO
    USE     SumTest;
    GO
    ALTER   DATABASE SumTest SET ALLOW_SNAPSHOT_ISOLATION OFF;
    ALTER   DATABASE SumTest SET AUTO_CLOSE OFF;
    ALTER   DATABASE SumTest SET AUTO_SHRINK OFF;
    ALTER   DATABASE SumTest SET AUTO_CREATE_STATISTICS ON;
    ALTER   DATABASE SumTest SET AUTO_UPDATE_STATISTICS ON;
    ALTER   DATABASE SumTest SET AUTO_UPDATE_STATISTICS_ASYNC OFF;
    ALTER   DATABASE SumTest SET PARAMETERIZATION SIMPLE;
    ALTER   DATABASE SumTest SET READ_COMMITTED_SNAPSHOT OFF;
    ALTER   DATABASE SumTest SET RECOVERY SIMPLE;
    ALTER   DATABASE SumTest SET RESTRICTED_USER;
    GO
    CREATE  TABLE dbo.Test 
            (
            id          BIGINT IDENTITY (1,1) NOT NULL,
            padding     CHAR(7999) NOT NULL,
            
            CONSTRAINT [PK dbo.Test (id)]
                PRIMARY KEY CLUSTERED (id),
            )
    ;
    GO
    -- Minimally-logged in 2008 onward
    INSERT  INTO dbo.Test
            WITH (TABLOCKX)
            (
            padding
            )
    SELECT  TOP (100000)
            padding = REPLICATE(CHAR(65 + (Data.n % 26)), 7999)
    FROM    (
            SELECT  TOP (100000)
                    n = ROW_NUMBER() OVER (ORDER BY (SELECT 0))
            FROM    master.sys.columns C1,
                    master.sys.columns C2,
                    master.sys.columns C3
            ORDER   BY 
                    n ASC
            ) AS Data
    ORDER   BY
            Data.n ASC
    ;

    That script creates a new database containing a single table with 100,000 very wide rows.  The id column is BIGINT IDENTITY, and the padding column is defined as CHAR(7999):

    Sample Data

    To make the test CPU-intensive, we’ll calculate CHECKSUM values for the REVERSE of the strings in the padding column, and SUM the result.  The natural way to write that query follows.  The convert to BIGINT is just to avoid an arithmetic overflow.

    SUM Test 1

    The plan is essentially the same as the ones seen previously, with the addition of an extra Compute Scalar to calculate the new processing-intensive expression.  This query runs for around 20 seconds, even with the table data entirely in memory.  If we run a similar query that uses a MIN or MAX aggregate instead of SUM, the query returns results in 10 seconds – half the time:

    SUM Test 2

    This is a very similar plan, but without the final Compute Scalar (the one that decides whether to return NULL or not) seen in the SUM and AVG plans.  Some people might compare the MAX and SUM queries and conclude that the difference in execution time is down to the missing Compute Scalar.  That would be wrong because the 10 seconds of extra CPU time cannot possibly be caused by a simple scalar computation operating on the single row present at that stage of the plan.

    In fact, the problem is in the Stream Aggregate, which in the SUM plan is computing two expressions:

    [Expr1009] = COUNT_BIG([Expr1003])
    [Expr1010] = SUM([Expr1003])

    [Expr1003] is our compute-intensive calculation, contained in the preceding Compute Scalar.  You would think that by the time the Stream Aggregate processes a row, the preceding Compute Scalar has already assigned a result value to [Expr1003] for that row, right?  Wrong!  Compute Scalars are implemented a little differently from other plan operators.  In general, a Compute Scalar operator does not perform the calculation as rows flow through it: the work is deferred until a later plan operator needs the value.  In most cases, this is a useful optimization because rows may be eliminated by a later join or filter before the result value is really needed.

    In this case, the Stream Aggregate is the first downstream operator to need the value of [Expr1003] – and it references it twice: once for the count, and once for the sum.  The shocking truth is that the plan evaluates the whole CHECKSUM(REVERSE(padding)) expression twice – and one of those times it is just counting rows so it knows whether to return NULL later on or not.

    We can verify this is the case by writing the query a little differently, to ensure that the expensive expression is calculated once before the Stream Aggregate receives the row:

    SUM Test 3

    Now, the expensive expression is calculated in a Constant Scan operator rather than a Compute Scalar.  The Stream Aggregate still references the expression twice, once for the COUNT and once for the SUM, but it is counting and summing the result of the calculation, not causing the calculation to be performed twice.  This query form produces the same result as the 20-second query, but in 10 seconds.  Both tests push a single CPU to 100% utilization, in case you were wondering about that.  To really make the point, consider this query:

    SELECT  SUM(CONVERT(BIGINT, CHECKSUM(REVERSE(T.padding)))),
            SUM(CONVERT(BIGINT, 0 + CHECKSUM(REVERSE(T.padding))))
    FROM    dbo.Test AS T
    OPTION  (MAXDOP 1)
    ;

    In the resulting plan, the Stream Aggregate references the CHECKSUM(REVERSE(padding)) expression four times, and the query runs for 40 seconds at 100% CPU!

    You might also wonder why the rewrite uses an OUTER APPLY instead of CROSS APPLY.  The reason is that with a CROSS APPLY, the optimizer sees through the trick and transforms the query plan to exactly the same form as before – and the query runs for 20 seconds as a result.  The OUTER APPLY is enough the defeat an optimizer transformation rule, resulting in a plan featuring a Left Outer Join.  Before you go off to rewrite all your SUMs and AVGs using OUTER APPLY, consider:

    1. The OUTER APPLY trick is a hack that depends on implementation details – it might not work tomorrow.
    2. This problem only occurs if the first operator to need an expression result references it more than once.
    3. This bug is only noticeable if the expression is expensive.

    So this is a confirmed bug in SQL Server, and the good news is that it has been fixed – but only in a private build of Denali (SQL11).  The current public Denali build (CTP 1) does not contain the fix, and there is no word on whether it will be made available for users of earlier SQL Server versions.

    Finally for today, I want to show you another good reason not to use the OUTER APPLY pattern as a fix.  Let’s assume we don’t know about this problem, and need to optimize the original CHECKSM(REVERSE(padding)) query.  The obvious solution is to add a computed column, and index it:

    ALTER   TABLE 
            dbo.Test 
    ADD     [Expr1003] 
            AS CHECKSUM(REVERSE(padding))
    ;
    -- Index it
    CREATE  INDEX 
            [IX dbo.Test Expr1003]
    ON      dbo.Test
            (
            [Expr1003]
            )
    WITH    (
            FILLFACTOR = 100,
            MAXDOP = 1,
            SORT_IN_TEMPDB = ON
            )
    ;

    Notice in passing that the new computed column is not persisted, so it doesn’t require any extra storage in the table itself, and does not fragment the clustered index.  Re-running the natural form of the query:

    SUM Test 4

    It now completes in 30ms – quite an improvement over 20,000ms!  The Stream Aggregate still performs the two SUM and COUNT operations, but now the expression result is available directly from the index – it isn’t calculated at all, let alone twice.

    If we re-run the OUTER APPLY query, we find that it does not use the index and uses exactly the same plan as it did previously, again taking 10 seconds to run.  Ironically, it fails to take advantage of the index for the same reason it previously performed better: the trick prevents the optimizer from matching the computed expression to the index.  Attempting to force the use of the index results in a very unhappy plan:

    SUM Test 5

    The optimizer still can’t make the transformation to use the pre-computed value from our index – but it honours the hint in the only way it can, by using the index to retrieve id column values (the unique clustering key included in the non-clustered index).  It then performs a bookmark lookup for every row into the clustered index to retrieve the padding column, and finally performs the expensive calculation once per row in the Constant Scan.

    The Sort operator is there to ‘optimize’ the bookmark lookups: by pre-sorting on the clustering key (id) it hopes to turn the normally random I/O associated with a bookmark lookup into sequential I/O.  With the data all in memory, this plan still executes in ten seconds, but if the data needed to be fetched in from disk…well you try it.

    So, how important is this bug?  Well, it depends.  You would have to take quite a deep look into each one of your production query plans to see how often a plan operator references a Compute Scalar expression more than once, where the expression is expensive to compute.  That’s probably not practical (though you could write an XQuery to search the plan cache for examples I suppose) but it is something to be aware of, just in case you come across a query that performs much slower than you might expect.

    Anyway, the point of this post, again, is that deep internals knowledge can come in very useful!

    © 2011 Paul White
    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi

More Posts Next page »
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement