THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Paul White: Page Free Space

A technical SQL Server blog from New Zealand.

  • Parallel Execution Plans Suck

    Summary: A deep dive into SQL Server parallelism, and a potential performance problem with parallel plans that use TOP.

    There was an interesting question asked by Mark Storey-Smith on dba.stackexchange.com back in October 2011.  He was looking at the execution plan for a query that counts a million rows from a virtual auxiliary table of numbers.  The query below is a slightly-modified version of the one in the original post:

    WITH
        N1 (n) AS
        (
            -- 10 rows
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1
        ),
        N2 AS (SELECT L.n FROM N1 AS L CROSS JOIN N1 AS R), -- 100 rows
        N3 AS (SELECT L.n FROM N2 AS L CROSS JOIN N2 AS R), -- 10,000 rows
        N4 AS (SELECT L.n FROM N3 AS L CROSS JOIN N3 AS R), -- 100,000,000 rows
        N5 AS (SELECT L.n FROM N4 AS L CROSS JOIN N1 AS R), -- 1,000,000,000 rows
        N6 AS 
        (
            SELECT TOP (1000000) n
            FROM N5
        )
    SELECT 
        COUNT_BIG(*)
    FROM N6
    OPTION (RECOMPILE, MAXDOP 1);

    This particular virtual numbers table is capable of producing up to a (short-scale) billion rows, via a number of cross joins, but the final common table expression N6 limits it to one million with the TOP clause.  The query plan is just a sequence of cross joins of the ten in-memory rows defined by the first common table expression, N1 (click to enlarge):

    One million row serial quey plan

    As usual, the Constant Scan operator is used to generate rows in memory without accessing a physical table, but this one has an interesting property: the ‘virtual table’ contains ten rows, but no columns.  The query itself only counts rows, it does nothing at all with any column in those rows, and the optimizer contains logic to only project columns that are needed later in the query plan.  If you look at the query plan in SSMS or SQL Sentry Plan Explorer, you will see that the Constant Scans have a blank output columns list; they project no columns at all.

    Execution Plans Suck

    The plan above does illustrate an important concept in plan-reading: execution plans start executing at the left-most node.  People are often told to read execution plans from the top-right, and that is fine if you just want to follow the flow of data – so long as you bear in mind that the flow of program control starts at the root (far left).

    Rob Farley (blog | twitter) sums this up by saying “execution plans suck”; a reference to the fact that rows are pulled up the tree by parent operators requesting rows, one at a time, from the its child operator(s).  Query execution follows a demand-based pipeline model (except in batch mode in parallel 2012 column-store plans, but that is a tangent we will not be pursuing today).  I’m not going to labour the point, but if you are interested to understand this better, take a look at my previous post on the topic, or Brad Schulz’s entertaining overview.

    Let’s look at the execution plan (with runtime statistics) changing the TOP specification from one million to one hundred to make it easier to see what’s going on:

    TOP 100 Plan

    I’m just showing part of the plan for clarity.  There are another four Constant Scans off to the right that all produce one row.  If query plans really did start executing at the top right, we would expect one row from the right-most Constant Scan, ten rows from its parent, and one hundred at the next operator up the tree (working right to left).  As it is, the expected pattern (one, then ten, then one hundred) appears closest to the Top operator.  This only makes sense if a row at a time is sucked up the plan from the root.  The pipelined (row-by-row) model also explains why execution stops after 100 rows; the Top operator in the plan simply stops requesting a new row from its immediate child at that point.

    The Question

    Back to the main thrust of today’s post.  The question arose when Mark ran his query with parallelism enabled:

    WITH
        N1 (n) AS
        (
            -- 10 rows
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1
        ),
        N2 AS (SELECT L.n FROM N1 AS L CROSS JOIN N1 AS R), -- 100 rows
        N3 AS (SELECT L.n FROM N2 AS L CROSS JOIN N2 AS R), -- 10,000 rows
        N4 AS (SELECT L.n FROM N3 AS L CROSS JOIN N3 AS R), -- 100,000,000 rows
        N5 AS (SELECT L.n FROM N4 AS L CROSS JOIN N1 AS R), -- 1,000,000,000 rows
        N6 AS 
        (
            SELECT TOP (1000000) n
            FROM N5
        )
    SELECT 
        COUNT_BIG(*)
    FROM N6
    OPTION (RECOMPILE, MAXDOP 2);

    That produces an actual execution plan like this (click to enlarge):

    Parallel Top One Million

    The question relates to the actual number of rows shown entering and leaving the Gather Streams exchange operator:

    Gather Streams Exchange

    As expected, one million rows leave the exchange operator, but the plan shows 1,004,588 rows entering it.  So the question is, are the row counts wrong, or if not, where did the rows go?

    The Answer

    As you may know, the answer lies in the fact that exchanges contain buffers.  For efficiency reasons, rows are not passed across threads one by one as in the normal model discussed above.  Instead, producer threads (on the right side of the exchange operator) pack rows one at a time into packets, and push completed packets across the exchange.  Consumer thread(s) on the left side of the exchange retrieve rows one at a time from the current packet on demand, re-establishing the demand-based pipeline model.  The internal class name for the packets, by the way, is Class eXchange PACKET – which gives us the familiar CXPACKET moniker.

    Exploring the detail at little more, we can see from the execution plan that the two independent producer threads to the right of the exchange pack a total of 499,225 + 505,363 = 1,004,588 rows into packets, and the single consumer thread (thread zero) retrieves just the one million rows needed by the Top operator:

    Per Thread Row Counts

    So, the actual row counts shown on the execution plan are correct, and the ‘missing rows’ are accounted for by rows that were added to packets but not ultimately needed by the consumer.  After the one millionth row is passed by the Top to the Stream Aggregate (and remember all these things happen one row at a time), the next time the Top gets control, it starts the process of shutting down the pipeline below it, rather than asking the Gather Streams exchange for another row.  Glossing over the finer details a little, instead of the Top calling a GetRow() method on the Gather Streams iterator, it calls a Close() method instead.

    SQL Server 2008+

    On SQL Server 2005 that is the end of the story.  On SQL Server 2008 and later (including 2012), there is more.  Let’s run the query one more time, but this time with a degree of parallelism of three, rather than two.  Sometimes, we will get a plan that contains this sort of row count arrangement at the Gather Streams:

    DOP 3 Gather Streams

    Much the same as before, a few thousand extra rows are processed by the producer threads than are ultimately needed by the consumer.  That’s fine, of course, the threads in a parallel plan execute independently, so there are bound to be small timing differences that lead to this sort of effect.  Every so often, however, executing this query on SQL Server 2008 or above will produce a result like this:

    Monster Row Count

    Whoa.  Nearly 335 million rows entering the exchange – and the query now runs for 50 seconds or so, having run consistently for around 200ms previously.  Looking at the per-thread actual row counts is revealing:

    Monster Per Thread

    If everything were perfectly balanced, we might expect each of the three producer threads to process around 333,000 rows.  Indeed, thread 1 and thread 2 are in this ballpark area, but thread 3 ground through 334 million rows on its own!  I should mention that there is nothing special about thread 3 (you are as likely to find the huge row count on thread 1 or thread 2, the labels are arbitrary).  Indeed, the problem can occur on any or all threads, as a second example run at DOP 3 shows below:

    DOP 3 Second Example

    DOP 3 Per Thread Row Counts 2

    This time two threads went rogue, resulting in over 667 million rows being processed in a total execution time of 67 seconds.

    Parallelism Problems?

    There are other parallelism (exchange) operators in the plan, though we have concentrated only on the final Gather Streams operator so far.  By the way, Gather Streams is also known as a Start Parallelism operator – a name that might surprise you since it seems to mark the end of parallelism in the plan, rather than the start.  Remember that execution plans suck.  The ‘final’ Gather Streams is in fact the first operator to start executing, and it is responsible for starting the parallel workers and attaching them to the threads that were reserved for this plan at the very start of execution.  Anyway, I digress.  Let’s look at the next exchange in the plan – reading left to right of course.  This is a Repartition Streams exchange operating in Round Robin distribution mode:

    Repartition Streams

    This shows that 1,000 rows arriving on the producer side of the exchange, and a total of 668 on the consumer side.  Not shown in the diagram above, the producer side has 340 rows on thread 1, and 330 rows on each of thread 2 and thread 3.  Note that these are not the same threads as the ones we saw numbered the same way before.  The numbering scheme is restarted for each independent parallel zone in the plan (and zones are bounded by a parallelism operator of one sort or another).  Anyway, the point is that the rows are pretty evenly distributed on the producer side of the exchange.

    On the consumer side (row counts illustrated above), things are rather different.  Thread 1 (in this zone) processes 334 rows, thread 2 gets 333, and thread 3 gets only one.  Now these are the same threads as shown in the 667 million row diagram.  I mentioned that parallel zones are bordered by parallelism operators; the current zone is bordered by the Repartition Streams on its right side, and by the Gather Streams on its left.  The same three threads are consumers at the Repartition Streams, and producers at the Gather Streams operator:

    Consumer Producer Branch

    There is a clear relationship between the thread row counts at the consumer side of the Repartition Streams (334, 333, and 1 row) and the row counts at the producer side of the Gather Streams (334 million, 333 million, 338,614 rows).  The two problematic threads have multiplied their row counts by a factor of one million – precisely the effect of the six cross joins in this parallelism zone.  The Constant Scan virtual tables contain ten rows each, and multiplying by ten, six times over, gives a factor of one million.

    Rogue Row Goals

    Thread 3 in the example above ends up processing 338,614 rows.  This number has no special significance, except it shows that this thread did not run this portion of the plan to completion.  If it had, the single row it started with would have ended up as one million rows by the time it had been cross-joined six times with the ten-row Constant Scan table.

    This is the row goal in action (if you need a details on exactly what a row goal is, please see my previous post).  Essentially, though, things like a TOP operator or a FAST n hint set a row goal.  Plans normally run to completion, but row goals modify this, producing a plan that tries to optimize for a certain number of rows rather than the full potential set.  The TOP operator is even more special.  As discussed briefly before, TOP can bring execution to an early end – instead of continuing to ask for rows from its child iterator, it calls the Close() method instead.  This call filters down the tree, and execution comes to an early end.  (Strictly, row goals affect optimizer choices and plan costing rather than being associated with the early end of execution, but I like the phrase ‘rogue row goal’).

    So, thread 3 did not run to completion – it responded to the early Close() call and only processed 338 thousand of the one million rows it could have produced if left alone.  Threads 1 and 2 never received the Close() call, or chose to ignore it.  These two rogues went on to process their full potential row set – 667 million rows – despite the fact that the TOP had seen the million rows it needed and was waiting for operations elsewhere in the parallel plan to stop.  We can see this by looking at the sys.dm_os_waiting_tasks DMV while the long-running query is executing:

    e_waitPortClose

    The output is split across two lines for readability, and shows execution context zero blocked on a CXPACKET wait by both execution context 1 and execution context 3.  Execution context zero is always thread zero – the so-called coordinator thread that runs the serial part of the execution plan to the left of the left-most Gather Streams operator.  Put another way, context zero always runs the part of the plan before (reading left to right) the Start Parallelism operator (and therefore always runs a serial plan).

    OK, so CXPACKET means the thread is involved in a parallelism-related wait.  The extra detail in the resource_description column tells us that the wait is occurring at the node id 2.  Checking the execution plan, we see that node 2 is indeed the Gather Streams exchange:

    Node 2

    The wait type of e_waitPortClose means the consumer is waiting for the port to close.  If you are wondering what a port is in this context, I will just say the parallelism architecture is more complex than just CXPACKET – the wiring includes a CXPort class, a CXPipe class, a CXTransLocal (local transport) class and a CXTransRemote class (for the Parallel Data Warehouse edition).  There is also a linked map structure that shows how the various pipes, ports, and transports connect together.  Closing the port is one step in shutting down part of a parallel plan which is running on the other side of an exchange (via a transport, pipe, and port).  The stack trace below shows thread zero waiting for the port to close:

    Stack Trace

    The important things are that (a) this problem does not occur in SQL Server 2005; and (b) a number of changes were made to the internal parallelism implementation in SQL Server 2008.  These changes seem to have introduced a bug, where the consumer can wait for the port to close, but one or more producers either ignore the request, or fail to check for it, and go on to process the whole potential result set instead of stopping early.

    Not Just Nested Loops

    Fellow SQLblogger Joe Chang (blog) suggested in the comments that this problem may only occur with parallel nested loops joins.  The script below reproduces the problem with parallel hash joins:

    WITH
        N1 (n) AS
        (
            -- 10 rows
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1
        ),
        N2 AS (SELECT L.n FROM N1 AS L JOIN N1 AS R ON L.n % 1 = R.n % 1), -- 100 rows
        N3 AS (SELECT L.n FROM N2 AS L JOIN N2 AS R ON L.n % 1 = R.n % 1), -- 10,000 rows
        N4 AS (SELECT L.n FROM N3 AS L JOIN N3 AS R ON L.n % 1 = R.n % 1), -- 100,000,000 rows
        N5 AS (SELECT L.n FROM N4 AS L JOIN N1 AS R ON L.n % 1 = R.n % 1), -- 1,000,000,000 rows
        N6 AS 
        (
            SELECT TOP (1000000) n
            FROM N5
        )
    SELECT 
        COUNT_BIG(*)
    FROM N6
    OPTION (RECOMPILE, MAXDOP 4, QUERYTRACEON 8649);

    The full execution plan is rather large, but the key part is shown below (click to enlarge):

    parallel hash join plan

    Yes, that is 700 million rows entering the Gather Streams exchange (click to enlarge):

    parallel hash row counts

    The bug does not occur in every query plan with Top and parallelism, but the choice of nested loops join is not the cause.

    Final Thoughts

    The potential for poor performance and excessive processor usage here is obvious; and the chance of hitting the race condition gets worse at higher DOP.  With eight threads per parallel zone (DOP 8), I hit this issue almost every time on SQL Server 2008, 2008 R2, and 2012.  Because this behaviour does not occur on SQL Server 2005, but does on 2008 and later, I have filed this as a bug on Connect:

    https://connect.microsoft.com/SQLServer/feedback/details/740234/poor-performance-with-parallelism-and-top

    Further Reading:

    http://sqlblog.com/blogs/paul_white/archive/2010/08/05/iterators-query-plans-and-why-they-run-backwards.aspx
    http://bradsruminations.blogspot.co.nz/2010/11/second-in-life-of-query-operator.html
    http://www.simple-talk.com/sql/learn-sql-server/understanding-and-using-parallelism-in-sql-server
    http://blogs.msdn.com/b/craigfr/archive/2006/10/25/the-parallelism-operator-aka-exchange.aspx
    http://sqlblog.com/blogs/paul_white/archive/2010/08/18/inside-the-optimiser-row-goals-in-depth.aspx

    © 2012 Paul White
    Twitter: @SQL_Kiwi (with an underscore)
    Email: SQLkiwi@gmal.com (no underscore)

  • Query Optimizer Deep Dive - Part 4

    This is the final part in a series of posts based on the content of the Query Optimizer Deep Dive presentations I have given over the last month or so at the Auckland SQL Users’ Group and the SQL Saturday events in Wellington, New Zealand and Adelaide, Australia.

    Links to other parts of this series: Part 1 Part 2 Part 3

    Beating the Optimizer

    Our test query produces an optimized physical execution plan that is quite different from the logical form of the query.  The estimated cost of the execution plan shown below is 0.0295 units.

    Optimized Query Plan

    Since we know the database schema very well, we might wonder why the optimizer did not choose to use the unique nonclustered index on names in the product table to filter rows based on the LIKE predicate.  We could use an index hint to force the name index to be used:

    Forced Index Hint

    That’s great, and no doubt the index seek is cheaper than the scan we had previously, but the optimizer has still chosen to use a merge join, and that means having both inputs sorted on Product ID.  The result of the index seek is ordered by name (the index key) rather than Product ID, so a sort is required.  It looks like the new sort adds a little more cost than the seek saves over the scan, because the estimated cost of the query plan with the index hint is 0.0316 units.

    Naturally, these numbers are rather small since AdventureWorks is not a large database, but these differences can be important in real systems.  Anyway, let’s persist with the index seek idea; why is the optimizer so keen on a merge join, even though it involves an extra sort, and we don’t have an ORDER BY Product ID clause on our query?  Without a top-level ORDER BY, we are giving the optimizer the freedom to return results in any order that is convenient – perhaps we can do better by forcing the index seek and a hash join instead of a merge join?

    Forced Hash Join

    Well the sort has gone, so the plan looks visually a little simpler, but the estimated cost has increased again, to 0.0348 units.  Hash join has quite a high start-up cost (and it requires a memory workspace grant).  We could try other things, but it certainly seems that the optimizer had it right to begin with in this case.

    The manual exploration above shows that the optimizer does generally find a good plan quickly (and sometimes it may even find the best possible plan). The terms ‘good’ and ‘best’ here are measured in terms of the optimizer’s own cost model. Whether one particular plan shape actually executes faster on a given system is another question. I might find, for example, that the first merge join plan runs fastest for me, whereas you might find the seek and sort runs fastest for you. We might both find that which is faster depends on whether the indexes and data needed are in memory or need to be retrieved from persistent storage. All these things aside, the important thing is that we are both likely to find that the optimizer’s plans are pretty good most of the time.

    Models and Limitations

    There are three principal models used by the optimizer to provide a basis for its reasoning: cardinality estimation, logical relational equivalences, and physical operator costing.  Good cardinality estimation (row count expectations at each node of the logical tree) is vital; if these numbers are wrong, all later decisions are suspect.  Fortunately, it is relatively easy to check cardinality estimation by comparing actual and estimated row counts in query plans.  There are some subtleties to be aware of – for example when interpreting the inner side of a nested loops join, at least in SSMS.  If you use the free SQL Sentry Plan Explorer tool, many of these common misinterpretations are handled for you.

    The model used in cardinality estimation is complex, and contains all sorts of hairy-looking formulas and calculations.  Nevertheless, it is still a model, and as such will deviate from reality at least to some extent.  I’ll have more to say later on about what we can do to help ensure good cardinality estimates, but for now we will just note that the model has its roots in relational theory and statistical analysis.

    Relational equivalences (such as A inner join B = B inner join A) are the basis for exploration rules in the cost-based optimizer.  Not all possible relational transformations are included in the product (remember the goal of the optimizer is to find a good plan quickly, not to perform an exhaustive search of all possible plans).  As a consequence, the SQL syntax you use will often affect the plan produced by the optimizer, even where different SQL syntaxes express the same logical requirement.  Also, the skilled query tuner will often be able to do better than the optimizer, given enough time and perhaps a better insight to the data.  The downside of such manual tuning is that it will usually require manual intervention again in the future as data volumes and distributions change.

    Physical operator costing is very much the simplest of the three models, using formulas that have been shown to produce good physical plan selections for a wide range of queries on a wide range of hardware.  The numbers used probably do not closely model your hardware or mine, but luckily that turns out not to be too important in most cases.  No doubt the model will have to be updated over time as new hardware trends continue to emerge, and there is some evidence in SQL Server 2012 show plan output that things are heading in that direction.

    Assumptions

    All models make simplifying assumptions, and the cardinality estimation and costing models are no different.  Some things are just too hard to model, and other things just haven’t been incorporated into the model yet.  Still other things could be modelled, but turn out not to add much value in practice and cost too much in terms of complexity or resource consumption.  Some of the major simplifying assumptions that can affect real plan quality are:

    • All queries start executing with a cold cache
      This isn’t as crazy as it sounds.  Fetching data from disk tends to dominate the overall cost of a query, modelling the amount of data that can be expected to be in cache already is hard, and this assumption does at least affect everything equally.  The optimizer does contain some logic to account for pages that might be in cache after the first access.
    • Statistical information is independent
      Correlations do frequently exist between columns in real databases, so this assumption can be problematic.  Multi-column statistics, filtered indexes, and indexed views can sometimes help with this.
    • Distribution is uniform
      This is assumed where the system has no information to the contrary.  One example: costing assumes that seeks (lookups) into an index are randomly distributed throughout the full range of the index.

    Helping the Optimizer

    There are three approaches to working with the optimizer:

    1. Ignore it.  It works well out of the box with default settings and automatic statistics.
    2. Override it using syntax tricks, hints, and plan guides.
    3. Provide it with the best information you can, and override it for just a small number of problematic queries.

    All three are valid approaches in different circumstances, though the third option is probably the one to recommend as the best starting point.  There is much more to helping the optimizer than just making sure statistics are up-to-date, however:

    Use a relational design

    This is probably the biggest single item.  The various models all assume that your database has a reasonably relational design, not necessarily 3NF or higher, but the closer your structures are to relational principles, the better.  Cardinality estimation works best with simpler relational operations like joins, projects, selects, unions and group by.  Avoid complex expressions and non-relational query features that force cardinality estimation to guess.  Also remember that the cost-based optimizer’s exploration rules are based on relational equivalences, so having a relational design and writing relational queries gives you the best chance of leveraging the hard work that has gone into those rules.  Some features added in 2005 (e.g. ranking functions) operate on sequences rather than multi-sets (hence the Sequence Project physical operator); the original rules all work with multi-sets, and the seams between the two approaches often show though.

    Use constraints

    Use check constraints, foreign keys, unique constraints to provide the optimizer with information about your data.  Simplification and exploration rules match patterns in the logical tree, and may also require certain properties to be set on the matched nodes.  Constraints and keys provide some of the most powerful and fundamental logical properties – omitting these can prevent many of the optimizer’s rules from successfully transforming to an efficient physical plan.  If an integer column should only contain certain values, add a check constraint for it.  If a foreign key relationship exists, enforce it.  If a key exists, declare it.  It is not always possible to predict the benefits of providing this type of optimizer information, but my own experience is that it is much greater than most people would ever expect.

    Statistics, indexes and computed columns

    Provide more than just default-sampled automatically-created statistics where appropriate (for example where distribution is unusual or correlations exist).  Create indexes that will provide potentially useful access paths for a wide range of queries.  If you have expressions on columns in your WHERE clause, consider adding a computed column that matches the expression exactly.  The computed column does not have to be persisted or indexed to be useful; the system can auto-create statistics on the computed column, avoiding cardinality guessing.

    Deep Trees

    Large, complex queries produce large, complex logical trees.  Any errors tend to multiply (perhaps exponentially) as the size of the tree increases, the search space of possible plans expands greatly, and things just become much more difficult in general.  Breaking a complex query into smaller, simpler, more relational, steps will generally get greater value out of the optimizer.  A side benefit is that small intermediate results stored in a temporary table can have statistics created, which will also help in many cases.  There will always be cases where a large complex query is required for ultimate performance, but very often the performance difference is relatively small and may not be worth the future maintenance costs as hand-tuned monolithic queries tend to be fragile.

    Opaque operators and new features

    User-defined functions (other than the in-line variety) may seem convenient, but they are almost completely opaque to the optimizer – it has to guess at the cardinality and distribution of rows produced.  Newer features also tend to have much shallower support in the engine than longer-established features that have been around for multiple versions of the product.  By all means use all the shiny new features, just be aware that combining them may produce plans of variable quality.

    Trace Flags

    To summarize the flags used in this series (all assume 3604 is also active):

    7352 : Final query tree
    8605 : Converted tree
    8606 : Input, simplified, join-collapsed, and normalized trees
    8607 : Output tree
    8608 : Initial memo
    8615 : Final memo
    8675 : Optimization stages and times

    The above were used in the presentation because they all work from 2005 to 2012.  There are a large number of other optimizer-related flags (some of which only work on 2012).  Some are listed below for people that like this sort of thing:

    2373 : Memory before and after deriving properties and rules (verbose)
    7357 : Unique hash optimization used
    8609 : Task and operation type counts
    8619 : Apply rule with description
    8620 : Add memo arguments to 8619
    8621 : Rule with resulting tree

    As usual, these are undocumented, and unsupported (including by me!) and purely for educational purposes.  Use with care at your own risk.

    Final Thoughts

    I hope you have gained some insight and intuition for the workings of the query optimizer: logical trees, simplification, cardinality estimation, logical exploration and physical implementation.  The geeky internals stuff is fun, of course, but I rather hope people came away from these sessions with a better understanding of how query text is transformed to an executable plan, and how relational designs and simpler queries can help the optimizer work well for you.  Taking advantage of the optimizer frees up time for more productive things – new projects, tuning the odd interesting query, whatever it is you would rather be doing than fighting the same query-tuning battles over and over again.

    There are some great resources out there to learn more about SQL Server. Make full use of them to continue to improve your technical skills and just as importantly, experiment with things to build your experience level. Take a deeper look at query plans and the properties they contain; there is a wealth of information there that sometimes requires a bit of thinking and research to understand, but the insights can be invaluable.

    The original slides and demo code is again attached as a zip file below.  Thanks for reading, all comments and feedback are very welcome.

    Finally, I would like to thank Adam Machanic (blog | twitter) for introducing me to the QUERYTRACEON syntax all that time ago, and Fabiano Amorim (blog | twitter) for his email last year that kick-started the process of developing this presentation and blog series.

    Links to other parts of this series: Part 1 Part 2 Part 3

    © 2012 Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • Query Optimizer Deep Dive – Part 3

    This is the third part in a series of posts based on the content of the Query Optimizer Deep Dive presentations I have given over the last month or so at the Auckland SQL Users’ Group and the SQL Saturday events in Wellington, New Zealand and Adelaide, Australia.

    Links to other parts of this series: Part 1 Part 2 Part 4

    Storage of Alternative Plans

    We saw in part 2 how optimizer rules are used to explore logical alternatives for parts of the query tree, and how implementation rules are used to find physical operations to perform each logical steps.  To keep track of all these options, the cost-based part of the SQL Server query optimizer uses a structure called the Memo.  This structure is part of the Cascades general optimization framework developed by Goetz Graefe.  The Memo provides an efficient way to store many plan alternatives via the use of groups.

    Each group in the Memo initially contains just one entry – a node from the input logical tree.  As exploration and implementation phases progress, new groups may be added, and new physical and logical alternatives may be added to existing groups (all alternatives in a group share the same logical properties, but will often have quite different physical properties).  Groups are also shared between plan alternatives where possible, allowing Cascades to search more plans in the same time and space compared with other optimization frameworks.

    Continuing with our example query, the input tree to cost-based optimization is first copied into the Memo structure.  We can see this structure using trace flag 8608:

    USE AdventureWorks2008R2;
    GO
    SET NOCOUNT ON;
    DBCC FREEPROCCACHE;
    DBCC TRACEON(3604);
    GO
    -- Initial memo contents
    SELECT
        p.Name,
        Total = SUM(inv.Quantity)
    FROM Production.Product AS p
    JOIN Production.ProductInventory AS inv ON
        inv.ProductID = p.ProductID
    WHERE
        p.Name LIKE N'[A-G]%'
    GROUP BY
        p.Name
    OPTION (QUERYTRACEON 8608);

    As a reminder, this is the logical input tree to cost-based optimization shown in part 2:

    Cost-Based Optimization Input Tree

    This is copied to the Memo, one group per logical node, illustrated below:

    Optimizer Initial Memo Contents

    The group numbers and linked structure shown are obtained directly from the trace flag output:

    Trace Flag 8608 Output

    Each group has an entry number (all zero above since there is only one logical entry per group at this stage), with a logical operation (e.g. LogOp_Join), and the group numbers of any child groups (e.g. the logical join in group 13 references its two inputs, groups 8 and 9, and a logical comparison defined by the sub-tree starting at group 12).

    Optimization Phases

    One the initial Memo has been populated from the input tree, the optimizer runs up to four phases of search.  There is a documented DMV, sys.dm_exec_query_optimizer_info, that contains a number of counters specific to query optimization.  Four of these counters are ‘trivial plan’, search 0’, ‘search 1’, and ‘search 2’.  The value for each counter keeps track of the number of times a search phase was entered.  By recording the values before and after a specific optimization, we can see which phases were entered.  The current counter values on one of my test SQL Server instances is shown below, as an example:

    sys.dm_exec_query_optimizer_info

    We look at each of the four phases in a bit more detail next:

    Search 0 – Transaction Processing

    This phase is primarily concerned with finding good plans for OLTP-type queries, which usually join a number of tables using a navigational strategy (looking up a relatively small number of rows using an index).  This phase primarily considers nested-loops joins, though hash match may be used when a loops join implementation is not possible.  Many of the more advanced (and costly to explore) optimizer rules are not enabled during this search phase – for example search 0 never considers indexed view matching or parallel plans.

    Search 1 – Quick Plan (also known as Complex Query I)

    This search phase can use most or all of the available rules, can perform limited join reordering, and may be run a second time (to consider parallel plans only) if the first run produces a plan with a high enough cost.  Most queries find a final plan during one of the Search 1 runs.

    Search 2 – Full Optimization

    This phase uses the most comprehensive search configuration, and may result in significant compilation times in some cases.  The search is either for a serial or parallel plan, depending on which type was cheaper after search 1.

    The scripts accompanying this series show another way to see which phases were run, using trace flag 8675.  The output provides some extra insight into how things work (for example it shows the number of optimization moves (tasks) made in each stage).  The only documented and supported way to see the search phases is via the DMV, however.

    Entry and Termination Conditions

    Each phase has entry conditions, a set of enabled rules, and termination conditions.  Entry conditions mean that a phase may be skipped altogether; for example, search 0 requires at least three joined tables in the input tree.  Termination conditions help to ensure the optimizer does not spend more time optimizing than it saves – if the current lowest plan cost drops below a configured value, the search will terminate early with a ‘Good Enough Plan Found’ result.

    The optimizer also sets a budget at the start of a phase for the number of optimization ‘moves’ it considers sufficient to find a pretty good plan (remember the optimizer’s goal is to find a good enough plan quickly).  If the process of exploring and implementing alternatives exceeds this ‘budget’ during a phase, the phase terminates with a ‘Time Out’ message.  Early termination (for whatever reason) is part of the optimizer’s design, completely normal, and not generally a cause for concern.

    From time to time, we might wish that the optimizer had different goals – perhaps that we could ask it to continue searching for longer – but this is not how it works.  It is all too easy to over-spend on optimization (finding transformations is not hard – finding ones that are robustly useful in a wide range of circumstances is), and full cost-based exploration is a memory and processor-intensive operation.  The optimizer contains many heuristics, checks and limits to avoid exploring unpromising alternatives, to prune the search space as it goes, and ultimately produce a pretty good plan pretty quickly, most of the time, on most hardware, in most databases.

    Costing

    First a quick summary: The optimizer runs up to four phases, each of which performs exploration using rules (finding new logical alternatives to some part of the tree), then a round of implementation rules (finding physical implementations for a logical part of the tree).  New logical or physical alternatives are added to the Memo structure – either as an alternative within an existing group, or as a completely new group.  Note that the Memo allows the optimizer to consider very many alternatives at once in a compact and quick-to-search structure; the optimizer does not just work on one overall plan, performing incremental improvements (this is a common misconception).

    Having found all these alternatives, the optimizer needs a way to choose between them.  Costing runs after each round of implementation rules, producing a cost value for each physical alternative in each group in the memo (only physical operations can be costed, naturally).  Costing considers factors like cardinality, average row size, expected sequential and random I/O operations, processor time, buffer pool memory requirements, and the effect of parallel execution.

    Like many areas of optimization, the costing calculations are based on a complex mathematical model.  This model generally provides a good basis on which to compare alternatives internally, regardless of the workload or particular hardware configuration.  The costing numbers do not mean very much by themselves, so it is generally unwise to compare them between statements or across different queries.  The numbers are perhaps best thought of as a unit-less quantity, useful only when comparing alternative plans for the same statement.

    The model is just that of course: a model, albeit one that happens to produce very good results for most people most of the time.  The slide deck that accompanies this series contains details of some of the simplifying assumptions made in the model.

    Final Memo Contents

    At the end of each search phase, the Memo may have expanded to include new logical and physical alternatives for each group, perhaps some completely new groups.  We can see the contents after each phase using trace flag 8615:

    -- Final memo
    SELECT
        p.Name,
        Total = SUM(inv.Quantity)
    FROM Production.Product AS p
    JOIN Production.ProductInventory AS inv ON
        inv.ProductID = p.ProductID
    WHERE
        p.Name LIKE N'[A-G]%'
    GROUP BY
        p.Name
    OPTION (RECOMPILE, QUERYTRACEON 8615);

    The final Memo (after each phase, remember) can be quite large, despite the techniques employed to constrain the search space.  Our simple test query only requires a single run through search 1, and terminates early after 509 moves (tasks) with a ‘Good Enough Plan Found’ message.  Despite that, the Memo contains 38 groups (up from 18) with many groups containing several alternatives.  The extract below highlights just the groups that relate directly to the join implementation:

    Final Memo Trace Flag 8615 Contents

    Looking at the green shaded areas, group 13 item 0 was the original (joining groups 8 and 9 and applying the condition described by the sub-tree starting with group 12).  It has gained an logical alternative (via the Join Commute rule, reversing the join input groups), and two surviving physical implementations: a one-to-many merge join as item 7, with group 8 option 2 as the first input, and group 9 option 2 as the second; and a physical inner Apply driven by group 8 option 6, on group 30 option 2.  Group 20 is completely new – another one-to-many merge join option, but this time with inputs from groups 19.4 and 8.2 with 12 familiar as the join condition.  Again, the scripts accompanying this series can be used to explore the details further, if you wish.

    The final selected plan consists of physical implementations chosen by starting with the lowest-cost physical option in the root group (18, shaded pink above).  The sub-tree from that point has a total estimated cost of 0.0295655 (this is the same cost shown in the final graphical execution plan).  The plan is produced by following the links in the Memo, from group 18 option 4, to group 20 option 6, and so on to the leaves:

    Good Enough Plan Found

    Optimized Query Plan

    End of Part 3

    The original slides and demo code are attached below as a zip file.

    Links to other parts of this series: Part 1 Part 2 Part 4

    © 2012 Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • Query Optimizer Deep Dive – Part 2

    This is the second part in a series of posts based on the content of the Query Optimizer Deep Dive presentations I have given over the last month or so at the Auckland SQL Users’ Group and the SQL Saturday events in Wellington, New Zealand and Adelaide, Australia.

    Links to other parts of this series: Part 1 Part 3 Part 4

    Cost-Based Optimization Overview

    The input to cost-based optimization is a tree of logical operations produced by the previous optimization stages discussed in part one.  Cost-based optimization takes this logical tree, explores logical alternatives (different logical tree shapes that produce the same results), generates physical implementations, assigns an estimated cost to each, and finally chooses the cheapest physical option overall.

    The goal of cost-based optimization is not to find the best possible physical execution plan by exploring every possible alternative; rather, the goal is to find a good plan quickly.  This approach gives us an optimizer that works pretty well for most workloads, most of the time.  If you think about it, that’s quite an achievement – after all, my database is likely very different from yours, and we probably run on quite different hardware as well.  The point about finding a good plan quickly is also important – generally, we would not want the optimizer to spend hours optimizing a query that runs for only a few seconds.

    This design has a number of important consequences.  First, a skilled query tuner will often be able to come up with a better plan than the optimizer does.  In some cases, this will because the human can reason about the query in a way the optimizer cannot.  Other times, it is simply a question of time – we might be happy spending half a day finding a great execution plan for a crucial query, whereas the optimizer places strict limits on itself to avoid spending more time on optimization than it saves on a single execution.

    Perhaps a future version of SQL Server will allow us to configure the optimizer to spend extra time on a particular query – today, we have to use query hints and ‘manual optimizations’ to encourage a particular execution plan shape.  The down side to circumventing the optimizer’s natural behaviour in this way is that we then have to take responsibility for ensuring that the execution plan remains good in the future, as data volumes and distributions change.  In most systems, it is best to limit the number of manually-tuned queries to a minimum, and thoroughly document any tricks you use to obtain a particular plan.

    Input Tree to Cost-Based Optimization

    Back to the task at hand.  We have a sample query (reproduced below) that has been through all the previous optimization stages, did not qualify for a trivial plan, and so needs to go through the cost-based optimization process.

    SELECT
        p.Name,
        Total = SUM(inv.Quantity)
    FROM 
        Production.Product AS p,
        Production.ProductInventory AS inv
    WHERE
        inv.ProductID = p.ProductID
        AND p.Name LIKE N'[A-G]%'
    GROUP BY
        p.Name;

    The tree of logical operations passed to the optimizer looks like this:

    Cost-Based Optimizer Input Tree

    This looks very similar to the original logical tree seen in part one, though cardinality estimates have been added to each primary node, and the tree layout has been expanded a little into a form that is easier for the system to work with.  In addition to simple cardinality estimates, remember that each node also has statistics objects (histograms and frequency information) derived from the familiar statistics associated with the tables in the query.

    The tree above was built from information obtained using trace flag 3604 and 8606 (see part one for more details).  Trace flag 8606 shows the tree at various stages, the diagram above is built from the view after project normalization:

    Trace Flag 8606 Output

    Properties

    There are also a number of other ‘properties’ associated with each node, that have been added by previous optimization stages.  These properties include logical attributes, such as:

    • Output columns and expressions
    • Uniqueness (key) information
    • Type information and nullability
    • Functional dependencies
    • Domain ranges for each column or expression

    There are also some physical properties that may be present on each node, for example:

    • Sort order at each node
    • Partitioning information
    • Halloween protection

    The Halloween protection information is worth explaining in a bit more detail.  A query generally executes as a pipeline, with rows being requested one at a time from the top of the tree – so rows are in effect pulled up the tree one at a time.  This pipeline arrangement has a number of important advantages, but a problem can arise where the query modifies data: changes made by the query can affect rows being read by the same query – the classic example is a query that adds 10% to the salaries of every employee.  When processed as a pipeline, we can get stuck in an infinite loop as rows that have had 10% added re-qualify for reading.  This problem happened to be first discovered on 31 October 1976 and became known as the Halloween Problem.

    One solution is to separate operations that read data from those that update data by reading all rows into temporary storage before performing any updates.  In SQL Server, this simple solution typically manifests as an Eager Table Spool in the query plan – all rows are written eagerly to temporary storage before any updates are performed.  This is a bit of a sledgehammer solution though, so the optimizer keeps track of which columns need protection from the Halloween problem (and how much protection they need) at each node by setting properties.  Some logical operations (like sorting) naturally mean that all input rows have to be read before the first output row can be produced.  These operations provide Halloween protection naturally, so an explicit Eager Table Spool would not be necessary.  There are a number of logical operations that can provide some degree of Halloween protection, and properties are used to ensure that just enough protection is provided, at minimum cost.

    Rules

    The optimizer contains rules to transform trees.  There are four classes of rule:

    • Simplification
    • Exploration
    • Implementation
    • Physical property enforcement

    Simplification rules transform some part of the logical tree to a simpler logical form, and are responsible for the simplification transforms we saw in part one.  Exploration rules run only during cost-based optimization, generating new logical equivalents for some part of the existing logical tree.  Implementation rules produce a physical operation (like a hash or merge join) for a logical operation (a logical join).  Property enforcement rules introduce new elements to the logical tree to enforce some desired physical property at that point, for example to enforce a particular sorted order of rows (as might be required by a merge join or stream aggregate).

    There are 395 rules in SQL Serer 2012, most of which perform quite simple operations; it is the fact that many rules can be run in various orders that accounts for the apparent complexity of optimizer behaviour.  The way that simple rules can combine to produce complex behaviours reminds me of Conway’s Game of Life – a cellular automation program with only four rules that nevertheless can produce extraordinarily complex and beautiful patterns as the rules are applied over and over.

    Exploration Rule Examples

    One example of a simple exploration rule is SELonJN, a rule that merges a suitable relational SELECT (row filter) into a logical join operation:

    SELonJN Exploration Rule

    Another common exploration rule that generates a new logical alternative is JoinCommute:

    JoinCommute Exploration Rule

    As the name suggests, JoinCommute explores a different logical join order, exploiting the fact that A JOIN B is equivalent to B JOIN A for inner joins.  Though logically equivalent, the different join orders may have different performance characteristics once the logical alternatives are translated to a physical implementation by a later implementation rule.  A third exploration rule that is relevant to our test query is GbAggBeforeJoin, a rule that explores the possibility of pushing an aggregation operation under a join:

    GbAggBeforeJoin Transformation

    Implementation Rule Examples

    As mentioned previously, implementation rules transform part of a logical tree into a physical alternative:

    Optimizer Implementation Rules

    The diagram shows three join implementation rules, JNtoNL (nested loops), JNtoHS (hash join), JNtoSM (sort-merge join); two implementations for Group-By Aggregate, GbAggToStrm (Stream Aggregate) and GbAggToHS (Hash Aggregate); SelectToFilter, GetToScan, and GetIdxToRng.

    Which Rules Were Used?

    If we want to see which rules were used when optimizing a query, one option is to use an undocumented view which shows the number of times a rule has been asked to estimate how valuable it might be in the current context (its ‘promise’ value), the number of times the rule was used to generate an alternative section of the tree (‘built’), and the number of times the output of the rule was successfully incorporated into the search space (‘succeeded’).  A sample of the contents of the sys.dm_exec_query_transformation_stats DMV is shown below:

    sys.dm_exec_query_transformation_stats

    By taking a snapshot view of this information before and after optimizing a particular query, we can see which rules were run.  The DMV is instance-wide however, so you would need to run these tests on a system to which you have exclusive access.  The scripts that accompany this series of posts contain a complete test rig to show the rules used by each query.

    End of Part 2

    Part three in this series covers the Memo structure and optimization phases in detail.  The original presentation slides and demo code are contained in the zip file shown below.

    Links to other parts of this series: Part 1 Part 3 Part 4

    © 2012 Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • Query Optimizer Deep Dive - Part 1

    This is the first in a series of posts based on the content of the Query Optimizer Deep Dive presentations I have given over the last month or so at the Auckland SQL Users’ Group and the SQL Saturday events in Wellington, New Zealand and Adelaide, Australia.

    Links to other parts of this series: Part 2 Part 3 Part 4

    Introduction

    The motivation behind writing these sessions is that I have found that relatively few people have a good intuition for the way the optimizer works, partly because the official documentation is rather sparse, and partly because what information is available is dispersed across many books and blog posts. The content presented here is very much geared to my preferred way of learning – it shows the concepts in what seems to me to be a reasonably logical sequence, and provides tools to enable the interested reader to explore further, if desired.

    When we write a query, we are not writing a computer program that can be directly executed.  SQL is a declarative language used to logically describe the results we want to see.  SQL Server goes through a process of compilation and optimization to create a physically executable implementation of the stated logical requirement.  To put it another way, a SQL query is the specification for an executable program that SQL Server writes for us.  The SQL Server component responsible for finding an efficient physical execution plan for a given logical query requirement is the Query Optimizer.

    The SQL Server Core Engine

    The diagram below shows conceptually where the Query Optimizer sits in the core SQL Server engine:

    SQL Server Core Engine Diagram

    Language Processing is concerned with things like parsing the text of the query to make sure it is syntactically valid, binding table references to real database objects, deriving types for expressions, semantic analysis (for example checking that the query is not trying to execute a table), and binding GROUP BY references to the appropriate logical scope.  The Query Optimizer aims to find a good physical execution plan that matches the logical semantic of the query, and the Query Executor is responsible for running that plan to completion.  Below that, the Storage Engine provides access to physical storage, together with locking and transaction services.  Finally, SQL-OS is the layer that provides threading, memory management, and scheduling services.

    Using the same colours for the components shown above, a very high-level overview of query processing looks like this:

    Query Processing Overview

    This series of blog posts is mostly concerned with the “Optimize” step above.  It takes as its input a tree of logical operations produced by the previous stages.

    Optimization Pipeline

    The diagram below shows the principal stages within query optimization, starting with the bound tree produced by the parser and algebrizer.  The coloured steps are the ones we will explore in depth – the right hand side of the diagram lists a bit more detail that I talk through when presenting this material live:

    Query Optimization Pipeline

    Input Tree

    We start by looking at the bound logical tree (input tree).  To make the discussion a little less abstract, we will use a test query issued against the AdventureWorks sample database.  This simple query shows products and the quantity in stock, limited to products with names that start with a letter between A and G:

    -- Test query
    SELECT
        p.Name,
        Total = SUM(inv.Quantity)
    FROM 
        Production.Product AS p,
        Production.ProductInventory AS inv
    WHERE
        inv.ProductID = p.ProductID
        AND p.Name LIKE N'[A-G]%'
    GROUP BY
        p.Name;

    The input tree generated by this query is shown below, slightly simplified:

    Optimizer Logical Input Tree

    The operations in this tree are all purely logical, and have their roots in Relational Theory.  The GET operation logically reads an entire table, though you should not yet be thinking in terms of physical operations such as a scan or a seek.  The JOIN is a cartesian product, logically joining every row in one table with every row in the other.  The SELECT is not a SQL statement, it is a relational operation: filtering rows based on some condition (known as a predicate).  Here, we are applying two SELECT operations to filter on matching product IDs and rows where the product name starts with a letter between A and G.  The next logical operation is a Group-By Aggregate, an extension to the basic relational algebra that logically groups rows by some common attribute, and optionally computes an aggregate expression for each group.  Finally, we have a relational Project operation, which filters columns – here we only need the Name and Total columns, and those are the only two returned in the logical result set.

    To see this tree directly, we can use undocumented trace flag 8605.  This flag also requires the common trace flag 3604 to produce output to the SSMS messages window.  We can use another undocumented syntax, the QUERYTRACEON hint, to set the trace flag just for this query:

    -- Input tree (ISO-92)
    SELECT
        p.Name,
        Total = SUM(inv.Quantity)
    FROM Production.Product AS p
    JOIN Production.ProductInventory AS inv ON
        inv.ProductID = p.ProductID
    WHERE
        p.Name LIKE N'[A-G]%'
    GROUP BY
        p.Name
    OPTION (RECOMPILE, QUERYTRACEON 8605);

    Now, in the SSMS messages tab, we see a textual representation of the tree of logical operators I drew with coloured boxes before:

    Trace Flag 8605 Output

    There are a few minor differences between this and the previous diagram.  In particular, the expression computed by the aggregate is added by a separate projection (omitted for clarity before), and the SQL-92 join syntax means the join predicate is more tightly bound to the logical join.  Rewriting the query using SQL-89 syntax highlights the second difference, producing a tree closer to the cartesian product followed by a relational SELECT:

    -- Input tree (ISO-89)
    SELECT
        p.Name,
        Total = SUM(inv.Quantity)
    FROM 
        Production.Product AS p,
        Production.ProductInventory AS inv
    WHERE
        inv.ProductID = p.ProductID
        AND p.Name LIKE N'[A-G]%'
    GROUP BY
        p.Name
    OPTION (RECOMPILE, QUERYTRACEON 8605);

    Input Tree SQL-89 Syntax

    Simplification

    The first step of the optimization pipeline we will look at is simplification, where the optimizer looks to rewrite parts of the logical tree to remove redundancies and move logical operations around a bit to help later stages.  Major activities that occur around the time simplification occurs (I have taken a little artistic licence here to group a few separate stages together) include:

    • Constant folding
    • Domain simplification
    • Predicate push-down
    • Join simplification
    • Contradiction detection

    Constant folding is the process of evaluating an expression during optimization, rather than repeatedly at execution time.  In the following example, the complex expression in the WHERE clause can safely be evaluated early, resulting in “WHERE p.Name LIKE ‘D%’:

    SELECT p.Name
    FROM Production.Product AS p
    WHERE p.Name LIKE SUBSTRING(LEFT(CHAR(ASCII(CHAR(68))), 1) + '%', 1, 2);

    Domain simplification enables the optimizer to reason about the range of valid values a column or expression can take.  In the next query, the only value for Product ID that logically matches all the predicates in the extended WHERE clause is the single value ‘400’.  The query plan shows that simplification has reduced the three predicates to a single equality predicate, “WHERE p.ProductID = 400”:

    SELECT TOP (10) * 
    FROM Production.Product AS p
    WHERE p.ProductID BETWEEN 300 AND 400
    AND p.ProductID BETWEEN 200 AND 500
    AND p.ProductID BETWEEN 400 AND 600;

    Predicate push-down involves pushing relational SELECTs (logically filtering rows based on a predicate) down the logical tree toward the leaves.  The heuristic here is that this is always a good thing to do (at least as far as expressions that reference a column are concerned) because it eliminates rows early, reducing the number for later logical operations.  Moving SELECTs closer to the logical GETs (reading tables) also helps index and computed-column matching.

    Join simplification allows the optimizer to remove unnecessary joins, convert logical outer joins to simpler inner joins where the NULLs introduces by the outer join are later rejected by another feature in the query, and remove provably empty subexpressions.  This is an incredibly powerful optimizer facility, with its roots again in relational theory.  Rob Farley has written (SQL Server MVP Deep Dives) and presented (http://bit.ly/SimpleRob) about ways to use this feature effectively.  In particular, he talks about how to write views to benefit from compile-time simplification, and avoid the views-on-views-on-views problem.  The following queries illustrate some of the join simplifications:

    -- Remove unnecessary joins
    SELECT
        th.ProductID,
        SUM(th.ActualCost)
    FROM Production.TransactionHistory AS th
    JOIN Production.Product AS p ON
        p.ProductID = th.ProductID
    GROUP BY
        th.ProductID;
    GO
    -- Outer join to join (null rejection)
    SELECT * 
    FROM Production.Product AS p
    LEFT JOIN Production.TransactionHistory AS th ON
        th.ProductID = p.ProductID
    WHERE th.ProductID < 10;
    GO
    -- Complex example combining multiple simplifications
    WITH Complex AS
    (
        SELECT
            pc.ProductCategoryID, pc.Name AS CatName,
            ps.ProductSubcategoryID, ps.Name AS SubCatName,
            p.ProductID, p.Name AS ProductName,
            p.Color, p.ListPrice, p.ReorderPoint,
            pm.Name AS ModelName, pm.ModifiedDate
        FROM Production.ProductCategory AS pc
        FULL JOIN Production.ProductSubcategory AS ps ON
            ps.ProductCategoryID = pc.ProductCategoryID
        FULL JOIN Production.Product AS p ON
            p.ProductSubcategoryID = ps.ProductSubcategoryID
        FULL JOIN Production.ProductModel AS pm ON
            pm.ProductModelID = p.ProductModelID
    )
    SELECT c.ProductID, c.ProductName
    FROM Complex AS c
    WHERE c.ProductName LIKE N'G%';

    That last example, with three full outer joins in a CTE (in-line view), simplifies down to a simple index seek on one table.

    Cardinality Estimation

    The optimizer only has direct information about table cardinality (total number of rows) for base tables.  Statistics can provide additional histograms and frequency statistics, but again these are only maintained for base tables.  To help the optimizer choose between competing strategies later on, it is essential to know how many rows are expected at each node in the logical tree, not just at the GET leaf nodes.  In addition, cardinality estimates above the leaves can be used to choose an initial join order (assuming the query contains several joins).

    Cardinality estimation computes expected cardinality and distribution statistics for each node, working up from the leaves one node at a time.  The logic used to create these derived estimates and statistics uses a model that makes certain assumptions about your data.  For example, where the distribution of values is unknown, it is assumed to be uniform across the whole range of potential values.

    You will generally get more accurate cardinality estimates if your queries fit the model used, and, as with many things in the wider optimizer, simple and relational is best.  Complex expressions, unusual or novel techniques, and using non-relational features often means cardinality estimation has to resort to guessing.

    Most of the choices made by the optimizer are driven by cardinality estimation, so if these are wrong, any good execution plans you might see are highly likely to be down to pure luck.

    Trivial Plan

    The next stage of the optimizer pipeline is Trivial Plan, which is a fast path to avoid the time and effort involved in full cost-based optimization, and only applies to logical query trees that have a clear and obvious ‘best’ execution plan.  The details of which types of query can benefit from Trivial Plan change frequently, but things like joins, subqueries, and inequality predicates generally prevent this optimization.  The queries below show some examples where Trivial Plan can, and cannot be applied:

    -- Trivial plan
    SELECT p.ProductID
    FROM Production.Product AS p
    WHERE p.Name = N'Blade';
    GO
    -- Still trivial
    SELECT
        p.Name,
        RowNumber =
            ROW_NUMBER() OVER (
                ORDER BY p.Name)
    FROM Production.Product AS p
    WHERE p.Name LIKE N'[A-G]%';
    GO
    -- 'Subquery' prevents a trivial plan
    SELECT (SELECT p.ProductID)
    FROM Production.Product AS p
    WHERE p.Name = N'Blade';
    GO
    -- Inequality
    SELECT p.ProductID
    FROM Production.Product AS p
    WHERE p.Name <> N'Blade';

    One interesting case where a Trivial Plan is not applied is where the estimated cost of the Trivial Plan query exceeds the configured ‘cost threshold for parallelism’.  The reasoning here is that any trivial plan that would qualify for a parallel plan suddenly presents a choice, and a choice means the query is no longer ‘trivial’, and we have to go through full cost-based optimization.  Taking this to the extreme, a SQL Server instance with the parallelism cost threshold set to zero will never produce a trivial plan (since all queries have a cost greater than zero).  At the risk of stating the slightly obvious, note that a plan produced at this stage will always be serial.  For experimentation only, the trivial plan stage can also be disabled using trace flag 8757.

    You can check whether a query used a trivial plan or not by inspecting the Properties window in SSMS, and clicking on the root node of the graphical query plan.  The ‘optimization level’ property will show ‘Trivial’ or ‘Full’ – the latter shows that a trip through full cost-based optimization was required.  If a query results in a trivial plan, a physical execution plan is produced and the optimization process ends here.

    End of Part 1

    The original slides and demo code are contained in the zip file attached below.  Links to other parts of this series: Part 2 Part 3 Part 4

    © 2012 Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • Fun with Aggregates

    There are interesting things to be learned from even the simplest queries.  For example, imagine you are given the task of writing a query to list AdventureWorks product names where the product has at least one entry in the transaction history table, but fewer than ten.

    One possible query to meet that specification is:

    SELECT
        p.Name
    FROM Production.Product AS p
    JOIN Production.TransactionHistory AS th ON
        p.ProductID = th.ProductID
    GROUP BY
        p.ProductID,
        p.Name
    HAVING
        COUNT_BIG(*) < 10;

    That query correctly returns 23 rows (execution plan and data sample shown below):

    image

    The execution plan looks a bit different from the written form of the query: the base tables are accessed in reverse order, and the aggregation is performed before the join.  The general idea is to read all rows from the history table, compute the count of rows grouped by ProductID, merge join the results to the Product table on ProductID, and finally filter to only return rows where the count is less than ten.

    This ‘fully-optimized’ plan has an estimated cost of around 0.33 units.  The reason for the quote marks there is that this plan is not quite as optimal as it could be – surely it would make sense to push the Filter down past the join too?  To answer that, let’s look at some other ways to formulate this query.  This being SQL, there are any number of ways to write logically-equivalent query specifications, so we’ll just look at a couple of interesting ones.  The first query is an attempt to reverse-engineer T-SQL from the optimized query plan shown above.  It joins the result of pre-aggregating the history table to the Product table before filtering:

    SELECT p.Name
    FROM
    (
        SELECT 
            th.ProductID, 
            cnt = COUNT_BIG(*) 
        FROM Production.TransactionHistory AS th 
        GROUP BY
            th.ProductID
    ) AS q1
    JOIN Production.Product AS p 
        ON p.ProductID = q1.ProductID
    WHERE
        q1.cnt < 10;

    Perhaps a little surprisingly, we get a slightly different execution plan:

    image

    The results are the same (23 rows) but this time the Filter is pushed below the join!  The optimizer chooses nested loops for the join, because the cardinality estimate for rows passing the Filter is a bit low (estimate 1 versus 23 actual), though you can force a merge join with a hint and the Filter still appears below the join.  In yet another variation, the < 10 predicate can be ‘manually pushed’ by specifying it in a HAVING clause in the “q1” sub-query instead of in the WHERE clause as written above.

    The reason this predicate can be pushed past the join in this query form, but not in the original formulation is simply an optimizer limitation – it does make efforts (primarily during the simplification phase) to encourage logically-equivalent query specifications to produce the same execution plan, but the implementation is not completely comprehensive.

    Moving on to a second example, the following query specification results from phrasing the requirement as “list the products where there exists fewer than ten correlated rows in the history table”:

    SELECT p.Name
    FROM Production.Product AS p
    WHERE EXISTS 
    (
        SELECT *
        FROM Production.TransactionHistory AS th 
        WHERE th.ProductID = p.ProductID 
        HAVING COUNT_BIG(*) < 10
    );

    Unfortunately, this query produces an incorrect result (86 rows):

    image

    The problem is that it lists products with no history rows, though the reasons are interesting.  The COUNT_BIG(*) in the EXISTS clause is a scalar aggregate (meaning there is no GROUP BY clause) and scalar aggregates always produce a value, even when the input is an empty set.  In the case of the COUNT aggregate, the result of aggregating the empty set is zero (the other standard aggregates produce a NULL).  To make the point really clear, let’s look at product 709, which happens to be one for which no history rows exist:

    -- Scalar aggregate
    SELECT COUNT_BIG(*)
    FROM Production.TransactionHistory AS th 
    WHERE th.ProductID = 709;
     
    -- Vector aggregate
    SELECT COUNT_BIG(*)
    FROM Production.TransactionHistory AS th 
    WHERE th.ProductID = 709
    GROUP BY th.ProductID;

    The estimated execution plans for these two statements are almost identical:

    image

    You might expect the Stream Aggregate to have a Group By for the second statement, but this is not the case.  The query includes an equality comparison to a constant value (709), so all qualified rows are guaranteed to have the same value for ProductID and the Group By is optimized away.

    In fact there are some minor differences between the two plans (the first is auto-parameterized and qualifies for trivial plan, whereas the second is not auto-parameterized and requires cost-based optimization), but there is nothing to indicate that one is a scalar aggregate and the other is a vector aggregate.  This is something I would like to see exposed in show plan so I suggested it on Connect.  Anyway, the results of running the two queries show the difference at runtime:

    image

    The scalar aggregate (no GROUP BY) returns a result of zero, whereas the vector aggregate (with a GROUP BY clause) returns nothing at all.  Returning to our EXISTS query, we could ‘fix’ it by changing the HAVING clause to reject rows where the scalar aggregate returns zero:

    SELECT p.Name
    FROM Production.Product AS p
    WHERE EXISTS 
    (
        SELECT *
        FROM Production.TransactionHistory AS th 
        WHERE th.ProductID = p.ProductID 
        HAVING COUNT_BIG(*) BETWEEN 1 AND 9
    );

    The query now returns the correct 23 rows:

    image

    Unfortunately, the execution plan is less efficient now – it has an estimated cost of 0.78 compared to 0.33 for the earlier plans.  Let’s try adding a redundant GROUP BY instead of changing the HAVING clause:

    SELECT p.Name
    FROM Production.Product AS p
    WHERE EXISTS 
    (
        SELECT *
        FROM Production.TransactionHistory AS th 
        WHERE th.ProductID = p.ProductID
        GROUP BY th.ProductID
        HAVING COUNT_BIG(*) < 10
    );

    Not only do we now get correct results (23 rows), this is the execution plan:

    image

    I like to compare that plan to quantum physics: if you don’t find it shocking, you haven’t understood it properly :)  The simple addition of a redundant GROUP BY has resulted in the EXISTS form of the query being transformed into exactly the same optimal plan we found earlier.  What’s more, in SQL Server 2008 and later, we can replace the odd-looking GROUP BY with an explicit GROUP BY on the empty set:

    SELECT p.Name
    FROM Production.Product AS p
    WHERE EXISTS 
    (
        SELECT *
        FROM Production.TransactionHistory AS th 
        WHERE th.ProductID = p.ProductID
        GROUP BY ()
        HAVING COUNT_BIG(*) < 10
    );

    I offer that as an alternative because some people find it more intuitive (and it perhaps has more geek value too).  Whichever way you prefer, it’s rather satisfying to note that the result of the sub-query does not exist for a particular correlated value where a vector aggregate is used (the scalar COUNT aggregate always returns a value, even if zero, so it always ‘EXISTS’ regardless which ProductID is logically being evaluated).

    The following query forms also produce the optimal plan and correct results, so long as a vector aggregate is used (you can probably find more equivalent query forms):

    WHERE Clause

    SELECT
        p.Name
    FROM Production.Product AS p
    WHERE 
    (
        SELECT COUNT_BIG(*) 
        FROM Production.TransactionHistory AS th
        WHERE th.ProductID = p.ProductID
        GROUP BY ()
    ) < 10;

    APPLY

    SELECT p.Name
    FROM Production.Product AS p
    CROSS APPLY
    (
        SELECT NULL
        FROM Production.TransactionHistory AS th 
        WHERE th.ProductID = p.ProductID
        GROUP BY ()
        HAVING COUNT_BIG(*) < 10
    ) AS ca (dummy);

    FROM Clause

    SELECT q1.Name 
    FROM
    (
        SELECT
            p.Name, 
            cnt = 
            (
                SELECT COUNT_BIG(*) 
                FROM Production.TransactionHistory AS th 
                WHERE th.ProductID = p.ProductID 
                GROUP BY ()
            )
        FROM Production.Product AS p
    ) AS q1 
    WHERE
        q1.cnt < 10;

    This last example uses SUM(1) instead of COUNT and does not require a vector aggregate…you should be able to work out why :)

    SELECT q.Name 
    FROM
    (
        SELECT
            p.Name, 
            cnt = 
            (
                SELECT SUM(1)
                FROM Production.TransactionHistory AS th 
                WHERE th.ProductID = p.ProductID
            )
        FROM Production.Product AS p
    ) AS q
    WHERE q.cnt < 10;

    image

    The semantics of SQL aggregates are rather odd in places.  It definitely pays to get to know the rules, and to be careful to check whether your queries are using scalar or vector aggregates.  As we have seen, query plans do not show in which ‘mode’ an aggregate is running and getting it wrong can cause poor performance, wrong results, or both.

    © 2012 Paul White

    Twitter: @SQL_Kiwi
    email: SQLkiwi@gmail.com

  • Dynamic Seeks and Hidden Implicit Conversions

    Most people know that a LIKE predicate with only a trailing wildcard can usually use an index seek:

    SELECT 
        p.Name 
    FROM Production.Product AS p
    WHERE 
        p.Name LIKE N'D%';

    Index Seek on LIKE

    As the execution plan shows, SQL Server determines a covering range (which depends on the collation), seeks the string index using the range as the start and end points of a partial scan, and applies the original LIKE condition as a residual predicate to just the rows that match the initial seek operation.  Specifically, the Storage Engine seeks the index to locate rows in the covering range, and the Query Processor applies the residual predicate (the LIKE) to the rows it receives.

    Dynamic Seeks

    But what if the LIKE search term is in a variable?

    DECLARE @Like NVARCHAR(50) = N'D%'
     
    SELECT 
        p.Name 
    FROM Production.Product AS p
    WHERE 
        p.Name LIKE @Like;

    SQL Server can still perform a seek here, but it needs to determine the covering seek range for the search term at execution time, not at compilation time:

    Dynamic Index Seek

    The plan now contains an extra Constant Scan,  a Compute Scalar and a Nested Loops Join.  These operators are interesting because they have zero cost estimates: no CPU, no I/O, nothing.  That’s because they are purely architectural: a workaround for the fact that SQL Server cannot currently perform a dynamic seek within the Index Seek operator itself.  To avoid affecting plan choices, this extra machinery is costed at zero.

    The Constant Scan produces a single in-memory row with no columns.  The Compute Scalar defines expressions to describe the covering seek range (using the runtime value of the @Like variable).  Finally, the Nested Loops Join drives the seek using the computed range information as correlated values.

    The upper tooltip shows that the Compute Scalar uses three internal functions, LikeRangeStart, LikeRangeEnd, and LikeRangeInfo.  The first two functions describe the range as an open interval.  The third function returns a set of flags encoded in an integer, that are used internally to define certain seek properties for the Storage Engine.  The lower tooltip shows the seek on the open interval described by the result of LikeRangeStart and LikeRangeEnd, and the application of the residual predicate ‘LIKE @Like’.

    More Dynamic Seeks

    Something very similar occurs in plans that use IN or OR with variables:

    DECLARE 
        @1 INTEGER = 320,
        @2 INTEGER = 325,
        @3 INTEGER = 330
     
    SELECT
        p.Name
    FROM Production.Product AS p
    WHERE 
        p.ProductID IN (@1,@2,@3);

    Dynamic Seek With IN and OR

    Now we have three ranges: one for each of the variables in the original query.  The Compute Scalar operators again define three columns containing the start and end of the range, and the associated informational flags (previously seen as a result of the LikeRangeInfo function).  This time, we see the decimal representation of these flags, which happens to be 62 for an equality comparison.  The IN expands to (ProductID = @1 OR ProductID = @2 OR ProductID = @3), so each of the ‘ranges’ here is in fact a single value, so the start and end range values are the same in each Compute Scalar.

    The three dynamic ranges are concatenated, sorted (so any overlapping ranges appear next to each other in the stream) and the Merge Interval collapses these intervals into one or more disjoint (non-overlapping) ranges.  This is important, because the three variables might, for example, all contain the same value, and it would be incorrect to return that value three times.  Anyway, for each disjoint range produced, the Nested Loops Join drives a new seek of the Clustered Index.  The overall effect is that an arbitrary number of possibly overlapping ranges are computed, merged, and then used to drive one or more seek operations. The final result of the query will be the combination of all the seek results, as you would expect.

    Hidden Conversions

    The following example contains a table with DATETIME2 values, and a query with a expression that at first sight seems unlikely to be able to seek on an index (the variable is typed as DATE, and there is a CONVERT function applied to the DATETIME2 column):

    DECLARE @Example TABLE (date_time DATETIME2 PRIMARY KEY);
     
    INSERT @Example (date_time) 
    VALUES ('20110101 12:34:56');
     
    DECLARE @date DATE = '2011-01-01';
     
    SELECT * 
    FROM @Example AS e 
    WHERE 
        @date = CONVERT(DATE, e.date_time);

    Nevertheless, a query plan that uses a seek can be produced:

    GetRangeThroughConvert

    In this case, neither SSMS or Plan Explorer will show the contents of the Compute Scalar (this is probably just an oversight, rather than deliberate concealment!).  We have to open the XML form of the execution plan to see the three familiar expressions, wrapped in a Value Vector (just a fancy container for multiple expressions).

    Another internal function, GetRangeThroughConvert, is responsible for determining the the range of DATETIME2 values covered by the DATE variable @date, and the informational flags needed.  In the same way the engine works out covering ranges for some LIKE predicates, this function determines ranges where certain problematic type conversions are required.  Otherwise, the machinery is the same: a range description is defined by the Compute Scalar, and the Nested Loops Join driving a seek using those values.

    More Hidden Conversions

    There is another related internal function used when the Query Processor needs to determine a range for a comparison between different data types.  This example returns rows based on a greater-than-or-equal comparison between DATE column values and the DATETIME return value of the GETDATE() intrinsic function:

    DECLARE @Example TABLE (col1 date PRIMARY KEY)
    SELECT * FROM @Example AS e 
    WHERE e.col1 >= DATEADD(DAY, -7, GETDATE());

    GetRangeWithMismatchedTypes

    Again, the SSMS graphical plan and Plan Explorer cannot display the contents of the Value Vector, so we have to dig into the XML again.  The function evaluates the DATEADD(GETDATE()) expression, computes the open-interval start point of a DATE range accounting for the conversion from DATETIME to DATE, and specifies NULL as the end of the range (since this is a >= comparison, there is no end value).  The flags value in this case is 22 (the flags for a >= seek operation).

    Everything All At Once

    This last example features all sorts type sloppiness, resulting in an execution plan that uses GetRangeThroughConvert on the string expression and GetRangeThroughConvert on the result of GetRangeWithMismatchedTypes applied to the result of the GETDATE function.  The whole thing is then wrapped in a dynamic seek with the Merge Interval enforcing the (annoying) BETWEEN requirement that the first parameter must be less than or equal to the second.  See if you can work out all the conversions necessary for this query, using the rules of data type precedence.  It is really quite impressive that this example of lazy T-SQL coding results in an index seek, don’t you think?

    DECLARE @Example TABLE (col1 DATETIME PRIMARY KEY)
    SELECT * FROM @Example AS e
    WHERE CONVERT(DATE, e.col1) BETWEEN '20000101' AND GETDATE();

    Merge Interval With Everything

    Conclusion

    SQL Server works quite hard sometimes to produce index seeks where they might seem unlikely.  This is a good thing, and it would be great to see this capability extended further in future.  The downside is that this extra effort means you are less likely to see an Index Scan when you have done something daft with data types.

    Why is this a bad thing if you get a seek anyway?  The problem is that these hidden implicit conversions can result in inaccurate cardinality and distribution estimations at any stage of the plan.  So, even if you get a seek, the plan might be way off overall.  If that isn’t persuasive enough, consider this: will having hidden nested range calculations improve your chances of getting a good query plan?  Probably not, no.  Be very aware of types, and in particular of the types returned by functions and expressions.  If in doubt, use SELECT INTO to materialize the results of an expression or query, and check the types of the columns produced.


    Note: if you have any scripts that trawl the plan cache looking for implicit conversions (CONVERT_IMPLICIT), you might want to look into updating them to check for these conversions too.  Not all conversions are bad ones, of course :)

    © 2012 Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

    Further Reading:

    Implicit Conversions – Craig Freedman
    More on Implicit Conversions – Craig Freedman
    Join Performance, Implicit Conversions, and Residuals - Me

  • Forcing a Parallel Query Execution Plan

    This post is for SQL Server developers who have experienced the special kind of frustration, which only comes from spending hours trying to convince the query optimizer to generate a parallel execution plan.  This situation often occurs when making an apparently innocuous change to the text of a moderately complex query; a change which somehow manages to turn a parallel plan that executes in ten seconds, into a five-minute serially-executing monster.

    SQL Server provides a number of query and table hints that allow the experienced practitioner to take greater control over the final form of a query plan.  These hints are usually seen as a tool of last resort, because they can make code harder to maintain, introduce extra dependencies, and may prevent the optimizer reacting to future changes in indexing or data distribution.  One such query hint is the (over) popular OPTION (MAXDOP 1), which prevents the optimizer from considering plans that use parallelism.  Sadly, there is currently no corresponding hint to force the optimizer to choose a parallel plan.

    The result of all this is a great deal of wasted time trying increasingly obscure query syntax, until eventually the desired parallel plan is obtained, or the developer gives up in despair.  Even where success is achieved, the price is often fragile code that risks reverting to serial execution any time indexing or statistics change.  In any case, the resulting SQL is usually hard to read, and scary to maintain.

    Why Expensive Queries Produce Serial Plans

    Whenever the query optimizer produces a serial plan instead of the ‘obviously better’ parallel plan, there is always a reason.  Leaving aside the more obvious causes, such as the configuration setting max degree of parallelism being set to one, running under a Resource Governor workload group with MAX_DOP = 1, or having only one logical processor available to SQL Server, the usual causes of a serial plan are parallelism-inhibiting operations, cardinality estimation errors, costing model limitations, and code path issues.

    Parallelism-Inhibiting Components

    There are many things that prevent parallelism, either because they make no sense in a parallel plan, or because the product just does not support them yet.  Some of these force the whole plan to run serially, others require a ‘serial zone’ – a part of the plan that runs serially, even though other parts may use multiple threads concurrently.

    That list changes from version to version, but for example these things make the whole plan serial on SQL Server 2008 R2 SP1:

    • Modifying the contents of a table variable (reading is fine)
    • Any T-SQL scalar function (which are evil anyway)
    • CLR scalar functions marked as performing data access (normal ones are fine)
    • Random intrinsic functions including OBJECT_NAME, ENCYPTBYCERT, and IDENT_CURRENT
    • System table access (e.g. sys.tables)

    Inconveniently, the list of intrinsic functions is quite long and does not seem to follow a pattern.  ERROR_NUMBER and @@TRANCOUNT also force a serial plan, @@ERROR and @@NESTLEVEL do not.  The T-SQL scalar function restriction is also a bit sneaky.  Any reference to a table with a computed column that uses such a function will result in a serial plan, even if the problematic column is not referenced in the query.

    These query features are examples that require a serial zone in the plan:

    • TOP
    • Sequence project (e.g. ROW_NUMBER, RANK)
    • Multi-statement T-SQL table-valued functions
    • Backward range scans (forward is fine)
    • Global scalar aggregates
    • Common sub-expression spools
    • Recursive CTEs

    The information presented above is based on the original list published by Craig Freedman, and updated for 2008 R2.

    One way to check that a query does not have any parallelism-inhibiting components is to test the query using a CPU cost multiplier.  This should only be done on a private test system where you are able to flush the whole plan cache after testing.  The idea is to use an undocumented and unsupported DBCC command to temporarily increase the CPU cost of the query plan operators.  It is not a fool-proof test (some rare parallelizable queries will not generate a parallel plan with this technique) but it is quite reliable:

    DBCC FREEPROCCACHE
    DBCC SETCPUWEIGHT(1000)
    GO
    -- Query to test
    SELECT
        COUNT_BIG(*) 
    FROM Production.Product AS p 
    LEFT JOIN Production.TransactionHistory AS th ON 
        p.ProductID = th.ProductID
    GO
    DBCC SETCPUWEIGHT(1)
    DBCC FREEPROCCACHE

    The final commands to reset the CPU weighting factor and flush the plan cache are very important.

    If you get an parallel estimated plan for a particular test query, it shows that a parallel plan is at least possible.  Varying the value passed to the DBCC command adjusts the multiplier applied to normal CPU costs, so you will likely see different plans for different values.  The illustrated factor of a thousand is often enough to produce a parallel estimated plan, but you may need to experiment with higher values.  It is not recommended to use estimated plans obtained using this technique directly in USE PLAN hints or plan guides because these are not plans the optimizer would ever produce naturally.  To be clear, direct use of the plans would likely render a production system unsupported and the person responsible might be fired, shot, or possibly both.

    Cardinality Estimation Errors

    If there is nothing that absolutely prevents parallelism in the target query, the optimizer may still choose a serial alternative if it has a lower estimated cost.  For that reason, there are a couple of things we can do to promote the parallel option here, all based on the very sound notion of giving the optimizer accurate information to base its estimates on.  The considerations here go well beyond just ensuring statistics are up-to-date, or building them with the FULLSCAN option.  For example, depending on the nature of the query, you may need to provide all or some of the following:

    • Multi-column statistics (for correlations)
    • Filtered statistics or indexes (for more histogram steps)
    • Computed columns on filtering expressions in the query (avoid cardinality guesses)
    • Good constraint information (foreign keys and check constraints)
    • Materialize parts of the query in temporary tables (more accurate statistics in deep plans)
    • Regular hints such as OPTIMIZE FOR

    In general, anything you can do to ensure that estimated row counts are close to the runtime values will help the optimizer cost the serial and parallel alternatives more accurately.  Many failures to choose a parallel plan are caused by inaccurate row counts.

    Model Limitations

    SQL Server uses a model to estimate the runtime cost of each operator in a query plan.  The exact calculations vary between operators, but most are based on a minimum cost, with an additional per-row component.  None of these estimates of expected CPU and I/O cost take into account the specific hardware SQL Server finds itself running on.  The advantage of this is that plans from one machine can be readily reproduced and compared on another machine running the same version of the software, without having to worry about hardware differences.

    Not all operators can be costed reasonably, and things like functions are particularly problematic because the optimizer has no clue how many rows might be produced or what the distribution of values might look like.  Even very normal-looking operators can pose problems.  Consider the task of estimating the number of rows and distribution of values resulting from a join or a complex GROUP BY clause.  Even where reasonable estimates can be made, the derived statistics that propagate up the query tree (from the persistent statistics at the leaves) tend to become quickly less reliable.  The optimizer includes many heuristics that aim to prevent these inaccuracies getting out of control, so it might resort to complete guesses after only a few operators as the compounding effect of deriving new statistics takes hold.

    There are many other assumptions and limitations of the model that will not fit into a blog post, the interested reader can find more detailed information in Chapter 8 of the indispensable SQL Server 2008 Internals book.

    Costing Limitations

    When SQL Server costs a parallel plan, it generally reduces the CPU cost for a parallel iterator by the a factor equal to the expected runtime DOP.  For example the previous query can produce the following serial and parallel plans:

    image

    image

    Taking the Merge Join operator as an example, the parallel version has its CPU cost reduced by a factor of 4 when the expected runtime DOP is four (serial plan on the left, parallel on the right):

    image

    On the other hand, the Index Scans show no reduction in I/O cost, though the CPU cost is again reduced by a factor of four:

    image

    As mentioned earlier, different operators cost themselves differently (for example a many-to-many merge join also has an I/O cost component that also happens to be reduced by a factor of four).  These details also vary somewhat between releases, so the presentation here is to give you an appreciation of the general approach rather than to dwell too much on the specifics.

    Looking again at the serial and parallel plans, it is clear that which of the two plans costs cheaper depends on whether the parallel plan saves enough by reducing CPU and I/O costs in the various operators, to pay for the extra operators in the plan.  In this case, the extra operators are three exchange (parallelism) operators – two Repartition Streams to redistribute rows for correct results when joining, and one Gather Streams to merge the threads back to a single final result.

    The way the numbers work means that it is often a tight race between the best parallel and serial plan alternatives.  In many real-world cases, the difference between the two can be extremely small – making it even more frustrating when the serial version turns out to take fifty times as long as the parallel version to execute.  One other point worth mentioning again here is that the DOP estimate is limited to the number of logical processors that SQL Server sees, divided by two.  My test machine has eight cores, all available to SQL Server, but the DOP estimate used for costing calculations is limited to four.  This has obvious consequences for costing, where CPU and I/O costs are typically divided by the estimated DOP of four, rather than eight.

    A Note about Parallel Nested Loops

    Plans with Nested Loops joins can be a particular problem, because the inner side almost always runs multiple threads serially.  The parallelism icons are still present, but they indicate that there are DOP independent serial threads.  The distinction is perhaps a subtle one, but it (a) explains why operators that normally force a serial zone can run ‘in parallel’ on the inner side of a loops join; and (b) the optimizer does not reduce the CPU costs on the inner side by the estimated runtime DOP.  This puts nested loops at an unfair disadvantage when it comes to parallelism costing, compared with Hash and Merge Joins.

    Code Path Issues

    This last category concerns the fact that the optimizer may not get as far as evaluating a parallel plan at all.  One way this can occur is if a final plan is found during the Trivial Plan stage.  If a Trivial Plan is possible, and the resulting cost is less than the configured cost threshold for parallelism, the full optimization stages are skipped and a serial plan is returned immediately.

    Trivial Plan

    The following query has an estimated serial plan cost of around 85 units, but with the parallelism threshold set to 100 a Trivial Plan is produced (as shown in the plan property ‘Optimization Level’ or by checking the changes in sys.dm_exec_query_optimizer_info as shown below:

    SELECT
        deqoi.[counter],
        deqoi.occurrence
    FROM sys.dm_exec_query_optimizer_info AS deqoi 
    WHERE
        [counter] IN ('trivial plan', 'search 0', 'search 1', 'search 2')
    GO
    SET SHOWPLAN_XML ON
    GO
    SELECT
        COUNT_BIG(*) 
    FROM dbo.bigTransactionHistory AS bth
    OPTION (RECOMPILE)
    GO
    SET SHOWPLAN_XML OFF
    GO
    SELECT
        deqoi.[counter],
        deqoi.occurrence
    FROM sys.dm_exec_query_optimizer_info AS deqoi 
    WHERE
        [counter] IN ('trivial plan', 'search 0', 'search 1', 'search 2')

    image

    When the cost threshold is reduced to 84 we get a parallel plan…

    image

    A deeper analysis shows that the query still qualified for Trivial Plan (and the stage was run) but the final cost exceeded the parallelism threshold so optimization continued.  This query does not qualify for ‘search 0’ (TP or Transaction Processing) because a minimum of three tables are required there.

    So, optimization moves on to ‘search 1’ (Quick Plan) which runs twice.  It runs once considering only serial plans, and comes out with a best cost of 84.6181.  Since this exceeds the threshold of 84, Quick Plan is re-run with the parallel plan option enabled.  The result is a parallel plan at cost 44.7854.  The plan does not meet the entry conditions for ‘search 2’ (Full Optimization) so the finished plan is copied out.

    Good Enough Plan & Time Out

    Returning to code path reasons that prevent a parallel plan, the last category covers queries that enter the Quick Plan stage, but that stage terminates early, either with a Good Enough Plan Found message, or a Time Out.  Both of these are heuristics to prevent the optimizer spending more time optimizing than it stands to gain by reducing estimated execution time (cost).  Good Enough Plan results when the current lowest cost plan is so cheap that further optimization effort is no longer justified.

    Time Out is a related phenomenon: at the start of a stage, the optimizer sets itself a ‘budget’ of a number of rule applications it estimates it can perform in the time justified by the initial cost of the plan.  This means that query trees that start with a higher cost get a correspondingly larger allowance of rule applications (roughly comparable to the number of moves a chess program thinks ahead).  If the optimizer explores the allowed number of rules before the natural end of the optimization stage, it returns the best complete plan at that point with a Time Out message.  This may well occur during the first run of ‘search 1’, preventing us reaching the second run that adds parallelism.

    One interesting consequence of the rule concerning Trivial Plan and the cost threshold for parallelism is that a system configured with a threshold of zero can never produce a Trivial Plan.  Bearing this in mind, we can generate a surprising Time Out with this query:

    SELECT * FROM Production.Product AS p

    As you would expect, this query is normally optimized using Trivial Plan (there are no cost-based plan choices here):

    image

    …but when the cost threshold is set to zero, we get Full Optimization with a Time Out…the optimizer timed out working out how to do SELECT * from a single table!

    image

    In this particular case, the optimizer ‘timed out’ after 15 tasks (it normally runs through many thousands).  A Time Out result can sometimes also be an indicator that the input query is over-complex, but the interpretation is not at all that straightforward.

    The Solution

    We need a robust query plan hint, analogous to MAXDOP, that we can specify as a last resort when all other techniques still result in a serial plan, and where the parallel alternative is much to be preferred of course.  I really want to emphasise that very many cases of unwanted serial plans are due to designers and developers not giving the optimizer good quality information.  I see very few systems with things like proper multi-column statistics, filtered indexes/statistics, and adequate constraints.  Even less frequently, do I see (perhaps non-persisted) computed columns created on query filter expressions to aid cardinality estimation.  On the other hand, non-relational database designs with poor indexing, and decidedly non-relational queries are extremely common.  (As are database developers complaining about the poor decisions the optimizer makes sometimes!)

    There’s always a Trace Flag

    In the meantime, there is a workaround.  It’s not perfect (and most certainly a choice of very last resort) but there is an undocumented (and unsupported) trace flag that effectively lowers the cost threshold to zero for a particular query.  It actually goes a little further than that; for example, the following query will not generate a parallel plan even with a zero cost threshold:

    SELECT TOP (1)
        p.Name
    FROM Production.Product AS p
    JOIN Production.TransactionHistory AS th ON
        th.ProductID = p.ProductID
    ORDER BY
        p.Name

    image

    This is a completely trivial query plan of course – the first row from the scan is joined to a single row from the seek.  The total estimated cost of the serial plan is 0.0065893.  Returning the cost threshold for parallelism to the default of 5 just for completeness, we can obtain a parallel plan (purely for demonstration purposes) using the trace flag:

    SELECT TOP (1)
        p.Name
    FROM Production.Product AS p
    JOIN Production.TransactionHistory AS th ON
        th.ProductID = p.ProductID
    ORDER BY
        p.Name
    OPTION (RECOMPILE, QUERYTRACEON 8649)

    image

    The parallel alternative is returned, despite the fact it costs much higher at 0.0349929 (5.3 times the cost of the serial plan).  In my testing, this trace flag has proved invaluable in certain particularly tricky cases where a parallel plan is essential, but there is no reasonable way to get it from the standard optimizer.

    Conclusion

    Even experts with decades of SQL Server experience and detailed internal knowledge will want to be careful with this trace flag.  I cannot recommend you use it directly in production unless advised by Microsoft, but you might like to use it on a test system as an extreme last resort, perhaps to generate a plan guide or USE PLAN hint for use in production (after careful review).

    This is an arguably lower risk strategy, but bear in mind that the parallel plans produced under this trace flag are not guaranteed to be ones the optimizer would normally consider.  If you can improve the quality of information provided to the optimizer instead to get a parallel plan, go that way :)

    If you would prefer to see a fully supported T-SQL OPTION (MINDOP) or OPTION (PARALLEL_PLAN) hint, please vote here:

    https://connect.microsoft.com/SQLServer/feedback/details/714968/provide-a-hint-to-force-generation-of-a-parallel-plan

    © 2011 Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • SQL Server Optimizer Bug with JOIN and GROUP BY

    I came across a SQL Server bug recently that made me wonder how on earth I never noticed it before.  As the title of this post suggests, the bug occurs in common JOIN and GROUP BY queries, and while it does not cause incorrect results to be returned, it will often cause a poor query plan to be selected by the optimizer.  If you are just interested in the bug itself, you will find a description and Connect item toward the end of this post.  As the regular reader will be expecting from me though, I am going to work up to it with a bit of background…

    Example Query

    Using everyone’s favourite Adventure Works database, here’s a simple query that includes a JOIN and a GROUP BY:

    SELECT
        COUNT_BIG(*)
    FROM Production.Product AS p
    JOIN Production.TransactionHistory AS th ON
        th.ProductID = p.ProductID
    GROUP BY
        p.Class
    OPTION (MAXDOP 1, RECOMPILE)

    One perfectly reasonable query plan possibility looks like this (estimated row counts shown):

    image

    After the Merge Join, rows go on to be grouped and counted.  This simple plan has an estimated cost of 1.1 units on my system.  The optimizer has chosen a Merge Join because indexes exist on the join column to provide sorted inputs, and the foreign key relationship between these two tables means that the join will be the efficient one-to-many type.  A Hash Match Aggregate is estimated to be cheaper than the alternative Sort and Stream Aggregate combination in this plan.  For reasons that will become apparent shortly, note that the aggregate computes the following expression (a simple count of all the rows):

    [Expr1004] = Scalar Operator(COUNT(*))

    Partial Aggregates

    The query plan shown above is not the one selected by the optimizer, however.  One of the tricks available to it is to move most of the aggregation work before the join.  The idea is that reducing row counts as early as possible in a plan generally pays off.  The plan actually selected by the optimizer is this one (estimated row counts again):

    image

    The new Stream Aggregate below the join computes the count of rows from the history table grouped by product id.  There are 441 unique product id values in the history table, so 441 product ids (and row counts) flow on to be joined with the 504 rows from the product table.  Note that the estimate of 441 groups exactly matches reality.

    With a smaller number of estimated rows coming out of the join, the optimizer chooses a Sort and Stream Aggregate combo instead of the Hash Match Aggregate seen previously.  To get the correct query result, the second aggregate uses SUM to add the per-product row counts together.  The estimated cost of this plan is 0.35 units according to SQL Server’s costing model, so you can see why the optimizer prefers it over the previous plan with a cost of 1.1 units.

    Clicking on the new aggregate below the join and looking at its properties, we see that it groups by product id and computes a partial aggregate expression:

    [partialagg1005] = Scalar Operator(Count(*))

    In this case, the expression label is the only way to identify this as a partial aggregate – there is no other more obvious indication.  The expression computed by the second Stream Aggregate (the one after the Sort) is:

    [Expr1004] = Scalar Operator(SUM([partialagg1005]))

    This aggregate is a Global aggregate: it computes the correct overall result based on partial results calculated earlier in the plan.  To summarize, the single COUNT in the original plan has been replaced with a partial aggregate COUNT, followed by a global aggregate SUM of those counts.

    Local and Global Aggregation

    This idea of computing part of an aggregate early and combining the partial results later on originated in parallel execution plans.  To illustrate the point, take a look at the following query and execution plan:

    SELECT 
        COUNT_BIG(*) 
    FROM Production.TransactionHistory AS th
    OPTION (RECOMPILE)

    [You will need to set the instance’s cost threshold for parallelism configuration value to zero in order to get a parallel plan for this query.]

    image

    The parallel aggregate computes the following partial aggregate expression per thread:

    [partialagg1003] = Scalar Operator(Count(*))

    The serial aggregate combines the partial aggregates using this expression:

    [Expr1002] = Scalar Operator(SUM([partialagg1003]))

    In parallel plans, the partial aggregate is more often referred to as a local aggregate, since it computes an aggregate that is local to the thread it is running on.  With four threads, there will be four local aggregations, which are combined to form the final result by the global aggregate, running on a single thread in this example.

    One other fact that will be important later on: notice that the plan above estimates that four rows will be produced by the parallel Stream Aggregate operator – one for each thread.  This plan was generated on a computer with eight processors available to SQL Sever.  The optimizer always estimates that half the processors will be available for parallel execution – this is a heuristic, and not one that makes an enormous amount of sense to me intuitively, but there you go, it is how it is.

    Partial or Local?

    To illustrate how inconsistent SQL Server is with the terms ‘partial’ and ‘local’:

    SELECT
        COUNT_BIG(*)
    FROM Production.TransactionHistory AS th
    GROUP BY
        th.Quantity
    OPTION (RECOMPILE)

    image

    The idea is the same, but the labels are different: notice that the hash aggregate now has an explicit Partial Aggregate label in the query plan.  You will only see this helpful change in parallel plans that use a Hash Partial (= Local) Aggregate (there is a sort of reason for this, explained in the next section on memory grants).

    The parallel hash aggregate is computing COUNT(*) again, but now it is grouping by Quantity.  The query plan needs to do a bit more work to get a correct result, because as it stands rows from the index scan are distributed unpredictably between threads, so each thread can see rows from any or all Quantity groups.  The aggregate is still ‘partial’ but this time it is because each thread only sees part of the full row set.

    To get the right answer (according to the semantic of the SQL query) SQL Server goes on to repartition the rows among new threads using Hash Partitioning on the Quantity column.  This step ensures that rows with the same Quantity value always end up on the same thread.  To be clear, it says nothing about how many Quantity groups go to each thread, just that the repartitioning step guarantees that rows associated with a particular Quantity value will all end up on the same thread.  For efficiency, we certainly hope that the partitioning results in a good distribution of groups among threads, but that is a separate issue completely.

    Each Stream Aggregate (one per thread) can now safely compute a SUM of the partial aggregate results it receives on its thread, though it still has to group by Quantity since one thread may receive more than one Quantity group.  The final Stream Aggregate is still a Global Aggregate, since it uses local (= partial) aggregate results, even though it executes in parallel (if you prefer, the result each thread’s Global Aggregate produces is globally valid for output).

    Memory Grants

    Another interesting thing about the Hash Match Partial Aggregate is that it never acquires more than the very small amount of memory necessary to create a minimally-size hash table.  If the aggregate runs out of memory while processing, it simply stops aggregating, and passes individual rows instead.  The global aggregate will still produce the correct result, but the query won’t be as efficient (for example, more rows have to move across the Repartition Streams exchange).  The partial aggregate is a performance optimization, it is not required for correctness (and note that a partial Stream Aggregate never needs a memory grant, so the same issue does not arise).

    Optimizer Rules

    The query optimizer rule that explores splitting ‘ordinary’ aggregates into local and global parts is called GenLGAgg for “Generate Local and Global Aggregate”.  In parallel plans, whether this transformation comes out cheaper depends on whether the plan saves enough cost (for example by reducing the number of rows that flow across the repartitioning exchange) to pay for the overhead of the extra aggregate operation.

    In serial plans, there are no exchanges (by definition) so this transformation needs to find another way to pay for itself.  The way we saw earlier, is to push the partial aggregate below a join such that it reduces the cost of that join (and later operations) enough to at least pay for itself.  There is a second optimizer rule to explore the option of pushing a local aggregate below a join; its name is LocalAggBelowJoin.  Note how both rules refer to the term Local, as in ‘local to a thread’.

    The Bug Revealed

    Here’s our original query and the execution plan the optimizer selected again:

    SELECT
        COUNT_BIG(*)
    FROM Production.Product AS p
    JOIN Production.TransactionHistory AS th ON
        th.ProductID = p.ProductID
    GROUP BY
        p.Class
    OPTION (MAXDOP 1, RECOMPILE)

    image

    Now here’s the plan produced for exactly the same query, with a MAXDOP 2 hint instead of MAXDOP 1:

    SELECT
        COUNT_BIG(*)
    FROM Production.Product AS p
    JOIN Production.TransactionHistory AS th ON
        th.ProductID = p.ProductID
    GROUP BY
        p.Class
    OPTION (MAXDOP 2, RECOMPILE)

    image

    The estimate of the number of rows emitted by the partial aggregate has doubled, from 441 rows (exactly right, remember) to 882.  If we specify MAXDOP 3, the row estimate trebles to 1323.  At MAXDOP 4, the estimate is 1764.  After MAXDOP 4, no increase occurs on my test machine because it has eight processors, and the optimizer estimates a maximum runtime DOP of half the number of processors available for parallel queries, as noted earlier.  If you have 64 processors, you could get the cost to multiply by 32…and so on.  To be absolutely clear about this, all these MAXDOP options still result in serial execution plans.

    Yes, but I don’t use MAXDOP hints

    Neither do I, much – at least not other than specifying one or zero.  Anyway, let’s see what happens on the same machine without any hints at all:

    SELECT
        COUNT_BIG(*)
    FROM Production.Product AS p
    JOIN Production.TransactionHistory AS th ON
        th.ProductID = p.ProductID
    GROUP BY
        p.Class

    image

    Yikes!  The correct estimate of 441 rows has been multiplied by 4 – the maximum possible on this 8-core machine, exactly as if we had specified (MAXDOP 4) to (MAXDOP 8) or even (MAXDOP 0).  Setting the server-wide max degree of parallelism configuration value to 1 (or using Resource Governor to do the same thing) will result in a correct estimate of 441 rows again (unless MAXDOP is specified).  You will also get correct estimates from the partial aggregate if SQL Server is running on a machine with only one processing unit, or if the affinity mask is set such that SQL Server can only see one processor.  See the pattern?

    Cause & Effects

    In short, the cardinality estimation problem comes down to the partial aggregate (and only the partial aggregate) costing itself as if it were running in a parallel plan.  In a parallel plan, each instance of the partial aggregate would produce the estimated number of rows per thread – so for a plan estimated to execute at DOP 4, we would get a (correct) multiplier effect on the estimate.  For a serial plan, this is just plain wrong.

    In the plan above, we are lucky that cardinality estimation for the merge join recognises that it cannot produce more rows than the 504 contained in the product table because of the foreign key relationship.  Nevertheless, the estimates for every operator after the problematic aggregate are still incorrect.

    In other plans, this effect can completely change the plan choice above a partial aggregate.  As an example, here is the SQL Server 2012 query from my last post, with the MAXDOP 1 hint:

    SELECT
        p.ProductModelID,
        COUNT_BIG(*),
        COUNT_BIG(DISTINCT th.TransactionDate)
    FROM Production.Product AS p
    JOIN Production.TransactionHistory AS th ON 
        th.ProductID = p.ProductID
    GROUP BY p.ProductModelID
    OPTION (RECOMPILE, MAXDOP 1)

    image

    This gives us the efficient new query transformation present in SQL Server 2012 (the middle hash aggregate is a partial one).  The exact same query, at MAXDOP 3 or above (or without a MAXDOP hint at all on a machine with 6 cores or more):

    image

    Now, we get the horribly inefficient old Eager Spool plan!

    The optimizer still considers both alternatives, but the estimated plan costs explain the choice it makes: the good plan has an estimated cost of 2.8 at MAXDOP 1 (that is, without the partial aggregate costing error).  At MAXDOP 2, the good plan’s cost increases to 3.45 units.  The estimated cost of the Eager Spool plan is 3.5 units, regardless of the MAXDOP setting (exactly as it should be for a serial plan), so you can see that the spool plan becomes the apparently cheaper option at MAXDOP 3.  (I’m sure it doesn’t need saying again, but the MAXDOP hinted queries here all still produce non-parallel plans.)

    The estimated I/O and CPU costs of the partial aggregate operator are also costed as if it were running in parallel at the estimated available DOP, further adding to the costing errors of the plan alternative.  Moreover, the cardinality estimation error will tend to propagate up the plan tree – increasing the costs of upstream operators, and causing larger memory grants to be requested than should be estimated to be necessary.

    Overall, the effect is that you are very likely to receive the wrong plan for your query.  The consequences, for non-trivial Adventure Works queries at any rate, could be severe.  Look out for these estimation errors in any serial plan that features a partial aggregate.  Queries that include a JOIN and a GROUP BY are likely candidates for the GenLGAgg and LocalAggBelowJoin rules to introduce a partial aggregate – and remember that in serial plans there is no way to know that an aggregate is partial without checking the details of the expressions it computes…

    The Connect item for this bug can be found here:
    https://connect.microsoft.com/SQLServer/feedback/details/711685/costing-cardinality-errors-with-partial-aggregates-in-serial-execution-plans

    This issue reproduces on all versions of SQL Server from 2005 to 2012 RC0 with TF4199 enabled for optimizer fixes (the behaviour on 2005 is even more erratic than described above).


    Update 8 Dec 2011: Thanks everyone for the votes, Microsoft have responded by confirming the bug. They plan to fix it sometime next year.

    © 2011 Paul White

    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • Is Distinct Aggregation Still Considered Harmful?

    Back in 2008, Marc Friedman of the SQL Server Query Processor Team wrote a blog entry entitled “Distinct Aggregation Considered Harmful”, in which he shows a way to work around the poor performance that often results simply from adding the keyword DISTINCT to an otherwise perfectly reasonable aggregate function in a query.  This post is an update to that, presenting a query optimizer enhancement in SQL Server 2012 RC0 that reduces the need to perform the suggested rewrite manually.  First, though, it makes sense for me to cover some broader aggregation topics in general.

    Query Plan Options for DISTINCT

    There are two main strategies to extract distinct values from a stream of rows.  The first option is to keep track of unique values in a hash table.  The second is to sort the incoming rows into groups and return one value from each group.  SQL Server uses the Hash Match operator to implement the hash table option, and Stream Aggregate or Sort Distinct for the sorting option.

    Hash Match Aggregate

    The optimizer tends to prefer the Hash Match Aggregate on larger rowsets, with fewer groups, where there is no reason to produce a sorted output, and where the incoming rows are not sorted on the DISTINCT expression(s).  Larger inputs favour hash matching because the algorithm generally scales well (although it does require a memory grant) and can make good use of parallelism.  Fewer groups are better for hashing because it means fewer entries in the hash table, and the memory needed to store unique values is proportional to the number of groups (and the size of the group).  Hash matching does not require or preserve the order of the incoming row stream.  The query and plan below show a Hash Match Aggregate building a hash table on values in the Quantity column:

    SELECT DISTINCT 
        th.Quantity
    FROM Production.TransactionHistory AS th
    OPTION (RECOMPILE)

    image

    The standard Hash Match Aggregate is blocking: it produces no output until it has processed its entire input stream.  If we restrict the number of rows using TOP or SET ROWCOUNT the aggregate can operate in a streaming fashion, producing new unique values as soon as they are encountered.  This streaming mode is known as Flow Distinct, and is activated depending on the estimated number of unique values in the stream.  In the example above, the estimated number of output rows from the Hash Match Aggregate is 455 (based on the statistics on my test machine).  Limiting the query to 455 or fewer rows using TOP or ROWCOUNT produces a Hash Match Aggregate running in Flow Distinct mode:

    SET ROWCOUNT 10
     
    SELECT DISTINCT 
        th.Quantity
    FROM Production.TransactionHistory AS th
    OPTION (RECOMPILE)
     
    SET ROWCOUNT 0

    image

    This plan is interesting because it limits the output to 10 rows without including a specific TOP operator for that purpose.  TOP is generally preferred to ROWCOUNT for the reasons set out in the Books Online topic “Limiting Result Sets by Using TOP and PERCENT”.  Ignore the part that says “because SET ROWCOUNT is used outside a statement that executes a query, its value cannot be used to generate a query plan for a query” – we have just seen how it can affect a query plan.

    For completeness, this is the TOP query and execution plan:

    SELECT DISTINCT TOP (10)
        th.Quantity
    FROM Production.TransactionHistory AS th
    OPTION (RECOMPILE)

    image

    Stream Aggregate

    Performing a DISTINCT is the same as applying a GROUP BY on every expression in the SELECT list.  A stream of rows sorted on these GROUP BY expressions has the useful property that all rows with the same GROUP BY values appear together, so all we need to do is output a row of GROUP BY keys each time a change occurs.  If the required order can be obtained without an explicit sort, the query plan can use a Stream Aggregate directly:

    SELECT DISTINCT 
        th.ProductID 
    FROM Production.TransactionHistory AS th
    OPTION (RECOMPILE)

    image

    This query plan shows an ordered scan of an index on the ProductID column, followed by a Stream Aggregate with a GROUP BY on the same column.  The Stream Aggregate emits a new row each time a new group is encountered, similar to the Hash Match Aggregate running in Flow Distinct mode.  This operator does not require a memory grant because it only needs to keep track of one set of GROUP BY expression values at a time.

    Sort Distinct

    The third alternative is to perform an explicit Sort followed by a Stream Aggregate.  The optimizer can often collapse this combination into a single Sort operating in Sort Distinct mode:

    SELECT DISTINCT
        p.Color
    FROM Production.Product AS p 
    OPTION (RECOMPILE)

    image

    To see the original Sort and Stream Aggregate combination, we can temporarily disable the GbAggToSort optimizer rule that performs this transformation:

    DBCC RULEOFF('GbAggToSort')
     
    SELECT DISTINCT
        p.Color
    FROM Production.Product AS p 
    OPTION (RECOMPILE)
     
    DBCC RULEON('GbAggToSort')

    The query plan now shows a regular (non-distinct) Sort followed by the Stream Aggregate:

    image

    Distinct Aggregates

    The DISTINCT keyword is most commonly used with the COUNT and COUNT_BIG aggregate functions, though it can be specified with a variety of built-in and CLR aggregates.  The interesting thing is that SQL Server always processes distinct aggregates by first performing a DISTINCT (using any of the three methods shown previously) and then applying the aggregate (for example COUNT) as a second step:

    SELECT 
        COUNT_BIG(DISTINCT p.Color) 
    FROM Production.Product AS p 
    OPTION (RECOMPILE)

    image

    One thing worth highlighting is that COUNT DISTINCT does not count NULLs.  The previous query that listed the distinct colours from the Product table produced 10 rows (9 colours and one NULL), but the the COUNT DISTINCT query here returns the value 9.  The query plan uses a Distinct Sort to perform the DISTINCT (which includes the NULL group) and then counts the groups using a Stream Aggregate.  The aggregate expression uses COUNT(expression) rather than COUNT(*), which eliminates the NULL group produced by the Distinct Sort.  This second example shows a Hash Match Aggregate performing the DISTINCT, followed by a Stream Aggregate to count the non-NULL groups:

    SELECT
        COUNT_BIG(DISTINCT th.Quantity)
    FROM Production.TransactionHistory AS th
    OPTION (RECOMPILE)

    image

    So far, the counting step has always been performed by a Stream Aggregate.  This is because it has been a scalar aggregate – an aggregate without a GROUP BY clause – and SQL Server always implements scalar aggregates using Stream Aggregate (there would be no point using hashing since there will only be one group by definition).  If we add a GROUP BY clause, the final COUNT is no longer a single (scalar) result, so we can get a plan with two Hash Match Aggregates:

    SELECT
        th.ProductID,
        COUNT_BIG(DISTINCT th.Quantity)
    FROM Production.TransactionHistory AS th
    GROUP BY
        th.ProductID
    OPTION (RECOMPILE)

    image

    In this plan, the rightmost aggregate is performing the DISTINCT, and the leftmost one implements the COUNT (with GROUP BY).

    Combining Aggregates

    The decision to always implement distinct aggregates as separate DISTINCT and aggregate steps makes life easier for the optimizer.  Separating the two operations makes it easier to plan and cost parallelism, partial aggregation, combining similar aggregates, moving aggregates around (for example pushing an aggregate below a join) and so on.  On the downside, it creates problems for queries that include multiple distinct aggregates, or combine a single distinct aggregate with an ‘ordinary’ aggregate.

    Multiple Aggregates

    As you probably know, there is no problem combining multiple ordinary aggregations into a single operator:

    SELECT
        COUNT_BIG(*),
        MAX(th.ActualCost),
        STDEV(th.Quantity)
    FROM Production.TransactionHistory AS th
    GROUP BY
        th.TransactionType

    image

    The single Hash Match Aggregate operator computes the three separate aggregates concurrently on the incoming stream.

    Multiple Distinct Aggregates

    This next query does a similar thing, but with three DISTINCT aggregations on the stream:

    SELECT 
        COUNT_BIG(DISTINCT th.ProductID), 
        COUNT_BIG(DISTINCT th.TransactionDate), 
        COUNT_BIG(DISTINCT th.Quantity)
    FROM Production.TransactionHistory AS th
    OPTION (RECOMPILE)

    It produces a more complex query plan:

    image

    The issue here is that once a DISTINCT has been performed on a stream, the stream no longer contains the columns necessary to perform the other DISTINCT operations.  To work around this, the optimizer builds a plan that reads the source stream once per DISTINCT, computes the aggregates separately, and then combines the results using a cross join (which is safe because these are all scalar aggregates, guaranteed to produce one row each).  The same basic pattern is employed if the query contains an outer GROUP BY clause, but instead of cross joins there will be inner joins on the GROUP BY columns.

    More often than not, the source of rows will not be an unrestricted table scan.  Where the source is complex (and therefore expensive to re-run for every distinct aggregate) or where a filter significantly reduces the number of rows, the query optimizer may choose to Eager Spool the source rows and replay them once per distinct aggregate:

    SELECT 
        COUNT_BIG(DISTINCT th.ProductID), 
        COUNT_BIG(DISTINCT th.TransactionDate), 
        COUNT_BIG(DISTINCT th.Quantity)
    FROM Production.TransactionHistory AS th
    WHERE
        th.ActualCost < $5
    GROUP BY
        th.TransactionType
    OPTION (RECOMPILE)

    image

    This is a plan shape that you are likely to encounter in the real world, since most queries will likely have a filtering condition or have a row source that is more complex than a simple scan of an index or table.  For queries over larger tables than Adventure Works provides, this plan is likely to perform poorly, as you might expect.  Aside from the obvious concerns, inserting rows into a spool has to be performed on a single thread (like all data modification operations in today’s SQL Server).  Another limitation is that the spool does not support parallel scan for reading, so the optimizer is very unlikely to restart parallelism after the spool (or any of its replay streams).  In queries that operate on large data sets, the parallelism implications of the spool plan can be the most important cause of poor performance.

    Performing Multiple Distinct Aggregates Concurrently

    If SQL Server were able to perform the whole COUNT DISTINCT aggregate in a single operator (instead of splitting the DISTINCT and COUNT into two steps) there would be no reason to split plans with spools as seen above.  This could not be done with a Stream Aggregate since that operator requires the stream to be sorted on the DISTINCT expression, and it is not possible to sort a single stream in more than one way at the same time.

    On the other hand, the Hash Match Aggregate does not require sorted input (it keeps the distinct values in a hash table, remember), so it should be possible to design a Hash Match Aggregate that computes COUNT DISTINCT in a single operation.  We can test this idea with a User-Defined Aggregate (UDA) – SQL Server 2008 required:

    using System;
    using System.Collections.Generic;
    using System.Data.SqlTypes;
    using System.IO;
    using Microsoft.SqlServer.Server;
     
    [SqlUserDefinedAggregate
        (
        Format.UserDefined,
        IsInvariantToDuplicates = true,
        IsInvariantToNulls = true,
        IsInvariantToOrder = true,
        IsNullIfEmpty = false,
        MaxByteSize = -1
        )
    ]
    public struct CountDistinctInt : IBinarySerialize
    {
        // The hash table
        private Dictionary<int, object> dict;
     
        // Recreate the hash table for each new group
        public void Init()
        {
            dict = new Dictionary<int, object>();
        }
     
        // Ignore NULLs, store key values in the hash table
        public void Accumulate(SqlInt32 Data)
        {
            if (!Data.IsNull)
            {
                dict[Data.Value] = null;
            }
        }
     
        // Merge partial aggregates
        public void Merge(CountDistinctInt Group)
        {
            foreach (var item in Group.dict.Keys)
            {
                dict[item] = null;
            }
        }
     
        // Return the DISTINCT COUNT result
        public int Terminate()
        {
            return dict.Count;
        }
     
        // Required by SQL Server to serialize this object
        void IBinarySerialize.Write(BinaryWriter w)
        {
            w.Write(dict.Count);
     
            foreach (var item in dict.Keys)
            {
                w.Write(item);
            }
        }
     
        // Required by SQL Server to deserialize this object
        void IBinarySerialize.Read(BinaryReader r)
        {
            var recordCount = r.ReadInt32();
            dict = new Dictionary<int, object>(recordCount);
     
            for (var i = 0; i < recordCount; i++)
            {
                dict[r.ReadInt32()] = null;
            }
        }
    }

    This UDA does nothing more than create a hash table, add (non-NULL) values to it, and return the count of values when aggregation is complete (the hash table takes care of duplicates).  There is a little extra infrastructure code in there to allow SQL Server to serialize and reconstruct the hash table when needed (and merge partial aggregates) but the core of the function is just four lines of code.  The example above only aggregates integer values, but it is easy to extend the idea to include other types.  Anyway, armed with the integer and date time versions of the UDA, I now return to the multiple-distinct-count query that caused all the spooling, with COUNT DISTINCT replaced by UDA references:

    SELECT 
        dbo.CountDistinctInt(th.ProductID), 
        dbo.CountDistinctDateTime(th.TransactionDate), 
        dbo.CountDistinctInt(th.Quantity)
    FROM Production.TransactionHistory AS th
    WHERE
        th.ActualCost < $5
    GROUP BY
        th.TransactionType
    OPTION (RECOMPILE)

    Instead of all the Eager Spools, we now get this query plan:

    image

    You may be surprised to see that the three distinct count aggregates are being performed by a Stream Aggregate; after all I just finished explaining why a Stream Aggregate could not possibly do what we want here.  The thing is that all CLR UDAs are interfaced to query plans using the Stream Aggregate model.  The fact that this UDA uses a hash table internally does not change that.

    The Sort in this plan is there to ensure that groups of rows arrive at the ‘Stream Aggregate’ interface in the required GROUP BY order.  This is so SQL Server knows when to call the Init() and Terminate() methods on our UDA.  The COUNT DISTINCT aggregation that is happening inside the UDA for each group could not care less about ordering, of course.  (In case you were wondering, yes, the UDA produces the same results as the original T-SQL code).

    The point here is to demonstrate that multiple DISCOUNT COUNT operations can be performed within a single operator, not that UDAs are necessarily always a great way to do that in general.  As far as performance is concerned, the original spool query runs in around 220ms, and the UDA settles down around 160ms which isn’t bad, all things considered.

    We can also improve the performance of the T-SQL query by rewriting it to avoid the spools by scanning the source table three times (this executes in around 75ms).  Part of the problem here (to go with the spool plan issues mentioned earlier) is that the optimizer assumes that all queries start executing with an empty data cache, and it does not account for the fact that the three scans complete sequentially, so the pages are extremely likely to be available from memory for the second and third scans.  The rewrite and query plan are below.

    WITH 
        Stream AS 
        (
            SELECT 
                th.TransactionType, 
                th.ProductID, 
                th.TransactionDate, 
                th.Quantity 
            FROM Production.TransactionHistory AS th 
            WHERE 
                th.ActualCost < $5
        ),
        CountDistinctProduct AS
        (
        SELECT 
            TransactionType, 
            COUNT_BIG(DISTINCT ProductID) AS c
        FROM Stream 
        GROUP BY 
            TransactionType
        ),
        CountDistinctTransactionDate AS
        (
        SELECT 
            TransactionType, 
            COUNT_BIG(DISTINCT TransactionDate) AS c
        FROM Stream 
        GROUP BY 
            TransactionType
        ),
        CountDistinctQuantity AS
        (
        SELECT 
            TransactionType, 
            COUNT_BIG(DISTINCT Quantity) AS c
        FROM Stream 
        GROUP BY 
            TransactionType
        )
    SELECT
        p.c,
        d.c,
        q.c
    FROM CountDistinctProduct AS p
    JOIN CountDistinctTransactionDate AS d ON
        d.TransactionType = p.TransactionType
    JOIN CountDistinctQuantity AS q ON
        q.TransactionType = d.TransactionType

    image

    Combining a Single Distinct Aggregate with other Aggregates

    Marc Friedman’s blog post presented a way to rewrite T-SQL queries that contain a single distinct aggregate and one or more non-distinct aggregates so as to avoid spools or reading the source of the rows more than once.  The essence of the method is to aggregate first by the GROUP BY expressions in the query and the DISTINCT expressions in the aggregate, and then to apply some relational math to aggregate those partial aggregates to produce the final result.  I encourage you to read the full post to see all the detail, but I will quickly work through an example here too:

    SELECT
        dp.EnglishProductName,
        SUM(frs.SalesAmount),
        COUNT_BIG(DISTINCT frs.SalesTerritoryKey)
    FROM dbo.FactResellerSales AS frs
    JOIN.dbo.DimProduct AS dp ON
        frs.ProductKey = dp.ProductKey
    GROUP BY
        dp.EnglishProductName
    OPTION (MAXDOP 1, RECOMPILE)

    This query contains a regular SUM aggregate and a COUNT DISTINCT, so as expected the query optimizer produces a plan with an Eager Spool (click to enlarge):

    image

    To the left of the spool, the top branch performs the DISTINCT followed by the COUNT per group, and the spool replay on the lower branch of the plan computes the SUM per group.  Finally, the two branches join on the common GROUP BY expression (EnglishProductName).

    The rewrite starts by grouping on EnglishProductName (the GROUP BY expression) and SalesTerritoryKey (the DISTINCT expression) to produce a partial aggregate:

    SELECT 
        dp.EnglishProductName, 
        frs.SalesTerritoryKey, 
        SUM(frs.SalesAmount) AS ssa
    FROM dbo.DimProduct AS dp
    JOIN dbo.FactResellerSales AS frs ON 
        frs.ProductKey = dp.ProductKey
    GROUP BY 
        frs.SalesTerritoryKey, 
        dp.EnglishProductName

    This query contains no distinct aggregates, so we get a plan with an ordinary join and and ordinary SUM aggregate:

    image

    To produce the results specified by the original query, we now need to SUM the partial SalesAmount sums, and COUNT (ignoring NULLs) the SalesTerritoryKey values.  The final rewrite looks like this:

    WITH PartialAggregate AS
    (
        SELECT 
            dp.EnglishProductName, 
            frs.SalesTerritoryKey, 
            SUM(frs.SalesAmount) AS ssa
        FROM dbo.DimProduct AS dp
        JOIN dbo.FactResellerSales AS frs ON 
            frs.ProductKey = dp.ProductKey
        GROUP BY 
            frs.SalesTerritoryKey, 
            dp.EnglishProductName
    )
    SELECT
        pa.EnglishProductName, 
        SUM(pa.ssa),
        COUNT_BIG(pa.SalesTerritoryKey)
    FROM PartialAggregate AS pa
    GROUP BY pa.EnglishProductName
    OPTION (RECOMPILE)

    The query plan adds another layer of aggregation on top of the partial aggregate plan above:

    image

    This plan avoids the Eager Spools seen earlier and improves execution time from 320ms to 95ms in this example.  On larger sets, where parallelism becomes important and the spools might need to use physical tempdb storage, the gains are likely to be much larger.

    New for SQL Server 2012 RC0

    The good news is that SQL Server 2012 RC0 adds a new optimizer rule (ReduceForDistinctAggs) to perform this rewrite automatically on the original form of the query.  This is particularly good because the rewrite, while ingenious, can be somewhat inconvenient to do in practice, and all too easy to get wrong (particularly ensuring that NULL partially-aggregated groups are handled correctly).  The new optimizer rule was not available in CTP3, so you will need RC0 to see SQL Server turn this T-SQL:

    SELECT
        dp.EnglishProductName,
        SUM(frs.SalesAmount),
        COUNT_BIG(DISTINCT frs.SalesTerritoryKey)
    FROM dbo.FactResellerSales AS frs
    JOIN.dbo.DimProduct AS dp ON
        frs.ProductKey = dp.ProductKey
    GROUP BY
        dp.EnglishProductName
    OPTION (RECOMPILE)

    …directly into this query plan:

    image

    This transformation is only valid for queries with a single distinct aggregate (and at least one non-distinct aggregate of course).  If your query contains multiple distinct aggregates, it may not help you directly, though you may be able to refactor the T-SQL to take advantage.

    If you want to see SQL Server 2012 RC0 produce the Eager Spool plan instead (created by the long-standing ExpandDistinctGbAgg rule), disable the new rule with:

    DBCC RULEOFF ('ReduceForDistinctAggs')

    …and then recompile.  Don’t forget to enable it again afterward using RULEON or by reconnecting to the server.

    Even with the new rule enabled, you may still see the spool or multiple-scan plan from time to time.  As always, the optimizer may explore many alternative plan forms and choose the one that looks cheapest.  In some cases, the optimizer may still choose the spool plan, though it probably won’t be right to do so…

    Thanks for reading; please consider voting for the following Connect item (to allow CLR UDA hash aggregates).  Thank you.

    http://connect.microsoft.com/SQLServer/feedback/details/629920/allow-option-hash-group-with-sqlclr-udas

    © 2011 Paul White

    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • How to Find the Statistics Used to Compile an Execution Plan

    In this post, I show you how to determine exactly which statistics objects were used by the query optimizer to produce an execution plan.

    Trace Flags

    We will need three undocumented trace flags.  The first one (3604) is well-known – it redirects trace output to the client so it appears in the SSMS messages tab.

    The second trace flag is 9292.  With this enabled, we get a report of statistics objects which are considered ‘interesting’ by the query optimizer when compiling, or recompiling the query in question.  For potentially useful statistics, just the header is loaded.

    The third trace flag is 9204.  With this enabled, we see the ‘interesting’ statistics which end up being fully loaded and used to produce cardinality and distribution estimates for some plan alternative or other.  Again, this only happens when a plan is compiled or recompiled – not when a plan is retrieved from cache.

    You can enable and disable these flags with the usual DBCC TRACEON and TRACEOFF commands, but it is also possible to enable them just for a particular statement using the undocumented QUERYTRACEON query hint (demonstrated below).

    Sample Query

    DBCC FREEPROCCACHE
     
    SELECT 
        p.Name,
        total_quantity = SUM(th.Quantity)
    FROM AdventureWorks.Production.Product AS p
    JOIN AdventureWorks.Production.TransactionHistory AS th ON
        th.ProductID = p.ProductID
    WHERE
        th.ActualCost >= $5.00
        AND p.Color = N'Red'
    GROUP BY
        p.Name
    ORDER BY
        p.Name
    OPTION
    (
        QUERYTRACEON 3604,
        QUERYTRACEON 9292,
        QUERYTRACEON 9204
    )

    The DBCC FREEPROCCACHE is just there to empty the plan cache so we get a compilation.  You can also evict the current plan from cache if you know its handle (SQL Server 2008) or use a RECOMPILE query hint.  Using RECOMPILE is often convenient, but you may get a different plan compared to that obtained without the hint.  Note that compiling the query is enough – we do not need to execute the query; simply requesting an ‘estimated plan’ will do.  It doesn’t hurt to run it either though, just to be clear.

    Sample Output

    Stats header loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.Product, 
        IndexId: 1, 
        ColumnName: ProductID, 
        EmptyTable: FALSE
     
    Stats loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.Product, 
        IndexId: 1, 
        ColumnName: ProductID, 
        EmptyTable: FALSE
     
    Stats header loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.Product, 
        IndexId: 3, 
        ColumnName: Name, 
        EmptyTable: FALSE
     
    Stats loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.Product, 
        IndexId: 3, 
        ColumnName: Name, 
        EmptyTable: FALSE
     
    Stats header loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.Product, 
        IndexId: 11, 
        ColumnName: Color, 
        EmptyTable: FALSE
     
    Stats loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.Product, 
        IndexId: 11, 
        ColumnName: Color, 
        EmptyTable: FALSE
     
    Stats header loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.TransactionHistory, 
        IndexId: 2, 
        ColumnName: ProductID, 
        EmptyTable: FALSE
     
    Stats loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.TransactionHistory, 
        IndexId: 2, 
        ColumnName: ProductID, 
        EmptyTable: FALSE
     
    Stats header loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.TransactionHistory, 
        IndexId: 5, 
        ColumnName: ActualCost, 
        EmptyTable: FALSE
     
    Stats loaded: 
        DbName: AdventureWorks, 
        ObjName: AdventureWorks.Production.TransactionHistory, 
        IndexId: 5, 
        ColumnName: ActualCost, 
        EmptyTable: FALSE

    There’s no sign of an official way to get this very useful information in Denali, despite it being requested many times over the years.  Trace flag 9204 works at least as far back as SQL Server 2005.  Both 92xx flags work in 2008, R2, and Denali CTP 3.

    Enjoy!

    © 2011 Paul White

    SQLkiwi@gmail.com
    @SQL_Kiwi

  • Can a SELECT query cause page splits?

    Books Online has this to say about page splits:

    When a new row is added to a full index page, the Database Engine moves approximately half the rows to a new page to make room for the new row.  This reorganization is known as a page split.  A page split makes room for new records, but can take time to perform and is a resource intensive operation. Also, it can cause fragmentation that causes increased I/O operations.

    Given that information, how can a SELECT statement be responsible for page splits?  Well, I suppose we could SELECT from a function that adds rows to a table variable as part of its internal implementation, but that would clearly be cheating, and no fun at all from a blogging point of view.

    Tables & Sample Data

    To demonstrate the effects I want to talk about today, I’m going to use a couple of tables with data from the sales data in the Adventure Works sample database.  There’s a full script at the end of this post for those of you that like to run these things for yourselves, but the data generally looks like this: (header table on the left; detail table on the right):

    image

    Both tables have a clustered index on the Sales Order ID column.  The query I’m going to run is this:

    image

    I have expressed the natural join predicate as (X >= Y) AND (X <= Y) instead of (X = Y) to prevent the optimizer choosing a hash or merge join (both of which require at least one equality predicate to perform an inner join).  The “+ 0” is there to prevent the optimizer from using an indexed nested loop join (that is, using a seek on the inner side of the join).  These tricks are necessary to generate a plan with an eager index spool on this small sample data set.

    Checking for Splits

    There are several ways to check for page splits while this query is running.  I’m going to use the new extended events UI in SQL Server code name “Denali”, which, by the way, I’m quite impressed with: it’s a lot like using Profiler inside SSMS and does away with all that tedious mucking about with CREATE EVENT SESSION statements and XQuery (there are also no improbability factors to compute).  Ok, so running the query above and looking for page splits via extended events, I get output like this:

    image

    These page splits are all caused by the Index Spool in the query plan.  This iterator builds a temporary b-tree index on the data produced by its sub-tree, and like any index insert operation, page splits will occur.  In fact this is a bit of a cheat, because these page splits aren’t really page splits at all – notice the sequential nature of the pages being ‘split’ (the page_id column), and its relationship with the new page.

    Even in Denali, we don’t yet have a nice easy way to distinguish between a traditional page split (where roughly half the existing rows are moved to a new page) and the case where a whole new page is allocated at one end of the index.  This strikes me as an amazing oversight.  I don’t know about you, but I generally don’t fret much over tables or indexes allocating a whole new page at one end to store new rows – I’m much more interested in existing index pages that need to split in half!  I certainly hope this improves before RTM.

    Index Spool Internals

    Having shown page splits that aren’t really page splits, let’s get to demonstrating an index spool that ‘really does’ split pages.  To do that, we need to know something about the way index spools build their temporary index.  You have probably noticed that most query plans that build permanent indexes involve a sort, which in turn requires a memory grant.

    Index spools do not get a memory grant, and are not preceded by an explicit sort operator in query plans.  Given that indexes require sorting, how does an index spool manage to build a b-tree index structure at all?  The answer is that index spools just insert rows into an empty b-tree in the order they are received.  In general, the rows used for the index will not be received in sorted order, resulting in lots and lots of page splits in the temporary index b-tree.

    The first run of the query did not suffer from this problem due to the clustered index on the header table, which tends to produce rows ordered by Sales Order ID – exactly the order of the index built by the index spool.  Let’s rebuild the header table clustered index to use the Customer ID column instead, and re-run the test.  The query plan looks exactly the same – the only difference is that the index spool is fed by a clustered index scan that is now extremely unlikely to produce rows in Sales Order ID sorted order:

    image

    The page splits reported by the extended events ‘trace’ are now very different:

    image

    Notice the same source page (page_id) being split over and over again as rows arrive in unsorted order.  There are now very many more page splits than before, and they are a lot more expensive because we are now not just allocating new empty pages at the end of the index – we have to shift half the rows and fix up all the page pointers.  In practice, this is the sort of splitting you are likely to see in tempdb with an index spool – rows are very unlikely to be in index order (else why would we need the spool at all?).  On that last point, consider the hoops I had to jump through to demonstrate a plan where the rows were in temporary index order – all that mess with “>= AND <=” and “+ 0”…

    Parallel Index Spools

    The last thing I want to cover about index spools is how they behave in parallel plans.  You might have noticed the MAXOP 1 hint in the previous examples.  Let’s remove that now, add some wait type monitoring, and see how we go:

    image

    Don’t be fooled by the parallelism indicator on the index spool – the index building portion of its execution happens on a single thread.  One way to see this is to capture thread and task information using the page splitting event session, where the same task and thread ids are reported for every one of the hundreds of events:

    image

    The wait statistics provide additional confirmation:

    image

    Ignore the CXPACKET waits (they are perfectly normal) and look at the EXECSYNC waits.  This query ran at DOP 8 on my machine, so 7 parallel threads had to wait while a single thread built the temporary index.  The index build took 114ms, and 7 threads waiting simultaneously for that time gives 7 x 114 = 798ms.  That’s within one quantum (4ms) of the total recorded EXECSYNC wait time.  EXECSYNC is used for lots of different things – whenever synchronization of parallel workers needs to occur outside a Parallelism operator, so please don’t take away the idea that EXECSYNC = index spool.

    © Paul White 2011

    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

    Script:

    -- Any AdventureWorks version should be fine
    USE AdventureWorks2008R2
    GO
    -- Tables
    CREATE TABLE #OrdHeader
    (
        SalesOrderID        INTEGER NOT NULL,
        OrderDate           DATETIME NOT NULL,
        SalesOrderNumber    NVARCHAR(25) NOT NULL,
        CustomerID          INTEGER NOT NULL
    )
    GO
    CREATE TABLE #OrdDetail
    (
        SalesOrderID        INTEGER NOT NULL,
        OrderQty            SMALLINT NOT NULL,
        LineTotal           NUMERIC(38,6) NOT NULL
    );
    GO
    -- Data
    INSERT #OrdHeader
    SELECT
        h.SalesOrderID,
        h.OrderDate,
        h.SalesOrderNumber,
        h.CustomerID
    FROM Sales.SalesOrderHeader AS h
    GO
    INSERT #OrdDetail
    SELECT
        d.SalesOrderID,
        d.OrderQty,
        d.LineTotal
    FROM Sales.SalesOrderDetail AS d
    GO
    -- Clustered indexes
    CREATE UNIQUE CLUSTERED INDEX cuq ON #OrdHeader (SalesOrderID) WITH (MAXDOP = 1)
    CREATE CLUSTERED INDEX cuq ON #OrdDetail (SalesOrderID) WITH (MAXDOP = 1)
    GO
    -- Test query
    SELECT
        d.OrderQty,
        h.SalesOrderNumber,
        h.OrderDate
    FROM #OrdDetail AS d
    JOIN #OrdHeader AS h ON
        d.SalesOrderID >= h.SalesOrderID + 0
        AND d.SalesOrderID <= h.SalesOrderID + 0
    OPTION  (MAXDOP 1)
    GO
    -- Lose the ordering
    CREATE CLUSTERED INDEX cuq ON #OrdHeader (CustomerID) 
    WITH (DROP_EXISTING = ON, MAXDOP = 1)
    GO
    SELECT
        d.OrderQty,
        h.SalesOrderNumber,
        h.OrderDate
    FROM #OrdDetail AS d
    JOIN #OrdHeader AS h ON
        d.SalesOrderID >= h.SalesOrderID + 0
        AND d.SalesOrderID <= h.SalesOrderID + 0
    OPTION  (MAXDOP 1)
    GO
    -- Parallelism
    DBCC SQLPERF('sys.dm_os_wait_stats', 'CLEAR')
    GO
    SELECT
        d.OrderQty,
        h.SalesOrderNumber,
        h.OrderDate
    FROM #OrdDetail AS d
    JOIN #OrdHeader AS h ON
        d.SalesOrderID >= h.SalesOrderID + 0
        AND d.SalesOrderID <= h.SalesOrderID + 0
    GO
    SELECT
        ows.wait_type,
        ows.waiting_tasks_count,
        ows.wait_time_ms,
        ows.max_wait_time_ms,
        ows.signal_wait_time_ms
    FROM sys.dm_os_wait_stats AS ows
    WHERE
        ows.wait_type IN (N'CXPACKET', 'EXECSYNC')
    GO
    DROP TABLE 
        #OrdHeader, 
        #OrdDetail
    GO
    /*
    -- Denali Extended Events Session Script (change the session_id filter!)
    -- Try the UI :)
     
    CREATE EVENT SESSION [Page Splits] ON SERVER 
    ADD EVENT sqlserver.page_split
        (
        ACTION(sqlos.system_thread_id,sqlos.task_address,sqlserver.sql_text)
    WHERE 
        ([package0].[equal_uint64]([sqlserver].[session_id],(60)))) 
    ADD TARGET 
        package0.ring_buffer(SET max_memory=(65536))
    WITH 
    (
        MAX_MEMORY=4096 KB,
        EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,
        MAX_DISPATCH_LATENCY=30 SECONDS,
        MAX_EVENT_SIZE=0 KB,
        MEMORY_PARTITION_MODE=NONE,
        TRACK_CAUSALITY=OFF,
        STARTUP_STATE=OFF
    )
    */
  • SQL Server, Seeks, and Binary Search

    The following table summarizes the results from my last two blog entries, showing the CPU time used when performing 5 million clustered index seeks:

    Clustered Index Seek Test Results

    In test 1, making the clustered index unique improved performance by around 40%. In test 2, making the same change reduced performance by around 70% (on 64-bit systems – more on that later).  As a reminder, both tests use nested loops to join a single-column BIGINT table of numbers to itself:

    Query and Plans

    In test 1, the table contains numbers from 1 to 5 million inclusive; for test 2, the table contains the even numbers from 2 to 10 million (still 5 million rows).  In case you were wondering, the test 2 results are the same if we populate the table with odd numbers from 1 to 9,999,999 instead of even numbers.  How can we explain these results?

    Performing an Index Seek

    Perhaps we need to look a bit deeper into how SQL Server performs an index seek.  Most people know that SQL Server indexes are a B+ tree, so let’s work through an example of locating the row containing the value 1 million, when the table contains values from 1 to 5 million, with a unique clustered index.

    Index seeks start at the index root page, which we can find in a number of ways (e.g. sys.system_internals_allocation_units).  I happened to use DBCC IND to determine that the root of the index was page 117482 in file 1 (in Denali, we can use a new DMF: sys.dm_db_database_page_allocations).  Anyway, whichever way you find it, we can use DBCC PAGE to look at the contents of the root page:

    Index Root Page Contents

    The col1 (key) column shows the lowest key value that can possibly be present in the lower-level page specified by Child Page Id.  We can see that the lowest possible key value on page 117485 is 906305, and the next lowest key value possible is 1132881 on page 117486.  So, if a row containing the value 1 million exists, the next page to look at is 117485:

    Level 1 Index Contents

    We are now at level 1 of the index, and have to scroll down the output a bit further to see that the next page to look at is 119580.  This page is at the leaf level of the index, and we find our target value at position 399:

    Leaf level index record

    Binary Search

    When performing an index seek, SQL Server obviously doesn’t run DBCC PAGE and scroll down the output window to find the next page to look at like we just did.  You may know that SQL Server can use binary search to locate index entries, but it’s worth taking a closer look at the index pages to see how this works in practice:

    SQL Server Page Structure Diagram

    The slot array contains a series of two-byte offsets pointing to the index records.  The index records are not necessarily stored in any particular order on the page; only the slot array entries are guaranteed to be in sorted order.  Note that the sorted slot array entries do not contain the index keys themselves, just offsets.

    Performance

    When SQL Server uses binary search to locate an index entry, it starts by checking the middle element of the slot array (shown in red in the example above).  It then follows that offset pointer to the index record, and compares the value it finds there with the sought value.  Assuming the value is not the one we are looking for, the sorted nature of the slot array allows SQL Server to narrow the search range in half after this first comparison.

    If the red index record happens to contain a value higher than the sought value, we know the value we are looking for is in that half of the slot array that sorts lower than the red entry.  This process of cutting the search space in half continues until we find the record we are looking for, or until it becomes clear that the value does not exist.

    The average and worst case performance of binary search is very close to log2N comparisons to find the item of interest in a sorted list of N items.  For the 476 index entries found in our test example index structure at the leaf level, that means at most 9 comparison operations.  That compares well to a linear search of the sorted slot array, which might require as many as 476 comparisons if we are unlucky, 238 on average.

    None of this offers any particular insight into why a binary search into a unique indexes containing only even (or odd!) numbers should be so slow.  One of the advantages of binary search is that it expects to perform around the same amount of work, regardless of the distribution of the values in the index.  Luckily, binary search is not the only game in town.

    Linear Interpolation

    As efficient as binary search is in the general case, in some cases we can do better.

    When searching for the value 1 million, we know from level 1 of the index that the lowest key value that can appear on the leaf level page we are interested in is 999,601.  We also know that the lowest key value on the next page is 1,000,077.  From the header information on our target page, we also know there are 476 entries in the slot array on that page.  Given these facts, we can make an immediate guess at which slot the value 1,000,000 ought to appear in:

    expected slot = (1000077 - 999601) / 476 * (1000000 - 999601) = 399

    This simple calculation takes us immediately to slot 399.  As we saw earlier, slot 399 does indeed contain the value 1,000,000:

    Slot 399

    Naturally, the quality of the linear interpolation guess depends on how evenly distributed the values are within the page.  To a lesser extent, it also depends on how closely the minimum key value information at level 1 of the index matches reality.  In both tests shown in earlier blog entries, the values are precisely evenly distributed and the level 1 index key information is spot on.

    Performance

    Linear interpolation has the potential to be more efficient even than binary search, finding the target value in one comparison in this example, compared to nine comparisons for binary search.  Even where the data is not perfectly distributed, there is scope for linear interpolation to be superior, simply by applying the technique repeatedly on successively narrower ranges.

    A reasonable practical compromise might be to try linear interpolation two or three times, before falling back to linear or binary search on the remaining range.  The advantage of linear interpolation may not seem much, but consider that SQL Server makes these comparisons for every index seek operation – when doing many millions of seeks, the difference can soon add up.  It is probably not coincidence that the OLTP TPC benchmark tests perform a very great number of singleton index seeks.

    Linear Regression

    A natural extension to the idea of linear interpolation is to apply a linear regression (line of best fit) instead:

    Linear Regression Diagram

    In the diagram, the blue dotted line represents a linear interpolation based on the values of the smallest and highest values only.  The black line is obtained by a linear regression of all the data values, and results in a much better fit.

    Since every straight line can be represented by a formula of the form y = mx + b, we can completely specify it by recording the values of the ‘m’ (slope) and ‘b’ (y-axis intercept) parameters.  The R2 value gives an idea of how good a fit the linear regression line is to the data.  In the present context, the ‘y’ value represents the indexed value, and the ‘x’ value is the slot position where that value can be found.  It’s easy to see that once we have a linear regression line, we can estimate the slot position for a particular sought index value by solving x = (y – b) / m.

    To use linear regression in a database, we might imagine storing the ‘b’ and ‘m’ parameter values somewhere (perhaps in a compact format in the page header), and deciding whether to apply the technique based on the R2 value.

    What Does SQL Server Do?

    All this is fine in theory, but does SQL Server really ever use anything except binary search?  To see, I ran tests with a debugger attached to the SQL Server process, and stepped through the calls made while the seek test queries were running.  The call stack below was obtained when running 64-bit SQL Server:

    64-bit Call Stack

    By contrast, the following call stack comes from 32-bit SQL Server:

    32-bit Call Stack

    Interestingly, SQL Server follows the same code paths (at least at the call stack level, which is the best we can do without source code) regardless of whether the index is defined as unique or not, and for all the numeric data types I tested (INT, BIGINT, REAL, FLOAT, DECIMAL with a zero scale).

    The performance problem on x64 servers appears to be most pronounced when the key values have a small fixed gap between them.  As we have seen, performance is very much worse on x64 servers compared with x86 versions when the table contains just even or odd numbers (i.e. with a gap of 1 between index entries).  If we change the test to multiply the original entries by a factor of ten (to produce a sequence of 10, 20, 30, 40…) the performance penalty all but disappears, and the x64 unique index test is around 30% faster than the non-unique test.

    Conclusions and Further Reading

    From the evidence available, my best guess is that 64-bit SQL Server does use linear interpolation and/or linear regression instead of binary search under suitable conditions, but that the implementation has edge-case poor performance where the index keys are separated by a small fixed offset.  It seems that we can achieve best performance by using a unique index where possible, and ensuring that keys are, as far as practicable, sequential and contiguous.

    B-tree Indexes, Interpolation Search, and Skew – Microsoft Research
    Interpolation Search – a log log N Search (PDF)

    © 2011 Paul White

    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi

  • Avoiding Uniqueness for Performance

    In my last post, I showed how using a index unique could speed up equality seeks by around 40%.

    For today’s entry, I’m going to use the same tables as last time (single BIGINT column, one table with a non-unique clustered index, and one table with a unique clustered index):

    CREATE TABLE dbo.SeekTest
    (
        col1    BIGINT NOT NULL,
    );
    GO
    -- Non-unique clustered index
    CREATE CLUSTERED INDEX cx
        ON dbo.SeekTest (col1)
    GO
    CREATE TABLE dbo.SeekTestUnique
    (
        col1    BIGINT NOT NULL
    );
    GO
    -- Unique clustered index
    CREATE UNIQUE CLUSTERED INDEX cuq
        ON dbo.SeekTestUnique (col1)

    Test Data

    This time, instead of filling the tables with all the numbers from 1 to 5 million, we’ll add just the even numbers from 1 to 10 million.  At the risk of stating the slightly obvious, this results in tables with the same number of rows as previously, just no odd numbers:

    INSERT dbo.SeekTest WITH (TABLOCKX)
        (col1)
    SELECT TOP (5000000)
        2 * ROW_NUMBER() OVER (ORDER BY @@SPID)
    FROM 
        master.sys.columns AS c,
        master.sys.columns AS c2,
        master.sys.columns AS c3;
    GO
    INSERT dbo.SeekTestUnique WITH (TABLOCKX)
        (col1)
    SELECT TOP (5000000)
        2 * ROW_NUMBER() OVER (ORDER BY @@SPID)
    FROM 
        master.sys.columns AS c,
        master.sys.columns AS c2,
        master.sys.columns AS c3;

    Test Run: Non-Unique Index

    The test is the same as before: join the test table to itself using a nested loops join, and count the rows returned.

    Don’t be fooled by the simplistic nature of the test; I realize you rarely use loops join to self join all the rows in a table.  The point here is to check how long it takes to perform 5 million row joins – something that probably happens quite often in most production systems, either as 5 million lookups into a single very much larger table, or perhaps by running a query that does 50,000 row-joins 100 times.  Anyway, here’s the query:

    SELECT 
        COUNT_BIG(*)
    FROM dbo.SeekTest AS st WITH (TABLOCK)
    INNER LOOP JOIN dbo.SeekTest AS st2 WITH (TABLOCK)
        ON st.col1 = st2.col1
    OPTION (MAXDOP 1);

    And the ‘actual’ query plan:

    SNAGHTML10ed6a4

    I get these results:

    image

    Notice the 5,000,001 scan count showing that we are performing 5 million range scans (not singleton seeks).  As far as performance goes, running this query with all the data in cache uses 9,251ms of CPU.

    Test Run: Unique Index

    Cool.  Now we know from last time that we can improve on this result by making the index unique, and performing singleton seeks instead of range scans.  Let’s do that then:

    SELECT
        COUNT_BIG(*)
    FROM dbo.SeekTestUnique AS stu WITH (TABLOCK)
    INNER LOOP JOIN dbo.SeekTestUnique AS stu2 WITH (TABLOCK)
        ON stu.col1 = stu2.col1
    OPTION (MAXDOP 1);

    Actual plan:

    SNAGHTML1102d2d

    And the results are:

    image

    As expected, the scan count confirms we are now doing singleton seeks, and the CPU time has improved to 16,005ms.

    Er, Hang On…

    If you think that going from 9,251ms of CPU to 16,005ms of CPU is not exactly an improvement, you’d be right.  There’s no typo there, no inadvertent switching of the unique and non-unique examples, and no tricks.

    Making the index unique really has slowed down this query by around 70%.

    The explanation is so interesting, it deserves a full post of its own…so stay tuned for that.  Feel free to speculate about the cause in the meantime… in the comments below, by email, or on Twitter.

    There is some evidence coming through in the comments that this behaviour is specific to x64 installations. I’d love to get some more validation of this if you have time to test on x86 and/or x64. Thanks.

    Disclaimer:

    In general, you will want to specify an index as UNIQUE wherever you can.  Many queries will benefit from a unique index rather than a non-unique one.  This post is very much to show that “It Always Depends” and to set the stage for the next post in this series.  Don’t run with scissors.  Warning filling may be hot.  May contain nuts… Sarcastic smile

    © 2011 Paul White
    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi

  • Enforcing Uniqueness for Performance

    A little while back, I posted a short series on seeks and scans, and one of the things I highlighted was the difference between a singleton seek and a range scan.  You can find that post here, if you want a refresher.

    Anyway, the broad point is that a singleton equality seek always retrieves exactly one row, and is guaranteed to do so because a unique index exists to enforce it.  A range scan, on the other hand, seeks down the B-tree to a starting (or ending) point, and scans forward (or backward) from that point using the next or previous page pointers.  The point of today’s short post is to show how much faster a singleton seek is, compared with a range scan, even when both return exactly the same number of records.

    Pretty simple test rig today; a 5 million row table with a single BIGINT NOT NULL column with all the values from 1 to 5,000,000:

    CREATE TABLE dbo.SeekTest
    (
        col1    BIGINT NOT NULL
    );
    GO
    CREATE CLUSTERED INDEX cx ON dbo.SeekTest (col1);
    GO
    -- 5 million rows numbered 1 to 5,000,000
    INSERT dbo.SeekTest WITH (TABLOCKX)
        (col1)
    SELECT TOP (5000000)
        ROW_NUMBER() OVER (ORDER BY @@SPID)
    FROM 
        master.sys.columns AS c,
        master.sys.columns AS c2,
        master.sys.columns AS c3;

    Notice that the clustered index is not defined as unique.  There will not be any uniquifiers (in case you think I am cheating) because SQL Server only adds them when necessary, and there are no duplicates in this example.  The test query simply joins the table to itself, forcing a nested loops join so we get 5 million seek operations:

    SELECT 
        COUNT_BIG(*)
    FROM dbo.SeekTest AS st WITH (TABLOCK)
    INNER LOOP JOIN dbo.SeekTest AS st2 WITH (TABLOCK) ON
        st2.col1 = st.col1
    OPTION (MAXDOP 1);

    And here’s the query plan:

    image

    On my machine, typical results are:

    image

    Notice the Scan Count.  We get one scan for the Clustered Index Scan iterator in the plan, and one scan for every seek operation.  This is one way to see that we are getting range scans here – the lack of a unique index means that SQL Server cannot guarantee that only one row will be returned from each seek, so a range scan is performed to pick up any extra rows.

    Now let’s create a second table, where the only difference is that the index is declared as UNIQUE:

    CREATE TABLE dbo.SeekTestUnique
    (
        col1    BIGINT NOT NULL
    );
    GO
    CREATE UNIQUE CLUSTERED INDEX cuq ON dbo.SeekTestUnique (col1);
    GO
    -- 5 million rows numbered 1 to 5,000,000
    INSERT dbo.SeekTestUnique WITH (TABLOCKX)
        (col1)
    SELECT TOP (5000000)
        ROW_NUMBER() OVER (ORDER BY @@SPID)
    FROM 
        master.sys.columns AS c,
        master.sys.columns AS c2,
        master.sys.columns AS c3;

    As before, we join this table to itself using loops join (the TABLOCKs are just to reduce locking overheads):

    SELECT 
        COUNT_BIG(*)
    FROM dbo.SeekTestUnique AS stu WITH (TABLOCK)
    INNER LOOP JOIN dbo.SeekTestUnique AS stu2 WITH (TABLOCK) ON
        stu2.col1 = stu.col1
    OPTION (MAXDOP 1);

    And we get the same plan:

    image

    But very different results:

    image

    Now there is only one scan, because the seeks are all singleton seeks, and execution time has improved from 9472ms to 6096ms.

    There are lots of good reasons to enforce uniqueness where you can (and a few not to).  For further information see:

    Is it a Seek or a Scan?
    Sneaky Reads with Unique Indexes
    Unique Indexes with GROUP BY (Rob Farley)
    Scan Count Meaning in STATISTICS IO Output – Part 1 (Amit Banerjee)
    Scan Count Meaning in STATISTICS IO Output – Part 2 (Amit Banerjee)

    © 2011 Paul White

    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi

More Posts Next page »
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement