THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul White: Page Free Space

A technical SQL Server blog from New Zealand.

SQL Server Optimizer Bug with JOIN and GROUP BY

I came across a SQL Server bug recently that made me wonder how on earth I never noticed it before.  As the title of this post suggests, the bug occurs in common JOIN and GROUP BY queries, and while it does not cause incorrect results to be returned, it will often cause a poor query plan to be selected by the optimizer.  If you are just interested in the bug itself, you will find a description and Connect item toward the end of this post.  As the regular reader will be expecting from me though, I am going to work up to it with a bit of background…

Example Query

Using everyone’s favourite Adventure Works database, here’s a simple query that includes a JOIN and a GROUP BY:

SELECT
    COUNT_BIG(*)
FROM Production.Product AS p
JOIN Production.TransactionHistory AS th ON
    th.ProductID = p.ProductID
GROUP BY
    p.Class
OPTION (MAXDOP 1, RECOMPILE)

One perfectly reasonable query plan possibility looks like this (estimated row counts shown):

image

After the Merge Join, rows go on to be grouped and counted.  This simple plan has an estimated cost of 1.1 units on my system.  The optimizer has chosen a Merge Join because indexes exist on the join column to provide sorted inputs, and the foreign key relationship between these two tables means that the join will be the efficient one-to-many type.  A Hash Match Aggregate is estimated to be cheaper than the alternative Sort and Stream Aggregate combination in this plan.  For reasons that will become apparent shortly, note that the aggregate computes the following expression (a simple count of all the rows):

[Expr1004] = Scalar Operator(COUNT(*))

Partial Aggregates

The query plan shown above is not the one selected by the optimizer, however.  One of the tricks available to it is to move most of the aggregation work before the join.  The idea is that reducing row counts as early as possible in a plan generally pays off.  The plan actually selected by the optimizer is this one (estimated row counts again):

image

The new Stream Aggregate below the join computes the count of rows from the history table grouped by product id.  There are 441 unique product id values in the history table, so 441 product ids (and row counts) flow on to be joined with the 504 rows from the product table.  Note that the estimate of 441 groups exactly matches reality.

With a smaller number of estimated rows coming out of the join, the optimizer chooses a Sort and Stream Aggregate combo instead of the Hash Match Aggregate seen previously.  To get the correct query result, the second aggregate uses SUM to add the per-product row counts together.  The estimated cost of this plan is 0.35 units according to SQL Server’s costing model, so you can see why the optimizer prefers it over the previous plan with a cost of 1.1 units.

Clicking on the new aggregate below the join and looking at its properties, we see that it groups by product id and computes a partial aggregate expression:

[partialagg1005] = Scalar Operator(Count(*))

In this case, the expression label is the only way to identify this as a partial aggregate – there is no other more obvious indication.  The expression computed by the second Stream Aggregate (the one after the Sort) is:

[Expr1004] = Scalar Operator(SUM([partialagg1005]))

This aggregate is a Global aggregate: it computes the correct overall result based on partial results calculated earlier in the plan.  To summarize, the single COUNT in the original plan has been replaced with a partial aggregate COUNT, followed by a global aggregate SUM of those counts.

Local and Global Aggregation

This idea of computing part of an aggregate early and combining the partial results later on originated in parallel execution plans.  To illustrate the point, take a look at the following query and execution plan:

SELECT 
    COUNT_BIG(*) 
FROM Production.TransactionHistory AS th
OPTION (RECOMPILE)

[You will need to set the instance’s cost threshold for parallelism configuration value to zero in order to get a parallel plan for this query.]

image

The parallel aggregate computes the following partial aggregate expression per thread:

[partialagg1003] = Scalar Operator(Count(*))

The serial aggregate combines the partial aggregates using this expression:

[Expr1002] = Scalar Operator(SUM([partialagg1003]))

In parallel plans, the partial aggregate is more often referred to as a local aggregate, since it computes an aggregate that is local to the thread it is running on.  With four threads, there will be four local aggregations, which are combined to form the final result by the global aggregate, running on a single thread in this example.

One other fact that will be important later on: notice that the plan above estimates that four rows will be produced by the parallel Stream Aggregate operator – one for each thread.  This plan was generated on a computer with eight processors available to SQL Sever.  The optimizer always estimates that half the processors will be available for parallel execution – this is a heuristic, and not one that makes an enormous amount of sense to me intuitively, but there you go, it is how it is.

Partial or Local?

To illustrate how inconsistent SQL Server is with the terms ‘partial’ and ‘local’:

SELECT
    COUNT_BIG(*)
FROM Production.TransactionHistory AS th
GROUP BY
    th.Quantity
OPTION (RECOMPILE)

image

The idea is the same, but the labels are different: notice that the hash aggregate now has an explicit Partial Aggregate label in the query plan.  You will only see this helpful change in parallel plans that use a Hash Partial (= Local) Aggregate (there is a sort of reason for this, explained in the next section on memory grants).

The parallel hash aggregate is computing COUNT(*) again, but now it is grouping by Quantity.  The query plan needs to do a bit more work to get a correct result, because as it stands rows from the index scan are distributed unpredictably between threads, so each thread can see rows from any or all Quantity groups.  The aggregate is still ‘partial’ but this time it is because each thread only sees part of the full row set.

To get the right answer (according to the semantic of the SQL query) SQL Server goes on to repartition the rows among new threads using Hash Partitioning on the Quantity column.  This step ensures that rows with the same Quantity value always end up on the same thread.  To be clear, it says nothing about how many Quantity groups go to each thread, just that the repartitioning step guarantees that rows associated with a particular Quantity value will all end up on the same thread.  For efficiency, we certainly hope that the partitioning results in a good distribution of groups among threads, but that is a separate issue completely.

Each Stream Aggregate (one per thread) can now safely compute a SUM of the partial aggregate results it receives on its thread, though it still has to group by Quantity since one thread may receive more than one Quantity group.  The final Stream Aggregate is still a Global Aggregate, since it uses local (= partial) aggregate results, even though it executes in parallel (if you prefer, the result each thread’s Global Aggregate produces is globally valid for output).

Memory Grants

Another interesting thing about the Hash Match Partial Aggregate is that it never acquires more than the very small amount of memory necessary to create a minimally-size hash table.  If the aggregate runs out of memory while processing, it simply stops aggregating, and passes individual rows instead.  The global aggregate will still produce the correct result, but the query won’t be as efficient (for example, more rows have to move across the Repartition Streams exchange).  The partial aggregate is a performance optimization, it is not required for correctness (and note that a partial Stream Aggregate never needs a memory grant, so the same issue does not arise).

Optimizer Rules

The query optimizer rule that explores splitting ‘ordinary’ aggregates into local and global parts is called GenLGAgg for “Generate Local and Global Aggregate”.  In parallel plans, whether this transformation comes out cheaper depends on whether the plan saves enough cost (for example by reducing the number of rows that flow across the repartitioning exchange) to pay for the overhead of the extra aggregate operation.

In serial plans, there are no exchanges (by definition) so this transformation needs to find another way to pay for itself.  The way we saw earlier, is to push the partial aggregate below a join such that it reduces the cost of that join (and later operations) enough to at least pay for itself.  There is a second optimizer rule to explore the option of pushing a local aggregate below a join; its name is LocalAggBelowJoin.  Note how both rules refer to the term Local, as in ‘local to a thread’.

The Bug Revealed

Here’s our original query and the execution plan the optimizer selected again:

SELECT
    COUNT_BIG(*)
FROM Production.Product AS p
JOIN Production.TransactionHistory AS th ON
    th.ProductID = p.ProductID
GROUP BY
    p.Class
OPTION (MAXDOP 1, RECOMPILE)

image

Now here’s the plan produced for exactly the same query, with a MAXDOP 2 hint instead of MAXDOP 1:

SELECT
    COUNT_BIG(*)
FROM Production.Product AS p
JOIN Production.TransactionHistory AS th ON
    th.ProductID = p.ProductID
GROUP BY
    p.Class
OPTION (MAXDOP 2, RECOMPILE)

image

The estimate of the number of rows emitted by the partial aggregate has doubled, from 441 rows (exactly right, remember) to 882.  If we specify MAXDOP 3, the row estimate trebles to 1323.  At MAXDOP 4, the estimate is 1764.  After MAXDOP 4, no increase occurs on my test machine because it has eight processors, and the optimizer estimates a maximum runtime DOP of half the number of processors available for parallel queries, as noted earlier.  If you have 64 processors, you could get the cost to multiply by 32…and so on.  To be absolutely clear about this, all these MAXDOP options still result in serial execution plans.

Yes, but I don’t use MAXDOP hints

Neither do I, much – at least not other than specifying one or zero.  Anyway, let’s see what happens on the same machine without any hints at all:

SELECT
    COUNT_BIG(*)
FROM Production.Product AS p
JOIN Production.TransactionHistory AS th ON
    th.ProductID = p.ProductID
GROUP BY
    p.Class

image

Yikes!  The correct estimate of 441 rows has been multiplied by 4 – the maximum possible on this 8-core machine, exactly as if we had specified (MAXDOP 4) to (MAXDOP 8) or even (MAXDOP 0).  Setting the server-wide max degree of parallelism configuration value to 1 (or using Resource Governor to do the same thing) will result in a correct estimate of 441 rows again (unless MAXDOP is specified).  You will also get correct estimates from the partial aggregate if SQL Server is running on a machine with only one processing unit, or if the affinity mask is set such that SQL Server can only see one processor.  See the pattern?

Cause & Effects

In short, the cardinality estimation problem comes down to the partial aggregate (and only the partial aggregate) costing itself as if it were running in a parallel plan.  In a parallel plan, each instance of the partial aggregate would produce the estimated number of rows per thread – so for a plan estimated to execute at DOP 4, we would get a (correct) multiplier effect on the estimate.  For a serial plan, this is just plain wrong.

In the plan above, we are lucky that cardinality estimation for the merge join recognises that it cannot produce more rows than the 504 contained in the product table because of the foreign key relationship.  Nevertheless, the estimates for every operator after the problematic aggregate are still incorrect.

In other plans, this effect can completely change the plan choice above a partial aggregate.  As an example, here is the SQL Server 2012 query from my last post, with the MAXDOP 1 hint:

SELECT
    p.ProductModelID,
    COUNT_BIG(*),
    COUNT_BIG(DISTINCT th.TransactionDate)
FROM Production.Product AS p
JOIN Production.TransactionHistory AS th ON 
    th.ProductID = p.ProductID
GROUP BY p.ProductModelID
OPTION (RECOMPILE, MAXDOP 1)

image

This gives us the efficient new query transformation present in SQL Server 2012 (the middle hash aggregate is a partial one).  The exact same query, at MAXDOP 3 or above (or without a MAXDOP hint at all on a machine with 6 cores or more):

image

Now, we get the horribly inefficient old Eager Spool plan!

The optimizer still considers both alternatives, but the estimated plan costs explain the choice it makes: the good plan has an estimated cost of 2.8 at MAXDOP 1 (that is, without the partial aggregate costing error).  At MAXDOP 2, the good plan’s cost increases to 3.45 units.  The estimated cost of the Eager Spool plan is 3.5 units, regardless of the MAXDOP setting (exactly as it should be for a serial plan), so you can see that the spool plan becomes the apparently cheaper option at MAXDOP 3.  (I’m sure it doesn’t need saying again, but the MAXDOP hinted queries here all still produce non-parallel plans.)

The estimated I/O and CPU costs of the partial aggregate operator are also costed as if it were running in parallel at the estimated available DOP, further adding to the costing errors of the plan alternative.  Moreover, the cardinality estimation error will tend to propagate up the plan tree – increasing the costs of upstream operators, and causing larger memory grants to be requested than should be estimated to be necessary.

Overall, the effect is that you are very likely to receive the wrong plan for your query.  The consequences, for non-trivial Adventure Works queries at any rate, could be severe.  Look out for these estimation errors in any serial plan that features a partial aggregate.  Queries that include a JOIN and a GROUP BY are likely candidates for the GenLGAgg and LocalAggBelowJoin rules to introduce a partial aggregate – and remember that in serial plans there is no way to know that an aggregate is partial without checking the details of the expressions it computes…

The Connect item for this bug can be found here:
https://connect.microsoft.com/SQLServer/feedback/details/711685/costing-cardinality-errors-with-partial-aggregates-in-serial-execution-plans

This issue reproduces on all versions of SQL Server from 2005 to 2012 RC0 with TF4199 enabled for optimizer fixes (the behaviour on 2005 is even more erratic than described above).


Update 8 Dec 2011: Thanks everyone for the votes, Microsoft have responded by confirming the bug. They plan to fix it sometime next year.

© 2011 Paul White

Twitter: @SQL_Kiwi
Email: SQLkiwi@gmail.com

Published Tuesday, December 06, 2011 12:17 PM by Paul White

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Jeff said:

Very interesting indeed

December 5, 2011 10:47 PM
 

David Hay said:

Just when you thought you could trust the optimizer...  Be interested to see what Microsoft says about this one.

December 6, 2011 8:27 AM
 

Paul White said:

Hi David,

The cynical side of me expects an eventual response like "thanks for reporting this issue, we will consider fixing it in a future version of SQL Server" :)

My more positive side is hoping for something a bit better than that, though they are often reluctant to fix things that affect existing query plans in unpredictable ways.  Maybe it will get a CU fix and become incorporated in TF4199.

Paul

December 6, 2011 8:42 AM
 

Gianluca Sartori said:

Thanks Paul, now I understand why I kept getting wrong row count estimations on some complex queries that match the pattern you described.

It takes great internals knowledge to understand exactly what causes the bug. Mere mortals like me just rewrite the query and keep going. :-)

I hope MS will fix this one shortly.

Gianluca

December 7, 2011 4:45 AM
 

Audrey Hammonds said:

Paul,

Brilliant post, and great catch! I ran into an issue just last week where I couldn't get my estimated row count to look right, and I couldn't figure out what was going on.  Now I'm going to go see if this is the issue.  

Was able to reproduce the bug on my local AdventureWorks in no time at all.  Connect item voted up and reproduction noted.  Thanks so much for raising the issue and for using it to teach us all a thing or three.  

--Audrey

December 7, 2011 9:54 AM
 

JRStern said:

A little off-topic, but NUMA doesn't have anything to do with how many processors SQL Server heuristically estimates are available?

March 18, 2013 1:26 PM
 

Paul White said:

Hi Mr Stern,

No, not directly - estimated DOP is pretty much always potentially-available-DOP / 2. There is a little logic that prefers to schedule threads within the same NUMA node if possible, but that's a different thing.

April 17, 2013 7:32 PM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement