THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul White: Page Free Space

A technical SQL Server blog from New Zealand. See also my articles on SQLperformance.com

SQL Server Bug: Slow T-SQL Sums and Averages

It’s a curious thing about SQL that the SUM or AVG of no items is not zero, it’s NULL.  In this post, you’ll see how this means your SUM and AVG calculations might run at half speed, or worse.  As with most of my blog entries though, today’s instalment is not so much about the result, but the journey we take to get there.

Before we get started on that, I just want to mention that there’s a problem with the Google Reader feed for this blog, so those of you that use that will have missed two recent entries: Seeking Without Indexes and Advanced TSQL Tuning: Why Internals Knowledge Matters.  Accessing the site directly always works of course :)

Ok, on to today’s story.  Take a look at this query:

SUM Of None Is NULL

Notice the explicit NULL.  The optimizer can see that this query contains a contradiction (1=0) so there’s no need to access the table at all.  The query will always compute an average over an empty set of rows, and the answer will always be NULL.  The Constant Scan operator constructs a row in memory without accessing any base tables, or needing to acquire any locks.  Let’s look at a less trivial example now.

SUM Of None Is NULL 2

Following the flow of data from right to left, the first thing to notice is that the optimizer has decided that the cheapest way to find all the ProductID values from the table is to scan the non-clustered index AK_Product_rowguid.  As the index name suggests, this is a non-clustered index over the row_guid column. Because ProductID forms the unique clustered index for the table, the non-clustered index also includes the ProductID keys at the leaf level of the index.  The non-clustered index is much narrower than the clustered index (because the cluster contains all in-row table columns at the leaf level) so all ProductID values in the table are stored more densely in the non-clustered index.  The higher density (and smaller size) of the non-clustered index means it is the lowest cost access method if all we need are product ids.

The output from the Index Scan operator is a stream of rows containing a single column – ProductID.  The Stream Aggregate consumes all the rows from the Index Scan to compute two values: the number of rows encountered (Count(*)); and the SUM of the same values (SUM(ProductID)).  The output from the Stream Aggregate is a single row with two columns: the result of the count is labelled [Expr1011] and the result of the sum is [Expr1012], though these labels may be different for you.

The final step to compute the average of ProductID values is performed by the Compute Scalar.  It computes a single expression ([Expr1003]) using a CASE statement, which first checks to see if the count of rows ([Expr1011]) is zero.  If it is zero, the result is defined as NULL, otherwise the average is computed as the SUM divided by the COUNT – [Expr1012] / [Expr1011].  Notice that the left-to-right evaluation guarantee of the CASE statement ensures that a divide-by-zero error never occurs.

The plan for the same query using SUM instead of AVG is very similar:

SUM Of None Is NULL 3

The only difference is in the Compute Scalar: the ELSE clause now returns the SUM (if any rows were counted), rather than the average.  The interesting thing about both cases (SUM and AVG) is that the optimizer produces a plan which counts the rows as well as summing them.  In both cases, the count is needed so the Compute Scalar can produce a NULL if no rows are aggregated.

One might think that it would be more efficient for the Stream Aggregate to just set a bit flag indicating whether zero or more rows were seen, rather than counting all of them, but that is not the way the product works today.  The question is, does this extra counting operation add any significant overhead to the query execution?  The answer will surprise you – it isn’t as simple as it seems.  To demonstrate the effect, we will need a larger table than the AdventureWorks database can provide.

USE     master;
GO
IF      DB_ID(N'SumTest')
        IS NOT NULL
BEGIN
        ALTER DATABASE SumTest SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
        DROP DATABASE SumTest;
END;
GO
CREATE  DATABASE SumTest
        COLLATE LATIN1_GENERAL_CS_AS;
GO
ALTER   DATABASE SumTest
        MODIFY FILE
        (
        NAME = N'SumTest',
        SIZE = 1GB,
        MAXSIZE = 1GB
        );
GO
ALTER   DATABASE SumTest
        MODIFY FILE
        (
        NAME = N'SumTest_log',
        SIZE = 256MB,
        MAXSIZE = 256MB
        );
GO
USE     SumTest;
GO
ALTER   DATABASE SumTest SET ALLOW_SNAPSHOT_ISOLATION OFF;
ALTER   DATABASE SumTest SET AUTO_CLOSE OFF;
ALTER   DATABASE SumTest SET AUTO_SHRINK OFF;
ALTER   DATABASE SumTest SET AUTO_CREATE_STATISTICS ON;
ALTER   DATABASE SumTest SET AUTO_UPDATE_STATISTICS ON;
ALTER   DATABASE SumTest SET AUTO_UPDATE_STATISTICS_ASYNC OFF;
ALTER   DATABASE SumTest SET PARAMETERIZATION SIMPLE;
ALTER   DATABASE SumTest SET READ_COMMITTED_SNAPSHOT OFF;
ALTER   DATABASE SumTest SET RECOVERY SIMPLE;
ALTER   DATABASE SumTest SET RESTRICTED_USER;
GO
CREATE  TABLE dbo.Test 
        (
        id          BIGINT IDENTITY (1,1) NOT NULL,
        padding     CHAR(7999) NOT NULL,
        
        CONSTRAINT [PK dbo.Test (id)]
            PRIMARY KEY CLUSTERED (id),
        )
;
GO
-- Minimally-logged in 2008 onward
INSERT  INTO dbo.Test
        WITH (TABLOCKX)
        (
        padding
        )
SELECT  TOP (100000)
        padding = REPLICATE(CHAR(65 + (Data.n % 26)), 7999)
FROM    (
        SELECT  TOP (100000)
                n = ROW_NUMBER() OVER (ORDER BY (SELECT 0))
        FROM    master.sys.columns C1,
                master.sys.columns C2,
                master.sys.columns C3
        ORDER   BY 
                n ASC
        ) AS Data
ORDER   BY
        Data.n ASC
;

That script creates a new database containing a single table with 100,000 very wide rows.  The id column is BIGINT IDENTITY, and the padding column is defined as CHAR(7999):

Sample Data

To make the test CPU-intensive, we’ll calculate CHECKSUM values for the REVERSE of the strings in the padding column, and SUM the result.  The natural way to write that query follows.  The convert to BIGINT is just to avoid an arithmetic overflow.

SUM Test 1

The plan is essentially the same as the ones seen previously, with the addition of an extra Compute Scalar to calculate the new processing-intensive expression.  This query runs for around 20 seconds, even with the table data entirely in memory.  If we run a similar query that uses a MIN or MAX aggregate instead of SUM, the query returns results in 10 seconds – half the time:

SUM Test 2

This is a very similar plan, but without the final Compute Scalar (the one that decides whether to return NULL or not) seen in the SUM and AVG plans.  Some people might compare the MAX and SUM queries and conclude that the difference in execution time is down to the missing Compute Scalar.  That would be wrong because the 10 seconds of extra CPU time cannot possibly be caused by a simple scalar computation operating on the single row present at that stage of the plan.

In fact, the problem is in the Stream Aggregate, which in the SUM plan is computing two expressions:

[Expr1009] = COUNT_BIG([Expr1003])
[Expr1010] = SUM([Expr1003])

[Expr1003] is our compute-intensive calculation, contained in the preceding Compute Scalar.  You would think that by the time the Stream Aggregate processes a row, the preceding Compute Scalar has already assigned a result value to [Expr1003] for that row, right?  Wrong!  Compute Scalars are implemented a little differently from other plan operators.  In general, a Compute Scalar operator does not perform the calculation as rows flow through it: the work is deferred until a later plan operator needs the value.  In most cases, this is a useful optimization because rows may be eliminated by a later join or filter before the result value is really needed.

In this case, the Stream Aggregate is the first downstream operator to need the value of [Expr1003] – and it references it twice: once for the count, and once for the sum.  The shocking truth is that the plan evaluates the whole CHECKSUM(REVERSE(padding)) expression twice – and one of those times it is just counting rows so it knows whether to return NULL later on or not.

We can verify this is the case by writing the query a little differently, to ensure that the expensive expression is calculated once before the Stream Aggregate receives the row:

SUM Test 3

Now, the expensive expression is calculated in a Constant Scan operator rather than a Compute Scalar.  The Stream Aggregate still references the expression twice, once for the COUNT and once for the SUM, but it is counting and summing the result of the calculation, not causing the calculation to be performed twice.  This query form produces the same result as the 20-second query, but in 10 seconds.  Both tests push a single CPU to 100% utilization, in case you were wondering about that.  To really make the point, consider this query:

SELECT  SUM(CONVERT(BIGINT, CHECKSUM(REVERSE(T.padding)))),
        SUM(CONVERT(BIGINT, 0 + CHECKSUM(REVERSE(T.padding))))
FROM    dbo.Test AS T
OPTION  (MAXDOP 1)
;

In the resulting plan, the Stream Aggregate references the CHECKSUM(REVERSE(padding)) expression four times, and the query runs for 40 seconds at 100% CPU!

You might also wonder why the rewrite uses an OUTER APPLY instead of CROSS APPLY.  The reason is that with a CROSS APPLY, the optimizer sees through the trick and transforms the query plan to exactly the same form as before – and the query runs for 20 seconds as a result.  The OUTER APPLY is enough the defeat an optimizer transformation rule, resulting in a plan featuring a Left Outer Join.  Before you go off to rewrite all your SUMs and AVGs using OUTER APPLY, consider:

  1. The OUTER APPLY trick is a hack that depends on implementation details – it might not work tomorrow.
  2. This problem only occurs if the first operator to need an expression result references it more than once.
  3. This bug is only noticeable if the expression is expensive.

So this is a confirmed bug in SQL Server, and the good news is that it has been fixed – but only in a private build of Denali (SQL11).  The current public Denali build (CTP 1) does not contain the fix, and there is no word on whether it will be made available for users of earlier SQL Server versions.

Finally for today, I want to show you another good reason not to use the OUTER APPLY pattern as a fix.  Let’s assume we don’t know about this problem, and need to optimize the original CHECKSM(REVERSE(padding)) query.  The obvious solution is to add a computed column, and index it:

ALTER   TABLE 
        dbo.Test 
ADD     [Expr1003] 
        AS CHECKSUM(REVERSE(padding))
;
-- Index it
CREATE  INDEX 
        [IX dbo.Test Expr1003]
ON      dbo.Test
        (
        [Expr1003]
        )
WITH    (
        FILLFACTOR = 100,
        MAXDOP = 1,
        SORT_IN_TEMPDB = ON
        )
;

Notice in passing that the new computed column is not persisted, so it doesn’t require any extra storage in the table itself, and does not fragment the clustered index.  Re-running the natural form of the query:

SUM Test 4

It now completes in 30ms – quite an improvement over 20,000ms!  The Stream Aggregate still performs the two SUM and COUNT operations, but now the expression result is available directly from the index – it isn’t calculated at all, let alone twice.

If we re-run the OUTER APPLY query, we find that it does not use the index and uses exactly the same plan as it did previously, again taking 10 seconds to run.  Ironically, it fails to take advantage of the index for the same reason it previously performed better: the trick prevents the optimizer from matching the computed expression to the index.  Attempting to force the use of the index results in a very unhappy plan:

SUM Test 5

The optimizer still can’t make the transformation to use the pre-computed value from our index – but it honours the hint in the only way it can, by using the index to retrieve id column values (the unique clustering key included in the non-clustered index).  It then performs a bookmark lookup for every row into the clustered index to retrieve the padding column, and finally performs the expensive calculation once per row in the Constant Scan.

The Sort operator is there to ‘optimize’ the bookmark lookups: by pre-sorting on the clustering key (id) it hopes to turn the normally random I/O associated with a bookmark lookup into sequential I/O.  With the data all in memory, this plan still executes in ten seconds, but if the data needed to be fetched in from disk…well you try it.

So, how important is this bug?  Well, it depends.  You would have to take quite a deep look into each one of your production query plans to see how often a plan operator references a Compute Scalar expression more than once, where the expression is expensive to compute.  That’s probably not practical (though you could write an XQuery to search the plan cache for examples I suppose) but it is something to be aware of, just in case you come across a query that performs much slower than you might expect.

Anyway, the point of this post, again, is that deep internals knowledge can come in very useful!

© 2011 Paul White
email: SQLkiwi@gmail.com
twitter: @SQL_Kiwi

Published Sunday, February 27, 2011 1:07 PM by Paul White

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Alejandro Mesa said:

Paul,

Great post, as always.

There is no doubt about your last statement, and I always encourage database developers to kick it up a notch by learning about the internals as much as they can.

--

AMB

February 27, 2011 2:52 PM
 

Fabiano Amorim said:

So far the best post I read in the year, thanks for that.

February 28, 2011 2:55 PM
 

Fabiano Amorim said:

Paul, I just realized that it only happens with "Scalar Aggregations"... because it has to deal with the NULL return... so the question is, WHY scalar aggregations has to return NULL if null don't means nothing:-( ?

If it needs to return something why not return zero?

Do you know if it is a a Standard thing?

Thanks

February 28, 2011 3:13 PM
 

Alexander Kuznetsov said:

Very interesting read, Paul, thanks for writing! I was wondering if it was a real life problem that prompted this research?

Regarding the following Fabiano's observation: "because it has to deal with the NULL return", maybe COALESCE(SUM(...),0) instead of SUM(...) might eliminate this counting, because there is no need to return a NULL?

February 28, 2011 10:31 PM
 

Paul White said:

Hey Fabiano,

Thanks very much - yes, it is in the SQL standard that a SUM of an empty set is NULL.  There's probably a good reason for it, but perhaps it is just a 'convenience' thing, I'm not really sure.

Paul

February 28, 2011 10:55 PM
 

Paul White said:

Hi Alex,

No, it wasn't a real-life problem - it came out of some research I was doing for an article on parallelism.  One test (with MIN or MAX) ran twice as fast as when I used SUM, so I started to look deeper...

You would think that ISNULL or COALESCE would eliminate the unnecessary count, but it doesn't.  This is something I tested at the time - the optimizer does not realize that the count is just there to decide about a NULL return, and simply wraps the final scalar computation with ISNULL (or the CASE expression COALESCE expands to).

As I mentioned in the main text, a better solution might have been for the aggregate to set a flag to true as soon as it encounters the first row.  There's no need to perform a full count - the final Compute Scalar just needs to know if at least one row was seen.

There is no current build of SQL Server that does not suffer from this bug, so it is something to be aware of when using SUM or AVG on a computed expression, especially if it is CPU-intensive.

Thanks for the comment.

Paul

February 28, 2011 11:02 PM
 

Brad Schulz said:

I've been incredibly busy during the last several weeks and have not done any blog reading at all, until today.  Naturally yours was the first one I caught up on, and your last 6 articles are brilliant as always.

Interesting about the Compute Scalar not really executing its expressions as rows flow through it... perhaps this is why estimated query plans always have the thinnest arrow possible flowing out of the left of the Compute Scalar even if the flow coming into its right is huge... it's really just kind of an "adjunct" of the upline operator.

Thanks again for all the incredible content!

Best...

--Brad

March 1, 2011 3:30 PM
 

Kendra Little said:

I always enjoy reading your posts--  I learn not only about the specific topic, but also about how to read plans better.

Your code samples and images are really presented beautifully, by the way. Not only looks good, but makes it easier to follow along.

March 1, 2011 8:35 PM
 

Paul White said:

Hi Brad,

Compute Scalar is a bit of an odd-ball, it doesn't really exist in the executable plan in the same way as more 'real' iterators like Sort or Join, so there aren't any statistics for actual rows, so you only ever see the estimated rows inherited from the operator below it in the tree.

Paul

March 2, 2011 4:30 AM
 

Paul White said:

Hey Kendra - I just noticed I forgot to reply to you!  I'm becoming more aware of the need to explain how execution plans work in my posts, so thanks for mentioning that; I'll certainly bear it in mind for the future.

The comments about presentation are especially appreciated - I am still in total awe of the talent you showed in the amazing Isolation Levels broadsheet you did!

Paul

March 3, 2011 7:12 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement