It’s a curious thing about SQL that the SUM or AVG of no items is not zero, it’s NULL. In this post, you’ll see how this means your SUM and AVG calculations might run at half speed, or worse. As with most of my blog entries though, today’s instalment is not so much about the result, but the journey we take to get there.
Before we get started on that, I just want to mention that there’s a problem with the Google Reader feed for this blog, so those of you that use that will have missed two recent entries: Seeking Without Indexes and Advanced TSQL Tuning: Why Internals Knowledge Matters. Accessing the site directly always works of course :)
Ok, on to today’s story. Take a look at this query:
Notice the explicit NULL. The optimizer can see that this query contains a contradiction (1=0) so there’s no need to access the table at all. The query will always compute an average over an empty set of rows, and the answer will always be NULL. The Constant Scan operator constructs a row in memory without accessing any base tables, or needing to acquire any locks. Let’s look at a less trivial example now.
Following the flow of data from right to left, the first thing to notice is that the optimizer has decided that the cheapest way to find all the ProductID values from the table is to scan the non-clustered index AK_Product_rowguid. As the index name suggests, this is a non-clustered index over the row_guid column. Because ProductID forms the unique clustered index for the table, the non-clustered index also includes the ProductID keys at the leaf level of the index. The non-clustered index is much narrower than the clustered index (because the cluster contains all in-row table columns at the leaf level) so all ProductID values in the table are stored more densely in the non-clustered index. The higher density (and smaller size) of the non-clustered index means it is the lowest cost access method if all we need are product ids.
The output from the Index Scan operator is a stream of rows containing a single column – ProductID. The Stream Aggregate consumes all the rows from the Index Scan to compute two values: the number of rows encountered (Count(*)); and the SUM of the same values (SUM(ProductID)). The output from the Stream Aggregate is a single row with two columns: the result of the count is labelled [Expr1011] and the result of the sum is [Expr1012], though these labels may be different for you.
The final step to compute the average of ProductID values is performed by the Compute Scalar. It computes a single expression ([Expr1003]) using a CASE statement, which first checks to see if the count of rows ([Expr1011]) is zero. If it is zero, the result is defined as NULL, otherwise the average is computed as the SUM divided by the COUNT – [Expr1012] / [Expr1011]. Notice that the left-to-right evaluation guarantee of the CASE statement ensures that a divide-by-zero error never occurs.
The plan for the same query using SUM instead of AVG is very similar:
The only difference is in the Compute Scalar: the ELSE clause now returns the SUM (if any rows were counted), rather than the average. The interesting thing about both cases (SUM and AVG) is that the optimizer produces a plan which counts the rows as well as summing them. In both cases, the count is needed so the Compute Scalar can produce a NULL if no rows are aggregated.
One might think that it would be more efficient for the Stream Aggregate to just set a bit flag indicating whether zero or more rows were seen, rather than counting all of them, but that is not the way the product works today. The question is, does this extra counting operation add any significant overhead to the query execution? The answer will surprise you – it isn’t as simple as it seems. To demonstrate the effect, we will need a larger table than the AdventureWorks database can provide.
IS NOT NULL
ALTER DATABASE SumTest SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
DROP DATABASE SumTest;
CREATE DATABASE SumTest
ALTER DATABASE SumTest
NAME = N'SumTest',
SIZE = 1GB,
MAXSIZE = 1GB
ALTER DATABASE SumTest
NAME = N'SumTest_log',
SIZE = 256MB,
MAXSIZE = 256MB
ALTER DATABASE SumTest SET ALLOW_SNAPSHOT_ISOLATION OFF;
ALTER DATABASE SumTest SET AUTO_CLOSE OFF;
ALTER DATABASE SumTest SET AUTO_SHRINK OFF;
ALTER DATABASE SumTest SET AUTO_CREATE_STATISTICS ON;
ALTER DATABASE SumTest SET AUTO_UPDATE_STATISTICS ON;
ALTER DATABASE SumTest SET AUTO_UPDATE_STATISTICS_ASYNC OFF;
ALTER DATABASE SumTest SET PARAMETERIZATION SIMPLE;
ALTER DATABASE SumTest SET READ_COMMITTED_SNAPSHOT OFF;
ALTER DATABASE SumTest SET RECOVERY SIMPLE;
ALTER DATABASE SumTest SET RESTRICTED_USER;
CREATE TABLE dbo.Test
id BIGINT IDENTITY (1,1) NOT NULL,
padding CHAR(7999) NOT NULL,
CONSTRAINT [PK dbo.Test (id)]
PRIMARY KEY CLUSTERED (id),
-- Minimally-logged in 2008 onward
INSERT INTO dbo.Test
SELECT TOP (100000)
padding = REPLICATE(CHAR(65 + (Data.n % 26)), 7999)
SELECT TOP (100000)
n = ROW_NUMBER() OVER (ORDER BY (SELECT 0))
FROM master.sys.columns C1,
) AS Data
That script creates a new database containing a single table with 100,000 very wide rows. The id column is BIGINT IDENTITY, and the padding column is defined as CHAR(7999):
To make the test CPU-intensive, we’ll calculate CHECKSUM values for the REVERSE of the strings in the padding column, and SUM the result. The natural way to write that query follows. The convert to BIGINT is just to avoid an arithmetic overflow.
The plan is essentially the same as the ones seen previously, with the addition of an extra Compute Scalar to calculate the new processing-intensive expression. This query runs for around 20 seconds, even with the table data entirely in memory. If we run a similar query that uses a MIN or MAX aggregate instead of SUM, the query returns results in 10 seconds – half the time:
This is a very similar plan, but without the final Compute Scalar (the one that decides whether to return NULL or not) seen in the SUM and AVG plans. Some people might compare the MAX and SUM queries and conclude that the difference in execution time is down to the missing Compute Scalar. That would be wrong because the 10 seconds of extra CPU time cannot possibly be caused by a simple scalar computation operating on the single row present at that stage of the plan.
In fact, the problem is in the Stream Aggregate, which in the SUM plan is computing two expressions:
[Expr1009] = COUNT_BIG([Expr1003])
[Expr1010] = SUM([Expr1003])
[Expr1003] is our compute-intensive calculation, contained in the preceding Compute Scalar. You would think that by the time the Stream Aggregate processes a row, the preceding Compute Scalar has already assigned a result value to [Expr1003] for that row, right? Wrong! Compute Scalars are implemented a little differently from other plan operators. In general, a Compute Scalar operator does not perform the calculation as rows flow through it: the work is deferred until a later plan operator needs the value. In most cases, this is a useful optimization because rows may be eliminated by a later join or filter before the result value is really needed.
In this case, the Stream Aggregate is the first downstream operator to need the value of [Expr1003] – and it references it twice: once for the count, and once for the sum. The shocking truth is that the plan evaluates the whole CHECKSUM(REVERSE(padding)) expression twice – and one of those times it is just counting rows so it knows whether to return NULL later on or not.
We can verify this is the case by writing the query a little differently, to ensure that the expensive expression is calculated once before the Stream Aggregate receives the row:
Now, the expensive expression is calculated in a Constant Scan operator rather than a Compute Scalar. The Stream Aggregate still references the expression twice, once for the COUNT and once for the SUM, but it is counting and summing the result of the calculation, not causing the calculation to be performed twice. This query form produces the same result as the 20-second query, but in 10 seconds. Both tests push a single CPU to 100% utilization, in case you were wondering about that. To really make the point, consider this query:
SELECT SUM(CONVERT(BIGINT, CHECKSUM(REVERSE(T.padding)))),
SUM(CONVERT(BIGINT, 0 + CHECKSUM(REVERSE(T.padding))))
FROM dbo.Test AS T
OPTION (MAXDOP 1)
In the resulting plan, the Stream Aggregate references the CHECKSUM(REVERSE(padding)) expression four times, and the query runs for 40 seconds at 100% CPU!
You might also wonder why the rewrite uses an OUTER APPLY instead of CROSS APPLY. The reason is that with a CROSS APPLY, the optimizer sees through the trick and transforms the query plan to exactly the same form as before – and the query runs for 20 seconds as a result. The OUTER APPLY is enough the defeat an optimizer transformation rule, resulting in a plan featuring a Left Outer Join. Before you go off to rewrite all your SUMs and AVGs using OUTER APPLY, consider:
- The OUTER APPLY trick is a hack that depends on implementation details – it might not work tomorrow.
- This problem only occurs if the first operator to need an expression result references it more than once.
- This bug is only noticeable if the expression is expensive.
So this is a confirmed bug in SQL Server, and the good news is that it has been fixed – but only in a private build of Denali (SQL11). The current public Denali build (CTP 1) does not contain the fix, and there is no word on whether it will be made available for users of earlier SQL Server versions.
Finally for today, I want to show you another good reason not to use the OUTER APPLY pattern as a fix. Let’s assume we don’t know about this problem, and need to optimize the original CHECKSM(REVERSE(padding)) query. The obvious solution is to add a computed column, and index it:
-- Index it
[IX dbo.Test Expr1003]
FILLFACTOR = 100,
MAXDOP = 1,
SORT_IN_TEMPDB = ON
Notice in passing that the new computed column is not persisted, so it doesn’t require any extra storage in the table itself, and does not fragment the clustered index. Re-running the natural form of the query:
It now completes in 30ms – quite an improvement over 20,000ms! The Stream Aggregate still performs the two SUM and COUNT operations, but now the expression result is available directly from the index – it isn’t calculated at all, let alone twice.
If we re-run the OUTER APPLY query, we find that it does not use the index and uses exactly the same plan as it did previously, again taking 10 seconds to run. Ironically, it fails to take advantage of the index for the same reason it previously performed better: the trick prevents the optimizer from matching the computed expression to the index. Attempting to force the use of the index results in a very unhappy plan:
The optimizer still can’t make the transformation to use the pre-computed value from our index – but it honours the hint in the only way it can, by using the index to retrieve id column values (the unique clustering key included in the non-clustered index). It then performs a bookmark lookup for every row into the clustered index to retrieve the padding column, and finally performs the expensive calculation once per row in the Constant Scan.
The Sort operator is there to ‘optimize’ the bookmark lookups: by pre-sorting on the clustering key (id) it hopes to turn the normally random I/O associated with a bookmark lookup into sequential I/O. With the data all in memory, this plan still executes in ten seconds, but if the data needed to be fetched in from disk…well you try it.
So, how important is this bug? Well, it depends. You would have to take quite a deep look into each one of your production query plans to see how often a plan operator references a Compute Scalar expression more than once, where the expression is expensive to compute. That’s probably not practical (though you could write an XQuery to search the plan cache for examples I suppose) but it is something to be aware of, just in case you come across a query that performs much slower than you might expect.
Anyway, the point of this post, again, is that deep internals knowledge can come in very useful!
© 2011 Paul White