THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul White: Page Free Space

A technical SQL Server blog from New Zealand.

Is Distinct Aggregation Still Considered Harmful?

Back in 2008, Marc Friedman of the SQL Server Query Processor Team wrote a blog entry entitled “Distinct Aggregation Considered Harmful”, in which he shows a way to work around the poor performance that often results simply from adding the keyword DISTINCT to an otherwise perfectly reasonable aggregate function in a query.  This post is an update to that, presenting a query optimizer enhancement in SQL Server 2012 RC0 that reduces the need to perform the suggested rewrite manually.  First, though, it makes sense for me to cover some broader aggregation topics in general.

Query Plan Options for DISTINCT

There are two main strategies to extract distinct values from a stream of rows.  The first option is to keep track of unique values in a hash table.  The second is to sort the incoming rows into groups and return one value from each group.  SQL Server uses the Hash Match operator to implement the hash table option, and Stream Aggregate or Sort Distinct for the sorting option.

Hash Match Aggregate

The optimizer tends to prefer the Hash Match Aggregate on larger rowsets, with fewer groups, where there is no reason to produce a sorted output, and where the incoming rows are not sorted on the DISTINCT expression(s).  Larger inputs favour hash matching because the algorithm generally scales well (although it does require a memory grant) and can make good use of parallelism.  Fewer groups are better for hashing because it means fewer entries in the hash table, and the memory needed to store unique values is proportional to the number of groups (and the size of the group).  Hash matching does not require or preserve the order of the incoming row stream.  The query and plan below show a Hash Match Aggregate building a hash table on values in the Quantity column:

SELECT DISTINCT 
    th.Quantity
FROM Production.TransactionHistory AS th
OPTION (RECOMPILE)

image

The standard Hash Match Aggregate is blocking: it produces no output until it has processed its entire input stream.  If we restrict the number of rows using TOP or SET ROWCOUNT the aggregate can operate in a streaming fashion, producing new unique values as soon as they are encountered.  This streaming mode is known as Flow Distinct, and is activated depending on the estimated number of unique values in the stream.  In the example above, the estimated number of output rows from the Hash Match Aggregate is 455 (based on the statistics on my test machine).  Limiting the query to 455 or fewer rows using TOP or ROWCOUNT produces a Hash Match Aggregate running in Flow Distinct mode:

SET ROWCOUNT 10
 
SELECT DISTINCT 
    th.Quantity
FROM Production.TransactionHistory AS th
OPTION (RECOMPILE)
 
SET ROWCOUNT 0

image

This plan is interesting because it limits the output to 10 rows without including a specific TOP operator for that purpose.  TOP is generally preferred to ROWCOUNT for the reasons set out in the Books Online topic “Limiting Result Sets by Using TOP and PERCENT”.  Ignore the part that says “because SET ROWCOUNT is used outside a statement that executes a query, its value cannot be used to generate a query plan for a query” – we have just seen how it can affect a query plan.

For completeness, this is the TOP query and execution plan:

SELECT DISTINCT TOP (10)
    th.Quantity
FROM Production.TransactionHistory AS th
OPTION (RECOMPILE)

image

Stream Aggregate

Performing a DISTINCT is the same as applying a GROUP BY on every expression in the SELECT list.  A stream of rows sorted on these GROUP BY expressions has the useful property that all rows with the same GROUP BY values appear together, so all we need to do is output a row of GROUP BY keys each time a change occurs.  If the required order can be obtained without an explicit sort, the query plan can use a Stream Aggregate directly:

SELECT DISTINCT 
    th.ProductID 
FROM Production.TransactionHistory AS th
OPTION (RECOMPILE)

image

This query plan shows an ordered scan of an index on the ProductID column, followed by a Stream Aggregate with a GROUP BY on the same column.  The Stream Aggregate emits a new row each time a new group is encountered, similar to the Hash Match Aggregate running in Flow Distinct mode.  This operator does not require a memory grant because it only needs to keep track of one set of GROUP BY expression values at a time.

Sort Distinct

The third alternative is to perform an explicit Sort followed by a Stream Aggregate.  The optimizer can often collapse this combination into a single Sort operating in Sort Distinct mode:

SELECT DISTINCT
    p.Color
FROM Production.Product AS p 
OPTION (RECOMPILE)

image

To see the original Sort and Stream Aggregate combination, we can temporarily disable the GbAggToSort optimizer rule that performs this transformation:

DBCC RULEOFF('GbAggToSort')
 
SELECT DISTINCT
    p.Color
FROM Production.Product AS p 
OPTION (RECOMPILE)
 
DBCC RULEON('GbAggToSort')

The query plan now shows a regular (non-distinct) Sort followed by the Stream Aggregate:

image

Distinct Aggregates

The DISTINCT keyword is most commonly used with the COUNT and COUNT_BIG aggregate functions, though it can be specified with a variety of built-in and CLR aggregates.  The interesting thing is that SQL Server always processes distinct aggregates by first performing a DISTINCT (using any of the three methods shown previously) and then applying the aggregate (for example COUNT) as a second step:

SELECT 
    COUNT_BIG(DISTINCT p.Color) 
FROM Production.Product AS p 
OPTION (RECOMPILE)

image

One thing worth highlighting is that COUNT DISTINCT does not count NULLs.  The previous query that listed the distinct colours from the Product table produced 10 rows (9 colours and one NULL), but the the COUNT DISTINCT query here returns the value 9.  The query plan uses a Distinct Sort to perform the DISTINCT (which includes the NULL group) and then counts the groups using a Stream Aggregate.  The aggregate expression uses COUNT(expression) rather than COUNT(*), which eliminates the NULL group produced by the Distinct Sort.  This second example shows a Hash Match Aggregate performing the DISTINCT, followed by a Stream Aggregate to count the non-NULL groups:

SELECT
    COUNT_BIG(DISTINCT th.Quantity)
FROM Production.TransactionHistory AS th
OPTION (RECOMPILE)

image

So far, the counting step has always been performed by a Stream Aggregate.  This is because it has been a scalar aggregate – an aggregate without a GROUP BY clause – and SQL Server always implements scalar aggregates using Stream Aggregate (there would be no point using hashing since there will only be one group by definition).  If we add a GROUP BY clause, the final COUNT is no longer a single (scalar) result, so we can get a plan with two Hash Match Aggregates:

SELECT
    th.ProductID,
    COUNT_BIG(DISTINCT th.Quantity)
FROM Production.TransactionHistory AS th
GROUP BY
    th.ProductID
OPTION (RECOMPILE)

image

In this plan, the rightmost aggregate is performing the DISTINCT, and the leftmost one implements the COUNT (with GROUP BY).

Combining Aggregates

The decision to always implement distinct aggregates as separate DISTINCT and aggregate steps makes life easier for the optimizer.  Separating the two operations makes it easier to plan and cost parallelism, partial aggregation, combining similar aggregates, moving aggregates around (for example pushing an aggregate below a join) and so on.  On the downside, it creates problems for queries that include multiple distinct aggregates, or combine a single distinct aggregate with an ‘ordinary’ aggregate.

Multiple Aggregates

As you probably know, there is no problem combining multiple ordinary aggregations into a single operator:

SELECT
    COUNT_BIG(*),
    MAX(th.ActualCost),
    STDEV(th.Quantity)
FROM Production.TransactionHistory AS th
GROUP BY
    th.TransactionType

image

The single Hash Match Aggregate operator computes the three separate aggregates concurrently on the incoming stream.

Multiple Distinct Aggregates

This next query does a similar thing, but with three DISTINCT aggregations on the stream:

SELECT 
    COUNT_BIG(DISTINCT th.ProductID), 
    COUNT_BIG(DISTINCT th.TransactionDate), 
    COUNT_BIG(DISTINCT th.Quantity)
FROM Production.TransactionHistory AS th
OPTION (RECOMPILE)

It produces a more complex query plan:

image

The issue here is that once a DISTINCT has been performed on a stream, the stream no longer contains the columns necessary to perform the other DISTINCT operations.  To work around this, the optimizer builds a plan that reads the source stream once per DISTINCT, computes the aggregates separately, and then combines the results using a cross join (which is safe because these are all scalar aggregates, guaranteed to produce one row each).  The same basic pattern is employed if the query contains an outer GROUP BY clause, but instead of cross joins there will be inner joins on the GROUP BY columns.

More often than not, the source of rows will not be an unrestricted table scan.  Where the source is complex (and therefore expensive to re-run for every distinct aggregate) or where a filter significantly reduces the number of rows, the query optimizer may choose to Eager Spool the source rows and replay them once per distinct aggregate:

SELECT 
    COUNT_BIG(DISTINCT th.ProductID), 
    COUNT_BIG(DISTINCT th.TransactionDate), 
    COUNT_BIG(DISTINCT th.Quantity)
FROM Production.TransactionHistory AS th
WHERE
    th.ActualCost < $5
GROUP BY
    th.TransactionType
OPTION (RECOMPILE)

image

This is a plan shape that you are likely to encounter in the real world, since most queries will likely have a filtering condition or have a row source that is more complex than a simple scan of an index or table.  For queries over larger tables than Adventure Works provides, this plan is likely to perform poorly, as you might expect.  Aside from the obvious concerns, inserting rows into a spool has to be performed on a single thread (like all data modification operations in today’s SQL Server).  Another limitation is that the spool does not support parallel scan for reading, so the optimizer is very unlikely to restart parallelism after the spool (or any of its replay streams).  In queries that operate on large data sets, the parallelism implications of the spool plan can be the most important cause of poor performance.

Performing Multiple Distinct Aggregates Concurrently

If SQL Server were able to perform the whole COUNT DISTINCT aggregate in a single operator (instead of splitting the DISTINCT and COUNT into two steps) there would be no reason to split plans with spools as seen above.  This could not be done with a Stream Aggregate since that operator requires the stream to be sorted on the DISTINCT expression, and it is not possible to sort a single stream in more than one way at the same time.

On the other hand, the Hash Match Aggregate does not require sorted input (it keeps the distinct values in a hash table, remember), so it should be possible to design a Hash Match Aggregate that computes COUNT DISTINCT in a single operation.  We can test this idea with a User-Defined Aggregate (UDA) – SQL Server 2008 required:

using System;
using System.Collections.Generic;
using System.Data.SqlTypes;
using System.IO;
using Microsoft.SqlServer.Server;
 
[SqlUserDefinedAggregate
    (
    Format.UserDefined,
    IsInvariantToDuplicates = true,
    IsInvariantToNulls = true,
    IsInvariantToOrder = true,
    IsNullIfEmpty = false,
    MaxByteSize = -1
    )
]
public struct CountDistinctInt : IBinarySerialize
{
    // The hash table
    private Dictionary<int, object> dict;
 
    // Recreate the hash table for each new group
    public void Init()
    {
        dict = new Dictionary<int, object>();
    }
 
    // Ignore NULLs, store key values in the hash table
    public void Accumulate(SqlInt32 Data)
    {
        if (!Data.IsNull)
        {
            dict[Data.Value] = null;
        }
    }
 
    // Merge partial aggregates
    public void Merge(CountDistinctInt Group)
    {
        foreach (var item in Group.dict.Keys)
        {
            dict[item] = null;
        }
    }
 
    // Return the DISTINCT COUNT result
    public int Terminate()
    {
        return dict.Count;
    }
 
    // Required by SQL Server to serialize this object
    void IBinarySerialize.Write(BinaryWriter w)
    {
        w.Write(dict.Count);
 
        foreach (var item in dict.Keys)
        {
            w.Write(item);
        }
    }
 
    // Required by SQL Server to deserialize this object
    void IBinarySerialize.Read(BinaryReader r)
    {
        var recordCount = r.ReadInt32();
        dict = new Dictionary<int, object>(recordCount);
 
        for (var i = 0; i < recordCount; i++)
        {
            dict[r.ReadInt32()] = null;
        }
    }
}

This UDA does nothing more than create a hash table, add (non-NULL) values to it, and return the count of values when aggregation is complete (the hash table takes care of duplicates).  There is a little extra infrastructure code in there to allow SQL Server to serialize and reconstruct the hash table when needed (and merge partial aggregates) but the core of the function is just four lines of code.  The example above only aggregates integer values, but it is easy to extend the idea to include other types.  Anyway, armed with the integer and date time versions of the UDA, I now return to the multiple-distinct-count query that caused all the spooling, with COUNT DISTINCT replaced by UDA references:

SELECT 
    dbo.CountDistinctInt(th.ProductID), 
    dbo.CountDistinctDateTime(th.TransactionDate), 
    dbo.CountDistinctInt(th.Quantity)
FROM Production.TransactionHistory AS th
WHERE
    th.ActualCost < $5
GROUP BY
    th.TransactionType
OPTION (RECOMPILE)

Instead of all the Eager Spools, we now get this query plan:

image

You may be surprised to see that the three distinct count aggregates are being performed by a Stream Aggregate; after all I just finished explaining why a Stream Aggregate could not possibly do what we want here.  The thing is that all CLR UDAs are interfaced to query plans using the Stream Aggregate model.  The fact that this UDA uses a hash table internally does not change that.

The Sort in this plan is there to ensure that groups of rows arrive at the ‘Stream Aggregate’ interface in the required GROUP BY order.  This is so SQL Server knows when to call the Init() and Terminate() methods on our UDA.  The COUNT DISTINCT aggregation that is happening inside the UDA for each group could not care less about ordering, of course.  (In case you were wondering, yes, the UDA produces the same results as the original T-SQL code).

The point here is to demonstrate that multiple DISCOUNT COUNT operations can be performed within a single operator, not that UDAs are necessarily always a great way to do that in general.  As far as performance is concerned, the original spool query runs in around 220ms, and the UDA settles down around 160ms which isn’t bad, all things considered.

We can also improve the performance of the T-SQL query by rewriting it to avoid the spools by scanning the source table three times (this executes in around 75ms).  Part of the problem here (to go with the spool plan issues mentioned earlier) is that the optimizer assumes that all queries start executing with an empty data cache, and it does not account for the fact that the three scans complete sequentially, so the pages are extremely likely to be available from memory for the second and third scans.  The rewrite and query plan are below.

WITH 
    Stream AS 
    (
        SELECT 
            th.TransactionType, 
            th.ProductID, 
            th.TransactionDate, 
            th.Quantity 
        FROM Production.TransactionHistory AS th 
        WHERE 
            th.ActualCost < $5
    ),
    CountDistinctProduct AS
    (
    SELECT 
        TransactionType, 
        COUNT_BIG(DISTINCT ProductID) AS c
    FROM Stream 
    GROUP BY 
        TransactionType
    ),
    CountDistinctTransactionDate AS
    (
    SELECT 
        TransactionType, 
        COUNT_BIG(DISTINCT TransactionDate) AS c
    FROM Stream 
    GROUP BY 
        TransactionType
    ),
    CountDistinctQuantity AS
    (
    SELECT 
        TransactionType, 
        COUNT_BIG(DISTINCT Quantity) AS c
    FROM Stream 
    GROUP BY 
        TransactionType
    )
SELECT
    p.c,
    d.c,
    q.c
FROM CountDistinctProduct AS p
JOIN CountDistinctTransactionDate AS d ON
    d.TransactionType = p.TransactionType
JOIN CountDistinctQuantity AS q ON
    q.TransactionType = d.TransactionType

image

Combining a Single Distinct Aggregate with other Aggregates

Marc Friedman’s blog post presented a way to rewrite T-SQL queries that contain a single distinct aggregate and one or more non-distinct aggregates so as to avoid spools or reading the source of the rows more than once.  The essence of the method is to aggregate first by the GROUP BY expressions in the query and the DISTINCT expressions in the aggregate, and then to apply some relational math to aggregate those partial aggregates to produce the final result.  I encourage you to read the full post to see all the detail, but I will quickly work through an example here too:

SELECT
    dp.EnglishProductName,
    SUM(frs.SalesAmount),
    COUNT_BIG(DISTINCT frs.SalesTerritoryKey)
FROM dbo.FactResellerSales AS frs
JOIN.dbo.DimProduct AS dp ON
    frs.ProductKey = dp.ProductKey
GROUP BY
    dp.EnglishProductName
OPTION (MAXDOP 1, RECOMPILE)

This query contains a regular SUM aggregate and a COUNT DISTINCT, so as expected the query optimizer produces a plan with an Eager Spool (click to enlarge):

image

To the left of the spool, the top branch performs the DISTINCT followed by the COUNT per group, and the spool replay on the lower branch of the plan computes the SUM per group.  Finally, the two branches join on the common GROUP BY expression (EnglishProductName).

The rewrite starts by grouping on EnglishProductName (the GROUP BY expression) and SalesTerritoryKey (the DISTINCT expression) to produce a partial aggregate:

SELECT 
    dp.EnglishProductName, 
    frs.SalesTerritoryKey, 
    SUM(frs.SalesAmount) AS ssa
FROM dbo.DimProduct AS dp
JOIN dbo.FactResellerSales AS frs ON 
    frs.ProductKey = dp.ProductKey
GROUP BY 
    frs.SalesTerritoryKey, 
    dp.EnglishProductName

This query contains no distinct aggregates, so we get a plan with an ordinary join and and ordinary SUM aggregate:

image

To produce the results specified by the original query, we now need to SUM the partial SalesAmount sums, and COUNT (ignoring NULLs) the SalesTerritoryKey values.  The final rewrite looks like this:

WITH PartialAggregate AS
(
    SELECT 
        dp.EnglishProductName, 
        frs.SalesTerritoryKey, 
        SUM(frs.SalesAmount) AS ssa
    FROM dbo.DimProduct AS dp
    JOIN dbo.FactResellerSales AS frs ON 
        frs.ProductKey = dp.ProductKey
    GROUP BY 
        frs.SalesTerritoryKey, 
        dp.EnglishProductName
)
SELECT
    pa.EnglishProductName, 
    SUM(pa.ssa),
    COUNT_BIG(pa.SalesTerritoryKey)
FROM PartialAggregate AS pa
GROUP BY pa.EnglishProductName
OPTION (RECOMPILE)

The query plan adds another layer of aggregation on top of the partial aggregate plan above:

image

This plan avoids the Eager Spools seen earlier and improves execution time from 320ms to 95ms in this example.  On larger sets, where parallelism becomes important and the spools might need to use physical tempdb storage, the gains are likely to be much larger.

New for SQL Server 2012 RC0

The good news is that SQL Server 2012 RC0 adds a new optimizer rule (ReduceForDistinctAggs) to perform this rewrite automatically on the original form of the query.  This is particularly good because the rewrite, while ingenious, can be somewhat inconvenient to do in practice, and all too easy to get wrong (particularly ensuring that NULL partially-aggregated groups are handled correctly).  The new optimizer rule was not available in CTP3, so you will need RC0 to see SQL Server turn this T-SQL:

SELECT
    dp.EnglishProductName,
    SUM(frs.SalesAmount),
    COUNT_BIG(DISTINCT frs.SalesTerritoryKey)
FROM dbo.FactResellerSales AS frs
JOIN.dbo.DimProduct AS dp ON
    frs.ProductKey = dp.ProductKey
GROUP BY
    dp.EnglishProductName
OPTION (RECOMPILE)

…directly into this query plan:

image

This transformation is only valid for queries with a single distinct aggregate (and at least one non-distinct aggregate of course).  If your query contains multiple distinct aggregates, it may not help you directly, though you may be able to refactor the T-SQL to take advantage.

If you want to see SQL Server 2012 RC0 produce the Eager Spool plan instead (created by the long-standing ExpandDistinctGbAgg rule), disable the new rule with:

DBCC RULEOFF ('ReduceForDistinctAggs')

…and then recompile.  Don’t forget to enable it again afterward using RULEON or by reconnecting to the server.

Even with the new rule enabled, you may still see the spool or multiple-scan plan from time to time.  As always, the optimizer may explore many alternative plan forms and choose the one that looks cheapest.  In some cases, the optimizer may still choose the spool plan, though it probably won’t be right to do so…

Thanks for reading; please consider voting for the following Connect item (to allow CLR UDA hash aggregates).  Thank you.

http://connect.microsoft.com/SQLServer/feedback/details/629920/allow-option-hash-group-with-sqlclr-udas

© 2011 Paul White

Twitter: @SQL_Kiwi
Email: SQLkiwi@gmail.com

Published Sunday, December 04, 2011 7:30 AM by Paul White

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

tobi said:

I think is it a very good idea to encourage Connect item voting. If you did a post on the top 10 most important Connect items from your point of view I certainly would see them through.

December 4, 2011 1:18 PM
 

Gianluca Sartori said:

Great post, as usual!

Nice implementation of CLR UDA also. This is something I never had the chance to practise, but seems very promising.

Thanks for sharing, Paul.

December 5, 2011 5:20 AM
 

Fabiano Amorim said:

Hi Paul, that is a very nice post, you have my vote in the connect item.

What about a solution using the rank_dense to calculate the distinct count? I mean...

WITH CTE_1

AS

(

SELECT

   dp.EnglishProductName,

   SUM(frs.SalesAmount) SUMSalesAmount,

   DENSE_RANK() OVER(PARTITION BY dp.EnglishProductName ORDER BY frs.SalesTerritoryKey) AS CountDistinctSalesTerritoryKey

FROM dbo.FactResellerSales AS frs

JOIN.dbo.DimProduct AS dp ON

   frs.ProductKey = dp.ProductKey

GROUP BY

   dp.EnglishProductName, frs.SalesTerritoryKey

)

SELECT EnglishProductName,

      SUM(SUMSalesAmount),

      MAX(CountDistinctSalesTerritoryKey)

 FROM CTE_1

GROUP BY EnglishProductName

ORDER BY 1

OPTION (RECOMPILE)

This is something that works since SQL2005...

December 5, 2011 10:07 PM
 

Paul White said:

Hi Fabiano,

Yes, I think someone left a similar suggestion on Marc's post; it's the same idea (partial aggregation) of course, just implemented using a different syntax.

If there is a downside to the DENSE_RANK, it is that it will always need full sort of the stream, whereas the 2012 transformation or manual partial aggregation can use hashing.

Eliminating the sort in favour of a hash will not always be optimal, perhaps, but it would be nice to give UDAs the option - so thanks for the vote on that.

Cheers,

Paul

December 6, 2011 1:44 AM
 

Brad Schulz said:

Terrific post, as usual... I probably learned at least a dozen new things by reading it.  Awesome.  Thanks, Paul.

--Brad

December 7, 2011 10:53 AM
 

Kiran Ramaswamy said:

Awesome! Thanks for the very informative post!

I'm guessing that there's no way this feature is going to appear somehow in older versions of SQL Server, such as 2005 or 2008, so I guess I should start looking at queries where I'm doing similar things and see if I can re-write them in the more optimal manner.

December 12, 2011 10:19 AM
 

Paul White said:

Hi Kiran,

I would agree there is next to no chance of the new transformation being back-ported, so those of us not adopting 2012 the moment it hits the shelves are left will rewriting queries the hard way.

Paul

December 12, 2011 11:28 AM
 

Marc Friedman said:

Great article!

March 16, 2012 7:55 PM
 

Paul White said:

Thanks very much Marc - your original article was one that I have read many times over the years, so I'm pleased (and relieved) that you like this follow-on.

March 17, 2012 2:31 AM
 

Alan R. said:

Hi Paul,

Great post! I focused on the UDA since the alternatives didn't work well in my case. I have a slow query that goes like this: select a,b,c,d,e,f,g, sum(h), count(distinct i), count(distinct j), count(distinct k). The count distinct was killing. 2min 20secs to run the query. By commenting all three count(distinct) it dropped to 50secs. By replacing count distinct with the UDA CountDistinctInt method, it went to 55secs. So from 2:25 to 0:55 it is a huge difference. Live saver!

Thanks!

Alan

July 12, 2012 9:49 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement