THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul White: Page Free Space

A technical SQL Server blog from New Zealand.

Fun with Aggregates

There are interesting things to be learned from even the simplest queries.  For example, imagine you are given the task of writing a query to list AdventureWorks product names where the product has at least one entry in the transaction history table, but fewer than ten.

One possible query to meet that specification is:

SELECT
    p.Name
FROM Production.Product AS p
JOIN Production.TransactionHistory AS th ON
    p.ProductID = th.ProductID
GROUP BY
    p.ProductID,
    p.Name
HAVING
    COUNT_BIG(*) < 10;

That query correctly returns 23 rows (execution plan and data sample shown below):

image

The execution plan looks a bit different from the written form of the query: the base tables are accessed in reverse order, and the aggregation is performed before the join.  The general idea is to read all rows from the history table, compute the count of rows grouped by ProductID, merge join the results to the Product table on ProductID, and finally filter to only return rows where the count is less than ten.

This ‘fully-optimized’ plan has an estimated cost of around 0.33 units.  The reason for the quote marks there is that this plan is not quite as optimal as it could be – surely it would make sense to push the Filter down past the join too?  To answer that, let’s look at some other ways to formulate this query.  This being SQL, there are any number of ways to write logically-equivalent query specifications, so we’ll just look at a couple of interesting ones.  The first query is an attempt to reverse-engineer T-SQL from the optimized query plan shown above.  It joins the result of pre-aggregating the history table to the Product table before filtering:

SELECT p.Name
FROM
(
    SELECT 
        th.ProductID, 
        cnt = COUNT_BIG(*) 
    FROM Production.TransactionHistory AS th 
    GROUP BY
        th.ProductID
) AS q1
JOIN Production.Product AS p 
    ON p.ProductID = q1.ProductID
WHERE
    q1.cnt < 10;

Perhaps a little surprisingly, we get a slightly different execution plan:

image

The results are the same (23 rows) but this time the Filter is pushed below the join!  The optimizer chooses nested loops for the join, because the cardinality estimate for rows passing the Filter is a bit low (estimate 1 versus 23 actual), though you can force a merge join with a hint and the Filter still appears below the join.  In yet another variation, the < 10 predicate can be ‘manually pushed’ by specifying it in a HAVING clause in the “q1” sub-query instead of in the WHERE clause as written above.

The reason this predicate can be pushed past the join in this query form, but not in the original formulation is simply an optimizer limitation – it does make efforts (primarily during the simplification phase) to encourage logically-equivalent query specifications to produce the same execution plan, but the implementation is not completely comprehensive.

Moving on to a second example, the following query specification results from phrasing the requirement as “list the products where there exists fewer than ten correlated rows in the history table”:

SELECT p.Name
FROM Production.Product AS p
WHERE EXISTS 
(
    SELECT *
    FROM Production.TransactionHistory AS th 
    WHERE th.ProductID = p.ProductID 
    HAVING COUNT_BIG(*) < 10
);

Unfortunately, this query produces an incorrect result (86 rows):

image

The problem is that it lists products with no history rows, though the reasons are interesting.  The COUNT_BIG(*) in the EXISTS clause is a scalar aggregate (meaning there is no GROUP BY clause) and scalar aggregates always produce a value, even when the input is an empty set.  In the case of the COUNT aggregate, the result of aggregating the empty set is zero (the other standard aggregates produce a NULL).  To make the point really clear, let’s look at product 709, which happens to be one for which no history rows exist:

-- Scalar aggregate
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS th 
WHERE th.ProductID = 709;
 
-- Vector aggregate
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS th 
WHERE th.ProductID = 709
GROUP BY th.ProductID;

The estimated execution plans for these two statements are almost identical:

image

You might expect the Stream Aggregate to have a Group By for the second statement, but this is not the case.  The query includes an equality comparison to a constant value (709), so all qualified rows are guaranteed to have the same value for ProductID and the Group By is optimized away.

In fact there are some minor differences between the two plans (the first is auto-parameterized and qualifies for trivial plan, whereas the second is not auto-parameterized and requires cost-based optimization), but there is nothing to indicate that one is a scalar aggregate and the other is a vector aggregate.  This is something I would like to see exposed in show plan so I suggested it on Connect.  Anyway, the results of running the two queries show the difference at runtime:

image

The scalar aggregate (no GROUP BY) returns a result of zero, whereas the vector aggregate (with a GROUP BY clause) returns nothing at all.  Returning to our EXISTS query, we could ‘fix’ it by changing the HAVING clause to reject rows where the scalar aggregate returns zero:

SELECT p.Name
FROM Production.Product AS p
WHERE EXISTS 
(
    SELECT *
    FROM Production.TransactionHistory AS th 
    WHERE th.ProductID = p.ProductID 
    HAVING COUNT_BIG(*) BETWEEN 1 AND 9
);

The query now returns the correct 23 rows:

image

Unfortunately, the execution plan is less efficient now – it has an estimated cost of 0.78 compared to 0.33 for the earlier plans.  Let’s try adding a redundant GROUP BY instead of changing the HAVING clause:

SELECT p.Name
FROM Production.Product AS p
WHERE EXISTS 
(
    SELECT *
    FROM Production.TransactionHistory AS th 
    WHERE th.ProductID = p.ProductID
    GROUP BY th.ProductID
    HAVING COUNT_BIG(*) < 10
);

Not only do we now get correct results (23 rows), this is the execution plan:

image

I like to compare that plan to quantum physics: if you don’t find it shocking, you haven’t understood it properly :)  The simple addition of a redundant GROUP BY has resulted in the EXISTS form of the query being transformed into exactly the same optimal plan we found earlier.  What’s more, in SQL Server 2008 and later, we can replace the odd-looking GROUP BY with an explicit GROUP BY on the empty set:

SELECT p.Name
FROM Production.Product AS p
WHERE EXISTS 
(
    SELECT *
    FROM Production.TransactionHistory AS th 
    WHERE th.ProductID = p.ProductID
    GROUP BY ()
    HAVING COUNT_BIG(*) < 10
);

I offer that as an alternative because some people find it more intuitive (and it perhaps has more geek value too).  Whichever way you prefer, it’s rather satisfying to note that the result of the sub-query does not exist for a particular correlated value where a vector aggregate is used (the scalar COUNT aggregate always returns a value, even if zero, so it always ‘EXISTS’ regardless which ProductID is logically being evaluated).

The following query forms also produce the optimal plan and correct results, so long as a vector aggregate is used (you can probably find more equivalent query forms):

WHERE Clause

SELECT
    p.Name
FROM Production.Product AS p
WHERE 
(
    SELECT COUNT_BIG(*) 
    FROM Production.TransactionHistory AS th
    WHERE th.ProductID = p.ProductID
    GROUP BY ()
) < 10;

APPLY

SELECT p.Name
FROM Production.Product AS p
CROSS APPLY
(
    SELECT NULL
    FROM Production.TransactionHistory AS th 
    WHERE th.ProductID = p.ProductID
    GROUP BY ()
    HAVING COUNT_BIG(*) < 10
) AS ca (dummy);

FROM Clause

SELECT q1.Name 
FROM
(
    SELECT
        p.Name, 
        cnt = 
        (
            SELECT COUNT_BIG(*) 
            FROM Production.TransactionHistory AS th 
            WHERE th.ProductID = p.ProductID 
            GROUP BY ()
        )
    FROM Production.Product AS p
) AS q1 
WHERE
    q1.cnt < 10;

This last example uses SUM(1) instead of COUNT and does not require a vector aggregate…you should be able to work out why :)

SELECT q.Name 
FROM
(
    SELECT
        p.Name, 
        cnt = 
        (
            SELECT SUM(1)
            FROM Production.TransactionHistory AS th 
            WHERE th.ProductID = p.ProductID
        )
    FROM Production.Product AS p
) AS q
WHERE q.cnt < 10;

image

The semantics of SQL aggregates are rather odd in places.  It definitely pays to get to know the rules, and to be careful to check whether your queries are using scalar or vector aggregates.  As we have seen, query plans do not show in which ‘mode’ an aggregate is running and getting it wrong can cause poor performance, wrong results, or both.

© 2012 Paul White

Twitter: @SQL_Kiwi
email: SQLkiwi@gmail.com

Published Monday, March 12, 2012 5:49 AM by Paul White

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

PassedBI said:

GJ! It was interesting to follow your examples. Dangerous subtle nuance, though I never skip GROUP BY clause.

March 11, 2012 2:28 PM
 

Alejandro Mesa said:

Paul,

Nice reading for a Monday morning :)

Talking about fun, I wonder why the cardinality estimate, out from the filter, changes if we use SUM(1) instead COUNT_BIG(*) in the following query.

SELECT

p.Name

FROM

(

SELECT th.ProductID

FROM Production.TransactionHistory AS th

GROUP BY th.ProductID

HAVING SUM(1) < 10

) AS q1

JOIN

Production.Product AS p

ON p.ProductID = q1.ProductID;

--

AMB

March 12, 2012 10:05 AM
 

Paul White said:

PassedBI,

I'm not sure I see what you mean about never skipping the GROUP BY clause.  There are certainly times when a scalar aggregate is what is required (counting all the rows in a table for example).  There are also times when the scalar semantic is necessary for correct results.  It's all about being aware of the differences.

Paul

March 12, 2012 10:26 AM
 

Paul White said:

Hi AMB,

With COUNT_BIG(*), cardinality estimation calculates an estimate from (derived) statistics.  Substituting SUM(1) results in a guess of 132.3 rows (30% of the estimated 441 rows from the aggregate).

Paul

March 12, 2012 10:33 AM
 

Alejandro Mesa said:

Thanks, Paul!

March 12, 2012 11:29 AM
 

PassedBI said:

Paul White,

I never skip GROUP BY clause

- when using HAVING condition - should I write.

Just never had the case when need use HAVING without GROUP BY.

March 12, 2012 2:48 PM
 

Rishabh K said:

Hi Paul,

I really enjoy reading your post. I think that with Sum(1) the cardinality estimate is calculated as All density for ProductID * No of rows after stream aggregation ,i.e, 0.002267574*441~ 1. Its fine  but I wonder why there is difference between count_Big and sum with regard to cardinality estimation. I tried with count(1) but still the same estimations

SELECT th.ProductID,COUNT_big(1),SUM(1)

FROM Production.TransactionHistory AS th

GROUP BY th.ProductID

having count_big(1)<10

Thanks for all your great articles

March 13, 2012 7:57 AM
 

AlexK said:

Paul,

I enjoyed reading your post, but I think HAVING clause is redundant. As such, it should be eliminated - that would simplify and improve SQL. There would be less things to learn, less chances to shoot ourselves in the foot.

What do you think?

March 13, 2012 10:01 AM
 

Paul White said:

Rishabh K,

Thanks.  It's the other way around: the cardinality estimate for COUNT is derived from statistics but SUM(1) results in a 30% guess.

Anyway, the reason for the difference is just that cardinality estimation knows how to predict a value for COUNT, but not for SUM(1), so it guesses.

I suppose cardinality estimation could treat the special case of SUM(1) as COUNT (though they produce different results on empty sets) but it doesn't.  More generally, SUM(1) is just a special case of SUM(constant) and there's no sensible way to handle statistics given a comparison on that result.

Finally, COUNT(1) and COUNT(*) are the same.  When you write COUNT(1) in a query, the query plan will show the aggregate as count star.

Paul

March 13, 2012 10:05 AM
 

Paul White said:

Hi Alex,

Good to hear from you.  I'm starting to think I over-emphasised the HAVING examples :)  Forgive me if I am misunderstanding you, but this post is not about HAVING - it is about the subtle semantic differences between aggregate queries with and without GROUP BY:

-- Try with the GROUP BY commented and uncommented

-- 23 versus 86 rows

-- No HAVING in sight!

SELECT

   p.Name

FROM Production.Product AS p

WHERE

(

   SELECT COUNT_BIG(*)

   FROM Production.TransactionHistory AS th

   WHERE th.ProductID = p.ProductID

   --GROUP BY p.ProductID

) < 10;

Anyway, yes I agree HAVING is pure sugar; we can always replace it with a filter over a derived table.  Nevertheless, it is standard SQL and hardly likely to be removed, whatever I think about it.

Paul

March 13, 2012 10:15 AM
 

Praveen kumar pddi said:

Hi Paul,

Can you try this variation of solution and provide your valuable inputs  by comparing existing solutions?

SELECT p.Name--,rn,cnt

FROM

(

select *

from

(

select *, row_number() over (partition by ProductID, cnt order by ProductID, cnt) as rn

from

(

   SELECT

       th.ProductID,

       cnt = COUNT_BIG(th.ProductID) over (partition by th.ProductID)

   FROM Production.TransactionHistory AS th

) AS q1

WHERE

   q1.cnt < 10 -- here , we are interested in mere 10  or less

) AS q1

WHERE

--above query returns multiple records for each productID , as we are fetching the data from transactional table

and rn=1

-- rn=1 will fetch one row per each product, in a way,  we are removing duplicates

) as q1

JOIN Production.Product AS p

   ON p.ProductID = q1.ProductID

March 19, 2012 9:45 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement