THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul White: Page Free Space

A technical SQL Server blog from New Zealand.

Undocumented Query Plans: The ANY Aggregate

As usual, here’s a sample table:

image

With some sample data:

image

And an index that will be useful shortly:

image

There’s a complete script to create the table and add the data at the end of this post.  There’s nothing special about the table or the data (except that I wanted to have some fun with values and data types).

The Task

We are asked to return distinct values of col1 and col2, together with any one value from the thing column (it doesn’t matter which) per group.  One possible result set is shown below:

image

As is so often the case, there are many ways to write a query to do this in SQL.  In fact, there are a surprising number of alternatives, a point I might return to in a future post.  For now, let’s take a look at a natural SQL query solution:

Using the MAX aggregate

imageimage

There’s absolutely nothing wrong with this query or execution plan.  The query is concise, returns the correct results, and executes quickly.  So why am I blogging about it?  It’s the aggregate function.  It bothers me that I have to use MAX (or MIN) here, when what I really want to write is something like:

image

The ANY aggregate

That’s not valid syntax of course – there’s no ANY aggregate in T-SQL.  But, just because we can’t use an ANY aggregate doesn’t mean the Query Processor can’t!

There’s an alternative way to think about the query we want to write: if we were to partition the input set on the grouping columns, and arbitrarily number the rows within each partition, we could just choose row #1 from each partition.  In T-SQL, we can use the ROW_NUMBER ranking function to do this:

image

If you are reasonably familiar with execution plans, you might expect that query to produce something like this:

image

…where the Index Scan produces rows ordered by col1 and col2, the Segment detects the start of each new group and sets a flag in the stream, the Sequence Project uses that flag to restart the row numbering for each group, and the Filter just restricts the output to rows where the row number is 1.  In fact, we don’t get a plan like that at all, we get:

image

This is the same plan produced for the query with the MAX aggregate!  Well, actually it isn’t quite the same.  If you click on the Stream Aggregate and take a look at its properties, you’ll see it isn’t doing a MAX.  The shots below show the values defined by the aggregate in the new query (shown on the left) and the original MAX query (shown on the right):

SNAGHTML2a9c2de2SNAGHTML2aaad7a9

Our new query form is using the ANY aggregate!

What Magic Is This?

SQL Server spotted that we didn’t really want to number the rows at all – we were just expressing the idea that we want one row per group, and we don’t care which one.

The optimizer can simplify a logical query tree like the Sequence Project plan shown above to use the ANY aggregate instead.  The simplification rule used to perform this is called SelSeqPrjToAnyAgg.  As the name suggests, it matches on a selection (filter) and a Sequence Project (specifically one that uses ROW_NUMBER) and replaces it with the ANY aggregate.  Also, as a simplification rule, it runs before full cost-based optimization even gets started, making it available even in trivial plans.

This particular optimization only matches very specific plan shapes, so you have to be careful with it.  The filter has to restrict the expression produced by the Sequence Project to be equal to one.  In our query, that correlates to the WHERE rn = 1 expression.  You also have to PARTITION BY and ORDER BY the grouping columns (though the order does not matter), and specifying a constant for the ORDER BY clause of the ranking function (to indicate that you don’t care about ordering) does not work:

image

That query (and ones like it that use (ORDER BY … SELECT 0, NULL, or NEWID() etc.) fail to match with the rule, resulting in the Segment, Sequence Project, and Filter plan.  (If you use SELECT <constant> you will see an extra Compute Scalar here):

image

If you want to take advantage of the ANY aggregate rewrite, you have to be careful to match the conditions for the rewrite exactly.  Even including the ROW_NUMBER column (rn) in the outer SELECT is enough to break the rule matching.  It really is quite sensitive, bless it.

Invisible ANY

Let’s drop the index we created earlier:

image

Now if we run the rule-matching ROW_NUMBER form of the query, we get this plan instead:

image

As I mentioned in my previous post on Row Goals and Grouping, a Sort followed by a Stream Aggregate can be transformed to a Sort running in Distinct Sort mode.  The Stream Aggregate (with its ANY aggregate) is subsumed by the Sort, and in the process, the ANY aggregation is lost to us – it’s logically still there, but not exposed in the query plan, not even in the XML:

image

The Sort produces the three columns needed (col1, col2, and thing), and performs Distinct ordered by col1 and col2, but there’s no longer any explicit reference to the ANY aggregate performed on the thing column.  To make it reappear, we need to temporarily disable the optimizer rule responsible for the transformation to Distinct Sort:

image

Now we see a plan with the separate Sort and a Stream Aggregate containing the ANY aggregate:

image

The final thing I want to show today is the ANY aggregate working in a Hash Match.  If we use an OPTION (HASH GROUP) query hint, we get this plan:

image

As promised, you can find the script for today’s entry below.  For more information on optimizer rules and internals, see my previous mini-series:

Inside the Optimizer: Constructing a Plan

© 2011 Paul White

email: SQLkiwi@gmail.com
twitter: @SQL_Kiwi

IF    OBJECT_ID(N'tempdb.bdpmet.#Example', N'U')
    IS NOT NULL
    DROP TABLE tempdb.penguin.#Example;
GO
    CREATE TABLE #Example
    (
        pk        NUMERIC IDENTITY PRIMARY KEY NONCLUSTERED,
        col1    SQL_VARIANT NULL,
        col2        SQL_VARIANT NULL,
        thing    SQL_VARIANT NULL,
    );
GO
INSERT #Example
    (col1, col2, thing)
VALUES
    ('A1', CONVERT(SQL_VARIANT, $100), CONVERT(SQL_VARIANT, PI())),
    ('A1', $100, {guid '1D008813-8E80-4821-A481-1A0DE5C4F4DC'}),
    ('A1', $100, 7.297352569824),
    ('A1', N'-U-', 1.3E8),
    ('A1', N'-U-', 9.10938291),
    ('A1', N'-U-', @@SERVICENAME),
    ('A2', {d '2011-07-11'}, 'aotearoa'),
    ('A2', {d '2011-07-11'}, 0xDEADBEEF),
    ('A2', {d '2011-07-11'}, N'संहिता'),
    ('A3', 1.054571726, {fn CURRENT_TIME}),
    ('A3', 1.054571726, RADIANS(RAND())),
    ('A3', 1.054571726, {fn DAYNAME (0)});
GO
CREATE INDEX nc1 ON #Example (col1, col2, thing);
GO
-- A natural query:
SELECT
    e.col1,
    e.col2,
    MAX(e.thing)
FROM #Example AS e
GROUP BY
    e.col1,
    e.col2;
GO
-- Would prefer:
SELECT
    e.col1,
    e.col2,
    ANY (e.thing)
FROM #Example AS e
GROUP BY
    e.col1,
    e.col2;
GO
-- Transformed to use ANY
SELECT
    e.col1,
    e.col2,
    e.thing
FROM
(
    SELECT
        *,
        rn = ROW_NUMBER() OVER (
            PARTITION BY e2.col1, e2.col2
            ORDER BY e2.col1, e2.col2)
    FROM #Example AS e2
) AS e
WHERE
    e.rn = 1;
GO
-- Not matched, no ANY aggregate
SELECT
    e.col1,
    e.col2,
    e.thing
FROM
(
    SELECT
        *,
        rn = ROW_NUMBER() OVER (
            PARTITION BY e2.col1, e2.col2
            ORDER BY (SELECT 0))
    FROM #Example AS e2
) AS e
WHERE
    e.rn = 1;
GO
-- Not matched - we use the rn column
SELECT
    e.col1,
    e.col2,
    e.thing,
    e.rn
FROM
(
    SELECT
        *,
        rn = ROW_NUMBER() OVER (
            PARTITION BY e2.col1, e2.col2
            ORDER BY e2.col1, e2.col2)
    FROM #Example AS e2
) AS e
WHERE
    e.rn = 1;
GO
DROP INDEX nc1 ON #Example;
GO
-- Distinct Sort with invisible ANY
SELECT
    e.col1,
    e.col2,
    e.thing
FROM
(
    SELECT
        *,
        rn = ROW_NUMBER() OVER (
            PARTITION BY e2.col1, e2.col2
            ORDER BY e2.col1, e2.col2)
    FROM #Example AS e2
) AS e
WHERE
    e.rn = 1;
GO
-- Disable rule
DBCC RULEOFF('GbAggToSort');
 
-- Same query, recompile to force re-optimization
SELECT
    e.col1,
    e.col2,
    e.thing
FROM
(
    SELECT
        *,
        rn = ROW_NUMBER() OVER (
            PARTITION BY e2.col1, e2.col2
            ORDER BY e2.col1, e2.col2)
    FROM #Example AS e2
) AS e
WHERE
    e.rn = 1
OPTION (RECOMPILE);
 
-- Enable rule again (*** IMPORTANT! ***)
DBCC RULEOFF('GbAggToSort');
GO
-- ANY aggregate with Hash Match
SELECT
    e.col1,
    e.col2,
    e.thing
FROM
(
    SELECT
        *,
        rn = ROW_NUMBER() OVER (
            PARTITION BY e2.col1, e2.col2
            ORDER BY e2.col1, e2.col2)
    FROM #Example AS e2
) AS e
WHERE
    e.rn = 1
OPTION (HASH GROUP);
Published Saturday, July 02, 2011 4:51 AM by Paul White

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Alejandro Mesa said:

Paul,

Great post, as always!

I know that things can change from version to version, or even CU to CU, but I would enjoy reading a book with all your findings.

Thanks for sharing it with us.

--

AMB

July 1, 2011 12:07 PM
 

KKline said:

What a great read, Paul. There is an ANY clause in the SQL standard, which is supported by other DBMS'es.

Btw, where can you even read about SelSeqPrjToAnyAgg?  Is that expressed in the query plan details somewhere?

Thanks and keep up the excellent work,

-Kev

July 1, 2011 12:59 PM
 

Meher said:

Very good post Paul. Thank you.

July 1, 2011 1:11 PM
 

Paul White said:

Hi AMB,

I'd love to put a book together one day - probably a PDF or something.  I'll get around to it I expect.  For the time being, I do love blogging here, and I have plenty to keep me busy these days :c)

Thanks so much for taking the time to comment - I see the read numbers go up, but it's still nice to hear from people so I know it's not just one person hitting F5!

Paul

July 1, 2011 1:24 PM
 

Paul White said:

Hey Kevin,

Thanks very much.  I wasn't sure whether ANY was a SQL standard thing or not, so cheers for clearing that up for me.

The optimizer rules are not at all documented anywhere, which is a great shame.  If you follow the link at the end of this entry, you'll see the stuff I have posted before about how to work out which rules are being called, though finding out what they do is trial-and-error, and really quite labour-intensive!  Fun though :)

Paul

July 1, 2011 1:35 PM
 

Gianluca Sartori said:

Great post, Paul, as usual!

@Kevin: I checked DB2, Oracle, PostgreSQL, MySQL, SQLLite and (ugh!) Access and none of these support the ANY aggregate, while most of them supports user-defined aggregate functions. Are you sure this is part of a SQL standard (92, 99, 2005)?

Thanks,

Gianluca

July 4, 2011 12:18 PM
 

Kevin Feasel said:

The ANY keyword in ANSI SQL isn't really an aggregate, at least in the same way as MIN, MAX, etc.  There are some examples at http://sqlnut.atw.hu/sqlnut2-chp-4-sect-2.html, but an Adventure Works example using ANY (SOME and ALL work similarly) would be:

select *

from HumanResources.Employee

where LoginID =

  ANY(select loginid from HumanResources.employee where managerid = 185)

Here, ANY is used basically like an IN or EXISTS statement:  where the main query matches any of the results in the subquery.  The execution plan is the same as:

select *

from HumanResources.Employee

where LoginID in (select LoginID from HumanResources.Employee where ManagerID = 185)

or:

select *

from HumanResources.Employee he

where exists (select LoginID from HumanResources.Employee e where he.LoginID = e.LoginID and e.ManagerID = 185)

And @Paul, I would make room on my shelf somewhere for whatever book you write, if only to catalog the insane engine quirks you've found over the years...

July 4, 2011 10:05 PM
 

Brad Schulz said:

Excellent, Paul...

I know I had seen that ANY() Aggregate before when digging into query plans in the past and I made a mental note to investigate further, but it fell through the cracks.  Thanks for the info!

--Brad

July 5, 2011 7:38 PM
 

Paul White said:

Hi Brad,

Yes the ANY 'aggregate' pops up all over the place in plans.  One other example is where SQL Server needs to check that a single value is returned by a subquery - just before the Assert.

Paul

July 6, 2011 12:30 AM
 

Rabia said:

LOL this is most excellent. My life would be easier if they just let me use the ANY aggregate in my query.

August 20, 2014 12:10 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement