As usual, here’s a sample table:

With some sample data:

And an index that will be useful shortly:

There’s a complete script to create the table and add the data at the end of this post. There’s nothing special about the table or the data (except that I wanted to have some fun with values and data types).
The Task
We are asked to return distinct values of col1 and col2, together with any one value from the thing column (it doesn’t matter which) per group. One possible result set is shown below:

As is so often the case, there are many ways to write a query to do this in SQL. In fact, there are a surprising number of alternatives, a point I might return to in a future post. For now, let’s take a look at a natural SQL query solution:
Using the MAX aggregate


There’s absolutely nothing wrong with this query or execution plan. The query is concise, returns the correct results, and executes quickly. So why am I blogging about it? It’s the aggregate function. It bothers me that I have to use MAX (or MIN) here, when what I really want to write is something like:

The ANY aggregate
That’s not valid syntax of course – there’s no ANY aggregate in T-SQL. But, just because we can’t use an ANY aggregate doesn’t mean the Query Processor can’t!
There’s an alternative way to think about the query we want to write: if we were to partition the input set on the grouping columns, and arbitrarily number the rows within each partition, we could just choose row #1 from each partition. In T-SQL, we can use the ROW_NUMBER ranking function to do this:

If you are reasonably familiar with execution plans, you might expect that query to produce something like this:

…where the Index Scan produces rows ordered by col1 and col2, the Segment detects the start of each new group and sets a flag in the stream, the Sequence Project uses that flag to restart the row numbering for each group, and the Filter just restricts the output to rows where the row number is 1. In fact, we don’t get a plan like that at all, we get:

This is the same plan produced for the query with the MAX aggregate! Well, actually it isn’t quite the same. If you click on the Stream Aggregate and take a look at its properties, you’ll see it isn’t doing a MAX. The shots below show the values defined by the aggregate in the new query (shown on the left) and the original MAX query (shown on the right):


Our new query form is using the ANY aggregate!
What Magic Is This?
SQL Server spotted that we didn’t really want to number the rows at all – we were just expressing the idea that we want one row per group, and we don’t care which one.
The optimizer can simplify a logical query tree like the Sequence Project plan shown above to use the ANY aggregate instead. The simplification rule used to perform this is called SelSeqPrjToAnyAgg. As the name suggests, it matches on a selection (filter) and a Sequence Project (specifically one that uses ROW_NUMBER) and replaces it with the ANY aggregate. Also, as a simplification rule, it runs before full cost-based optimization even gets started, making it available even in trivial plans.
This particular optimization only matches very specific plan shapes, so you have to be careful with it. The filter has to restrict the expression produced by the Sequence Project to be equal to one. In our query, that correlates to the WHERE rn = 1 expression. You also have to PARTITION BY and ORDER BY the grouping columns (though the order does not matter), and specifying a constant for the ORDER BY clause of the ranking function (to indicate that you don’t care about ordering) does not work:

That query (and ones like it that use (ORDER BY … SELECT 0, NULL, or NEWID() etc.) fail to match with the rule, resulting in the Segment, Sequence Project, and Filter plan. (If you use SELECT <constant> you will see an extra Compute Scalar here):

If you want to take advantage of the ANY aggregate rewrite, you have to be careful to match the conditions for the rewrite exactly. Even including the ROW_NUMBER column (rn) in the outer SELECT is enough to break the rule matching. It really is quite sensitive, bless it.
Invisible ANY
Let’s drop the index we created earlier:

Now if we run the rule-matching ROW_NUMBER form of the query, we get this plan instead:

As I mentioned in my previous post on Row Goals and Grouping, a Sort followed by a Stream Aggregate can be transformed to a Sort running in Distinct Sort mode. The Stream Aggregate (with its ANY aggregate) is subsumed by the Sort, and in the process, the ANY aggregation is lost to us – it’s logically still there, but not exposed in the query plan, not even in the XML:

The Sort produces the three columns needed (col1, col2, and thing), and performs Distinct ordered by col1 and col2, but there’s no longer any explicit reference to the ANY aggregate performed on the thing column. To make it reappear, we need to temporarily disable the optimizer rule responsible for the transformation to Distinct Sort:

Now we see a plan with the separate Sort and a Stream Aggregate containing the ANY aggregate:

The final thing I want to show today is the ANY aggregate working in a Hash Match. If we use an OPTION (HASH GROUP) query hint, we get this plan:

As promised, you can find the script for today’s entry below. For more information on optimizer rules and internals, see my previous mini-series:
Inside the Optimizer: Constructing a Plan
© 2011 Paul White
email: SQLkiwi@gmail.com
twitter: @SQL_Kiwi
IF OBJECT_ID(N'tempdb.bdpmet.#Example', N'U')
IS NOT NULL
DROP TABLE tempdb.penguin.#Example;
GO
CREATE TABLE #Example
(
pk NUMERIC IDENTITY PRIMARY KEY NONCLUSTERED,
col1 SQL_VARIANT NULL,
col2 SQL_VARIANT NULL,
thing SQL_VARIANT NULL,
);
GO
INSERT #Example
(col1, col2, thing)
VALUES
('A1', CONVERT(SQL_VARIANT, $100), CONVERT(SQL_VARIANT, PI())),
('A1', $100, {guid '1D008813-8E80-4821-A481-1A0DE5C4F4DC'}),
('A1', $100, 7.297352569824),
('A1', N'-U-', 1.3E8),
('A1', N'-U-', 9.10938291),
('A1', N'-U-', @@SERVICENAME),
('A2', {d '2011-07-11'}, 'aotearoa'),
('A2', {d '2011-07-11'}, 0xDEADBEEF),
('A2', {d '2011-07-11'}, N'संहिता'),
('A3', 1.054571726, {fn CURRENT_TIME}),
('A3', 1.054571726, RADIANS(RAND())),
('A3', 1.054571726, {fn DAYNAME (0)});
GO
CREATE INDEX nc1 ON #Example (col1, col2, thing);
GO
-- A natural query:
SELECT
e.col1,
e.col2,
MAX(e.thing)
FROM #Example AS e
GROUP BY
e.col1,
e.col2;
GO
-- Would prefer:
SELECT
e.col1,
e.col2,
ANY (e.thing)
FROM #Example AS e
GROUP BY
e.col1,
e.col2;
GO
-- Transformed to use ANY
SELECT
e.col1,
e.col2,
e.thing
FROM
(
SELECT
*,
rn = ROW_NUMBER() OVER (
PARTITION BY e2.col1, e2.col2
ORDER BY e2.col1, e2.col2)
FROM #Example AS e2
) AS e
WHERE
e.rn = 1;
GO
-- Not matched, no ANY aggregate
SELECT
e.col1,
e.col2,
e.thing
FROM
(
SELECT
*,
rn = ROW_NUMBER() OVER (
PARTITION BY e2.col1, e2.col2
ORDER BY (SELECT 0))
FROM #Example AS e2
) AS e
WHERE
e.rn = 1;
GO
-- Not matched - we use the rn column
SELECT
e.col1,
e.col2,
e.thing,
e.rn
FROM
(
SELECT
*,
rn = ROW_NUMBER() OVER (
PARTITION BY e2.col1, e2.col2
ORDER BY e2.col1, e2.col2)
FROM #Example AS e2
) AS e
WHERE
e.rn = 1;
GO
DROP INDEX nc1 ON #Example;
GO
-- Distinct Sort with invisible ANY
SELECT
e.col1,
e.col2,
e.thing
FROM
(
SELECT
*,
rn = ROW_NUMBER() OVER (
PARTITION BY e2.col1, e2.col2
ORDER BY e2.col1, e2.col2)
FROM #Example AS e2
) AS e
WHERE
e.rn = 1;
GO
-- Disable rule
DBCC RULEOFF('GbAggToSort');
-- Same query, recompile to force re-optimization
SELECT
e.col1,
e.col2,
e.thing
FROM
(
SELECT
*,
rn = ROW_NUMBER() OVER (
PARTITION BY e2.col1, e2.col2
ORDER BY e2.col1, e2.col2)
FROM #Example AS e2
) AS e
WHERE
e.rn = 1
OPTION (RECOMPILE);
-- Enable rule again (*** IMPORTANT! ***)
DBCC RULEOFF('GbAggToSort');
GO
-- ANY aggregate with Hash Match
SELECT
e.col1,
e.col2,
e.thing
FROM
(
SELECT
*,
rn = ROW_NUMBER() OVER (
PARTITION BY e2.col1, e2.col2
ORDER BY e2.col1, e2.col2)
FROM #Example AS e2
) AS e
WHERE
e.rn = 1
OPTION (HASH GROUP);