THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul White: Page Free Space

A technical SQL Server blog from New Zealand. See also my articles on SQLperformance.com

Row Goals and Grouping

You might recall (from my last post) that query plans containing a row goal tend to favour nested loops or merge join over hashing.  This is because a hash join has to fully process its build input (to populate its hash table) before it can start probing for matches from its second input.  Hash join therefore has a high start-up cost, which is balanced by a lower per-row cost once probing begins.  In this post, I’ll take a look at how row goals affect grouping operations.

Grouping Strategies

While the start-up cost of hash join often makes it unsuitable for plans with a row goal, there are times when hashing operations may feature in such plans, since the Hash Match iterator also supports a streaming mode.

As an example, say we are asked to list one hundred unique first names from the AdventureWorks Contacts table:

SELECT  TOP (100) FirstName
FROM AdventureWorks.Person.Contact
GROUP BY FirstName;

-- Or, equivalently
SELECT DISTINCT TOP (100) FirstName
FROM AdventureWorks.Person.Contact;

The optimiser has two strategies to choose from here.  The first is to sort the records by FirstName, and then return one row from each group.  Conceptually, this involves a sort followed by a stream aggregate – though in practice SQL Server collapses the whole operation into a single Sort iterator, running in Distinct Sort mode:

Distinct-Sort

If rows already sorted by FirstName are available (perhaps from a previous sorting operation earlier in the plan, or an index), only the Stream Aggregate part is required:

Stream-Aggregate

The second option is to hash the FirstName values, only storing a value if it has not already been encountered.  Once all input rows have been consumed, we simply scan the (unique) stored values to produce the required output:

Hash-Aggregate

The cheapest option is the Stream Aggregate, though it does require that pre-sorted rows are available.   Otherwise, the optimiser generally prefers hashing, especially for larger input sets with fewer groups, and where no particular output order is specified.  Sort Distinct is often the highest-cost option, unless a sorted output is required.

Blocking and Streaming

The Sort Distinct and Hash Match Aggregate iterators in the plans above are both blocking: they only start to produce output after consuming the entire input, and both require a memory grant at runtime.

The Sort iterator is blocking because it cannot determine which item sorts highest without examining all records, and it requires a memory grant to cache the rows it receives.  The Hash Match iterator requires a memory grant for its hash table.  The Stream Aggregate, as the name suggests, is a streaming (non-blocking) iterator: it consumes just as many rows from its child input as are necessary to produce the next distinct row.

Where a row goal is specified, there is another implementation option.  Unlike a hash join, a Hash Match aggregate can also operate in a streaming mode, returning a new value for FirstName every time it encounters a value that it has not seen before.  The use of a hash table makes it very cheap to check if the current value has been seen already.

When running in this streaming mode, the Hash Match iterator shows that it is performing a ‘Flow Distinct’ physical operation:

Flow-Distinct

This is the plan produced by the optimiser for our sample query: there are no blocking operations, though the Hash Match does still require a memory grant to ‘remember’ the values it has already seen.

Statistics

The query plan chosen by the optimiser is a cost-based decision, based in part on statistical information.  In general, the optimiser chooses a Flow Distinct where it determines that fewer output rows are required than there are distinct values in the input set.

To see how many distinct values (and hence how many groups) the optimiser expects, we can examine the estimated execution plan for the same query (but without the TOP clause):

Expected-Group-Count[5]

This shows that the optimiser expects 1018 unique values in the FirstName column, so the optimiser will choose a Flow Distinct implementation if the row goal is 1017 or less.  You can experiment with different TOP or OPTION (FAST n) values to verify that is the case.

To see how the optimiser ‘knows’ that 1018 groups exist, let’s look at the statistical information for the FirstName column:

DBCC SHOW_STATISTICS 
(
[Person.Contact], [FirstName]
)
WITH
STAT_HEADER,
DENSITY_VECTOR,
HISTOGRAM;

Partial output:

Statistics[7]

The DISTINCT_RANGE_ROWS column shows how many distinct values are contained in each step of the histogram, excluding the RANGE_HI_KEY values.  Adding all the DISTINCT_RANGE_ROWS values together results in a total of 818, to which we add the 200 unique RANGE_HI_KEY values, producing a total number of expected unique values of 1018 – exactly as shown in the estimated plan.

This information is also available directly from the all density value circled in green.  ‘All density’ is an estimate of 1/(number of distinct values) – so taking the reciprocal of 0.0009823183 also gives the expected number of unique values.  Thanks to Ben Nevarez for pointing this out in the comments.

The full calculation performed inside the optimiser takes into account the sampled number of rows compared to the current table cardinality (which might have changed since the statistics were recorded).  In this case, the 19,972 rows sampled by the statistics exactly matches the current table cardinality, so no adjustment was necessary.

That’s all for today, next time I’ll take a look at row goals with sorting operations.

Paul

email: SQLkiwi@gmail.com
twitter: @SQL_Kiwi

Published Sunday, August 22, 2010 7:48 PM by Paul White

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Paul White said:

Thank you, Brad - I appreciate it :)

Hash Match in Flow Distinct mode is one of those 'interesting rarities' that I find fascinating.  Good to know that at least one other person agrees!

I haven't fully nailed down my next post yet - but there's a better-than-even chance it will ring some bells for you...watch this space I guess.

Paul

August 22, 2010 10:49 AM
 

Page Free Space: Paul White said:

When you write a query to return the first few rows from a potential result set, you’ll often use the

August 26, 2010 9:34 AM
 

Rob Farley said:

I wrote a post recently about how query tuning isn’t just about how quickly the query runs – that if

March 7, 2011 7:01 PM
 

Paul White: Page Free Space said:

As usual, here’s a sample table: With some sample data: And an index that will be useful shortly: There’s

July 1, 2011 10:51 AM
 

Paul White: Page Free Space said:

This is a post for T-SQL Tuesday #43 hosted by my good friend Rob Farley . The topic this month is Plan

June 11, 2013 5:00 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement