THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Page Free Space

See also my SQL Server technical articles on

Partitioning and the Common Subexpression Spool

SQL Server 2005 introduced the OVER clause to enable partitioning of rowsets before applying a window function.  This post looks at how this feature may require a query plan containing a 'common subexpression spool'.  This query plan construction is required whenever an aggregate window function or the NTILE ranking window function is used.


To illustrate, here is a simple query based on the AdventureWorks sample database.  The AdventureWorks product warehouse is organised into shelves, with multiple bins per shelf.  Each bin can hold several different products.  We have been asked to produce a report with the following (partial) output:


Notice that the total_in_bin column contains the sum of product quantities, partitioned by shelf and bin.  We can meet the requirements using a query featuring the OVER clause:

   1: SELECT  INV.Shelf,
   2:         INV.Bin,
   3:         INV.ProductID,
   4:         INV.Quantity,
   5:         total_in_bin = 
   6:             SUM(INV.Quantity) OVER (
   7:                 PARTITION BY INV.Shelf, INV.Bin)
   8: FROM    Production.ProductInventory INV
   9: WHERE   INV.Shelf BETWEEN 'A' AND 'C'
  10: ORDER   BY
  11:         INV.Shelf,
  12:         INV.Bin;

That produces the correct results, with this interesting query plan:


A special kind of spool

The spools shown in the above plan are very similar to those you will find in wide (per-index) update plans, where a set of data changes is saved once, and replayed to multiple single-index-updating consumers.  This kind of spool is known as a Common Subexpression Spool, and the particular type in play here is called a Segment Spool.  A Segment Spool co-operates with its child Segment iterator to save and replay one group at a time (as defined by the PARTITION BY clause).

The Segment Spool iterator always appears as the immediate parent of a Segment iterator.  The two leaf-level Table Spool iterators shown in the plan are secondary spools, which only replay rows saved by the primary spool.  The way this query plan produces its results is perhaps not entirely intuitive (until you are familiar with the pattern) so we'll explore it a step at a time.

The execution plan

The Segment iterator (see my previous post) detects when a row arriving from the Index Seek belongs to a new group (as specified in the PARTITION BY expression).  Segment adds a new column to the row, which is used as a flag: true if the current row represents the first in a group, false otherwise.  The new column has an auto-generated name; in my case it was named [Segment1003] and I'll refer to it by that label from here on in.

Like all spools, the Segment Spool saves a copy of the rows it receives to a worktable backed by tempdb.  Unlike a regular spool though, the saved rows are available to multiple consumers, and it processes a group at a time, clearing its worktable between groups.  The Segment Spool lazily writes rows into its worktable, until the start of a new group is signalled.  Once the segment spool has a complete group in its worktable, one row (not the whole group) is returned to its parent (the top-level Nested Loops operator in this case).

The data values stored in this one row are not important: they do not contribute to the final result.  The point is that this single row is received on the outer input of the parent Nested Loops iterator.  This causes the iterator to execute its inner input once per group.

Per-group processing

Executing the inner input causes the spooled rows (one complete group, remember) to be replayed into the Stream Aggregate, which computes the SUM of values in the Quantity column for the group.  This gives us the SUM(Quantity) value we need for the current group.  The problem is that the Stream Aggregate is destructive: the rows that contributed to the sum are no longer present in the flow, so we can't just add the result column directly.

The solution is to replay the saved group rows for a second time, from the secondary Table Spool at the lowest level of the plan. The replayed rows are joined with the single column result of the Stream Aggregate, producing the original rows plus a column containing the group sum (in every row) - exactly what is needed.

This set of rows is passed back up to the top-level Nested Loops operator, which joins it with the 'dummy' row produced by the Segment Spool to begin with.  This process is repeated for every group, building up the final output a group at a time.  It’s important to realise that it is the top-level Nested Loops iterator that builds the final output - the Segment Spool clears its worktable between groups.

The final group

You might be wondering how the Segment Spool ensures that the final group is processed.  There is no first row of a following group to set the [Segment1003] flag, so how does this last group end up in the query output?  The answer is that the Segment Spool emits an extra dummy row when it reaches the end of its input.  This extra row triggers the Nested Loops iterator to execute its inner input one last time, so picking up the final group.

There are a couple of interesting side-effects to this.  In general, the Index Seek produces rows which partition into n groups, so the extra dummy row means that the segment spool will pass n+1 rows to its parent.  The iterators on the second level of the plan (the level that contains the Stream Aggregate) will also execute n+1 times.  The Stream Aggregate will produce n rows, and the lowest secondary spool will also execute n times.

This is particularly interesting when the Index Seek produces no rows at all (n = 0).  The Segment iterator also produces no rows, but the Segment Spool produces one row, and the second-level plan iterators also execute once as a result.  You can demonstrate this for yourself by running the sample query for shelf 'Z' and examining the actual row counts between iterators.

Related reading:
RANK, DENSE_RANK & NTILE - Craig Freedman
Optimised per-index maintenance plans - Craig Freedman

© Paul White
twitter: @SQL_Kiwi

Published Wednesday, July 28, 2010 4:36 AM by Paul White

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



Page Free Space - Paul White said:

In my last post I showed how SQL Server 2005 and later can use a Segment Spool to efficiently implement

July 28, 2010 5:12 AM

Martin said:

Hi Paul,

Great Article.

Have you any idea why the logical reads reported by "STATISTICS IO" for these spools are so much greater (about double) than when I try and simulate what is going on manually?

Example Script Here:

March 4, 2011 8:24 AM

Paul White said:

Hi Martin,

(copied from my answer to your Stack Overflow question)

Logical reads are counted differently for worktables: there is one 'logical read' per row read. This does not mean that worktables are somehow less efficient than a 'real' spool table (quite the reverse); the logical reads are just in different units.

I believe the thinking was that counting hashed pages for worktable logical reads would not be very useful because these structures are internal to the server. Reporting rows spooled in the logical reads counter makes the number more meaningful for analysis purposes.

This insight should make the reason your formula works clear. The two secondary spools are fully read twice (2 * COUNT(*)), and the primary spool emits (number of group values + 1) rows as explained in my blog entry, giving the (COUNT(DISTINCT CustomerID) + 1) component. The plus one is for the extra row emitted by the primary spool to indicate the final group has ended.


March 4, 2011 9:20 AM

Paul White: Page Free Space said:

This is a post for T-SQL Tuesday #43 hosted by my good friend Rob Farley . The topic this month is Plan

June 11, 2013 5:00 AM

Leave a Comment

Privacy Statement