THE SQL Server Blog Spot on the Web
Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Adam Machanic

Adam Machanic, Boston-based independent database consultant, writer, and speaker, shares his experiences with programming, performance tuning, and optimizing SQL Server 2000, 2005, and 2008, in conjunction with related technologies such as .NET.

Medians, ROW_NUMBERs, and performance

A couple of days ago, Aaron Bertrand posted about a method for calculating medians in SQL Server 2005 using the ROW_NUMBER function in conjunction with the COUNT aggregate. This method (credited to Itzik Ben-Gan) is interesting, but I discovered an even better way to attack the problem in Joe Celko's Analytics and OLAP in SQL.

Rather than using a COUNT aggregate in conjunction with the ROW_NUMBER function, Celko's method uses ROW_NUMBER twice: Once with an ascending sort, and again with a descending sort. The output rows can then be matched based on the ascending row number being within +/- 1 of the descending row number.  This becomes clearer with a couple of small examples:

A

1

4

B

2

3

C

3

2

D

4

1

 

A

1

5

B

2

4

C

3

3

D

4

2

E

5

1


In the first table (even number of rows), the median rows are B and C. These can be matched based on [Ascending Column] IN ([Descending Column] + 1, [Descending Column] - 1). In the second table (odd number of rows), the median row is C, which is matched where [Ascending Column] = [Descending Column]. Note that in the second table, the match criteria for the first table does not apply -- so the generic expression to match either case is the combination of the two:  [Ascending Column] IN ([Descending Column], [Descending Column] + 1, [Descending Column] - 1).


We can apply this logic within the AdventureWorks database to find the median of the "TotalDue" amount in the Sales.SalesOrderHeader table, for each customer:


SELECT
   CustomerId,
   AVG(TotalDue)
FROM
(
   SELECT
      CustomerId,
      TotalDue,
      ROW_NUMBER() OVER (
         PARTITION BY CustomerId
         ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc,
      ROW_NUMBER() OVER (
         PARTITION BY CustomerId
         ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc
   FROM Sales.SalesOrderHeader SOH
) x
WHERE
   RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CustomerId
ORDER BY CustomerId;

 


The equivalent logic using Itzik Ben-Gan's method follows:



SELECT
   CustomerId,
   AVG(TotalDue)
FROM
(
   SELECT
      CustomerId,
      TotalDue,
      ROW_NUMBER() OVER (
         PARTITION BY CustomerId
         ORDER BY TotalDue) AS RowNum,
       COUNT(*) OVER (
          PARTITION BY CustomerId) AS RowCnt
   FROM Sales.SalesOrderHeader
) x
WHERE
   RowNum IN ((RowCnt + 1) / 2, (RowCnt + 2) / 2)
GROUP BY CustomerId
ORDER BY CustomerId;

 


Taking a look at the estimated execution plans for these two queries, we might believe that Ben-Gan's method is superior: Celko's algorithm requires an expensive intermediate sort operation and has an estimated cost of 4.96, compared to 3.96 for Ben-Gan's.


Remember that these are merely estimates. And as it turns out, this is one of those times that the Query Optimizer's cost estimates are are totally out of line with the reality of what happens when you actually run the queries. Although the performance difference is not especially noticeable on a set of data as small as that in Sales.SalesOrderHeader, check out the STATISTICS IO output. Celko's version does 703 logical reads; Ben-Gan's does an astonishing 140110!


There is a good lesson to be learned from this: Cost-based optimization is far from perfect! Never completely trust what estimates tell you; they've come a long way, but clearly there is still some work to do in this area. The only way to actually determine that one query is better than another is to run it against a realistic set of data and look at how much IO and CPU time is actually used.


In this case, Ben-Gan's query probably should perform better than Celko's. It seems odd that the Query Processor can't collect the row counts at the same time it processes the row numbers. Regardless, as of today this is the best way to solve this problem... Not that I've ever needed a median in any production application I've worked on. But I suppose that's beside the point!

Published Monday, December 18, 2006 3:52 PM by Adam Machanic

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Adam Machanic said:

Paul Nielsen asked me for clarification on the final paragraph and why Ben-Gan's "should" perform better.  Here's my response:

Ben-Gan's -should- perform better, because the QP shouldn't have to do a table spool to get the row count at the same time it does the row number.  Row numbering is essentially the same operation as row counting (a walk over the rows that satisfy a given predicate, except that row numbering has to be ordered/sequential), so I would expect a shortcut by the QP in that case -- but the SQL Server team has obviously not implemented anything along those lines.  If the
QP did behave that way, Ben-Gan's query would be much better than Celko's because it wouldn't have to do an intermediate sort (assuming a supporting index existed).
December 18, 2006 4:46 PM
 

Peter DeBetta's SQL Programming Blog said:

I was going to comment on Adam's post "Medians, ROW_NUMBERs, and performance" (which was a follow-up to...
December 20, 2006 6:18 AM
 

Peter DeBetta's SQL Programming Blog said:

I was going to comment on Adam's post " Medians, ROW_NUMBERs, and performance " (which was a follow-up

January 2, 2007 4:12 PM

Leave a Comment

(required) 
(optional)
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based independent database consultant, writer, and speaker. He has been involved in dozens of SQL Server implementations for both high-availability OLTP and large-scale data warehouse applications, and has optimized data access layer performance for several data-intensive applications. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "Expert SQL Server 2005 Development" (Apress, 2007) and "Inside SQL Server 2005: Query Tuning and Optimization" (Microsoft Press, 2007). Adam regularly speaks at user groups, community events, and conferences on a variety of SQL Server and .NET-related topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server and a Microsoft Certified IT Professional (MCITP).
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement