THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
 in Adam Machanic (Entire Site) Search

Adam Machanic, Boston-based SQL Server developer, shares his experiences with programming, monitoring, and performance tuning SQL Server. And the occasional battle with the query optimizer.

# Medians, ROW_NUMBERs, and performance

A couple of days ago, Aaron Bertrand posted about a method for calculating medians in SQL Server 2005 using the ROW_NUMBER function in conjunction with the COUNT aggregate. This method (credited to Itzik Ben-Gan) is interesting, but I discovered an even better way to attack the problem in Joe Celko's Analytics and OLAP in SQL.

Rather than using a COUNT aggregate in conjunction with the ROW_NUMBER function, Celko's method uses ROW_NUMBER twice: Once with an ascending sort, and again with a descending sort. The output rows can then be matched based on the ascending row number being within +/- 1 of the descending row number.  This becomes clearer with a couple of small examples:

 A 1 4 B 2 3 C 3 2 D 4 1

 A 1 5 B 2 4 C 3 3 D 4 2 E 5 1

In the first table (even number of rows), the median rows are B and C. These can be matched based on [Ascending Column] IN ([Descending Column] + 1, [Descending Column] - 1). In the second table (odd number of rows), the median row is C, which is matched where [Ascending Column] = [Descending Column]. Note that in the second table, the match criteria for the first table does not apply -- so the generic expression to match either case is the combination of the two:  [Ascending Column] IN ([Descending Column], [Descending Column] + 1, [Descending Column] - 1).

We can apply this logic within the AdventureWorks database to find the median of the "TotalDue" amount in the Sales.SalesOrderHeader table, for each customer:

SELECT
CustomerId,
AVG(TotalDue)
FROM
(
SELECT
CustomerId,
TotalDue,
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc
) x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CustomerId
ORDER BY CustomerId;

The equivalent logic using Itzik Ben-Gan's method follows:

SELECT
CustomerId,
AVG(TotalDue)
FROM
(
SELECT
CustomerId,
TotalDue,
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue) AS RowNum,
COUNT(*) OVER (
PARTITION BY CustomerId) AS RowCnt
) x
WHERE
RowNum IN ((RowCnt + 1) / 2, (RowCnt + 2) / 2)
GROUP BY CustomerId
ORDER BY CustomerId;

Taking a look at the estimated execution plans for these two queries, we might believe that Ben-Gan's method is superior: Celko's algorithm requires an expensive intermediate sort operation and has an estimated cost of 4.96, compared to 3.96 for Ben-Gan's.

Remember that these are merely estimates. And as it turns out, this is one of those times that the Query Optimizer's cost estimates are are totally out of line with the reality of what happens when you actually run the queries. Although the performance difference is not especially noticeable on a set of data as small as that in Sales.SalesOrderHeader, check out the STATISTICS IO output. Celko's version does 703 logical reads; Ben-Gan's does an astonishing 140110!

There is a good lesson to be learned from this: Cost-based optimization is far from perfect! Never completely trust what estimates tell you; they've come a long way, but clearly there is still some work to do in this area. The only way to actually determine that one query is better than another is to run it against a realistic set of data and look at how much IO and CPU time is actually used.

In this case, Ben-Gan's query probably should perform better than Celko's. It seems odd that the Query Processor can't collect the row counts at the same time it processes the row numbers. Regardless, as of today this is the best way to solve this problem... Not that I've ever needed a median in any production application I've worked on. But I suppose that's beside the point!
Published Monday, December 18, 2006 3:52 PM by Adam Machanic

Paul Nielsen asked me for clarification on the final paragraph and why Ben-Gan's "should" perform better.  Here's my response:

Ben-Gan's -should- perform better, because the QP shouldn't have to do a table spool to get the row count at the same time it does the row number.  Row numbering is essentially the same operation as row counting (a walk over the rows that satisfy a given predicate, except that row numbering has to be ordered/sequential), so I would expect a shortcut by the QP in that case -- but the SQL Server team has obviously not implemented anything along those lines.  If the
QP did behave that way, Ben-Gan's query would be much better than Celko's because it wouldn't have to do an intermediate sort (assuming a supporting index existed).
December 18, 2006 4:46 PM

#### Peter DeBetta's SQL Programming Blog said:

I was going to comment on Adam's post &quot;Medians, ROW_NUMBERs, and performance&quot; (which&amp;nbsp;was a follow-up&amp;nbsp;to...
December 20, 2006 6:18 AM

#### Peter DeBetta's SQL Programming Blog said:

I was going to comment on Adam's post " Medians, ROW_NUMBERs, and performance " (which was a follow-up

January 2, 2007 4:12 PM

#### Michelle Ufford said:

Thank you for this post, Adam!  :)

March 16, 2011 11:48 AM

#### Mike L said:

This is a very, very cool solution. :)  Thanks for sharing it.

August 23, 2011 10:50 AM

#### Fkoffee said:

thanks a ton!!!

May 14, 2012 1:56 PM

#### Mike said:

Shouldn't the row_number(s) be labeled the opposite order?

As shown here below?

SELECT

CustomerId,

AVG(TotalDue)

FROM

(

SELECT

CustomerId,

TotalDue,

ROW_NUMBER() OVER (

PARTITION BY CustomerId

ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowDesc, --Changed

ROW_NUMBER() OVER (

PARTITION BY CustomerId

ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowAsc --Changed

) x

WHERE

RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)

GROUP BY CustomerId

ORDER BY CustomerId;

May 24, 2013 3:27 PM

Mike,

Not unless I've woken up in Bizarro World.

May 25, 2013 10:54 AM

#### Mike said:

I thought it odd too... so I tried a few more things.  Here is what I found, in case it interests you.

select

JobID,

Age,

row_number() over (partition by JobID order by Age asc, EmployeeID asc)as RowAsc,

row_number() over (partition by JobID order by Age desc, EmployeeID desc) as RowDesc

from MyTable

The output shows the RowAsc column in descending order (instead of ascending order) and vice versa for the RowDesc.

If I change the SQL to this:

select

JobID,

Age,

row_number() over (partition by JobID order by Age desc, EmployeeID desc) as RowDesc,

row_number() over (partition by JobID order by Age asc, EmployeeID asc) as RowAsc

from MyTable

Then the output properly shows the RowAsc column as ascending order and vice versa for RowDesc.

I have not idea why this would matter.

May 28, 2013 2:12 PM

#### Fox said:

Is the Where statement correct?

If RowCnt=4, ((RowCnt + 1) / 2, (RowCnt + 2) / 2) is (2.5, 3) and will return value in row 3.

I think it should be

WHERE

RowNum IN ( RowCnt  / 2, (RowCnt + 2) / 2)

Am I missing anything?

Thanks

July 24, 2013 1:02 PM

#### Fox said:

Sorry, never mind.  I understand now.

July 24, 2013 1:17 PM

#### Speedbird186 said:

Unfortunately, this method does not work if there are numerous duplicates in the value column (in your case, the TotalDue). Consider this with scores. Many students end up wit the same score. Then SQL Server does something really odd (this table is going to look bad in the comment, sorry):

Score RowAsc RowDesc

71        10       1

68         9       2

64         6       3

64         7       4

64         8       5

63         3       6

63         4       7

63         5       8

62         2       9

61         1      10

Clearly, the median should calculate as 63.5. However, there is no result in this case.

In other words, if there is an even number of rows, there is not a row where RowAsc IN (RowDesc - 1, RowDesc + 1) because the RowDesc and RowAsc vary by more than 1 in all cases.

July 26, 2013 9:00 PM

#### Speedbird186 said:

Note on my comment above, Itzik Ben-Gan's method does produce the correct result.

July 26, 2013 9:07 PM

@speedbird186: Yes, it does work. You just need to break the tie in the ORDER BY clause by using a unique key.

July 27, 2013 4:06 PM

#### ino_js said:

I am not able to understand the RowNum IN ((RowCnt + 1) / 2, (RowCnt + 2) / 2) logic

August 5, 2013 5:23 PM

ino_js: If you write a couple of sequences out on paper I'm sure you'll see how the math works. It's just a way of matching the row number to the middle of the set.

August 7, 2013 11:25 AM

#### Bishop said:

You sir are a legend, great solution and extremely fast!

But i tend to agree with speedbird186 over duplicates, a slight rework of your example and the problem is fixed:

SELECT

CustomerId,

AVG(TotalDue)

FROM

(

SELECT

CustomerId,

TotalDue,

RowDesc,

ROW_NUMBER() OVER (

PARTITION BY CustomerId

ORDER BY TotalDue ASC, SalesOrderId ASC, RowDesc DESC) AS RowAsc

from (

SELECT

CustomerId,

TotalDue,

ROW_NUMBER() OVER (

PARTITION BY CustomerId

ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc

) x

WHERE

RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)

GROUP BY CustomerId

ORDER BY CustomerId;

By using RowDesc to order RowAsc you make sure duplicates are ordered in such a way that "RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)" will still work perfectly.

Thanks again

October 21, 2013 9:47 AM

#### Paul Lundgren said:

Adam, you've given me a large head start on a project at work, with a very easy-to-understand explanation. I appreciate it greatly.

July 30, 2014 3:14 PM

#### Brian Nordberg said:

Speedbird186's comment about dupes is very important. This method works because you have included a key. For those altering this, ensure you retrieve something to keep the dataset unique. Otherwise duplicate values will get ignored and you could significantly skew your results.

September 5, 2014 12:11 AM

#### CarefulMe said:

In addition to ensuring you have a unique field in the order by of the row_number functions, you need to be careful in both approaches whether there are NULLs in the 'value' field that you are calculating the median over. NULL values are still given row numbers, but should be excluded from the calculation of a median.

Otherwise thanks for the great post, this is much better than the CLR options that suffer from memory issues. I had already figured out Itzik Ben-Gan's method myself when I was trying to find a better solution.

I also developed an extended piece of code that can calculate any percentile using a similar logic but it gets a little more messy.

October 1, 2014 8:49 PM

#### CarefulMe said:

In addition to ensuring you have a unique field in the order by of the row_number functions, you need to be careful in both approaches whether there are NULLs in the 'value' field that you are calculating the median over. NULL values are still given row numbers, but should be excluded from the calculation of a median.

Otherwise thanks for the great post, this is much better than the CLR options that suffer from memory issues. I had already figured out Itzik Ben-Gan's method myself when I was trying to find a better solution.

I also developed an extended piece of code that can calculate any percentile using a similar logic but it gets a little more messy.

October 1, 2014 8:49 PM

#### RenzoSQL said:

I am having a hard time getting the right results in an scenario where there are multiple duplicates and maybe one other value different than the rest.  What would I get as the Median in the following scenario?  I am always getting the higher number in every situation, in this case i get 693.

The records are exactly in the following order:

658.00

658.00

693.00

658.00

658.00

658.00

658.00

658.00

658.00

658.00

658.00

658.00

658.00

658.00

658.00

658.00

November 19, 2015 4:29 PM

#### n8 said:

Thanks for the correction, Bishop!  Works like a charm.

March 21, 2016 6:29 PM

(required)
(required)
Submit