THE SQL Server Blog Spot on the Web
Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Aaron Bertrand

Take the high road to middle ground

A long, long, time ago, I wrote an article on calculating the median in a table.

http://www.aspfaq.com/2506

Of course, this was against SQL Server 2000; long before we had seen but glimpes of upcoming features in "Yukon" like Common Table Expressions (CTEs) and ROW_NUMBER().

Well, now SQL Server 2005 has been with us for a while, and I am going back to see how many of my "solutions" could be improved upon.

Last night, I went to a talk by Itzik Ben-Gan at the monthly meeting for the New England SQL User Group, and he reminded me how powerful ROW_NUMBER() could be in a situation like calculating the median. So, we have a good candidate here, right?

Let's take this trivial table:

USE tempdb;
GO

CREATE TABLE dbo.MedianTest
(
	ClientID INT,
	Requests INT
);
GO

CREATE CLUSTERED INDEX c 
	ON dbo.MedianTest(ClientID, Requests);
GO

SET NOCOUNT ON;
GO

INSERT dbo.MedianTest 
	SELECT 1, 50
	UNION ALL SELECT 1, 40
	UNION ALL SELECT 1, 24
	UNION ALL SELECT 2, 25
	UNION ALL SELECT 2, 75
	UNION ALL SELECT 3, 10
	UNION ALL SELECT 3, 2
	UNION ALL SELECT 3, 7
	UNION ALL SELECT 3, 12
	UNION ALL SELECT 4, 22
	UNION ALL SELECT 4, 26
	UNION ALL SELECT 4, 31;
GO

Now, by following the example in the article above, I would get the median for each ClientID as follows:

SELECT DISTINCT
	ClientID,
	Median = (
		 (SELECT MAX(Requests) FROM
		   (SELECT TOP 50 PERCENT Requests FROM dbo.MedianTest tA
			WHERE tA.ClientID = T.ClientID
			ORDER BY Requests) TopHalf)
		 +
		 (SELECT MIN(Requests) FROM
		   (SELECT TOP 50 PERCENT Requests FROM dbo.MedianTest tB
			WHERE tB.ClientID = T.ClientID
			ORDER BY Requests DESC) BottomHalf)
		) / 2 
	FROM dbo.MedianTest T
	ORDER BY ClientID;
GO

To do this in SQL Server 2005, we can get a huge increase in performance by using a CTE and ROW_NUMBER():

WITH cte AS
(
	SELECT
		ClientID,
		Requests,
		rn = ROW_NUMBER() OVER
		(
			PARTITION BY ClientID
			ORDER BY Requests
		),
		rc = COUNT(*) OVER
		(
			PARTITION BY ClientID
		)
	FROM
		dbo.MedianTest
)
	SELECT
		ClientID, 
		Median = AVG(Requests)
	FROM
		cte
	WHERE
		rn IN ((rc + 1) / 2, (rc + 2) / 2)
	GROUP BY
		ClientID
	ORDER BY
		ClientID;
GO

On this size of data, you're not going to see a noticeable difference. But imagine a table much, much larger. You may be astonished if you run these together and turn on the actual execution plan. The "old" solution uses three independent sort operations, and on my hardware the cost ratio for the overall queries is 98(old):2(new).

Don't forget to clean up:

DROP TABLE dbo.MedianTest;
GO

Published Friday, December 15, 2006 1:31 PM by AaronBertrand

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Aaron said:

Adam Machanic posted me offline, suggesting that I investigate what he called "Celko's method."  He said that it gave better performance against the Sales.SalesOrderHeader table in AdventureWorks.  I don't happen to have the sample database installed here, so I can't test right now, and will have to take his word for it that Celko's method had a fraction of the logical reads compared to the above.  Here is the approach adapted to my sample here:

SELECT 
ClientID,
Median = AVG(Requests)
FROM
(
SELECT
ClientID,
Requests,
RowAsc = ROW_NUMBER() OVER
(
PARTITION BY ClientID
ORDER BY Requests
),
RowDesc = ROW_NUMBER() OVER
(
PARTITION BY ClientID
ORDER BY Requests DESC
)
FROM
dbo.MedianTest
) x
WHERE
RowAsc IN
(
RowDesc,
RowDesc - 1,
RowDesc + 1
)
GROUP BY
ClientID
ORDER BY
ClientID;
December 18, 2006 6:25 AM
 

Adam Machanic said:

A couple of days ago, Aaron Bertrand posted about a method for calculating medians in SQL Server 2005...
December 18, 2006 3:30 PM
 

Peter DeBetta's SQL Programming Blog said:

I was going to comment on Adam's post "Medians, ROW_NUMBERs, and performance" (which was a follow-up to...
December 20, 2006 6:18 AM
 

Peter DeBetta's SQL Programming Blog said:

I was going to comment on Adam's post " Medians, ROW_NUMBERs, and performance " (which was a follow-up

January 2, 2007 4:12 PM
 

Adam Machanic said:

A couple of days ago, Aaron Bertrand posted about a method for calculating medians in SQL Server 2005

January 8, 2007 1:17 PM

Leave a Comment

(required) 
(optional)
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement