A long, long, time ago, I wrote an article on calculating the median in a table.
http://www.aspfaq.com/2506
Of course, this was against SQL Server 2000; long before we had seen but glimpes of upcoming features in "Yukon" like Common Table Expressions (CTEs) and ROW_NUMBER().
Well, now SQL Server 2005 has been with us for a while, and I am going back to see how many of my "solutions" could be improved upon.
Last night, I went to a talk by Itzik Ben-Gan at the monthly meeting for the New England SQL User Group, and he reminded me how powerful ROW_NUMBER() could be in a situation like calculating the median. So, we have a good candidate here, right?
Let's take this trivial table:
USE tempdb;
GO
CREATE TABLE dbo.MedianTest
(
ClientID INT,
Requests INT
);
GO
CREATE CLUSTERED INDEX c
ON dbo.MedianTest(ClientID, Requests);
GO
SET NOCOUNT ON;
GO
INSERT dbo.MedianTest
SELECT 1, 50
UNION ALL SELECT 1, 40
UNION ALL SELECT 1, 24
UNION ALL SELECT 2, 25
UNION ALL SELECT 2, 75
UNION ALL SELECT 3, 10
UNION ALL SELECT 3, 2
UNION ALL SELECT 3, 7
UNION ALL SELECT 3, 12
UNION ALL SELECT 4, 22
UNION ALL SELECT 4, 26
UNION ALL SELECT 4, 31;
GO
Now, by following the example in the article above, I would get the median for each ClientID as follows:
SELECT DISTINCT
ClientID,
Median = (
(SELECT MAX(Requests) FROM
(SELECT TOP 50 PERCENT Requests FROM dbo.MedianTest tA
WHERE tA.ClientID = T.ClientID
ORDER BY Requests) TopHalf)
+
(SELECT MIN(Requests) FROM
(SELECT TOP 50 PERCENT Requests FROM dbo.MedianTest tB
WHERE tB.ClientID = T.ClientID
ORDER BY Requests DESC) BottomHalf)
) / 2
FROM dbo.MedianTest T
ORDER BY ClientID;
GO
To do this in SQL Server 2005, we can get a huge increase in performance by using a CTE and ROW_NUMBER():
WITH cte AS
(
SELECT
ClientID,
Requests,
rn = ROW_NUMBER() OVER
(
PARTITION BY ClientID
ORDER BY Requests
),
rc = COUNT(*) OVER
(
PARTITION BY ClientID
)
FROM
dbo.MedianTest
)
SELECT
ClientID,
Median = AVG(Requests)
FROM
cte
WHERE
rn IN ((rc + 1) / 2, (rc + 2) / 2)
GROUP BY
ClientID
ORDER BY
ClientID;
GO
On this size of data, you're not going to see a noticeable difference. But imagine a table much, much larger. You may be astonished if you run these together and turn on the actual execution plan. The "old" solution uses three independent sort operations, and on my hardware the cost ratio for the overall queries is 98(old):2(new).
Don't forget to clean up:
DROP TABLE dbo.MedianTest;
GO