THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
 in Peter Larsson (Entire Site) Search

Thinking outside the box
• Median and Weighted Median calculations

 Hi all!This is my first blog post here at sqlblog.com after some convincing arguments made by Adam Machanic. Some of you know me as Peso, and I have mostly blogged at SQLTeam.com before.Yesterday Joe Celko posted a query on microsoft.public.sqlserver.programming newsgroup of how to write an elegant query for Weighted Median calculation.He managed to get the correct results but also ended up with some pretty ugly code. Joe also wrote he had a feeling of an elegant query should exist but was not seeing it.Anyway, since some of Celko's books have helped me in the past, I thought I should return the favor.Consider this sample dataDECLARE @Foo TABLE        (            x SMALLMONEY NOT NULL        )INSERT  @FooVALUES  (1),        (2),        (2),        (3),        (3),        (3)The most common approach to calculate the median value I have seen isSELECT  AVG(x)FROM    (            SELECT  x,                    ROW_NUMBER() OVER (ORDER BY x DESC) AS a,                    ROW_NUMBER() OVER (ORDER BY x) AS b            FROM    @Foo        ) AS dWHERE   b - a BETWEEN -1 AND 1 It's a great method and it works in almost all cases. Yes, in almost all cases. There are circumstances where SQL Server will not give the correct result.Adam Machanic has described the problem in this Connect issue. It seems the table need some key to work properly, but as seen with Celko's sample data, there is no key in this scenario.x       a  b  a-b1.0000  6  1    52.0000  4  2    22.0000  5  3    23.0000  1  4   -33.0000  2  5   -33.0000  3  6   -3As you can see, the difference calculated by a-b suddenly doesn't match!How can we overcome this behaviour? How can we write a piece of code that works all times?It's not that hard. See this query.-- Median by Peso SELECT  AVG(x)FROM    (            SELECT  x,                    2 * ROW_NUMBER() OVER (ORDER BY x) - COUNT(*) OVER () AS y            FROM    @Foo        ) AS dWHERE   y BETWEEN 0 AND 2 Yes it works! And now how to do the weighted median? Well, we follow the same approach and write this piece of code.-- Weighted Median by PesoSELECT  SUM(y) / SUM(t)FROM    (            SELECT  SUM(x) OVER (PARTITION BY x) AS y,                    2 * ROW_NUMBER() OVER (ORDER BY x) - COUNT(*) OVER () AS z,                    COUNT(*) OVER (PARTITION BY x) AS t            FROM    @Foo        ) AS dWHERE   z BETWEEN 0 AND 2I think the query qualifies as elegant. Let's see if Celko thinks the same.As you can see, the query is only slightly more complicated than the normal Median. I leave the math behind it as an exercise to the reader.//Peso Posted Friday, September 18, 2009 10:02 AM by Peso | 13 Comments Filed under: Algorithm, Math, Optimization