THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Peter Larsson

Thinking outside the box

Median and Weighted Median calculations

Hi all!

This is my first blog post here at sqlblog.com after some convincing arguments made by Adam Machanic.
Some of you know me as Peso, and I have mostly blogged at SQLTeam.com before.

Yesterday Joe Celko posted a query on microsoft.public.sqlserver.programming newsgroup of how to write an elegant query for Weighted Median calculation.
He managed to get the correct results but also ended up with some pretty ugly code. Joe also wrote he had a feeling of an elegant query should exist but was not seeing it.

Anyway, since some of Celko's books have helped me in the past, I thought I should return the favor.

Consider this sample data


DECLARE @Foo TABLE
        (
            x SMALLMONEY NOT NULL
        )

INSERT  @Foo
VALUES  (1),
        (2),
        (2),
        (3),
        (3),
        (3)

The most common approach to calculate the median value I have seen is

SELECT  AVG(x)
FROM    (
            SELECT  x,
                    ROW_NUMBER() OVER (ORDER BY x DESC) AS a,
                    ROW_NUMBER() OVER (ORDER BY x) AS b
            FROM    @Foo
        ) AS d
WHERE   b - a BETWEEN -1 AND 1

It's a great method and it works in almost all cases. Yes, in almost all cases. There are circumstances where SQL Server will not give the correct result.
Adam Machanic has described the problem in this Connect issue. It seems the table need some key to work properly, but as seen with Celko's sample data, there is no key in this scenario.

x       a  b  a-b
1.0000  6  1    5
2.0000  4  2    2
2.0000  5  3    2
3.0000  1  4   -3
3.0000  2  5   -3
3.0000  3  6   -3

As you can see, the difference calculated by a-b suddenly doesn't match!
How can we overcome this behaviour? How can we write a piece of code that works all times?
It's not that hard. See this query.

-- Median by Peso
SELECT  AVG(x)
FROM    (
            SELECT  x,
                    2 * ROW_NUMBER() OVER (ORDER BY x) - COUNT(*) OVER () AS y
            FROM    @Foo
        ) AS d
WHERE   y BETWEEN 0 AND 2

Yes it works! And now how to do the weighted median? Well, we follow the same approach and write this piece of code.

-- Weighted Median by Peso
SELECT  SUM(y) / SUM(t)
FROM    (
            SELECT  SUM(x) OVER (PARTITION BY x) AS y,
                    2 * ROW_NUMBER() OVER (ORDER BY x) - COUNT(*) OVER () AS z,
                    COUNT(*) OVER (PARTITION BY x) AS t
            FROM    @Foo
        ) AS d
WHERE   z BETWEEN 0 AND 2

I think the query qualifies as elegant. Let's see if Celko thinks the same.
As you can see, the query is only slightly more complicated than the normal Median. I leave the math behind it as an exercise to the reader.

//Peso

Published Friday, September 18, 2009 10:02 AM by Peso
Filed under: , ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Adam Machanic said:

Welcome, Peter!!

September 18, 2009 9:42 AM
 

Ralph Wilson said:

Peter,

If I remember my statistics correctly, the following definition from Wikipedia is correct:

In probability theory and statistics, a median is described as the number separating the higher half of a sample, a population, or a probability distribution, from the lower half. The median of a finite list of numbers can be found by arranging all the observations from lowest value to highest value and picking the middle one. If there is an even number of observations, the median is not unique, so one often takes the mean of the two middle values.

In the case of the data you provided as your example, your SQL that you said "works all the time", actually returns a value of 2.5 . . . which, technically, is _not_ the Median.  The Median, in this case, is _2_ values: 3 and 4.  

If your order the list by the value in question and find the ROUND((N/2), 0) and the ROUND(((N + 1.0)/2.0), 0) entries then you will consistently get the Median.  In the case of an even number of entries, ROUND((N/2), 0) and ROUND(((N + 1.0)/2.0), 0) will provide 2 distinc answers, e.g. for 6 entries you get 3 and 4.  In the case of an odd number of entries, it will provide only 1 distinct answer, e.g. for 7 entries you get 4 and 4 or a _distinct_ answer of 4.

Not having seen Joe C.'s SQL, I can only offer my first try at producing the desired results and comment that, like Joe, I, too, would like for it to be more elegant. ;-)

SELECT X,

      RowNbr

FROM   (

       SELECT F.X,

              ROW_NUMBER() OVER (ORDER BY F.X) AS RowNbr,

              MinRow,

              MaxRow

       FROM   FOO F,

              (

               SELECT ROUND(((COUNT(*) + 1)/ 2.0), 0) AS MinRow,

                      ROUND((((COUNT(*) + 1) + 1) / 2.0), 0)

                           AS MaxRow

               FROM   FOO

              ) Y

      ) Y

WHERE  RowNbr BETWEEN MinRow AND MaxRow

September 22, 2009 3:10 PM
 

Peso said:

The same wikipedia article about MEDIAN also states

" For example, if a < b < c, then the median of the list {a, b, c} is b, and if a < b < c < d, then the median of the list {a, b, c, d} is the mean of b and c, i.e. it is (b + c)/2. "

September 22, 2009 3:24 PM
 

George said:

After days of searching, this blog post is the closest thing I have seen to the perfect answer to my question. Those queries are indeed the most elegent median calculations I have seen thus far.

I am currently working on a query which would need to perform weighted median calculaions on a series of financial characteristics grouped by characteristic name. So for example, if the table contains values for P/E, Market Cap, Price/Book, etc with many values for each catagory, how could I find the weighted median of each catagory with one swoop.

I'd also like to find the most elegent way to do this over 2 different tables and insert into two seperate columns. (ex. Wtd Median of Market Cap from Portfolio Table and Wtd Median of Market Cap from Benchmark Table inserted as Portfolo and Benchmark columns in a temp table respectivly.)

I'm not really expecting an answer on a thread that has been dead for so long, but I have my fingers crossed.

Thanks guys.

August 16, 2010 10:28 AM
 

Peso said:

Yes, of course. Add the type the the query like this.

DECLARE @Foo TABLE

       (

           x SMALLMONEY NOT NULL,

           typ CHAR(1) NOT NULL

       )

INSERT  @Foo

VALUES  (1, 'a'),

       (2, 'a'),

       (2, 'a'),

       (3, 'a'),

       (3, 'a'),

       (3, 'a'),

(1, 'b'),

       (2, 'b'),

       (3, 'b'),

       (4, 'b')

SELECT      typ,

           SUM(y) / SUM(t)

FROM        (

               SELECT  typ,

                       SUM(x) OVER (PARTITION BY typ, x) AS y,

                       2 * ROW_NUMBER() OVER (PARTITION BY typ ORDER BY x) - COUNT(*) OVER (PARTITION BY typ) AS z,

                       COUNT(*) OVER (PARTITION BY typ, x) AS t

               FROM    @Foo

           ) AS d

WHERE       z BETWEEN 0 AND 2

GROUP BY    typ

ORDER BY    typ

August 16, 2010 10:35 AM
 

George said:

That worked. I can't thank you enough. You sir have just saved me hours of agony and, not to mention, inspired me to get working on sharpening my SQL skills. Thank you again for your help.

August 16, 2010 1:19 PM
 

Jeremy said:

Is there a way to do this if your values are in one column and their count is in another column?

For example

1000, 1

1500, 1

2000, 4

2500, 1

I need to get back 2000 not 1750.

January 15, 2011 5:24 PM
 

Peso said:

So you want the median, and not the weghted median?

January 16, 2011 5:14 AM
 

Peso said:

Here is one way. It is nowhere near optimal, but it works.

DECLARE @Sample TABLE

(

Value INT,

Items INT

)

INSERT @Sample

VALUES (1000, 1),

(1500, 1),

(2000, 4),

(2500, 1)

SELECT AVG(s.Value)

FROM @Sample AS s

OUTER APPLY (

SELECT SUM(x.Items)

FROM @Sample AS x

WHERE x.Value < s.Value

) AS f(Items)

CROSS APPLY (

SELECT SUM(x.Items)

FROM @Sample AS x

WHERE x.Value <= s.Value

) AS t(Items)

INNER JOIN (

SELECT FLOOR((SUM(Items) + 1) / 2E) AS Low,

CEILING((SUM(Items) + 1) / 2E) AS High

FROM @Sample

) AS d ON d.Low BETWEEN COALESCE(f.Items, 0) + 1 AND t.Items

OR d.High BETWEEN COALESCE(f.Items, 0) + 1 AND t.Items

January 16, 2011 5:26 AM
 

Jeremy said:

Thank you very much Peso! If I wanted to expand the @Sample table and include two additional columns, city and transid and restrict the query to where city = '1' and do a join to table @transactions where @sample.transid = @transactions.transid... is that possible?

I tried adding something similar to this to your code but I'm not able to get it to run.

For example, changed

FROM @sample AS s

to

FROM @Sample s left join @transactions on s.transid = @transactions.transid

where  @transactions.region = '9' and s.city = '1'

January 16, 2011 1:29 PM
 

Jacob said:

I am working on a project where we need to see fourth quintile or 80th percentile of a population.  I found a small enhancement to Peso's Median which facilitates this.

-- Percentile by Jacob

declare @percentile float

set @percentile = 80

SELECT  AVG(x)

FROM    (

           SELECT  x,

                   ROW_NUMBER() OVER (ORDER BY x) - COUNT(*) OVER () * @percentile/100 AS y

           FROM    @Foo

       ) AS d

WHERE   y BETWEEN 0 AND 1

May 17, 2011 2:21 PM
 

mbs said:

Pese, can you please help me?  I have following code to get weighted Median by mbr_cnt.  It executes and gives me numbers. However, I did some manual checks and they are off.  Thank you so much in advance!

-mbs

DECLARE @CA_SA_RPT_BASE_PA TABLE (REPORTING_YEAR INT

,REPORTING_MONTH INT

,CDHP_PRIM VARCHAR(6)

,CDHP_SCND VARCHAR(6)

,CDHP_THRD VARCHAR(6)

,SIC_DESC_L1 VARCHAR(2000)

,SIC_DESC_L2 VARCHAR(2000)

,SIC_DESC_L3 VARCHAR(2000)

,SIC_DESC_L4 VARCHAR(2000)

,DIVISION VARCHAR(25)

,ACCT_NAME VARCHAR(50)

,BNFT_AGRMT VARCHAR(4)

,MARKET_SEGMENT_L1 VARCHAR(50)

,MARKET_SEGMENT_L2 VARCHAR(50)

,FUNDING_TYPE VARCHAR(20)

,FUNDING_TYPE_DETAIL VARCHAR(20)

,PRODUCER VARCHAR(100)

,BROKER VARCHAR(100)

,CLUSTER VARCHAR(100)

,AE VARCHAR(100)

,SUB_CNT INT

,MBR_CNT INT

,IND_IN_DED_AMT DECIMAL(15,2)

,IND_OUT_DED_AMT DECIMAL(15,2)

,FMY_IN_DED_AMT DECIMAL(15,2)

,FMY_OUT_DED_AMT DECIMAL(15,2)

,COINS_IN_RT DECIMAL(5,2)

,COINS_OUT_RT DECIMAL(5,2)

,IND_OPX_IN_AMT DECIMAL(15,2)

,IND_OPX_OUT_AMT DECIMAL(15,2)

,FMY_OPX_IN_AMT DECIMAL(15,2)

,FMY_OPX_OUT_AMT DECIMAL(15,2))

insert into @CA_SA_RPT_BASE_PA

select top 1000 * from CA_SA_RPT_BASE_PA

/*select * from @CA_SA_RPT_BASE_PA

where REPORTING_YEAR = 2009

and REPORTING_MONTH = 2

and DIVISION = 'OKLAHOMA'*/

SELECT REPORTING_YEAR

    , REPORTING_MONTH

    , DIVISION

    , 'DIVISION' AS RPT_TYPE

    , 'MBR_CNT' AS WEIGHTED_BY

    , 'Median' AS MEASURE

    , AVG(s.IND_IN_DED_AMT) AS IND_IN_DED

    , AVG(s.IND_OUT_DED_AMT) AS IND_OON_DED

    , AVG(s.FMY_IN_DED_AMT) AS FAM_IN_DED

    , AVG(s.FMY_OUT_DED_AMT) AS FAM_OON_DED

    , AVG(s.COINS_IN_RT) AS IN_COINS

    , AVG(s.COINS_OUT_RT) AS OON_COINS

    , AVG(s.IND_OPX_IN_AMT) AS IND_IN_OPX

    , AVG(s.IND_OPX_OUT_AMT) AS IND_OON_OPX

    , AVG(s.FMY_OPX_IN_AMT) AS FAM_IN_OPX

    , AVG(s.FMY_OPX_OUT_AMT) AS FAM_OON_OPX

 FROM @CA_SA_RPT_BASE_PA AS s

OUTER APPLY ( SELECT SUM(x.mbr_cnt)

               FROM @CA_SA_RPT_BASE_PA AS x

              WHERE x.IND_IN_DED_AMT < s.IND_IN_DED_AMT ) AS f(mbr_cnt)

CROSS APPLY ( SELECT SUM(x.mbr_cnt)

               FROM @CA_SA_RPT_BASE_PA AS x

              WHERE x.IND_IN_DED_AMT <= s.IND_IN_DED_AMT ) AS t(mbr_cnt)

INNER JOIN ( SELECT FLOOR((SUM(mbr_cnt) + 1) / 2E) AS Low

                  , CEILING((SUM(mbr_cnt) + 1) / 2E) AS High

               FROM @CA_SA_RPT_BASE_PA) AS d

        ON d.Low BETWEEN COALESCE(f.mbr_cnt, 0) + 1 AND t.mbr_cnt

        OR d.High BETWEEN COALESCE(f.mbr_cnt, 0) + 1 AND t.mbr_cnt

group by REPORTING_YEAR

      , REPORTING_MONTH

      , DIVISION

May 15, 2012 11:35 AM
 

aman rastogi said:

kuch palle nahi pada,sab sar ke upar se nikal gaya.

December 18, 2012 11:53 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement