THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Rob Farley

- Owner/Principal with LobsterPot Solutions (a MS Gold Partner consulting firm), Microsoft Certified Master, Microsoft MVP (SQL Server), APS/PDW trainer and leader of the SQL User Group in Adelaide, Australia. Rob is a former director of PASS, and provides consulting and training courses around the world in SQL Server and BI topics.

Medians pre-SQL 2012

Hi! - Great that you've found this page, but it's no longer here! You can find the content over at: http://blogs.lobsterpot.com.au/2015/01/27/medians-pre-sql-2012/

Published Tuesday, January 27, 2015 2:59 PM by Rob Farley

Comments

 

Joe Celko said:

The median was a hot topic in SQL in the early 1990'S . There were two newsstand magazines for databases, DBMS and DATABASE PROGRAMMING & DESIGN. They started at two separate publishers, but thanks to mergers and buyout found themselves under CMP and they were merged into Intelligent Enterprise later. I wrote columns in both magazines. Chris Date got my old position when I moved to DBMS. He has collected his old columns into a series of books form Addison-Wesley.

Chris would write a column, then i would write a reply, much like the famous Fred Allen & Jack Benny “feud” on early radio (http://www.otrcat.com/jack-benny-and-fred-allen-feud-p-48618.html). We wanted to sell both magazines!

Chris and I both finished a column with an SQL puzzle or problem. The median became the hot topic in 1992 and 1993, after the SQL-92 standard came. Date proposed two different solutions for the median. His first solution was based on the fact that if you duplicate every row in a table, the median will stay the same. The duplication will guarantee that you always work with a table that has an even number of rows. The first version that appeared in his column was wrong and drew some mail from me and from others who had different solutions. Here is a corrected version of his first solution, with his famous Parts table.

CREATE VIEW Temp1

AS SELECT weight FROM Parts

UNION ALL

SELECT weight FROM Parts;

CREATE VIEW Temp2

AS SELECT weight

FROM Temp1

<= (SELECT COUNT(*)

FROM Temp1 AS T1

WHERE T1.weight >= Temp1.weight)

AND (SELECT COUNT(*) FROM Parts)

<= (SELECT COUNT(*)

FROM Temp1 AS T2

WHERE T2.weight <= Temp1.weight);

SELECT AVG(DISTINCT weight) AS median

FROM Temp2;

This involves the construction of a doubled table of values, which can be expensive in terms of both time and storage space. You will also find that this requires a good implementation of the SQL-92 standard that allows you to use a correlated scalar subquery in place of a scalar value expression. Most SQL implementations are not yet that sophisticated.

The use of AVG(DISTINCT x) is important because leaving it out would return the simple average instead of the median. Consider the set of weights (12, 17, 17, 14, 12, 19). The doubled table, Temp1, is then (12, 12, 12, 12, 14, 14, 17, 17, 17, 17, 19, 19). But because of the duplicated values, Temp2 becomes (14, 14, 17, 17, 17, 17), not just (14, 17). The simple average is (96 / 6.0) = 16; it should be (31 / 2.0) = 15.5 instead.

My first solution involved a slight modification of Date's solution to avoid the use of a doubled table, but it depends on a SQL-92 implementation that has a CEILING() function or particular vendor of rounding and truncation.

SELECT MIN(weight) -- smallest value in upper half

FROM Parts

WHERE weight IN (SELECT P1.weight

FROM Parts AS P1, Parts AS P2

WHERE P2.weight >= P1.weight

GROUP BY P1.weight

HAVING COUNT(*) <=

(SELECT CEILING(COUNT(*) / 2.0)

FROM Parts))

UNION

SELECT MAX(weight) -- largest value in lower half

FROM Parts

WHERE weight IN (SELECT P1.weight

FROM Parts AS P1, Parts AS P2

WHERE P2.weight <= P1.weight

HAVING COUNT(*) <=

(SELECT CEILING(COUNT(*) / 2.0)

FROM Parts));

or using the same idea and a CASE expression:

FROM (SELECT MAX(weight)

FROM Parts AS B1

WHERE (SELECT COUNT(*) + 1

FROM Parts

WHERE weight < B1.weight)

<= (SELECT CEILING (COUNT(*)/2.0)

FROM Parts)

UNION ALL

SELECT MAX(weight)

FROM Parts AS B

WHERE (SELECT COUNT(*) + 1

FROM Parts

WHERE weight < B.weight)

<= CASE (SELECT MOD (COUNT(*), 2)

FROM Parts)

WHEN 0

THEN (SELECT CEILING (COUNT(*)/2.0) + 1

FROM Parts)

ELSE (SELECT CEILING (COUNT(*)/2.0)

FROM Parts)

END) AS Medians(weight);

At this point, we got more and more answers. And the code got more and more complicated. If you want to look at it, it is in Chapter 23 of the Third Edition of SQL FOR SMARTIES. I would not use any of them today, but I have a discussion and a detailed explanation of how I improved my code.

Date's second solution in 1995 was based on Celko's median, folded into one query where I had used VIEWs for readability (CTE did not exist yet).

SELECT AVG(DISTINCT Parts.weight) AS median

FROM Parts

WHERE Parts.weight IN

(SELECT MIN(weight)

FROM Parts

WHERE Parts.weight IN

(SELECT P2.weight

FROM Parts AS P1, Parts AS P2

WHERE P2.weight <= P1.weight

GROUP BY P2.weight

HAVING COUNT(*)

<= (SELECT CEILING(COUNT(*) / 2.0)

FROM Parts))

UNION

SELECT MAX(weight)

FROM Parts

WHERE Parts.weight IN

(SELECT P2.weight

FROM Parts AS P1, Parts AS P2

WHERE P2.weight >= P1.weight

GROUP BY P2.weight

HAVING COUNT(*)

<= (SELECT CEILING(COUNT(*) / 2.0)

FROM Parts)));

Date mentions that this solution will return a NULL for an empty table and that it assumes there are no NULLs in the column. If there are NULLs, the WHERE clauses should be modified to remove them.

Rory Murchison was a regular contributor to puzzle answers. He came up with the idea of concatenating the key to each alue to make sure that every value is seen as a unique entity. Selecting the middle values is then a special case of finding the n-th item in the table.

My Second Median used a working table with the values and a tally of their occurrences from the

original table. This working table should be quite a bit smaller than the original table, and very fast to construct if there is an index on the target column. An awful self-join then gave you the counts of rows before and after each row's value. Finding the middle was easy.

-- construct table of cumulative tallies

CREATE TABLE Summary

(weight REAL NOT NULL,

occurs INTEGER NOT NULL, -- number of occurrences

pretally INTEGER NOT NULL, -- cumulative tally before

posttally INTEGER NOT NULL); -- cumulative tally after

SELECT AVG(S3.weight) AS median

FROM Summary AS S3

WHERE (S3.posttally > (SELECT MAX(posttally) / 2.0 FROM Summary)

AND S3.pretally < (SELECT MAX(posttally) / 2.0 FROM Summary))

OR S3.pretally = (SELECT MAX(posttally) / 2.0 FROM Summary)

OR S3.posttally = (SELECT MAX(posttally) / 2.0 FROM Summary);

A simple median technique based on all of these methods was proposed by Philip Vaughan of San Jose, CA. It derives a VIEW with unique weights and number of occurrences and then a VIEW of the middle weights.

Anatoly Abramovich, Yelena Alexandrova, and Eugene Birger presented a series of articles in SQL Forum magazine on computing the median (SQL Forum 1993, 1994). They define a characteristic

function, which they call delta, using the SIGN() function. The delta or characteristic function accepts a Boolean expression as an argument and returns one if it is TRUE and zero if it is FALSE or UNKNOWN. In SQL-92 we have a CASE expression, which can be used to construct the delta function. CASE was new to SQL-92.

The authors also distinguish between the statistical median, whose value must be a member of the set, and the financial median, whose value is the average of the middle two members of the set. A statistical median exists when there is an odd number of items in the set. If there is an even number of items, you

must decide if you want to use the highest value in the lower half (they call this the left median) or the lowest value in the upper half (they call this the right median).

The left statistical median of a unique column can be found with this query, if you will assume that we have a column called bin which represents the storage location of a part.

SELECT P1.bin

FROM Parts AS P1, Parts AS P2

GROUP BY P1.bin

HAVING SUM(CASE WHEN (P2.bin <= P1.bin) THEN 1 ELSE 0 END)

= (COUNT(*) / 2.0);

Changing the direction of the theta test in the HAVING clause will allow you to pick the right statistical median Getting the other medians involves playing with this self-join query and the CASE expressions.

The statistical median of a column with duplicate values can be found with a query based on the same ideas, but you have to adjust the HAVING clause to allow for overlap.

My Third Median

Another approach made easier with SQL-92 involves looking at a picture of a line of sorted values and seeing where the median would fall. Every value in column weight of the table partitions the table into three sections, the values which are less than weight, equal to weight or greater than weight. We can get a profile of each value with a tabular subquery expression.

Now the question is how to define a median in terms of the partitions. Clearly, the definition of a median means that if (lesser = greater) then weight is the median.

SELECT AVG(DISTINCT weight)

FROM (SELECT P1.pno, P1.weight,

SUM(CASE WHEN P2.weight < P1.weight

SUM(CASE WHEN P2.weight = P1.weight

THEN 1 ELSE 0 END),

SUM(CASE WHEN P2.weight > P1.weight

THEN 1 ELSE 0 END)

FROM Parts AS P1, Parts AS P2

GROUP BY P1.pno, P1.weight)

AS Partitions (pno, weight, lesser, equal, greater)

WHERE lesser = greater

OR (lesser <= (SELECT COUNT(*) FROM Parts)/2.0

AND greater <= (SELECT COUNT(*) FROM Parts)/2.0);

It is also worth noting that you can use either AVG(DISTINCT i) or AVG(i) in the SELECT clause. The AVG(DISTINCT i) will return the usual median when there are two values. This happens when you have an even number of rows and a partition in the middle, such as (1,2,2, 3, 3, 3) which has (2, 3) in the middle, which gives us 2.5 for the median. The AVG(i) will return the weighted median instead. This happens when you also factor in the number of times the two values in the middle of a table with an even number of rows. The table with (1,2,2, 3, 3, 3) would return (2,2, 3, 3, 3) in the middle, which gives us 2.6 for the weighted median. The weighted median is a more accurate description of the data.

I sent this attempt to Richard Romley, who made it quite a bit simpler: , but let me take you thru the steps so you can see the reasoning.

Look at the WHERE clause. It could use some algebra and since it deals only with aggregate functions and scalar subqueries, you could move it into a HAVING clause. Moving things from the WHERE clause into the HAVING clause in a grouped query is important for performance, but it is not always possible.

SELECT AVG(DISTINCT weight)

FROM (SELECT P1.weight

FROM Parts AS P1, Parts AS P2

GROUP BY P1.pno, P1.weight

THEN 1 ELSE 0 END)

>= ABS(SUM(CASE WHEN P2.weight < P1.weight THEN 1

WHEN P2.weight > P1.weight THEN -1

ELSE 0 END)))

AS Partitions;

If you prefer to use functions instead of a CASE expression, then use this version of the query:

SELECT AVG(DISTINCT weight)

FROM (SELECT P1.weight

FROM Parts AS P1, Parts AS P2

GROUP BY P1.pno, P1.weight

HAVING SUM(ABS(1 - SIGN(P1.weight - P2.weight))

>= ABS(SUM(SIGN (P1.weight - P2.weight)))

AS Partitions;

February 1, 2015 2:48 PM
 

Rob Farley said:

Thank you Joe - terrific fun to read this.

February 1, 2015 3:08 PM
 

TheSQLGuru said:

Joe, please run your versions against the others using Aaron Bertrand's 1M row test harness and report back on their performance.

February 9, 2015 10:31 AM
Anonymous comments are disabled

This Blog

Syndication

Tags

No tags have been created or used yet.

News

News? Haven't you read my blog?

My Company


Can't find something?

Contact Me

IM: rob_farley@hotmail.com
Twitter: @rob_farley
Skype: rob_farley
E: rob_farley@hotmail.com

MVP (SQL Server)




Certifications








Adelaide SQL UG

Privacy Statement