THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Adam Machanic

Adam Machanic, Boston-based SQL Server developer, shares his experiences with programming, monitoring, and performance tuning SQL Server. And the occasional battle with the query optimizer.

Who's On First? Solving the Top per Group Problem (Part 1: Technique)

Relative comparison is a simple matter of human nature. From early childhood we compare and contrast what we see in the world around us, building a means by which to rate what we experience. And as it turns out, this desire to discover top and bottom, rightmost and leftmost, or best and worst happens to extend quite naturally into business scenarios. Which product is the top seller? How about the one that's simply not moving off the shelves? Which of our customers has placed the most expensive order? What are the most recent orders placed at each of our outlets?

In the world of common business questions, the edge cases are generally of most interest. What's in the middle is unimportant; it's often too difficult for the mind to compare and comprehend when there are hundreds, thousands, or even millions of items, transactions, or facts that are all within a similar range. Instead, we focus on those that stick out in some extraordinary way.

Those of us who work with SQL products on a regular basis are faced with solving this same problem time and again as we work through various business requirements. Over time, I have noticed four basic query patterns that can be used to solve the problem; each are logically equivalent (within certain restrictions -- more on that later), but can have surprisingly different performance characteristics depending on the data being queried. In this first post, I will outline the available patterns/methods. In the following posts, I will show the results of testing each pattern against a variety of scenarios in an attempt to discover where and when each should be used.

The four basic patterns are outlined below. Each of the methods is illustrated using a query to show all customers' names, plus their most recent order date, and the amount of that order. I've included notes that indicate where logic differences can arise among the various methods.

 

Method 1: Join to full group and use correlated subquery with a MIN/MAX aggregate to filter

In this method we use an inner join to get all required columns, then filter the resultant set using a correlated subquery in the WHERE clause.

SELECT
    c.FirstName,
    c.LastName,
    o.OrderDate,
    o.OrderAmount
FROM Customers c
JOIN Orders o ON o.CustomerId = c.CustomerId
WHERE
    o.OrderDate  =
    (
        SELECT MAX(o1.OrderDate)
        FROM Orders o1
        WHERE
            o1.CustomerId = o.CustomerId
    )

Logic notes: With this method ties are automatically included in the output, unless a tiebreaker is specified (which can be tricky given that you only have one column to work with). This method does not allow you to pull back an arbitrary number of rows, such as top 10 per customer; you are limited to the edge and any ties that might exist.

 

Method 1a: Join to full group and use correlated subquery with TOP(n) and ORDER BY to filter

This method is almost identical to Method 1 (which is why it is classified here as 1a), but the TOP and ORDER BY allow for a bit more flexibility than the aggregates.

SELECT
    c.FirstName,
    c.LastName,
    o.OrderDate,
    o.OrderAmount
FROM Customers c
JOIN Orders o ON o.CustomerId = c.CustomerId
WHERE
    o.OrderDate  =
    (
        SELECT TOP(1)
            o1.OrderDate
        FROM Orders o1
        WHERE
            o1.CustomerId = o.CustomerId
        ORDER BY
            o1.OrderDate DESC
    )

Logic notes: With this method you can more easily integrate a tiebreaker than with Method 1; the comparison column can be anything, including a primary key, and you can still order on whatever column makes most sense. In addition, you can take more rows than with Method 1 by using IN instead of = in the WHERE clause, and increasing the argument value to TOP.

 

Method 2: CROSS APPLY to ordered TOP(n)

In this method, SQL Server 2005's CROSS APPLY operator is used. This operator allows us to essentially create a table-valued correlated subquery -- something that impossible in previous versions of SQL Server. By using TOP in conjunction with ORDER BY we can get as many rows per group as needed.

SELECT
    c.FirstName,
    c.LastName,
    x.OrderDate,
    x.OrderAmount
FROM Customers c
CROSS APPLY
(
    SELECT TOP(1)
        o.OrderDate,
        o.OrderAmount
    FROM Orders o
    WHERE
        o.CustomerId = c.CustomerId
    ORDER BY
        o.OrderDate DESC
) x

Logic notes: This method is almost identical, from a logic point of view, with Method 1a modified to use IN on a primary key column. With both methods WITH TIES can be added to the TOP in order to get ties.

 

Method 3: Join to derived table that uses a partitioned, ordered windowing function, and filter in the outer query based on the row number

In this method a derived table or CTE is used, in conjunction with a windowing function partitioned based on the required grain of the final query. So for the "most recent order per customer" query, the row number is partitioned based on the customer. This gives us a count starting at 1 for each customer, which can be filtered in the outer query.

SELECT
    c.FirstName,
    c.LastName,
    x.OrderDate,
    x.OrderAmount
FROM Customers c
INNER JOIN
(
    SELECT
        o.OrderDate,
        o.OrderAmount,
        o.CustomerId,
        ROW_NUMBER() OVER
        (
            PARTITION BY o.CustomerId
            ORDER BY o.OrderDate DESC
        ) AS r
    FROM Orders o
) x ON
    x.CustomerId = c.CustomerId
    AND x.r = 1

Logic notes: If ties are important, use DENSE_RANK instead of ROW_NUMBER. ROW_NUMBER is good for arbitrary TOP(n), similar to Method 2. Unlike the previously described methods, in conjunction with DENSE_RANK this method can return an arbitrary TOP(n) rows, all of which can include ties. So if you would like to see the three most recent order dates and each happens to have multiple orders, this method will be able to return them all by simply filtering on x.r = 3. This would not be directly possible with any of the other methods described here.

 

Method 4: "Carry-along sort"

This is the only "tricky" method, and not one that I recommend using, except as a last resort. I'm including it here only for completeness and comparison because it happens to be a very high performance method in some cases. This method involves converting each of the required inner columns into a string, concatenating them, then applying an aggregate to the string as a whole. By putting the "sort" column first, the other data is "carried along" -- thus the name for the method. The concatenated data is then "unpacked" in the outer query. This can be surprisingly efficient from an I/O standpoint, but the resultant code is a maintenance nightmare and it is quite easy to introduce errors. In addition, this method can only return the top 1 per group -- no ties or multiple return items are supported.

SELECT
    c.FirstName,
    c.LastName,
    CONVERT(DATETIME, SUBSTRING(x.OrderInfo, 1, 8)) AS OrderDate,
    CONVERT(MONEY, SUBSTRING(x.OrderInfo, 9, 15)) AS OrderAmount
FROM Customers c
INNER JOIN
(
    SELECT
        o.CustomerId,
        MAX
        (
            CONVERT(CHAR(8), OrderDate, 112) +
            CONVERT(CHAR(15), SubTotal)
        ) OrderInfo
    FROM Orders o
    GROUP BY
        o.CustomerId
) x ON
    x.CustomerId = c.CustomerId

 

This post is just the beginning; watch this space in the coming days for a series of performance tests and analysis of these methods, and some results that I personally found to be quite surprising.
 

Published Friday, February 08, 2008 7:09 PM by Adam Machanic

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Dave Markle said:

Adam:

I use the top-per-group problem (off of the Northwind database) to help evaluate potential developer candidates.  We give them one simple top-per-group question, BOL, and Query Analyzer, and NO time limit.  You'd be shocked at how FEW candidates can get any one of these solutions.  I'd say 5-10% of the people that make it though the phone interview get past this question.  I'd normally curse you for giving our secret sauce away, but the sad truth is that none of the 90-95% of people that don't get it will be reading this...

February 9, 2008 9:56 PM
 

Gil said:

In the method 3 discussion, did you mean x.r<=3  ?

February 10, 2008 1:21 AM
 

Adam Machanic said:

Gil: Absolutely!  Thanks for pointing that out.

February 10, 2008 1:59 AM
 

Adam Machanic said:

And in retrospect, I think that discussion is slightly incorrect -- you should get the same results using Method 1A, by using IN in the outer query and SELECT DISTINCT TOP(3) in the inner query.  I'll make sure to revisit that in a later part of the series.

February 10, 2008 2:02 AM
 

Adam Machanic said:

David: Having done a fair amount of interviewing myself, I can only say that I feel your pain and agree that this post will not make the slightest bit of difference :-)

February 10, 2008 2:05 AM
 

jerryhung said:

Good post, I have done 1, 1a, 3 myself, same logic, different usages

I really should *think* of using cross join often, but it's new, and so bad (n x m size) that I normally don't think about it

February 11, 2008 2:07 PM
 

steve dassin said:

For all those who would fail this developer test fret not. The whole weight of MS is not to further the 5% who pass but to enlighten the rest. I will be driving home this point to the AZ user group, http://www.azsqlserver.com/. Isn't fair and balanced wonderful :)

February 12, 2008 12:05 AM
 

Christopher Tan said:

It's ok

February 23, 2008 7:45 AM
 

Prabhakaran.S said:

I have read the blog about "Carry-along sort" which was written by ADAM.

I am looking for more information about this sorting in more detail including performance as well.

Could you provide me some link or any pdf to get to know more about this sorting method

Expecting your response on this is much appreciated.

Regards,

Prab

March 23, 2008 12:53 AM
 

Peter said:

Thanks for the reminder about CROSS and OUTER APPLY.  I was struggling with some really hard to read code where there were a ton of correlated subqueries to pull in values for about 20-30 columns, all against the same data.  No really good way of changing it because it is a view, but the OUTER APPLY let me pull in the appropriate data with little trouble and about 1/2 the cost of the original query.

April 21, 2008 4:14 PM
 

Ben said:

Thanks for this info. Very useful.

But I would like to say I had MASSIVE improvement in query runtimes when using the "tricky" method. FROM 25 seconds using the method 1a to under one second using "tricks". Well ok, perhaps the indexes were not THAT great.

A tip for all trickers. I use bit-math instead of strings. The highest priority goes into the higest bits. Plenty of space for states and such in a BigInt.

All good wishes to you all

Ben

November 10, 2010 8:26 AM
 

Joseph said:

Thank you so much for this!  Great content!

September 6, 2012 4:50 PM
 

Steve said:

I'm a developer, not a DBA, but I've worked with SQL Server for 15 years.  I'd like to know if my solution is correct for this problem.  My head hurt when I tried to understand your solutions so I went back to basics and came up with something that seems simpler to me.  

We have a table of employees where the INACTIVE field can be 1 or 0.  Sometimes an employee moves to a different store and (right or wrong) HR sets the old record to INACTIVE and creates a new active record with a new EMPLOYID (which is the primary key).  I want to find the most recent EMPLOYID that is inactive to update the user's account in ADSI.  (I think that's a TOP 1 of GROUP problem, but I'm not sure)  So I added a PRIOREMPID field to hold the most recent EMPLOYID into my query of active employees.  Here's how I did it and it seems to work:

SELECT A.SOCSCNUM, A.EMPLOYID, B.EMPLOYID AS PRIOREMPID

FROM EMPLOYEES A

LEFT JOIN (SELECT SOCSCNUM, EMPLOYID, MAX(DEMPINAC) AS MAXDATE FROM EMPLOYEES WHERE INACTIVE = 1 GROUP BY SOCSCNUM, EMPLOYID) B

ON A.SOCSCNUM = B.SOCSCNUM

WHERE A.INACTIVE = 0

I doubt that I would pass Dave Markle's candidate test, but I think he'd be passing on a good developer.  Anyway, I'm not available right now.

www.stevegaines.us

October 26, 2012 11:32 AM
 

Steve said:

After glancing at your solutions again, I think mine looks a lot like Method 4.  Do you agree?  You say it can only return the top 1 per group, which is exactly what I wanted.  Multiples would be a problem in my case.

Thanks

October 26, 2012 11:44 AM
 

Adam Machanic said:

Hi Steve,

What you're describing does sound like a top 1 per group problem, but your solution doesn't, to me, look correct. It's unclear to me what the purpose is of finding the maximum "DEMPINAC" per social security number and employee ID combination. Also, what if HR marks TWO employee IDs inactive for a given SSN? Then there will be three different employee IDs, and your query will return two rows for that SSN. Is that correct?

Either way, this is not similar to method 4. Probably closest to method 1 -- finding the max value of some correlation column and then using it to identify the "top" row(s). But that's not exactly what your query is doing.

--Adam

October 27, 2012 11:45 AM
 

crokusek said:

This is also a hard/fun problem to optimize when the top(n) values need to be distinct.  

It might be an interesting article to see various ways of dealing with "distinct top(n) P, V" efficiently.

One way around this is to use row_number() on partition P atop a distinct non-top sub query.  From my experience, SQL Server is not able to end the subquery for distinct values early enough (e.g. predicate pushing of row_number()).

March 7, 2013 1:47 PM
 

dakra said:

If you only need one row per group, just do:

SELECT DISTINCT ColA, MAX(ColB), MIN(ColC), AVG(ColD), Sum(ColD), Count(ColE), VAR(colF)

FROM Table1

GROUP BY  ColA

ORDER BY  ColA;

I included the MIN(ColC), AVG(ColD), Sum(ColD), Count(ColE) to show that you can mix and match aggregation functions in one query.

March 12, 2013 12:37 PM
 

Adam Machanic said:

@dakra: That's not the same thing. You risk mixing and matching values from different rows if you do that. That's the whole point of these techniques -- to get exactly one row, with only that row's values. No mix and match allowed.

March 12, 2013 1:54 PM
 

Larry Rosenfeld said:

this was very helpful.  thanks so much!

June 24, 2013 6:14 PM
 

Allan said:

Great job, I've worked endlessly to solve this with queries not as effective. Cheers

December 13, 2013 1:30 PM
 

Brad Stiritz said:

Hi Adam,

Thanks for the great post. I'm sorry, I couldn't find your promised follow-up :

>This post is just the beginning; watch this space in the coming days for a series of performance tests and analysis of these methods, and some results that I personally found to be quite surprising.

If you did actually post that follow-up, would you mind please posting a link here? I'm very curious to see description & analysis of performance figures for your various solutions.

Thanks for your consideration

May 8, 2014 9:18 PM
 

Ariski said:

May I find the second part of this nice work please? Really need to dig on which one has the lowest I/O usage. Big thanks!

May 29, 2014 10:19 PM
 

Maciej said:

I not tested but:

Query return only customers that have order not "all customers' names".

What if many orders have the same date? Metod 1 return all of them, method 4 only one with max amount?

July 29, 2014 10:02 AM
 

What about the First in Group when there are two or more criteria (filter) said:

Searching the internet for a solution.

There are two columns in a table that must only contain specific values (using an IN statement)

Now that the filter is applied, only the top 1 for the FK is required.

Looking for a FK with two null columns or a FK with any of the two filtered values.

For now I am using the general format of:

SELECT DISTINCT ColA, MAX(ColB), MIN(ColC), AVG(ColD), Sum(ColD), Count(ColE), VAR(colF)

FROM Table1

GROUP BY  ColA

ORDER BY  ColA;

since I somewhat don't care about specific values being mixed, it is more of a rule check that there is a record with this criteria or there isn't.

When I take your excellent examples and try to add the filters, I just cant get it to work. Any advice or suggestions would be very welcome.

August 29, 2014 9:59 AM

Leave a Comment

(required) 
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based SQL Server developer, writer, and speaker. He focuses on large-scale data warehouse performance and development, and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "SQL Server 2008 Internals" (Microsoft Press, 2009) and "Expert SQL Server 2005 Development" (Apress, 2007). Adam regularly speaks at conferences and training events on a variety of SQL Server topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server, a Microsoft Certified IT Professional (MCITP), and an alumnus of the INETA North American Speakers Bureau.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement