I was going to comment on Adam's post "Medians, ROW_NUMBERs, and performance" (which was a follow-up to Aaron's post "Take the high road to middle ground") and I realized that I needed to show some code and alas, comments are not so nice about formatting code, so I am posting a follow-up post...
At the PASS Summit back in November, Itzik and I were examining this issue of the QP doing the table spool when one would think it shouldn't need to and we were both stumped as to why. But further examination revealed some strangeness: Even though the logical reads for Ben-Gan's solution is quite high, the actual query cost is less than Celko's solution. It makes one wonder: Which is the better indicator of better perfromance - actual query cost or statistics such as time and IO? I have seen people use actual query cost as the basis for choosing one solution over another. Is using actual query cost a valid way of deciding which solution to choose? Like I said, it makes one wonder....
For Ben-Gan's solution, I also was dissatisfied with the QP's choice in doing the table spool. And so I decided to create a variation on Ben-Gan's solution that calculated the count for each CustomerID separately. Although the logical reads is higher than Celko's solution, it is only 748 as compared to the 703 of Celko's solution; and better yet, the execution time is [on average] just slightly better and the query cost is always better (37% versus 63%). Put a supporting index in place (CustomerID, TotalDue) and this solution's logical reads drops to 121 (versus 76 for Celko's) but is consistently better in execution time and actual query cost (12% versus 88%)
Here is the query:
;
with c as
(SELECT CustomerId , TotalDue,
Row_Number()
OVER (PARTITION BY CustomerId ORDER BY TotalDue) as rn
FROM Sales.SalesOrderHeader)
, d as
(SELECT CustomerId ,
(COUNT(*)+1) /2 as cnt1,
(COUNT(*)+2) /2 as cnt2
FROM Sales.SalesOrderHeader
GROUP BY CustomerID)
SELECT c.CustomerId, Avg(c.TotalDue) AS Median
FROM C
INNER JOIN d
ON c.CustomerId = d.CustomerId AND c.rn IN (d.cnt1, d.cnt2)
GROUP BY c.CustomerId
I agree with Adam that cost-based optimization is far from perfect. But I don't like the inconsistency in the various results of STATISTICS IO, STATISTICS TIME, and actual query cost. And so what is the best way to decide which query performs better and which will scale better: STATISTICS IO, STATISTICS TIME, actual query cost, a combination of all of these, or perhaps something else (such as a detailed analysis of the actual query plans)?