Originally posted
here.
I keep seeing questions on newsgroups about paging in stored
procedures, and whether there will be a better way in SQL Server 2005.
However, aside from a few answers in newsgroups, I haven't seen any
content on how to do it. So I'd like to spend a few minutes and share
with you the new features that will make paging stored procedures both
easier to build and a lot more performant...
But first, since this is a performance-related blog, let's generate a bunch of big test data!
Since I don't have AdventureWorks installed on my test SQL
Server 2005 server at the moment and am too lazy to track it down, the
data is somewhat ugly... Anyway, start this up and let it run for a
while (takes around 35 minutes on my test box):
SELECT DISTINCT A.Name + B.Name + C.Name AS SomewhatLargeString
INTO #BigTableOfStrings
FROM master..spt_values A,
master..spt_values B,
master..spt_values C
WHERE a.TYPE NOT IN ('P', 'R', 'F', 'F_U')
AND b.TYPE NOT IN ('P', 'R', 'F', 'F_U')
CREATE CLUSTERED INDEX CI_LargeString ON #BigTableOfStrings(SomewhatLargeString)
Coffee (or preferably, beer) time! See you in 35 minutes.
... All set?
Okay, now let's pretend we're in a web app with a SQL Server
2000 backend and we want the second page of data... Rows 11-20, ordered
by SomewhatLarge. How might we do this in SQL Server 2000..? Here's one
way:
SELECT A.SomewhatLargeString
FROM #BigTableOfStrings A
JOIN #BigTableOfStrings B ON B.SomewhatLargeString <= A.SomewhatLargeString
GROUP BY A.SomewhatLargeString
HAVING COUNT(*) BETWEEN 11 AND 20
... Still waiting for that to return? Keep waiting. Maybe get
another coffee, or go home for the night. On my system, that query has
quite possibly the biggest estimated cost I've ever seen, a staggering
1,523,110,700. Yeah, that sucks. So let's change it around a bit:
SELECT x.SomewhatLargeString
FROM (
SELECT TOP 20 A.SomewhatLargeString, COUNT(*) AS TheCount
FROM #BigTableOfStrings A
JOIN #BigTableOfStrings B ON B.SomewhatLargeString <= A.SomewhatLargeString
GROUP BY A.SomewhatLargeString
ORDER BY A.SomewhatLargeString ) x
WHERE x.TheCount BETWEEN 11 AND 20
Much nicer than before, with an estimated cost on my system of 23.1,
and a virtually instant return time. But wait, we have an over-zealous
user who wants rows 20,001 - 20,010!
SELECT x.SomewhatLargeString
FROM (
SELECT TOP 20010 A.SomewhatLargeString, COUNT(*) AS TheCount
FROM #BigTableOfStrings A
JOIN #BigTableOfStrings B ON B.SomewhatLargeString <= A.SomewhatLargeString
GROUP BY A.SomewhatLargeString
ORDER BY A.SomewhatLargeString ) x
WHERE x.TheCount BETWEEN 20001 AND 20010
Oops, cost skyrocketed to 50516, with a return time of 1:30... Can
you feel your users abandoning your sinking ship of an app and heading
towards the competitors? Probably not, since they won't actually click
through 20,000 rows, but it makes a really good contrived example, so
let's roll with it!
So how to solve this problem? In SQL Server 2000, the answer
is, find some other paging mechanism, probably using a middle tier. But
in SQL Server 2005, we have new and better toys to play with. Allow me
to introduce your new paging best friend, the ROW_NUMBER() function.
For those readers who are slow on the uptake, this function does
exactly what its name implies; it creates a surrogate row number for
each row in a result set. So now, instead of the very painfully
inefficient COUNT(*) methods, we can let SQL Server do all the work as
it builds the result set we actually want...
SELECT x.SomewhatLargeString
FROM (
SELECT TOP 20010 A.SomewhatLargeString, ROW_NUMBER() OVER(ORDER BY A.SomewhatLargeString) AS TheCount
FROM #BigTableOfStrings A
ORDER BY A.SomewhatLargeString) x
WHERE x.TheCount BETWEEN 20000 AND 20010
I guess we can call that a tiny improvement. Total estimated cost: 0.202. That's only about a 25 MILLION PERCENT difference.
So why is ROW_NUMBER so much more efficient? It's a combination
of the COUNT(*) method itself being inefficient and the query optimizer
probably not handling it as well as it should. The COUNT(*) method
requires an ordered-forward clustered index scan of the table, followed
by correlation of each of the top 20010 rows that we asked for to all
of the rows less than or equal to that row in the same table, for the
count. Alas, the optimizer chooses a nested loop for that, which causes
the cost to shoot up as the operation is repeated over and over.
The ROW_NUMBER method, on the other hand, requires only the
scan of the TOP 20010 rows, followed by computation of the row number
itself, which is nothing more than a scalar calculation. Then, just
filter the rows. Simple, easy on the optimizer, easy for the system,
and good for your customers. Definitely a great feature that I'm
looking forward to using!