THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Adam Machanic

Adam Machanic, Boston-based SQL Server developer, shares his experiences with programming, monitoring, and performance tuning SQL Server. And the occasional battle with the query optimizer.

Running sums, redux

Siddhartha Gautama, the Buddha, taught us to understand that the key to enlightenment is following the Middle Path.  And today I learned a valuable lesson in extremes.  You can file this one in the "Doh!  Wrong again!" category...

A fairly common question on SQL Server forums is, "how can I get the running sum of the data in this column?"  Being the fan of set-based queries that I am, I always answer the exact same way.  I show the person asking the question how to do a self-join on the grouped column, getting all of the "previous" values to create a running sum.  The following example shows how you might do this against the AdventureWorks Production.TransactionHistory table:

SELECT
    TH1.TransactionID,
    TH1.ActualCost,
    SUM(TH2.ActualCost) AS RunningTotal
FROM Production.TransactionHistory TH1
JOIN Production.TransactionHistory TH2 ON TH2.TransactionID <= TH1.TransactionID
GROUP BY TH1.TransactionID, TH1.ActualCost
ORDER BY TH1.TransactionID

Pretty simple query.  For each row of the "TH1" alias, every row with a lesser-or-equal TransactionID will be summed.  Thereby creating a running total for every row of the table.  Note, I've used the IDENTITY column instead of the date column.  I'd generally suggest not doing so because, e.g., you might need to insert some post-dated rows at some point and relying on the IDENTITY for a time sequence will thereby not work.  But in this case it's a lazy solution because the TransactionDate column isn't indexed, and it's also not unique.  I want to test a lot of rows (TransactionHistory has around 113,000), but I don't want to skew the test by forcing a table scan on every iteration!

But I digress.  The point is, I've given this answer more than a few times and, well, I'd like to apologize.  Just now I went ahead and ran this query on my powerful test server--err, my laptop. 

As you might guess, since I'm performance-minded I also happen to be extremely impatient--so I went ahead and killed the query at the five-minute mark.  SSMS's result grid showed the first 26,568 rows, so obviously there was a long way to go to hit that 113,000 mark.  And with an estimated cost of 38,086 for the query, I can't say I'm surprised.

A few moments of head scratching and the following re-write was issued:

SELECT
    TH1.TransactionID,
    TH1.ActualCost,
    (
        SELECT SUM(TH2.ActualCost)
        FROM Production.TransactionHistory TH2
        WHERE TH2.TransactionID <= TH1.TransactionID
    ) AS RunningTotal
FROM Production.TransactionHistory TH1
ORDER BY TH1.TransactionID

With an estimated cost of only 6,630, I had high hopes for this one.  Alas, once again I was forced to cancel the query at the five-minute mark.  27,683 rows.  Not much better, I'm afraid.  And, as an aside, I'm starting to wonder about these estimated costs.  But that's another post for another day.

So where am I going with all of this?  Well, there's a reason I haven't given any indication up until this point in the post.  You see, it's utterly painful to write this, but...

In this case, a cursor is faster.

Yes, I said it.  That evil construct which we as database developers despise, the cursor.  Thanks to Paul Nielsen, who revealed this ugly fact to me in a conversation today, I was forced to test this for myself (hoping to prove him wrong, of course).  Which is why I started playing around with the solution that I've given so many times on forums.  Unfortunately, he is correct.

My next test query, using the first cursor I've written in several years:

DECLARE RunningTotalCursor
CURSOR LOCAL FAST_FORWARD FOR
    SELECT TransactionID, ActualCost
    FROM Production.TransactionHistory
    ORDER BY TransactionID

OPEN RunningTotalCursor

DECLARE @TransactionID INT
DECLARE @ActualCost MONEY

DECLARE @RunningTotal MONEY
SET @RunningTotal = 0

DECLARE @Results TABLE
(
    TransactionID INT NOT NULL PRIMARY KEY,
    ActualCost MONEY,
    RunningTotal MONEY
)

FETCH NEXT FROM RunningTotalCursor
INTO @TransactionID, @ActualCost

WHILE @@FETCH_STATUS = 0
BEGIN
    SET @RunningTotal = @RunningTotal + @ActualCost

    INSERT @Results
    VALUES (@TransactionID, @ActualCost, @RunningTotal)

    FETCH NEXT FROM RunningTotalCursor
    INTO @TransactionID, @ActualCost
END

CLOSE RunningTotalCursor

DEALLOCATE RunningTotalCursor

SELECT *
FROM @Results
ORDER BY TransactionID

What's really unfortunate about the cursor approach is that you need to use a temporary table if you want to return a single result set to the client. I figured the additional I/O due to the temp table would balance any improvement gains from the cursor approach, thereby rendering my forum responses correct, and Paul wrong.  Well, 14 seconds and 113,443 rows later, SSMS and my laptop declared Paul the undisputed Champion of the Cursor.

This cursor makes a lot of sense in this case.  The set-based query works by looping over each row of the table, taking a sum of every previous row.  So for the 10th row, 10 previous rows also need to be visited.  For the 1000th row, 1000 previous rows need to be visited.  And so on.  The larger the set gets, the worse performance will be--and that's not going to be a merely linear decrease in performance.  Think about this:  Using the set-based method to find the running sum over a set of 100 rows, 5050 total rows need to be visited.  For a set of 200 rows, the query processor needs to visit 20100 total rows -- a four-fold increase in the amount of work that must be done to satisfy the query (O((N^2)/2), for those who are a bit more algorithmically minded.)

The cursor, on the other hand, needs to visit each row exactly once (O(N)). By maintaining the running count in a variable, there is no need to re-visit previous rows.  And as my laptop was so happy to show me, the I/O cost due to the temp table does not overshadow the performance improvement of having to visit so many less rows.

So what have we learned today?  In my set-based singlemindedness I failed to realize that the cursor does, indeed, have utility.  Everything in moderation.

Next steps?  I get the feeling that this can be made even faster by employing a CLR routine.  Pull the data into a DataReader and loop over that instead, which will completely eliminate the need for a temporary table.  Watch for that experiment coming to this space soon.

And next time you hear someone mention how horrible cursors are, remind that person that there is a time and place for everything (and it's called college).


Published Wednesday, July 12, 2006 10:51 PM by Adam Machanic
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Dennis Allen said:

There is still overhead in the cursor that you can escape.  On way to perform this running total style calculation in a "set" oriented way that avoid the heavy recursion of the first two examples is...

SET NOCOUNT ON

DECLARE @hist TABLE ( TransactionID INT PRIMARY KEY CLUSTERED, ActualCost money, RunningTotal money )

DECLARE @total money; SET @total = 0.0

DECLARE @start datetime, @stop datetime

SET @start = GETDATE()

INSERT INTO @hist ( TransactionID, ActualCost, RunningTotal )

      SELECT TransactionID, ActualCost, 0.0

      FROM Production.TransactionHistory

      ORDER BY TransactionID

UPDATE @hist

      SET @total = RunningTotal = ActualCost + @total

SET @stop = GETDATE()

PRINT 'Time to calculate running total is ' + CONVERT(VARCHAR(10), DATEDIFF(ms, @start, @stop)) + 'ms'

SELECT TransactionID, ActualCost, RunningTotal

FROM @hist

February 21, 2007 2:24 PM
 

Dennis Allen said:

Sorry about the double post here, but I found it very interesting to turn on IO statistics for both so you can see the main differences.  I took some liberties in moving around declarations in the original cursor code to better accommodate timing the calculation of the running total.

If you run these, you will immediately notice the difference...

/***** ORIGINAL *****/

SET NOCOUNT ON

SET STATISTICS IO ON

DECLARE RunningTotalCursor

CURSOR LOCAL FAST_FORWARD FOR

   SELECT TransactionID, ActualCost

   FROM Production.TransactionHistory

   ORDER BY TransactionID

DECLARE @TransactionID INT

DECLARE @ActualCost MONEY

DECLARE @RunningTotal MONEY

SET @RunningTotal = 0

DECLARE @Results TABLE

(

   TransactionID INT NOT NULL PRIMARY KEY,

   ActualCost MONEY,

   RunningTotal MONEY

)

DECLARE @start datetime, @stop datetime

SET @start = getdate()

OPEN RunningTotalCursor

FETCH NEXT FROM RunningTotalCursor

INTO @TransactionID, @ActualCost

WHILE @@FETCH_STATUS = 0

BEGIN

   SET @RunningTotal = @RunningTotal + @ActualCost

   INSERT @Results

   VALUES (@TransactionID, @ActualCost, @RunningTotal)

   FETCH NEXT FROM RunningTotalCursor

   INTO @TransactionID, @ActualCost

END

CLOSE RunningTotalCursor

DEALLOCATE RunningTotalCursor

SET @stop = getdate()

PRINT 'Time to calculate running total is ' + convert(varchar(10), datediff(ms, @start, @stop)) + 'ms'

SELECT *

FROM @Results

ORDER BY TransactionID

/***** ALTERNATE *****/

SET NOCOUNT ON

SET STATISTICS IO ON

DECLARE @hist table ( TransactionID int primary key clustered, ActualCost money, RunningTotal money )

DECLARE @total money; SET @total = 0.0

DECLARE @start datetime, @stop datetime

SET @start = getdate()

INSERT INTO @hist ( TransactionID, ActualCost, RunningTotal )

SELECT TransactionID, ActualCost, 0.0

FROM Production.TransactionHistory

ORDER BY TransactionID

UPDATE @hist

SET @total = RunningTotal = ActualCost + @total

SET @stop = getdate()

PRINT 'Time to calculate running total is ' + convert(varchar(10), datediff(ms, @start, @stop)) + 'ms'

SELECT TransactionID, ActualCost, RunningTotal

FROM @hist

February 21, 2007 3:11 PM
 

Adam Machanic said:

I found Linchi's recent post on use of cursors in the TPC-E test to be quite interesting. The question

October 13, 2007 10:06 AM
 

GRM said:

I have tried to come up with a solution that didn't require a cursor to calculate a running total,the following solution ran in 57 secs on my machine and does not use a cursor.

I have used the production.TransactionHistory table in the AdventureWorks database.

Declare @iCounter int

Declare @i int

Declare @mintrans int, @TransId int, @Count1 int

SET NOCOUNT ON;

Set @i = 1

Select @iCounter = Count(*) From production.TransactionHistory

Select @mintrans = Min(transactionId) From production.TransactionHistory

While @i <= @iCounter

Begin

Select @transId = TransactionId From production.TransactionHistory  where Transactionid = @mintrans

Update th1

Set th1.RT = ISNULL(th1.ActualCost + th2.RT, th1.ActualCost)

From production.TransactionHistory th1

Join production.TransactionHistory th2 on th2.TransactionId = th1.TransactionId - 1

Where th1.TransactionId = @transId

Set @mintrans = @mintrans + 1

Set @i = @i+1

End

Select th1.TransactionId, th1.ActualCost, ISNULL(th1.ActualCost + th2.RT, th1.ActualCost) AS RT

From production.TransactionHistory th1

Join production.TransactionHistory th2 on th2.TransactionId = th1.TransactionId - 1

There are improvements that can be made to this - but this was my first attempt at trying to do this without a cursor, even though a look is used to perform the update statement, which is the first stage of this task.

November 27, 2007 11:45 AM
 

GRM said:

Forgot to mention, that you will have to add a new column to the table, i have called it RT and it's type is Money.

November 27, 2007 6:16 PM
 

Adam Machanic said:

Back again! Fourth post for the month of February, making this my best posting month in, well, months.

January 25, 2009 3:01 PM
 

Shivani said:

Hi Adam,

What about doing the Running Sum calculations in reporting application?  Is it better idea?

I used SSRS with follwoing query

SELECT SalesOrderID,SalesOrderDetailID,

OrderQty

FROM  AdventureWorks.Sales.SalesOrderDetail

and the RunningValue function - =RunningValue(Fields!OrderQty.Value,Sum,Nothing)

Just looking at it I see RunningValue function performing better than if I write T-SQL code.

What's your thought?

March 20, 2009 8:55 AM
 

Adam Machanic said:

Shivani: Absolutely!  The application is 100%, without a doubt, the right place to do this kind of work.  But sometimes we need to make exceptions even to 100% without a doubt rules :-)

March 20, 2009 10:18 AM
 

Mike said:

I know this is a really old article but if you ever want examples of where window functions in 2012 really clean up it's this sort of thing.

SELECT

   TH1.TransactionID,

   TH1.ActualCost,

   TotalAmount = SUM(TH1.ActualCost) OVER (ORDER BY TH1.TransactionID ROWS UNBOUNDED PRECEDING)

FROM Production.TransactionHistory TH1

ORDER BY TH1.TransactionID

Runs in under 1 second.

August 29, 2012 8:39 AM

Leave a Comment

(required) 
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based SQL Server developer, writer, and speaker. He focuses on large-scale data warehouse performance and development, and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "SQL Server 2008 Internals" (Microsoft Press, 2009) and "Expert SQL Server 2005 Development" (Apress, 2007). Adam regularly speaks at conferences and training events on a variety of SQL Server topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server, a Microsoft Certified IT Professional (MCITP), and an alumnus of the INETA North American Speakers Bureau.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement