THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul Nielsen

www.SQLServerBible.com

Cumulative Totals Screencast

Prompted by a daily "SQL Tip" email a few days ago that recommended the correlated subquery to solve the cumulative total problem, this 5 minute screencast (5:02) demonstrates that the cursor solution indeed out-scales the set-based correlated subquery for this particular problem.
Published Thursday, December 06, 2007 6:49 PM by Paul Nielsen

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Knyazev Alexey said:

CURSOR must die  :)))

--00:02 Execution Time

SET NOCOUNT ON

GO

DECLARE @ComulativeTotal MONEY

SET @ComulativeTotal=0

UPDATE Sales.SalesOrderHeader

SET @ComulativeTotal=ComulativeTotal=@ComulativeTotal+ISNULL(TotalDue, 0)

December 7, 2007 12:34 AM
 

DW said:

Greate solution, Knyazev!

December 7, 2007 2:30 AM
 

Steve Morris said:

But how do you enforce any sort of ordering ?

A cumulative total normally needs to follow certain ordering rules to be useful.

December 7, 2007 3:46 AM
 

Knyazev Alexey said:

OK!!!

--00:03 Execution Time

SET NOCOUNT ON

GO

CREATE TABLE #T (ID INT IDENTITY, SalesOrderID INT, TotalDue MONEY, ComulativeTotal MONEY)

INSERT INTO #T

SELECT SalesOrderID, TotalDue, ComulativeTotal

FROM Sales.SalesOrderHeader

ORDER BY SalesOrderID

DECLARE @ComulativeTotal MONEY

SET @ComulativeTotal=0;

UPDATE #T

SET @ComulativeTotal=ComulativeTotal=@ComulativeTotal+ISNULL(TotalDue, 0)

UPDATE t1

SET t1.ComulativeTotal=t2.ComulativeTotal

FROM Sales.SalesOrderHeader t1 INNER JOIN #T t2

ON t1.SalesOrderID=t2.SalesOrderID

DROP TABLE #T

December 7, 2007 6:02 AM
 

Paul Nielsen said:

Very cool use of the Multiple Assignment Variable, Knyazev!

To compare apples to apples, on my machine the whole script runs in just under 2 seconds and the update to #T alone runs in under 1 second. The result is identical to the cursor so in this case there is no ordering problem and it produces the correct result.

I'm still cautious of the fact that with the multiple assignment variable method there's no absolute proof of the ordering – you have to trust that the query plan to pull the data using the clustered index order. If I was implementing cumulative totals at a bank, I’d use the cursor to be certain. But that doesn’t take away from your slam dunk on the performance.

December 7, 2007 8:54 AM
 

Adam Machanic said:

December 7, 2007 10:24 AM
 

Kalman Toth said:

Wow Knyazev!

Set-based programming at its greatest!

December 7, 2007 10:58 AM
 

david wei said:

I love the Multiple Assignment set solution, Knyazev.

You actually don't need use #temp table to enforce the ordering.

Your original script has guranteed the ordering (by storage engine), because the query plan starts with an ordered clustered index scan.

UPDATE Sales.SalesOrderHeader

SET @ComulativeTotal=ComulativeTotal=@ComulativeTotal+ISNULL(TotalDue, 0)

The first operator shows:

                          |--Clustered Index Scan(OBJECT:([AdventureWorks].[Sales].[SalesOrderHeader].[PK_SalesOrderHeader_SalesOrderID]), ORDERED FORWARD)

December 11, 2007 3:31 AM
 

Adam Machanic said:

David: Can you guarantee that the storage engine will always use an ordered scan?

December 11, 2007 9:22 AM
 

david wei said:

Yes, because of the potential risk of page split issue on clustered index table , storage engine will always use ordered scan (even sometimes the paln does NOT show the ordered!). The only exception is you use Read UNCOMMITTED isolation.

Itzik presented this at devTeach Vancouver conference, I remember you were in that room too. :-)

SQL412 What's Between Index Internals, Isolation Levels and Data Consistency

December 11, 2007 12:37 PM
 

Paul Nielsen said:

Hi David Wei, I disagree that "original script has guranteed the ordering (by storage engine)". Instead I'd say that the current behavior of the engine uses a clustered index and produces the correct order, but the T-SQL command does not declare the order so there is no guarantee of the same behavior in the future. It’s a question of whether you want to trust in the continued behavior of the engine to produce the desired result, or specify in code the desired result. Because the row order is paramount to a cumulative total, you can’t risk missing the correct order sometime in the future.

I love, too, Multiple Assignment Variables, and I’ve written about them before. But in this case, until Microsoft allows a syntax that can specify the order by for Multiple Assignment Variables, the safest T-SQL solution is the cursor, or Adam’s CLR cursor. It makes a great demo, but it’s not production safe.

December 11, 2007 5:29 PM
 

Adam Machanic said:

David,

I think you might be confusing this with something else Itzik said.  I didn't remember hearing that, so I checked in with Itzik this afternoon and asked.  He confirmed that there is definitely no guarantee of ordering without on ORDER BY clause.

December 11, 2007 5:47 PM
 

david wei said:

Ok, I agree that "if you don't use order by, it up to storage engine to decide how to access the data. So ordered scan on clustered index table is not guaranteed by any supported document."

Based on above discussions, Knyazev’s 2nd script still not guarantees the ordering when update #t (because no order by)

To force an ordered scan, we need join the same table again, the MERGE join hint requires the ordered scan on both input.

(Without the hint, query optimizer also choose merge join, but we explicit use the hint to GUARANTEE the ordering)

DECLARE @CumulativeTotal MONEY

SET @CumulativeTotal=0

UPDATE s1

SET @CumulativeTotal=s1.CumulativeTotal=@CumulativeTotal+ISNULL(s1.TotalDue, 0)

from Sales.SalesOrderHeader s1

inner merge join Sales.SalesOrderHeader s2 on s1.SalesOrderID = s2.SalesOrderID

December 12, 2007 1:48 AM
 

Kevin Boles said:

So what happens to this query when it gets parallelized?  Now you really don't have any control over the splits of the load, right?  And if a run is spread across more than one thread then you are hosed it would seem.  I would think that the MERGE join trick is also succeptable to this issue.  

Bottom line is to remember (as I understand it):  in set-based processing there is NO guaranteed ordering of sets except on output after an order by clause.

December 14, 2007 10:38 AM
 

Kalman Toth said:

I redid Paul Nielsen's original WHILE loop.

It runs in 2 sec.  Much faster than the 80 sec correlated subquery!

However, it is not a clean query anymore since update is used.

======================================

DBCC dropcleanbuffers

SET NoCount ON

DECLARE

 @SalesOrderID INT,

 @TotalDue MONEY,

 @CumuTot MONEY

SELECT SalesOrderID, OrderDate, TotalDue, CumuTot = convert(money,0.0)

into #Accum

FROM Sales.SalesOrderHeader

ORDER BY OrderDate, SalesOrderID

create index #idxAccum on #Accum( SalesOrderID)

SET @CumuTot = 0

 DECLARE curRunTot CURSOR FAST_FORWARD

   FOR

     SELECT SalesOrderID, TotalDue

       FROM #Accum

       ORDER BY OrderDate, SalesOrderID

 OPEN curRunTot

 FETCH curRunTot INTO @SalesOrderID, @TotalDue  

 WHILE @@Fetch_Status = 0

   BEGIN

     SET @CumuTot = @CumuTot + @TotalDue

     UPDATE #Accum

       SET CumuTot = @CumuTot

       WHERE SalesOrderID = @SalesOrderID    

     FETCH curRunTot INTO @SalesOrderID, @TotalDue  

   END

 CLOSE curRunTot

  DEALLOCATE curRunTot

SELECT SalesOrderID, TotalDue, CumuTot

 FROM #Accum

 ORDER BY OrderDate, SalesOrderID

December 14, 2007 12:23 PM
 

david wei said:

Kevin: (1st) update is never executed using parallelism, but if you want to GUARANTEE the "Guarantee" , add OPTION (MAXDOP 1) as you wish.

(2nd) don't confued by the name "SET", set-based is still processed internally ROW by ROW. each iterator calls it's child getRow method repeatedly until it returned all rows. The Merge join read sorted 2 inputs, compare row by row. If two rows are matched, it output it and move to next; if not match, it discards the lower one, and move to next. The output ordering of the merge join is reserved. (for ex: when your query uses order by on the merge key, you will never see sql server use a sort iterator after the merge, that's the reason). The next iterator, comput scalar also processes ROW by ROW for each row output from the MERGE. This is perfect safe for the ordering.

Kalman: your script is an improvement by moving only few columns into temp table, thus, reduced the read, but this still can not compete with the beautiful SET solution. (BTW, you still need update back the original table by join the #accum).

Also, we have been talking a lot about the dfault behaior of order scan on clustered index, I think it worth to past a link to Paul Randal's blog : When can allocation order scans be used? (Though it is not officially documented :)

http://blogs.msdn.com/sqlserverstorageengine/archive/2006/11/09/when-can-allocation-order-scans-be-used.aspx

December 14, 2007 1:52 PM
 

Paul Nielsen said:

Re: Multiple Assignment Variable for Cumulative Totals

SQL is a declarative query and data manipulation language. As such, the whole point of SQL is that the query describes the question and the Query Engine is free to solve that question however it deems best. Anytime you code depending on the behavior of the engine rather than completely describing the question, you're asking for trouble. Even if it's blazingly fast, cool, w00t to the max, and a demo that blows everyone away, it's not good SQL.

December 14, 2007 2:12 PM
 

david wei said:

We are working on SQL server database, not just T-SQL language. There are situations we need the advanced hint to guide the query engine to boost performance. Personally, I don't agree that "CURSOR must die", but in this particular case, I vote the set solution,especially when you are dealing with large tables :-)

Cheers.

December 14, 2007 3:50 PM
 

Kevin Boles said:

I am aware that set-based processing is actually row-by-row when it gets down to the nitty gritty inside the engine.  :-)  My comment about parallelization was directed at the numerous examples of this technique I have seen on the web that are SELECTs vice UPDATEs.  Report queries often have a need for this.  Does anyone know definitively if my supposition is correct - that parallelization of a query can also lead to "out-of-order" processing and thus incorrect results in a query such as the one in this thread?

December 15, 2007 11:39 AM

Leave a Comment

(required) 
(required) 
Submit

About Paul Nielsen

Paul Nielsen believes SQL is the romance language of data. As such he’s a hands-on database developer, Microsoft SQL Server MVP, trainer, and author of SQL Server Bible series (Wiley). As a data architect, he developed the concepts of Smart Database Design and Nordic – an open source O/R dbms for SQL Server. He lives in Colorado Springs.

This Blog

Syndication

News

news item test
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement