THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a freelance data mangler in London

Don’t turn the dataflow into a cursor [SSIS]

I saw a thread on the SSIS forum today that went something like this:


I have the following dataset:

AccountNo Date DailyMovement
00000001 28/08/2010 10
00000001 29/08/2010 5
00000001 30/08/2010 7
00000002 28/08/2010 8
00000002 29/08/2010 6

for which I want to compute a running total per [AccountNo] & [Date] like so:

AccountNo Date DailyMovement RunningTotal
00000001 28/08/2010 10 10
00000001 29/08/2010 5 15
00000001 30/08/2010 7 22
00000002 28/08/2010 8 8
00000002 29/08/2010 6 14

How do I do that using a script component?


That last comment:

How do I do that using a script component?

is a fairly common question. People assume that if a calculated value is dependant on prior rows then a script component needs to be involved because that running total needs to be tracked somewhere, but that isn’t necessarily the case. Ask yourself, how would you do this if the data were residing in a database table, would you do this?:

SELECT '00000001' AS [AccountNo], CONVERT(date,'20080828') AS [Date],10 AS [DailyMovement]
INTO  
#balances
UNION 
ALL
SELECT '00000001' AS [AccountNo], CONVERT(date,'20080829') AS [Date],5 AS
[DailyMovement]
UNION 
ALL
SELECT '00000001' AS [AccountNo], CONVERT(date,'20080830') AS [Date],7 AS
[DailyMovement]
UNION 
ALL
SELECT '00000002' AS [AccountNo], CONVERT(date,'20080828') AS [Date],8 AS
[DailyMovement]
UNION 
ALL
SELECT '00000002' AS [AccountNo], CONVERT(date,'20080829') AS [Date],6 AS
[DailyMovement]

SELECT
[AccountNo]
,     
[Date]
,     
[DailyMovement]
,      (   SELECT  SUM(DailyMovement
)
          
FROM    #balances
b2
          
WHERE   b1.[AccountNo] =
b2.[AccountNo]
          
AND     b1.[Date] >= b2.[Date]
)
FROM   #balances b1;

 

OK that works, and here’s a screenshot to prove it:

image

But really, would any of you actually do this? Hopefully not, the use of the correlated subquery has simply turned what should be a nice set operation into a cursor-in-disguise (something I have talked about before) because that subquery will be getting executed for every row in the table. Instead you could run the following on that same dataset:

SELECT b1.[AccountNo]
,     
b1.[Date]
,     
b1.[DailyMovement]
,      SUM(b2.[DailyMovement]) AS
[RunningTotal]
FROM   #balances
b1
INNER  JOIN #balances b2 ON  b1.[AccountNo] =
b2.[AccountNo]
                          
AND b1.[Date] >=
b2.[Date]
GROUP  BY b1.[AccountNo],b1.[Date],b1.[DailyMovement];

and you get the same result but with a much more amenable execution plan (execute with SET STATISTICS IO ON if you don’t believe me)!

The same principle applies in a SSIS dataflow. Often there is no need to resort to a script component, the same result can be achieved using some smart dataflow design. You can use, for example, the same approach as in the second SQL query (above) using a combination of SSIS’ Sort, Aggregate, Merge Join and Conditional Split components. That is not to say that one will be quicker than the other but at least you’ll have something that is more intuitive and arguably more maintainable. Of course if performance gain is your primary goal then the correct advice is, as always, “test and measure, test and measure, test and measure”!!!

Ask yourself “If I could, how would I solve this using T-SQL?” and see if that influences your dataflow design at all. Invariably script components should be the last choice, not the first!

@jamiet

Published Tuesday, August 31, 2010 11:22 PM by jamiet

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

SAinCA said:

Do excuse this seemingly dumb question, but I thought that SQL Server doesn't guarantee the sequence of rows returned from an operation. Does the GROUP BY in your 2nd query act as an ORDER BY so that the RunningTotal is ALWAYS the sum of current and preceding GROUP BY key-data in ascending sequence?  Thanks in advance.

August 31, 2010 6:20 PM
 

jamiet said:

SAinCA,

You are correct that SQL Server never guarantees the order.

In this case, no, it is not the GROUP BY that guarantees the RunningTotal is always the sum of current and preceding rows, that particular action is guaranteed by the join predicate:

"AND b1.[Date] >= b2.[Date]"

Hope that makes sense.

-Jamie

September 1, 2010 4:07 AM
 

Paul White said:

Wouldn't it be nice if SSIS included a specialized Running Totals aggregate?

BTW for T-SQL running totals I use the Quirky Update method (http://www.sqlservercentral.com/articles/T-SQL/68467/) or Hugo Kornelis' innovative set-based iteration method (MVP Deep Dives Chapter 4).  They're both much, much more efficient than a triangular join.

Thanks for a great read.

Paul

September 1, 2010 9:18 AM
 

JR said:

I have found that using ELT (Extract, Load, Transform) can work better than writing scripts for SSIS. With ELT you load the data into the database and just write T-SQL to transform the data as needed. This is usually easier and faster. This is a good example of this.

September 7, 2010 12:54 PM
 

SSIS Junkie said:

Introduction The Dataflow task is one of the core components (if not the core component) of SQL Server

January 13, 2011 5:04 AM
 

Stefan_G said:

Hi

I apologize for being blunt, but this article is just silly. Your other articles indicate that you usually know what you are talking about so this article is a bit surprising.

Your two SQL statements both generate what is known as a triangular join. Both are extremely slow if you are generating a running sum over a large number of rows - say 10.000 rows or so.

Generating a running sum with a script is EXTREMELY efficient. If the data is already sorted, all you need is a single pass over the data.

I would also really like to see your hinted SSIS implementation with "Sort, Aggregate, Merge Join and Conditional Split components"

Such a solution would probably be at least a thousand times slower than the script solution and also a lot more complex and error prone.

Calculating a running sum is a perfect example of where a script transform should be used.

/SG

March 14, 2011 6:43 PM
 

jamiet said:

Stefan,

I don't really take kindly to my blogs being referred to as silly but you raise a good point and so I do owe you a reply.

The explicit aim of the blog post was to make the point that there are often options available when trying to accomplish something using SSIS - many people, for whatever reason, don't like to resort to scripting and hence I point out that there are alternate options.

I didn't make any claims about performance and in fact explicitly stated that if high performance *was* one's goal then one should test to find the most performant method.

Thanks for giving me the chance to reiterate and for triggering a discussion.

Regards

Jamie

March 14, 2011 6:59 PM
 

Stefan_G said:

Thanks for the reply!

I apologize again for calling the article silly.

When I read the article it gives the impression that you are saying that using a script for a running total will "turn the dataflow into a cursor", but that it can be avoided by using a clever dataflow instead.

From the article it sounds like "turning the dataflow into a cursor" will cause a performance problem.

So, the general impression of the article is that you are saying that you should not use a script to calculate a running total, but you should use a clever dataflow instead.

The truth is that using a script to calculate a running total is probably the fastest possible method. Using a dataflow or a triangular join in SQL can be a thousand times slower.

By the way, in my tests, your second SQL query is a LOT slower than the first.

I use this script to generate some test data:

if object_id('tempdb..#balances') is not null drop table #balances

create table #balances (AccountNo varchar(100), [Date] date, [DailyMovement] decimal(20,4))

declare @i int=0

set nocount on

begin tran

while @i<10000 begin

insert into #balances (AccountNo, [date], [DailyMovement]) (select '000000001', dateadd(day, rand()*1000, '20110101'), rand()*10)

set @i=@i+1

end

commit

March 15, 2011 5:08 AM
 

jamiet said:

"the article gives the impression that you are saying that using a script for a running total will "turn the dataflow into a cursor", "

Yes, re-reading it I can see why one might think that - maybe I need to rephrase. The article should say something along the lines of

"using a script for a running total *might* turn the dataflow into a cursor"

"By the way, in my tests, your second SQL query is a LOT slower than the first."

Good to know, thanks. And on your dataset I see the same. Therein lies an important point - one must test and measure in order to get the optimum result. More accurately, you must test and measure with a representative dataset.

Thanks again for the comments.

-Jamie

March 15, 2011 7:21 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement