I saw a thread on the SSIS forum today that went something like this:
I have the following dataset:
| AccountNo | Date | DailyMovement |
| 00000001 | 28/08/2010 | 10 |
| 00000001 | 29/08/2010 | 5 |
| 00000001 | 30/08/2010 | 7 |
| 00000002 | 28/08/2010 | 8 |
| 00000002 | 29/08/2010 | 6 |
for which I want to compute a running total per [AccountNo] & [Date] like so:
| AccountNo | Date | DailyMovement | RunningTotal |
| 00000001 | 28/08/2010 | 10 | 10 |
| 00000001 | 29/08/2010 | 5 | 15 |
| 00000001 | 30/08/2010 | 7 | 22 |
| 00000002 | 28/08/2010 | 8 | 8 |
| 00000002 | 29/08/2010 | 6 | 14 |
How do I do that using a script component?
That last comment:
How do I do that using a script component?
is a fairly common question. People assume that if a calculated value is dependant on prior rows then a script component needs to be involved because that running total needs to be tracked somewhere, but that isn’t necessarily the case. Ask yourself, how would you do this if the data were residing in a database table, would you do this?:
SELECT '00000001' AS [AccountNo], CONVERT(date,'20080828') AS [Date],10 AS [DailyMovement]
INTO #balances
UNION ALL
SELECT '00000001' AS [AccountNo], CONVERT(date,'20080829') AS [Date],5 AS [DailyMovement]
UNION ALL
SELECT '00000001' AS [AccountNo], CONVERT(date,'20080830') AS [Date],7 AS [DailyMovement]
UNION ALL
SELECT '00000002' AS [AccountNo], CONVERT(date,'20080828') AS [Date],8 AS [DailyMovement]
UNION ALL
SELECT '00000002' AS [AccountNo], CONVERT(date,'20080829') AS [Date],6 AS [DailyMovement]
SELECT [AccountNo]
, [Date]
, [DailyMovement]
, ( SELECT SUM(DailyMovement)
FROM #balances b2
WHERE b1.[AccountNo] = b2.[AccountNo]
AND b1.[Date] >= b2.[Date])
FROM #balances b1;
OK that works, and here’s a screenshot to prove it:

But really, would any of you actually do this? Hopefully not, the use of the correlated subquery has simply turned what should be a nice set operation into a cursor-in-disguise (something I have talked about before) because that subquery will be getting executed for every row in the table. Instead you could run the following on that same dataset:
SELECT b1.[AccountNo]
, b1.[Date]
, b1.[DailyMovement]
, SUM(b2.[DailyMovement]) AS [RunningTotal]
FROM #balances b1
INNER JOIN #balances b2 ON b1.[AccountNo] = b2.[AccountNo]
AND b1.[Date] >= b2.[Date]
GROUP BY b1.[AccountNo],b1.[Date],b1.[DailyMovement];
and you get the same result but with a much more amenable execution plan (execute with SET STATISTICS IO ON if you don’t believe me)!
The same principle applies in a SSIS dataflow. Often there is no need to resort to a script component, the same result can be achieved using some smart dataflow design. You can use, for example, the same approach as in the second SQL query (above) using a combination of SSIS’ Sort, Aggregate, Merge Join and Conditional Split components. That is not to say that one will be quicker than the other but at least you’ll have something that is more intuitive and arguably more maintainable. Of course if performance gain is your primary goal then the correct advice is, as always, “test and measure, test and measure, test and measure”!!!
Ask yourself “If I could, how would I solve this using T-SQL?” and see if that influences your dataflow design at all. Invariably script components should be the last choice, not the first!
@jamiet