THE SQL Server Blog Spot on the Web
Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Andy Leonard

SSIS and ETL
Thoughts about Database and Software Development, and the tools of the trade.

SSIS Design Pattern - Incremental Loads

Introduction
 
Loading data from a data source to SQL Server is a common task. It's used in Data Warehousing, but increasingly data is being staged in SQL Server for non-Business-Intelligence purposes.
 
Maintaining data integrity is key when loading data into any database. A common way of accomplishing this is to truncate the destination and reload from the source. While this method ensures data integrity, it also loads a lot of data that was just deleted.
 
Incremental loads are a faster and use less server resources. Only new or updated data is touched in an incremental load.
 
When To Use Incremental Loads
 
Use incremental loads whenever you need to load data from a data source to SQL Server.
 
Incremental loads are the same regardless of which database platform or ETL tool you use. You need to detect new and updated rows - and separate these from the unchanged rows.
 
Incremental Loads in Transact-SQL
 
I will start by demonstrating this with T-SQL:
 
0. (Optional, but recommended) Create two databases: a source and destination database for this demonstration:
 

CREATE DATABASE [SSISIncrementalLoad_Source]

CREATE DATABASE [SSISIncrementalLoad_Dest]

1. Create a source named tblSource with the columns ColID, ColA, ColB, and ColC; make ColID is a primary unique key:
 
USE SSISIncrementalLoad_Source
GO
CREATE TABLE dbo.tblSource
(ColID int NOT NULL
,ColA varchar(10) NULL
,ColB datetime NULL constraint df_ColB default (getDate())
,ColC int NULL
,constraint PK_tblSource primary key clustered (ColID))
 
2. Create a Destination table named tblDest with the columns ColID, ColA, ColB, ColC:
 
USE SSISIncrementalLoad_Dest
GO
CREATE TABLE dbo.tblDest
(ColID int NOT NULL
,ColA varchar(10) NULL
,ColB datetime NULL
,ColC int NULL)
 
3. Let's load some test data into both tables for demonstration purposes:
 
USE SSISIncrementalLoad_Source
GO

-- insert an "unchanged" row
INSERT INTO dbo.tblSource
(ColID,ColA,ColB,ColC)
VALUES(0, 'A', '1/1/2007 12:01 AM', -1)

-- insert a "changed" row
INSERT INTO dbo.tblSource
(ColID,ColA,ColB,ColC)
VALUES(1, 'B', '1/1/2007 12:02 AM', -2)

-- insert a "new" row
INSERT INTO dbo.tblSource
(ColID,ColA,ColB,ColC)
VALUES(2, 'N', '1/1/2007 12:03 AM', -3)

USE SSISIncrementalLoad_Dest
GO

-- insert an "unchanged" row
INSERT INTO dbo.tblDest
(ColID,ColA,ColB,ColC)
VALUES(0, 'A', '1/1/2007 12:01 AM', -1)

-- insert a "changed" row
INSERT INTO dbo.tblDest
(ColID,ColA,ColB,ColC)
VALUES(1, 'C', '1/1/2007 12:02 AM', -2)

4. You can view new rows with the following query:

SELECT
s.ColID, s.ColA, s.ColB, s.ColC
FROM SSISIncrementalLoad_Source.dbo.tblSource s
LEFT JOIN SSISIncrementalLoad_Dest.dbo.tblDest d ON d.ColID = s.ColID
WHERE d.ColID IS NULL

This should return the "new" row - the one loaded earlier with ColID = 2 and ColA = 'N'. Why? The LEFT JOIN and WHERE clauses are the key. Left Joins return all rows on the left side of the join clause (SSISIncrementalLoad_Source.dbo.tblSource in this case) whether there's a match on the right side of the join clause (SSISIncrementalLoad_Dest.dbo.tblDest in this case) or not. If there is no match on the right side, NULLs are returned. This is why the WHERE clause works: it goes after rows where the destination ColID is NULL. These rows have no match in the LEFT JOIN, therefore they must be new.

This is only an example. You occasionally find database schemas that are this easy to load. Occasionally. Most of the time you have to include several columns in the JOIN ON clause to isolate truly new rows. Sometimes you have to add conditions in the WHERE clause to refine the definition of truly new rows.

Incrementally load the row ("rows" in practice) with the following T-SQL statement:

INSERT INTO SSISIncrementalLoad_Dest.dbo.tblDest
(ColID, ColA, ColB, ColC)
SELECT s.ColID, s.ColA, s.ColB, s.ColC
FROM SSISIncrementalLoad_Source.dbo.tblSource s
LEFT JOIN SSISIncrementalLoad_Dest.dbo.tblDest d ON d.ColID = s.ColID
WHERE d.ColID IS NULL

5. There are many ways by which people try to isolate changed rows. The only sure-fire way to accomplish it is to compare each field. View changed rows with the following T-SQL statement:

SELECT d.ColID, d.ColA, d.ColB, d.ColC
FROM SSISIncrementalLoad_Dest.dbo.tblDest d
INNER JOIN SSISIncrementalLoad_Source.dbo.tblSource s ON s.ColID = d.ColID
WHERE (
(d.ColA != s.ColA)
OR (d.ColB != s.ColB)
OR (d.ColC != s.ColC)
)

This should return the "changed" row we loaded earlier with ColID = 1 and ColA = 'C'. Why? The INNER JOIN and WHERE clauses are to blame - again. The INNER JOIN goes after rows with matching ColID's because of the JOIN ON clause. The WHERE clause refines the resultset, returning only rows where the ColA's, ColB's, or ColC's don't match and the ColID's match. This is important. If there's a difference in any or some or all the rows (except ColID), we want to update it.

Extract-Transform-Load (ETL) theory has a lot to say about when and how to update changed data. You will want to pick up a good book on the topic to learn more about the variations.

To update the data in our destination, use the following T-SQL: 

UPDATE d
SET
d.ColA = s.ColA
,d.ColB = s.ColB
,d.ColC = s.ColC
FROM SSISIncrementalLoad_Dest.dbo.tblDest d
INNER JOIN SSISIncrementalLoad_Source.dbo.tblSource s ON s.ColID = d.ColID
WHERE (
(d.ColA != s.ColA)
OR (d.ColB != s.ColB)
OR (d.ColC != s.ColC)
)
 
Incremental Loads in SSIS 
 
Let's take a look at how you can accomplish this in SSIS using the Lookup Transformation (for the join functionality) combined with the Conditional Split (for the WHERE clause conditions) transformations.
 
Before we begin, let's reset our database tables to their original state using the following query:

USE SSISIncrementalLoad_Source
GO

TRUNCATE TABLE dbo.tblSource

-- insert an "unchanged" row
INSERT INTO dbo.tblSource
(ColID,ColA,ColB,ColC)
VALUES(0, 'A', '1/1/2007 12:01 AM', -1)

-- insert a "changed" row
INSERT INTO dbo.tblSource
(ColID,ColA,ColB,ColC)
VALUES(1, 'B', '1/1/2007 12:02 AM', -2)

-- insert a "new" row
INSERT INTO dbo.tblSource
(ColID,ColA,ColB,ColC)
VALUES(2, 'N', '1/1/2007 12:03 AM', -3)

USE SSISIncrementalLoad_Dest
GO

TRUNCATE TABLE dbo.tblDest

-- insert an "unchanged" row
INSERT INTO dbo.tblDest
(ColID,ColA,ColB,ColC)
VALUES(0, 'A', '1/1/2007 12:01 AM', -1)

-- insert a "changed" row
INSERT INTO dbo.tblDest
(ColID,ColA,ColB,ColC)
VALUES(1, 'C', '1/1/2007 12:02 AM', -2)

Next, create a new project using Business Intelligence Development Studio (BIDS). Name the project SSISIncrementalLoad:

Once the project loads, open Solution Explorer and rename Package1.dtsx to SSISIncrementalLoad.dtsx:

When prompted to rename the package object, click the Yes button. From the toolbox, drag a Data Flow onto the Control Flow canvas:

 

Double-click the Data Flow task to edit it. From the toolbox, drag and drop an OLE DB Source onto the Data Flow canvas: 

 

Double-click the OLE DB Source connection adapter to edit it:

 

Click the New button beside the OLE DB Connection Manager dropdown:

Click the New button here to create a new Data Connection:

Enter or select your server name. Connect to the SSISIncrementalLoad_Source database you created earlier. Click the OK button to return to the Connection Manager configuration dialog. Click the OK button to accept your newly created Data Connection as the Connection Manager you wish to define. Select "dbo.tblSource" from the Table dropdown:

 

Click the OK button to complete defining the OLE DB Source Adapter.

Drag and drop a Lookup Transformation from the toolbox onto the Data Flow canvas. Connect the OLE DB connection adapter to the Lookup transformation by clicking on the OLE DB Source and dragging the green arrow over the Lookup and dropping it. Right-click the Lookup transformation and click Edit (or double-click the Lookup transformation) to edit:

 

When the editor opens, click the New button beside the OLE DB Connection Manager dropdown (as you did earlier for the OLE DB Source Adapter). Define a new Data Connection - this time to the SSISIncrementalLoad_Dest database. After setting up the new Data Connection and Connection Manager, configure the Lookup transformation to connect to "dbo.tblDest":

 

Click the Columns tab. On the left side are the columns currently in the SSIS data flow pipeline (from SSISIncrementalLoad_Source.dbo.tblSource). On the right side are columns available from the Lookup destination you just configured (from SSISIncrementalLoad_Dest.dbo.tblDest). Follow the following steps:

1. We'll need all the rows returned from the destination table, so check all the checkboxes beside the rows in the destination. We need these rows for our WHERE clauses and for our JOIN ON clauses.

2. We do not want to map all the rows between the source and destination - we only want to map the columns named ColID between the database tables. The Mappings drawn between the Available Input Columns and Available Lookup Columns define the JOIN ON clause. Multi-select the Mappings between ColA, ColB, and ColC by clicking on them while holding the Ctrl key. Right-click any of them and click "Delete Selected Mappings" to delete these columns from our JOIN ON clause.

3. Add the text "Dest_" to each column's Output Alias. These rows are being appended to the data flow pipeline. This is so we can distinguish between Source and Destination rows farther down the pipeline:

Next we need to modify our Lookup transformation behavior. By default, the Lookup operates as an INNER JOIN - but we need a LEFT (OUTER) JOIN. Click the "Configure Error Output" button to open the "Configure Error Output" screen. On the "Lookup Output" row, change the Error column from "Fail component" to "Ignore failure". This tells the Lookup transformation "If you don't find an INNER JOIN match in the destination table for the Source table's ColID value, don't fail." - which also effectively tells the Lookup "Don't act like an INNER JOIN, behave like a LEFT JOIN":

Click OK to complete the Lookup transformation configuration.

From the toolbox, drag and drop a Conditional Split Transformation onto the Data Flow canvas. Connect the Lookup to the Conditional Split as shown. Right-click the Conditional Split and click Edit to open the Conditional Split Editor:

 

Expand the NULL Functions folder in the upper right of the Conditional Split Transformation Editor. Expand the Columns folder in the upper left side of the Conditional Split Transformation Editor. Click in the "Output Name" column and enter "New Rows" as the name of the first output. From the NULL Functions folder, drag and drop the "ISNULL( <<expression>> )" function to the Condition column of the New Rows condition:

Next, drag Dest_ColID from the columns folder and drop it onto the "<<expression>>" text in the Condition column. "New Rows" should now be defined by the condition "ISNULL( [Dest_ColID] )". This defines the WHERE clause for new rows - setting it to "WHERE Dest_ColID Is NULL".

Type "Changed Rows" into a second Output Name column. Add the expression "(ColA != Dest_ColA) || (ColB != Dest_ColB) || (ColC != Dest_ColC)" to the Condition column for the Changed Rows output. This defines our WHERE clause for detecting changed rows - setting it to "WHERE ((Dest_ColA != ColA) OR (Dest_ColB != ColB) OR (Dest_ColC != ColC))". Note "||" is used to convey "OR" in SSIS Expressions:

 

Change the "Default output name" from "Conditional Split Default Output" to "Unchanged Rows":

Click the OK button to complete configuration of the Conditional Split transformation.

Drag and drop an OLE DB Destination connection adapter and an OLE DB Command transformation onto the Data Flow canvas. Click on the Conditional Split and connect it to the OLE DB Destination. A dialog will display prompting you to select a Conditional Split Output (those outputs you defined in the last step). Select the New Rows output:

Next connect the OLE DB Command transformation to the Conditional Split's "Changed Rows" output:

 

 Your Data Flow canvas should appear similar to the following:

Configure the OLE DB Destination by aiming at the SSISIncrementalLoad_Dest.dbo.tblDest table:

 

Click the Mappings item in the list to the left. Make sure the ColID, ColA, ColB, and ColC source columns are mapped to their matching destination columns (aren't you glad we prepended "Dest_" to the destination columns?):

 

Click the OK button to complete configuring the OLE DB Destination connection adapter.

Double-click the OLE DB Command to open the "Advanced Editor for OLE DB Command" dialog. Set the Connection Manager column to your SSISIncrementalLoad_Dest connection manager:

 

Click on the "Component Properties" tab. Click the elipsis (button with "...") beside the SQLCommand property:

 The String Value Editor displays. Enter the following parameterized T-SQL statement into the String Value textbox:

UPDATE dbo.tblDest
SET
ColA = ?
,ColB = ?
,ColC = ?
WHERE ColID = ?

 

 The question marks in the previous parameterized T-SQL statement map by ordinal to columns named "Param_0" through "Param_3". Map them as shown below - effectively altering the UPDATE statement for each row to read:

UPDATE SSISIncrementalLoad_Dest.dbo.tblDest
SET
ColA = SSISIncrementalLoad_Source.dbo.ColA
,ColB = SSISIncrementalLoad_Source.dbo.ColB
,ColC = SSISIncrementalLoad_Source.dbo.ColC
WHERE ColID = SSISIncrementalLoad_Source.dbo.ColID

Note the query is executed on a row-by-row basis. For performance with large amounts of data, you will want to employ set-based updates instead.

 Click the OK button when mapping is completed.

Your Data Flow canvas should look like that pictured below:

 

If you execute the package with debugging (press F5), the package should succeed and appear as shown here:

 

Note one row takes the "New Rows" output from the Conditional Split, and one row takes the "Changed Rows" output from the Conditional Split transformation. Although not visible, our third source row doesn't change, and would be sent to the "Unchanged Rows" output - which is simply the default Conditional Split output renamed. Any row that doesn't meet any of the predefined conditions in the Conditional Split is sent to the default output.

That's all! Congratulations - you've built an incremental database load! [:)]

Get the code! (Free registration required)

:{> Andy


Published Monday, July 09, 2007 3:13 PM by andyleonard

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Jason Haley said:

July 10, 2007 9:09 AM
 

Jason Haley said:

July 10, 2007 9:10 AM
 

Alberto Ferrari said:

Andy, maybe you are interested in taking a look at the TableDifference component I published at http://www.sqlbi.eu.

It is an all-in-one and completely free SSIS component that handles these kind of situations without the need to cache data in the Lookup. Lookups are nice but - in real situaton - they may shortly lead to out of memory situations (think at a hundred million rows table... it simply cannot be cached in memory).

Beware that - for huge table comparison - you will need both TableDifference AND the FlowSync component that you can find at the same site.

I'll be glad to hear your comments about it.

Alberto

July 12, 2007 5:21 AM
 

andyleonard said:

Thanks Alberto! Checking it out now.

:{> Andy

July 13, 2007 9:30 PM
 

David R Buckingham said:

Thank you greatly Andy.  This couldn't have come at a better time as I just started using Integration Services for the first time on Friday to handle eight different data loads (all for a single client).  Four of the data loads are straight appends, but the other four are incremental.

This approach is vastly superior to loading the incremental data into a temporary table and then processing it against the destination table.  In fact, it proved to be more efficient than both set-based insert/updates or a cursor-based approach.  Yes, I tested both approaches prior to implementing yours.  Your approach was faster than the set-based insert/updates even though I tested it across the WAN which suprised me greatly.

I also created a script to assist with the creation of the Conditional Split "Changed Rows" condition which follows (be sure your results aren't being truncated when you have a table with many columns):

--- BEGIN SCRIPT ---

DECLARE @Filter varchar(max)

SET @Filter = ''

-- ((ISNULL(<ColumnName>)?"":<ColumnName>)!=(ISNULL(Dest_<ColumnName>)?"":Dest_<ColumnName>)) ||

SELECT @Filter = @Filter + '((ISNULL(' + c.[name] + ')?"":' + c.[name] + ')!=(ISNULL(Dest_' + c.[name] + ')?"":Dest_' + c.[name] + ')) || '

FROM sys.tables t

INNER JOIN sys.columns c

ON t.[object_id] = c.[object_id]

WHERE SCHEMA_NAME( t.[schema_id] ) = 'GroupHealth'

AND t.[name] = 'ConsumerDetail'

AND c.[is_identity] = 0

AND c.[is_rowguidcol] = 0

ORDER BY

c.[column_id]

SET @Filter = LEFT( @Filter, LEN( @Filter ) - 2 )

SELECT @Filter

--- END SCRIPT ---

Again, thanks greatly.  I now have 2 SSIS books on there way to me.  I am eager to learn as much as I can.

July 17, 2007 3:52 PM
 

Bill Mo said:

Hello,Andy!Thanks a lot for your incremental process!I'm doing SSIS project!

July 17, 2007 9:47 PM
 

david boston said:

Thanks this worked a treat for my SSIS project.

July 20, 2007 5:01 AM
 

andyleonard said:

Hi David, Bill, and David,

  Thanks for the feedback!

:{> Andy

August 8, 2007 7:14 PM
 

saul said:

Hi Andy !!  Great work... I was scared because of this Incremental load... and you saved my weekend... now I can enjoy it .... :-)

September 7, 2007 5:56 PM
 

Steve Hall said:

Anyone had a problem with the insert and update commands locking each other out?

Didn't happen at first but does now.  Update gets blocked by the insert and it just hangs.

Steve

September 18, 2007 1:18 PM
 

andyleonard said:

Thanks Saul!

Steve, are you sure there's not something more happening on the server that's causing this?

If this is repeatable, please provide more information and I'll be happy to take a look at it.

SQL Server does a fair job of detecting and managing deadlocks when they occur. I haven't personally seen SQL Server "hang" since 1998 - and then it was due to a failing I/O controller.

:{> Andy

September 27, 2007 6:57 PM
 

Bill Mo said:

Hi,Andy! I have a same problem with Steve,it is block. When bulk insert and update happen,Update gets blocked by the insert and it just hangs!Insert's wait type is ASYNC_NETWORK_IO.

October 8, 2007 4:15 AM
 

Bobby said:

Thx 4 the trick with Fail -> Left Join ! I was thinking how to do it whole day :o)

October 18, 2007 1:23 AM
 

Andy Leonard said:

Introduction This post is part of a series of posts on ETL Instrumentation. In Part 1 we built a database

November 18, 2007 10:53 PM
 

Michael Ross said:

Steve,

This most certainly can be the case with larger datasets.  In my case, I ran into this issue with large FACT table loads.  Either consider dumping the contents of the insert into a temp table or SSIS RAW datafile and complete the insert in a separate dataflow task or modify the isolationlevel of the package.  Be warned, make sure you research the IsolationLevel property thoroughly before making such a change.

November 26, 2007 12:03 PM
 

Michael said:

What happens when a field is NULL in the destination or source when determining changed rows? Don't we need special checks to ensure if a destination field is NULL the source should also be? Thus a change has occured and the record should be updated?

December 26, 2007 10:26 AM
 

andyleonard said:

Hi Michael,

  Excellent question! This post was intended to cover the principles of Incremental Loads, and not as a demonstration of production-ready code. </CheesyExcuse>

  There are a couple approaches to handling NULLs in the source or destination, each with advantages and disadvantages. In my opinion, the chief consideration is data integrity and the next-to-chief consideration is metadata integrity.

  A good NULL trap can be tricky because NULL == NULL should never evaluate to True. I know NULL == NULL can evaluate to True with certain settings, but these settings also have side-effects. And then there's maintenance to consider... basically, there's no free lunch.

  A relatively straightforward method involves identifying a value for the field that the field will never contain (i.e. -1, "(empty)", or even the string "NULL") and using that value as a substitute for NULL. In the SSIS expression language you can write a change-detection expression like:

(ISNULL(Dest_ColA) ? -1 : Dest_ColA) != (ISNULL(ColA) ? -1 : ColA)

  But again, if ColA is ever -1 this will evaluate as a change and fire an update. Why does this matter? Some systems include "number of updated rows" as a validation metric.

:{> Andy

December 26, 2007 12:50 PM
 

Michael said:

Hi Andy,

Thanks for this great article!

Do you have any hints for implementing your design with an Oracle Source. I am attempting to incrementally update from a table with 7 million rows with ~50 fields. The Lookup Task failed when I attempted to use it like you described above due to a Duplicate Key error...cache is full. I googled this and found an article suggesting enabling restrictions and enabling smaller cache amounts. However it is now extremely slow. Do you have any experience/advice on tweaking the lookup task for my environment?

Is there value in attempting to port this solution to an Oracle to SQL environment?

Is there a way to speed things up/replace the lookup task by using a SQL Execution Task which calls a left outer join?

Is there major difference\impact in having multiple primary keys?

Thanks Again

December 26, 2007 1:47 PM
 

Andy Leonard said:

Now that our 5-month old son - Riley Cooper - is on the mend , I am hitting the speaking trail again!

January 6, 2008 6:16 PM
 

Jigs said:

Hi AndY looks great and work also great but if there are more records to update than it just hangs while doing insert and update so what should i do ..is there any workaround by which we can avoid hanging od SSIS pacage. Please Suggest

Thanks

Jigu

January 15, 2008 3:36 PM
 

andyleonard said:

Hi Bill and Jigu,

Although I mention set-based updates here I did not demonstrate the principle because I felt the post was already too long - my apologies.

I have since written more on Design Patterns. Part 3 of my series on ETL Instrumentnation (http://sqlblog.com/blogs/andy_leonard/archive/2007/11/18/ssis-design-pattern-etl-instrumentation-part-3.aspx#SetBasedUpdates) demonstrates set-based updates.

I need to dedicate a post to set-based updates.

:{> Andy

January 16, 2008 7:10 AM
 

Jai said:

Hi Andy

Thanks you did great help to understand data update through SSIS

package

April 5, 2008 6:16 PM
 

Kenneth said:

Hi Andy,

I have a hard time following your instructions. Can you send me your sample project

Thank You

Kenneth

aspken@msn.com

July 29, 2008 1:44 PM
 

andyleonard said:

Hi Kenneth,

  Sorry to hear you're having a hard time with my instructions.

  One of the last instructions is a link at the bottom of the page called "Get the code". It points to this URL: http://vsteamsystemcentral.com/dnn/Demos/IncrementalLoads/tabid/94/Default.aspx.

Hope this helps,

Andy

July 29, 2008 1:59 PM

Leave a Comment

(required) 
(optional)
(required) 
Submit

About andyleonard

Andy Leonard is a SQL Server MVP, Solid Quality Mentor, database developer, engineer, husband, and father. At the time of this writing (August 2007) Andy lives in Farmville, Virginia with his lovely bride Christy Lynn and their three children: Stevie Ray, Emma Grace, and Riley Cooper. Andy's two older children - Manda and Penny - are grown and married. When he's not working (and Andy works a lot), he enjoys spending time with his family, working on their old farmhouse, fishing, and reading older sci-fi series.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement