THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Andy Leonard

Andy Leonard is CSO of Linchpin People and SQLPeople, an SSIS Trainer, Consultant, and developer; a Business Intelligence Markup Language (Biml) developer; SQL Server database and data warehouse developer, community mentor, engineer, and farmer. He is a co-author of SQL Server 2012 Integration Services Design Patterns. His background includes web application architecture and development, VB, and ASP. Andy loves the SQL Server Community!
Note: Comments are moderated. Spam shall not pass! </GandalfVoice>

Introducing Change Data Capture, SSIS, and SQL Server 2008 CTP5 (Nov 2007)

Introduction 

On Thursday, 24 Jan 2008, I presented New Features In SSIS 2008 to the Richmond SQL Server Users Group.

Most of the presentation was dedicated to demonstrating Change Data Capture (CDC) interacting with SQL Server 2008 Integration Services. I started seriously working on this demo the first week of January, thinking I'd put 2 - 6 hours into it to get it running using the detailed instructions in Books Online. Things were going relatively well working through the demo until I hit calls from SSIS to table-valued functions created by CDC.

The TVFs didn't exist. Indifferent

Well, that's not exactly true. Their renamed stubs exist in CTP5, but they only return 0. Bummer. Or is it? This is what's cool about CTPs - you get a feel for where the development is and where it will likely go. And while working through some other stuff related to CDC I'd learned enough to cobble together a solution, so I did. What I built will demonstrate the principles of CDC.

<As_Is>

I assure you there will be differences in the RTM and would wager good money this post will be obsolete with CTP6. So if I get any questions / comments the first thing I will ask is "Which CTP are you using?" If it's not CTP5 this post is not appropriate.
</As_Is>

You can build your own virtual server or PC and play along, or you can download a prebuilt virtual hard drive (vhd) file here. I built my own. Once I got the SQL Server 2008 November CTP up and running I logged in and began tinkering. Here's what I did and what I learned:

Introducing CDC 101

The idea of Change Data Capture is pretty cool. The data is stored in a format similar to that used by some database engine transaction logs. In a transaction log changes are respresented in row-based before- and after-images Inserts have empty before-images and the inserted data in the after-image. This makes sense because there's no data there before, only after the insert. Deletes have deleted data in the before-image and empty after-images. Updates have the existing row (or affected columns in the row) in the before-image and the updated row (or affected columns in the row) in the after-image.

In SQL Server 2008's implementation of CDC, changes are stored in a table. The table is created in a special schema named "cdc" built in the database when you enable Change Data Capture. I had a cynical thought: I wonder if the Center for Disease Control uses SQL Server... Wink

A table in the cdc schema is created for each table for which Change Tracking is enabled. The table is named cdc.schema_table_CT and contains a column for each column in the Tracked table plus metadata columns. In my example I enabled Change Tracking on a table named dbo.Contact. The table created in the cdc schema is cdc.dbo_Contact_CT.

One of the metadata columns is __$operation and it identifies the record (image) type. Inserts and Deletes have no empty images. Deletes are respresented in the row by an __$operation column value of 1; Inserts by an __$operation column value of 2. The before-image of an Update in the row by an __$operation column value of 3; the after-image by an __$operation column value of 4.

Setting It Up

I'm going to demonstrate using Change Data Capture to develop high-performance ETL with SSIS 2008. I started with the Improving Incremental Loads with Change Data Capture - accessible from Change Data Capture In Integration Services - topic in SQL Server 2008 CTP5 Books Online (ms-help://MS.SQLCC.v10/MS.SQLSVR.v10.en/s10is_1devconc/html/c4aaba1b-73e5-4187-a97b-61c10069cc5a.htm).

I first create two databases: CDCSource and CDCTarget. I use the following script:

use master
go
-- create CDCSource database...
if not exists(select name
from sys.databases
where name = 'CDCSource')
Create Database CDCSource
go
use CDCSource
go

-- create CDCTarget database...
if not exists(select name
from sys.databases
where name = 'CDCTarget')
Create Database CDCTarget
go

Next I create a table in each database to serve as my ETL source and destination. I use a portion of the AdventureWorks Person.Contact table:

use CDCTarget
go

-- create and populate dbo.Contact table...
if exists(select name
from sys.tables
where name = 'Contact')
Drop Table dbo.Contact
go

select ContactID
,NameStyle
,Title
,FirstName
,MiddleName
,LastName
,Suffix
,EmailAddress
,EmailPromotion
,Phone
,ModifiedDate
into dbo.Contact
from AdventureWorks.Person.Contact
go

use CDCSource
go

-- create and populate dbo.Contact table...
if exists(select name
from sys.tables
where name = 'Contact')
Drop Table dbo.Contact
go

select ContactID
,NameStyle
,Title
,FirstName
,MiddleName
,LastName
,Suffix
,EmailAddress
,EmailPromotion
,Phone
,ModifiedDate
into dbo.Contact
from AdventureWorks.Person.Contact
go

Change Tracking

Change Tracking is enabled on the database and then on individual tables. To enable Change Tracking in CTP5 use the following T-SQL statement:

-- enable CDC on database...
if not exists(select name, is_cdc_enabled
              from master.sys.databases
              where name = 'CDCSource'
               and is_cdc_enabled = 1)
 exec sys.sp_cdc_enable_db_change_data_capture
go

Once Change Tracking is enabled on the database you can enable Change Data Capture on individual tables using a script similar to the following:

-- enable CDC on dbo.Contact table...
exec sys.sp_cdc_enable_table_change_data_capture
  @source_schema
= 'dbo'
, @source_name = 'Contact'
, @role_name = 'cdc_admin';
go

Making It Work

Books Online has us using a table-valued function to access the cdc.dbo_Contact_CT table, but that function is merely a stub in CTP5. Here's where I got creative to make it work.

Create a table-valued function dbo.uf_Contact:

-- build CDC table-valued function...
if exists(select name
          from sys.objects
          where name = 'uf_Contact')
 drop function cdc.uf_Contact
go

CREATE function cdc.uf_Contact (
@start_time datetime
,@end_time datetime
)
returns @Contact table (
ContactID int
,NameStyle bit
,Title nvarchar(8)
,FirstName nvarchar(50)
,MiddleName nvarchar(50)
,LastName nvarchar(50)
,Suffix nvarchar(10)
,EmailAddress nvarchar(50)
,EmailPromotion int
,Phone nvarchar(25)
,ModifiedDate datetime
,CDC_OPERATION varchar(1)
)
as
begin
declare @from_lsn binary(10), @to_lsn binary(10)
if (@start_time is null)
select @from_lsn = sys.fn_cdc_get_min_lsn('Contact')
else
select @from_lsn = sys.fn_cdc_increment_lsn(sys.fn_cdc_map_time_to_lsn('largest less than or equal',@start_time))
if (@end_time is null)
select @to_lsn = sys.fn_cdc_get_max_lsn()
else
select @to_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal',@end_time)
if (@from_lsn = sys.fn_cdc_increment_lsn(@to_lsn))
return
-- Query for change data
insert into @Contact
select ContactID
,NameStyle
,Title
,FirstName
,MiddleName
,LastName
,Suffix
,EmailAddress
,EmailPromotion
,Phone
,ModifiedDate
,case __$operation
 when 1 then 'D'
 when 2 then 'I'
 when 4 then 'U'
 else null
end as CDC_OPERATION
from cdc.dbo_Contact_CT
where __$start_lsn between @from_lsn and @to_lsn
 and __$operation != 3 --'all'
return
end

go

This is the function we will query from SSIS to get changes.

Change Tracking is now in place for CDCSource and dbo.Contact. Changes made to the dbo.Contact will be stored in the cdc.dbo_Contact_CT table.

The SSIS Package

Let's build an SSIS package to take advantage of Change Tracking. Create a new SSIS 2008 project and rename the default package ChangeDataCapture.dtsx. Again, I'm using the Improving Incremental Loads with Change Data Capture - accessible from Change Data Capture In Integration Services - topic in SQL Server 2008 CTP5 Books Online (ms-help://MS.SQLCC.v10/MS.SQLSVR.v10.en/s10is_1devconc/html/c4aaba1b-73e5-4187-a97b-61c10069cc5a.htm).

Add the following package-scoped variables:

Name                     Data Type            Value
ExtractStartTime     DateTime             
ExtractEndTime      DateTime             
DataReady             Int32                    2
DelaySeconds        Int32                    10
IntervalID                Int32                    0
TimeoutCount         Int32                    0
TimeoutCeiling        Int32                    20
SQLDataQuery       String                  SELECT * FROM cdc.uf_Contact(null, '2008-01-16 01:00:00')

Right-click in the Connection Managers space at the bottom of the package and select" New ADO.NET Connection...". Connect to an instance of SQL Server 2008 CTP5. My instance is the default instance on a virtual server named VPCSQL2K8CTP5. In the "Select or enter a database name" textbox enter CDCSource:

Click the Test Connection button to make sure all is well with your connectivity. Correct any issues and click the OK button to close the editor.

You are free to leave the connection manager named as is, but I like to have the names make sense to me. Right-click the Connection Manager and rename it "CDCSource.ADO".

Create two more Connection Managers. Make them OLEDB Connection Managers. Set up one for CDCSource and name it CDCSource.OLEDB. Set up the other for CDCTarget and name it (wait for it...) CDCTarget.OLEDB. Wink

Drag an Execute SQL Task onto the Control Flow canvas and rename it Get Interval. Set the ConnectionType property to the OLEDB and the Connection property CDCSource.OLEDB. Set the SQLStatement property to:

SELECT DATEADD(hh,-2, GETDATE()) AS ExtractStartTime,
GetDate() AS ExtractEndTime

Set the ResultSet property to Single Row. On the Result Set page, add two results:

Result Name     Variable Name
0                      User::ExtractStartTime
1                      User::ExtractEndTime

Drag and drop a For Loop Container onto the Control Flow and connect the Get Interval Execute SQL Task to it using a Success precedence constraint. Double-click the For Loop to open the editor. Set the InitExpression property to @DataReady = 0 and the EvalExpression property to @DataReady == 0:

Drag an Execute SQL Task into the For Loop Container. Change the name to Check Change Data Status and double-click the Task to open the editor. Set the Connection to CDCSource.OLEDB and the SQLStatement property to the following script:

declare @DataReady int
declare
@TimeoutCount int
if
not exists (select tran_end_time
                  
from cdc.lsn_time_mapping
                   where tran_end_time > ? )
select @DataReady = 0
else
if ? = 0
select @DataReady = 3
else
if not exists (select tran_end_time
                  
from cdc.lsn_time_mapping
                   where tran_end_time <= ? )
select @DataReady = 1
else
select @DataReady = 2
select @TimeoutCount = ?
if (@DataReady = 0)
select @TimeoutCount = @TimeoutCount + 1
else
select @TimeoutCount = 0
if (@TimeoutCount > ?)
select @DataReady = 5
select @DataReady as DataReady, @TimeoutCount as TimeoutCount 

Set the Parameters as shown:

Set the ResultSet property to Single Row and the Result Set page as follows:

Click the OK button to close the editor. Drag a Script Task into the For Loop Container with the Execute SQL Task. Double-click it to open the editor. Set the Script Language to Microsoft Visual Basic 2005 (you can now also use Microsoft Visual C# 2005) and click the ellipsis in the ReadOnlyVariables property and select User::DelaySeconds:

I have to pause here and say "Kudos!" to the SSIS team for this interface. I usually fat-finger the variable names when typing them. This is much, much nicer. Thanks!

Click the OK button to close the Select Variables form. Click the Edit Script button to open the VSTA editor. Double-click the ScriptMain.vb class to open the script editor. Edit Public Sub Main() so it reads as follows:

Public Sub Main()
'
' Add your code here
'
System.Threading.Thread.Sleep(CType(Dts.Variables("DelaySeconds").Value, Integer) * 1000)
Dts.TaskResult = ScriptResults.Success
End Sub

Close the VSTA editor and click the OK button to close the Script Task editor.

Connect the Execute SQL Task to the Script Task with a Success Precedence Constraint. Double-click the constraint to open the editor. Dropdown the Evaluation operation and select "Expression and Constraint". Make sure the Value is set to Success. Enter the following expression into the Expression textbox: "@DataReady == 0 && @TimeoutCount <= @TimeoutCeiling" as shown:

Click the Test button to test the validity of the expression. Click the OK buttont to close the Precedence Constraint editor.

I like the visual effects available for precedence constraints. When you hover over the Fx box you get a tooltip that displays the Expression value:

But note: if you select the precedence constraint and view properties (a quick shortcut is to press the F4 key) you can set the Annotation property from AsNeeded to ConstraintOptions, which will display the Expression property value always - how cool:

Add a second Script Task inside the For Loop Container and double-click to open the editor. Select Microsoft Visual Basic 2005 as the language and select User::DataReady and User::ExtractStartTime as ReadOnlyVariables. Click the Edit Script button and open ScriptMain.vb. Replace the code in Public Sub Main() with the following VB.Net:

Public Sub Main()
'
' Add your code here
'
' User variables.
Dim dataReady As Integer = _
CType(Dts.Variables("DataReady").Value, Integer)
Dim extractStartTime As Date = _
CType(Dts.Variables("ExtractStartTime").Value, DateTime)
' System variables.
Dim packageName As String = _
Dts.Variables(
"PackageName").Value.ToString()
Dim executionStartTime As Date = _
CType(Dts.Variables("StartTime").Value, DateTime)
Dim eventMessage As New System.Text.StringBuilder()
If dataReady = 1 OrElse dataReady = 5 Then
If dataReady = 1 Then
eventMessage.AppendLine("Start Time Error")
Else
eventMessage.AppendLine("Timeout Error")
End If
With eventMessage
.Append(
"The package ")
.Append(packageName)
.Append(
" started at ")
.Append(executionStartTime.ToString())
.Append(
" and ended at ")
.AppendLine(DateTime.Now().ToString())
If dataReady = 1 Then
.Append("The specified ExtractStartTime was ")
.AppendLine(extractStartTime.ToString())
End If
End With
System.Windows.Forms.MessageBox.Show(eventMessage.ToString())
Dts.Log(eventMessage.ToString(), 0,
Nothing)
Dts.TaskResult = ScriptResults.Failure
  Else
    Dts.TaskResult = ScriptResults.Success
  End If
End Sub

Close the editor and click the OK button to close the Script Task editor.

Connect the Execute SQL Task to the new Script Task with a Success Precedence Constraint. Edit the Precedence Constraint, setting the Evaluation Operation to "Success and Constraint" and the Expression to "@DataReady == 1 || @DataReady == 5". Click the OK button to close the editor.

And Now For Something Completely Different

This next part isn't included in the Books Online walk-through. I needed to add it because I was stopping a lot during my presentations. It doesn't hurt to have this in your package. So here goes.

We're now done with the For Loop Container. Add a new Execute SQL Task to the Control Flow canvas below the For Loop Container, and connect the For Loop Container to it using a Success Precedence Constraint. Double-click it to open the editor. Change the Name property to Update EndTime and the ResultSet property to Single Row. Set the Connection property to CDCSource.OLEDB and add the following statement to the SQLStatement property:

select GETDATE() as ExtractEndTIme

Click the Result Set page and add a Result. Set the Name to 0 and the Variable Name property to User::ExtractEndTime. Click the OK button to close the editor.

Why did I do this? The scripts that retrieve the changed rows use time. You'll see when we walk through the demo portion, you can wait several minutes between the time you initially set the ExtractEndTime variable in the original Execute SQL Task and the time you actually update data in the Change-Tracked table. This task makes certain that the value of ExtractEndTime is current when you exit the loop.

Back To Our Regularly Scheduled Programming

Next, add a Script Task and double-click it to open the editor. This time select Microsoft Visual C# as the ScriptLanguage property. Add User::DataReady, User::ExtractEndTime, and User::ExtractStartTime in the ReadOnlyVariables property, and User:SQLDataQuery in the ReadWriteVariables property:

Click the Edit Script button to open the VSTA editor and double-click ScriptMain.cs to open the class. Replace the public void Main() method code with the following code:

public void Main()
  {
    //string sEST;
    //sEST = Dts.Variables["ExtractStartTime"].Value.ToString();
    //System.Windows.Forms.MessageBox.Show(sEST);
    int dataReady;
    System.
DateTime extractStartTime;
    System.
DateTime extractEndTime;
    string sqlDataQuery;
    dataReady = (
int)Dts.Variables["DataReady"].Value;
    extractStartTime = (System.
DateTime)Dts.Variables["ExtractStartTime"].Value;
    extractEndTime = (System.
DateTime)(Dts.Variables["ExtractEndTime"].Value);

    //string sExtractStart;
    //sExtractStart = extractStartTime.ToString();
    //System.Windows.Forms.MessageBox.Show(sExtractStart);
    if (dataReady == 2)
      {
        sqlDataQuery =
"SELECT * FROM cdc.uf_Contact('" + string.Format("{0:yyyy-MM-dd hh:mm:ss}",
extractStartTime) +
"', '" + string.Format("{0:yyyy-MM-dd hh:mm:ss}", extractEndTime) + "')";
      }
    else
      {
        sqlDataQuery =
"SELECT * FROM cdc.uf_Contact(null" + ", '" + string.Format("{0:yyyy-MM-dd hh:mm:ss}", extractEndTime) + "')";
      }
    Dts.Variables[
"SQLDataQuery"].Value = sqlDataQuery;
    //system.windows.forms.messagebox[sqlDataQuery];
    System.Windows.Forms.MessageBox.Show(sqlDataQuery);
    Dts.TaskResult = (
int)ScriptResults.Success;
  }

Close the VSTA editor and click the OK button to close the Script Task editor.

Add a Data Flow Task and connect the former Script Task to it via a Success Precedence Constraint. Double-click the Data Flow Task to open the editor. Drag an OLE DB Data Adapter onto the Data Flow canvas. Set the Connection Manager property to CDCSource.OLEDB, the Data Access Mode property to "SQL command from variable", and the Variable name property to User::SQLDataQuery:

 

Click the OK button to close the editor.

As in SSIS 2005, a Conditional Split transformation accepts a single input and allows you to define multiple outputs using the SSIS Expression Language. Rows at the input are redirected to one and only one of the outputs. If you are familiar with the switch function in C-ish languages or the Select Case function in VB-ish languages, the Conditional Split operates much the same way. The else from these programming functions is accomplished via the Default Output, which you can also label with a name.

Drag a Conditional Split Transformation onto the canvas. Add an output named Inserts with the Condition expression set to CDC_OPERATION == "I". Add another output named Updates with the Condition expression set to CDC_OPERATION == "U". Add a third output named Deletes with the Condition expression set to CDC_OPERATION == "D":

These three outputs will divide our data stream into rows to be inserted, updated, and deleted (respectively). For the Inserts, drag an OLEDB Destination Adapter onto the Data Flow canvas. For the Updates and Deletes, drag two OLEDB Command transformations onto the canvas. Rename the OLEDB Destination Adapter "Land Inserts". Rename the OLEDB Command transformations "Apply Updates" and "Apply Deletes" (respectively).

Connect a data flow from the Conditional Split to the Destination Adapter. Because there are multiple outputs available from the Conditional Split, you will be prompted to select one. Select the Inserts output:

 

Click the OK button to close the Input Output Selection form. 

Double-click the Destination Adapter to open the editor. Set the OLEDB Connection Manager property the to CDCTarget.OLEDB connection manager. Select the dbo.Contact table from the "Name of the table or view" dropdown:

Click the OK button to close the editor. 

Drag a second data flow from the Conditional Split to the Apply Updates OLEDB Command transformation and select Updates when prompted for an output. Double-click the Apply Updates OLEDB Command transformation to open the Advanced Editor for Apply Updates (OLEDB Command transformations do not have a standard editor - at least not in CTP5). Select the CDCTarget.OLEDB connection manager on the Connection Managers tab. On the Component Properties tab, set the SqlCommand property to the following T-SQL statement:

Update dbo.Contact
 set
 
 NameStyle = ?
 ,Title = ?
 ,FirstName = ?
 ,MiddleName = ?
 ,LastName = ?
 ,Suffix = ?
 ,EmailAddress = ?
 ,EmailPromotion = ?
 ,Phone = ?
 ,ModifiedDate = ?
where ContactID = ?

Click the Column Mappings tab and map Available Input Columns to Available Destination Columns in one of two ways: 

  1. On the top half of the editor drag individual Input Columns and drop them on the desired Destination Columns.
  2. On the lower half of the editor select Input Columns to match to Destination Columns.

The question marks in the SqlCommand statement map to Paramater values (Param_n) by ordinal. The first question mark maps to Param_0, the second to Param_1, and so on. Our first question mark assigns a value to the NameStyle column, so map the NameStyle Input to Param_0 in the Destination. Continue this, mapping the following:

Input                                      Destination
NameStyle                             Param_0
Title                                       Param_1
FirstName                              Param_2
MiddleName                           Param_3
LastName                              Param_4
Suffix                                     Param_5
EmailAddress                         Param_6
EmailPromotion                      Param_7
Phone                                    Param_8
ModifiedDate                          Param_9
ContactID                               Param_10

 

Click the OK button to close the Advanced Editor.

Drag another output from the Conditional Split to the Apply Deletes OLEDB Command transformation and select Deletes when prompted to select an output. Double-click the OLEDB Command to open its Advanced Editor and select CDCTarget.OLEDB for the Connection Manager. On the Component Properties tab, enter the following statement in the SqlCommand property:

delete dbo.Contact where ContactID = ?

On the Column Mappings tab, map the ContactID input to the Param_0 Destination and click the OK button to close the Advanced Editor.

The Data Flow is complete and should appear similar to the following:

Go, Go, Change Data Capture!

Execute the SSIS package and view the Control Flow. If you do nothing more, you will see the For Loop execute 20 times, each time incrementing the @TimeoutCount variable value and checking to see if it has reached the @TimeoutCeiling variable value:

 However, if you open SQL Server Management Studio (SSMS) and execute a query that changes the underlying data in the CDCSource database, the SSIS package will detect the change and break out of this loop.

Restart the SSIS package (to reset the @TimeoutCount variable value). Use the following query to execute changes in the CDCSource.dbo.Contact table:

-----------------------
-- Execute Changes --
-----------------------

-- Update --
update dbo.Contact
set EmailAddress = '_' + EmailAddress
where ContactID % 5 = 0

-- Insert --
insert into dbo.Contact
(NameStyle
,Title
,FirstName
,MiddleName
,LastName
,Suffix
,EmailAddress
,EmailPromotion
,Phone
,ModifiedDate)
values
(0
,
'Mr.'
,'Andy'
,'Mortimer'
,'Leonard'
,null
,'aleonard@solidq.com'
,0
,
'123-456-6789'
tdate()
)

-- Delete --
delete dbo.Contact
where ContactID % 27 = 0

Return to SSIS and note that on the next pass through the loop more code is executed:

If you didn't comment it out, the C# script should display a message box containing the query (stored in @User::SqlDataQuery) that will serve as the source of the OLEDB Source Adapter in the Data Flow:

Click the OK button to dismiss the non-modal message box.

Click the Data Flow tab to observe changes as they are applied to CDCTarget. When complete, the Data Flow should look similar to the following:

You can also look at the table containing the changes by executing the following statement against CDCSource in SSMS:

select
__$start_lsn
,__$end_lsn
,__$seqval
,__$operation
,__$update_mask
,ContactID
,NameStyle
,Title
,FirstName
,MiddleName
,LastName
,Suffix
,EmailAddress
,EmailPromotion
,Phone
,ModifiedDate
from cdc.dbo_Contact_CT

In my database the change tracking table appears as shown:

You will recall from (building the function) that the __$operation column is used to determine whether the change is an insert, update, or delete. I mentioned this earlier: The value in this column captures the type of change.

__$operation         Change Type
1                          Delete
2                          Insert
3                          Update (Before)
4                          Update (After)

The changes in the image above are all Updates.

Conclusion

This post represents one method to use SQL Server 2008 CTP5 (Nov 2007) Change Data Capture with SSIS 2008. CTP6 will be out soon and things will likely change. I will write an updated version at that time, and endeavor to keep my readers posted on other changes to this technology between now and SQL Server 2008 Release To Manufacturing (RTM).

:{> Andy

Published Saturday, February 09, 2008 6:57 AM by andyleonard

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Tod McKenna said:

Great post Andy. CDC is a critical part of data integration and it is good to see that SQL Server 2008 has methods to implement it. As MS has been bragging about this for a while, its pretty cool to see an actual example using the technology. I'll have to try this out later...

For years, I've used an approach that relies heavily on staging tables, SQL, and even the file system. I have posted a lot of this code/theory on my blog Tod means Fox (blog.todmeansfox.com). It will be quite interesting for me to go back and revisit some of my old techniques to see how the new CDC abilities of SQL Server 2008 can help. The reality is that source systems tend not to be SQL Server, so it may be necessary to first import all source data into a SQL Server staging database and then process CDC between the stage and the destination. In some of my scenarios, this would be an extra step.

Anyway, thanks for the time/info/example!

-Tod

February 9, 2008 8:20 AM
 

Shan said:

Have you managed to get this to run on RC0 yet. MS seems to have changed the SP names etc.

July 9, 2008 1:42 AM
 

andyleonard said:

Hi Shan,

  Not yet. I am working on it this week.

  I'm not surprised they renamed the stored procedures. They were renamed between CTP5 and CTP6 too.

:{> Andy

July 9, 2008 5:31 AM
 

Dodiet said:

I want to ask you, how about if we have 3 primary key on our table AccountNo, Month, Year...if there is a row with same AccountNo, Month, Year, SSIS will update it, and if there is a row with same AccountNo but with different month and year it will insert how write the "conditional split" condition...???

Thx

Dodiet

August 4, 2008 3:26 AM
 

andyleonard said:

Hi Dodiet,

  Compound primary keys like the one you describe are common in source data. Although I do not demonstrate it here, the principle for detection and loading is the same.

  Together, these fields comprise the "business key" and must be treated - together - like the source's ContactID in this post. There are a few ways to accomplish this. The quickest (and easiest to understand) is to convert the values to varchar and concatenate the values, separated by a delimiter (I usually use a pipe character "|").

  You would then use this field to uniquely identify rows in the source and destinations: left joining on <AccountNo>|<Month>|<Year>. You would also leave these columns out of the update statement's SET clause. You do not update business keys. You update everything except business keys. This is true in the example: note the ContactID is used in the WHERE clause but not in the list of fields in the SET clause.

  One thing to note: concatenation is the simplest method to understand compound unique keys - but it is also the worst performing method. For larger data loads you will want to explore other means of hashing the data. I encourage you to search for articles about data profiling as this is one use of data profiling.

Hope this helps,

Andy

August 4, 2008 9:40 AM
 

Andrei Ignat said:

See also CDC Helper  - a GUI for CDC at

http://www.codeplex.com/CDCHelper

August 6, 2008 4:36 AM
 

Andrew said:

I have a similar setup to this, however I am using an xml file as my input prior to the conditional split. In my case the xml rows are presorted into inserts, updates and deletes and then bulk fired through the conditions. This has an undesired effect in that if I Insert a REcord, Delete that record and then reinsert it again, the SSIS Package deals with all Inserts first and then all deletes meaning you lose the final insert.  My question is, can you instruct it to process by row and not resort into groups?

July 2, 2009 5:52 AM
 

Nirmal Kumar said:

I need a sample dtsx package for the above instructed package so that I can check where went wrong.

Regards,

Nirmal

June 14, 2011 9:12 AM
 

HsInAz said:

I am planning to use this excellent post for a current project I am working on. My platform is still going to be SQL Server 2008, but given that this post wans written in 2008, I am wondering if all of this is still valid and up to date.

June 1, 2012 4:51 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

My Company


Other Blog

Check out my personal blog...
http://andyleonard.me

Contact Me

Twitter: @AndyLeonard
Email: andy.leonard@gmail.com

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement