THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Eric Johnson

My Thoughts on SQL Server and Database Related Technologies.

Slowly Changing Dimensions - Duplicate Data Issues

The Slowly Changing Dimension (SCD) transformations are pretty handy in SSIS packages. They let you compare a new set of data to an existing table and insert or update as needed. Alternatively, you can have them insert new rows and mark old rows as "Expired" instead of updating rows when a change occurs. This is great if you want to maintain history. Now these are meant for dimensions in a data warehouse, but you can use them against any table in your database. The reason for this post is to look at an issue that can occur with SCDs when you have duplicate data in your incoming data set.

When an SCD analyzes a set of data it looks at each row and compares it to the destination. It decides whether the row will be inserted or updated (or inserted with the old row being marked inactive). Notice I said this comparison is with the incoming data and the destination. The SCD does check for duplicates that may exist in the incoming data. So if you have duplicates, that also happen to be new rows when compared to the destination, all the duplicate rows get "flagged" for insertion. You can probably guess what happens next. The first row gets inserted but the second and subsequent duplicates cause a Primary Key violation when they attempt to insert the now duplicate record.

Now this might seem like a glitch to some, but this functionality is "by design ". The button line is that the data being inserted should be de-duplicated prior to being inserted into the destination. In a perfect world, the data would come from a source where it lived in a normalized, and therefore de-duplicated, state. If you don't have the luxury of living in a perfect world, you can build a manual data de-dup process or use a Sort transformation. One of the options of the Sort transformation is to "Remove Rows with Duplicate Sort Values". Just drop one of these in, check the box, and you have magically de-dup your values.

Published Tuesday, May 05, 2009 4:14 PM by ejohnson2010

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



bxero said:

Using the SCD transformation for anything but the smallest of dimension tables is almost an abuse. It's one of the slowest and most costly data flow tasks to implement on even a moderately sized data set.

I wouldn't recommend using the sort transformation to de-duplicate data either for the same reasons.  Both of these tasks are costly to the speed of the pipeline.

You can make creative use of the Lookup task combined with the conditional split or a script task to hash values to do insert/update operations on a table. I agree though that data should be de-duplicated in separate processes outside of the data flow task to preserve speed.

May 6, 2009 8:49 AM

Leave a Comment


About ejohnson2010

Eric (Microsoft SQL MVP) is the co-founder of Consortio Services, and the primary Database Technologies Consultant. His background in Information Technology is diverse, ranging from operating systems and hardware to specialized applications and development. He has even done his fair share of work on networks. Since IT is really just a way to support business processes, he has also acquired his MBA. All in all, he has 10 years of experience with IT, a great amount of which has been working with Microsoft SQL Server. Eric has managed and designed databases of all shapes and sizes. He has delivered numerous SQL Server training classes and webcasts as well as presentations at national technology conferences. Most recently, he presented at TechMentor on SQL Server 2005 Replication, Reporting Services, and Integration Services. In addition, he is active in the local SQL Server Community, serving as the President of the Colorado Springs SQL Server Users Group.
Privacy Statement