THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

SQLBI - Marco Russo

SQLBI is a blog dedicated to building Business Intelligence solutions with SQL Server.
You can follow me on Twitter: @marcorus

Surrogate key generation in SSIS

In the last few days I tested several ways to generate surrogate keys when loading a dimension into a data mart.

The classical approach with SQL Server is to rely on a INT IDENTITY column. It works fine, but you don't know what surrogate key has been generated for a given application key until you read the row from the table. Not a big issue with DTS, because you can't do very much (in a way faster than SQL itself) even if you know this number, but with SSIS if you want to push performances to the limit, you need to load a fact table within the dimension table(s) without reading what you just writed into a table.

The first step in this direction is to generate surrogate keys inside a SSIS package. If you do a complete dimension load at each run, you can solve the whole problem using a scripting component inside a data flow task. Read this article from Jamie Thomson to see how you can implement it. But life is never as easy as you could desire, so when you have to incrementally load a dimension table then you need to give to this script the right initial value (the last ID currently used instead of 0).

After several approaches discussed on the newsgroup, I came to the conclusion that the most affordable way to implement it is a not-so-desirable one:

  • define a Execute SQL Task which execute a SELECT COALESCE( MAX( ID_Dimension ), 0 ) AS LastID FROM Dimension and put the single row ResultSet into a variable (let's say we use User::ID_Dimension)

  • define a Data Flow Task in the same way Jamie did in his article

  • put the User::ID_Dimension variable into the ReadOnly Variables collection in the Script Component

  • define PreExecute method inside the script to initialize the value for counter you will use to enumerate new surrogate keys (see code below)

Public Class ScriptMain

    Inherits UserComponent

    Dim counter As Int32

    Public Overrides Sub Input_ProcessInputRow(ByVal Row As InputBuffer)

        counter = counter + 1

        Row.IDDimension = counter

    End Sub

    Public Overrides Sub PreExecute()

        MyBase.PreExecute()

        counter = Variables.IDDimension

    End Sub

End Class

That's it. You can download the sample package here.

Now, don't pretend this solution to be faster than a SQL Server INT IDENTITY column... by now, it isn't, but we can't judge this with current CTPs. But don't desperate: it could be better in the future and, more important, if you really need to use the generated keys in other transformations you could already take advantage from this architecture.

Why it's a not-so-desirable way? Well, I don't like to rely on a variable defined in the control flow while all the dimension-load logic is inside a data flow task. Imagine how ugly could become a SSIS package if you load 20 or 30 dimensions; of course you can group two or more tasks together, but I consider this only as a work-around. There is another issue if you would like to update your somewhere-in-the-package variable with the last generated surrogate key, because you can't use the same variable as ReadOnly and ReadWrite (in the Script Transform) and you can't read a variable passed as ReadWrite outside the PostExecute method (I'm right: you can only read/write it inside PreExecute overridden method, you can read it in the PreExecute methos - as in my sample - only if variable is passed as ReadOnly).

I thought other ways to find a better solution (most of them has been suggested by kindly people on the newsgroup):

  • Script transformation with two inputs: not supported (I would like to pass the real data source as Input1 and the last used ID as Input2)
  • Cross Join between two inputs (defined as the previous idea)
  • Custom data flow task (with one input and a property with table/column to look for the last ID) - it needs to be written

There are many other possible variations on those themes. But I have to say that SSIS could be improved very much to make life easier for the BI developer: a real "definitive" surrogate key generation task could be the right answer to this kind of needs.

Published Monday, May 30, 2005 2:47 PM by Marco Russo (SQLBI)

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Kuldeep said:

Does it slowdown database if we create trigger to create surrogate key? what are possible advantages and disadvantages

April 25, 2010 12:35 AM
 

Marco Russo (SQLBI) said:

Triggers have the same side effect of an IDENTITY column: SSIS doesn't know what is the number generated. The point of the whole post is to generate surrogate keys that can be known by the SSIS package.

April 25, 2010 3:04 AM

Leave a Comment

(required) 
(required) 
Submit

About Marco Russo (SQLBI)

Marco Russo is a consultant, writer and trainer specialized in Business Intelligence with Microsoft technologies. He runs the SQLBI.COM website, which is dedicated to distribute resources useful for BI developers, like Integration Services components, Analysis Services models, tools, technical information and so on. Marco is certified as MCT, MCDBA, MCSD.NET, MCSA, MCSE+I.

This Blog

Syndication

Archives

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement