THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jorg Klein

Microsoft Business Intelligence consultant from the Netherlands

SSIS - Decrease your fact table loading time up to 40%

Replace the multiple "Lookup Error Output" Derived Columns and Union All's with a single Derived Column and get a performance boost...

Almost every BI developer needs to perform lookups while loading the Data Warehouse. When loading the fact tables for example, lookups are used to receive the (surrogate) primary keys of the dimension tables that are connected to the fact table.

I, and I think most developers, perform the DWH fact-dimension lookups in the following way:
1. Each fact record contains business keys to the different dimensions.
2. Each dimension record contains a business key and a surrogate key (integer).
3. Each dimension contains one unknown member, with a surrogate key of value 0.
4. Match the dimension business key in the fact record to the business key in the dimension to receive the dimension surrogate key, using a Lookup Component. If the lookup gained no match, point to the unknown member in the dimension (surrogate key 0).
5. Store the fact record in the DWH fact table with the gained surrogate keys of the dimensions.

Loading a fact table in the way described above would typically look like this:


I have tested this solution with 313.341 records. This took 1 minute and 23 seconds.
This way of loading the fact table contains a lot of semi-blocking components: the Union All's. Read more about semi-blocking components in
this blog, posted by me a little while ago. These components cause an unnecessary negative impact on the performance, as you can read in my other blog.

There is a much more efficient way to load a fact table in a datawarehouse:
1. Set the all the Lookup Error Outputs to "Ignore Failure".
2. Delete all the Union All and Derived Column components.
3. Add a Derived Column component to the end of the flow and add the unknown surrogate keys like this:

 der

The data flow now looks like this:

df2

The loading of exactly the same amount of records (313.341) now took just 45 seconds! This is a performance boost of almost 40%. Loading your fact table like this does not only decrease your loading time, it also takes less development time. So a win-win situation!

A minor downside might be that you can't easily see how much records have failed a particular lookup component, but it's not so hard to make a simple (SSRS) report which gives you this overview. It might be a good opportunity to include the information of all your fact tables in one single report, it will give you a much better overview of lookup-failures during the load of your fact tables then all the separate SSIS dataflow tasks.

Published Wednesday, October 22, 2008 11:47 PM by jorg
Filed under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

hari said:

i want what are the synchronous and asynchronous trans

August 25, 2010 7:38 AM
 

Dimitris said:

Hello,

nice article.

I want to make a question.

You described the easy-fast solution.In what way do we handle the load of fact in ssis,considering that our dimension has more than 1 skey for one key?(meaning that we have diferrent sk for the same id for diferent time periods).

tnks in advance!

September 2, 2010 5:42 AM
 

Daniel Barkman said:

Do you have a download of this sample?

March 1, 2011 12:31 PM
 

Emine said:

great idea Jorg !!! simple ! yet it saves a lot of time

October 8, 2011 2:55 PM
 

Ritz said:

How do you handle Late Arriving Dimensions? Do you insert Inferred Dimensions during fact load?

August 17, 2012 1:35 AM
 

Ritz said:

To be able to get the most current surrogate key, need to use SQL query in the Lookup Item to select only the Active/current record, else you will start getting multiple surrogate keys if the records have historical data

August 17, 2012 1:37 AM
 

Gary said:

Just FYI...I haven't tested this exact number of rows nor do I have your table layout so can't repeat this test but I'd bet money that if you simply outer joined all thee dim tables to the fact query in the source of the dataflow using CASE WHEN statements to substitute the unknown dimension rows, that the runtime would drop to something like 1 second. I'm guessing your rows and dimensions are very small with that kind of throughput since you're driving those rows through memory of the ETL tool through multiple tasks. My tests of that with data I have was about as fast as watching paint dry. When I pushed it to the database it screamed.

Rule #1 I tell all new ETL developers is - never ever make the ETL tool do something that can be done exponentially faster in the database if you need scalability. That applies to all ETL tools. Course if you're only loading 300K rows, you can probably do anything you want and who cares?  If you're loading a billion rows like I do, the above scenario is going to take a mighty long time.

May 7, 2013 1:07 AM

Leave a Comment

(required) 
(required) 
Submit

About jorg

Jorg Klein, Microsoft Business Intelligence consultant/architect from the Netherlands.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement