THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Jorg Klein

Microsoft Data & Analytics consultant and Microsoft Data Platform MVP from the Netherlands

  • SSIS - Connect to Oracle on a 64-bit machine (Updated for SSIS 2008 R2)

    We recently had a few customers where a connection to Oracle on a 64 bit machine was necessary. A quick search on the internet showed that this could be a big problem. I found all kind of blog and forum posts of developers complaining about this. A lot of developers will recognize the following error message:

    Test connection failed because of an error in initializing provider. Oracle client and networking components were not found. These components are supplied by Oracle Corporation and are part of the Oracle Version 7.3.3 or later client software installation.
    Provider is unable to function until these components are installed.

    After a lot of searching, trying and debugging I think I found the right way to do it!


    Because BIDS is a 32 bit application, as well on 32 as on 64 bit machines, it cannot see the 64 bit driver for Oracle. Because of this, connecting to Oracle from BIDS on a 64 bit machine will never work when you install the 64 bit Oracle client.

    Another problem is the "Microsoft Provider for Oracle", this driver only exists in a 32 bit version and Microsoft has no plans to create a 64 bit one in the near future.

    The last problem I know of is in the Oracle client itself, it seems that a connection will never work with the instant client, so always use the full client.
    There are also a lot of problems with the 10G client, one of it is the fact that this driver can't handle the "(x86)" in the path of SQL Server. So using the 10G client is no option!


    • Download the Oracle 11G full client.
    • Install the 32 AND the 64 bit version of the 11G full client (Installation Type: Administrator) and reboot the server afterwards. The 32 bit version is needed for development from BIDS with is 32 bit, the 64 bit version is needed for production with the SQLAgent, which is 64 bit.
    • Configure the Oracle clients (both 32 and 64 bits) by editing  the files tnsnames.ora and sqlnet.ora. Try to do this with an Oracle DBA or, even better, let him/her do this.
    • Use the "Oracle provider for OLE DB" from SSIS, don't use the "Microsoft Provider for Oracle" because a 64 bit version of it does not exist.
    • Schedule your packages with the SQLAgent.

    Background information

    • Visual Studio (BI Dev Studio)is a 32bit application.
    • SQL Server Management Studio is a 32bit application.
    • dtexecui.exe is a 32bit application.
    • dtexec.exe has both 32bit and 64bit versions.
    • There are x64 and x86 versions of the Oracle provider available.
    • SQLAgent is a 64bit process.

    My advice to BI consultants is to get an Oracle DBA or professional for the installation and configuration of the 2 full clients (32 and 64 bit). Tell the DBA to download the biggest client available, this way you are sure that they pick the right one ;-)

    Testing if the clients have been installed and configured in the right way can be done with Windows ODBC Data Source Administrator:
    Administrative tools...
    Data Sources (ODBC)


    It seems that, unfortunately, some additional steps are necessary for SQL Server 2008 R2 installations:

    1. Open REGEDIT (Start… Run… REGEDIT) on the server and search for the following entry (for the 32 bits driver): HKEY_LOCAL_MACHINE\Software\Microsoft\MSDTC\MTxOCI
    Make sure the following values are entered:


    2. Next, search for (for the 64 bits driver): HKEY_LOCAL_MACHINE\Software\Wow6432Node\Microsoft\MSDTC\MTxOCI
    Make sure the same values as above are entered.

    3. Reboot your server.

  • Replication Services as ETL extraction tool

    In my last blog post I explained the principles of Replication Services and the possibilities it offers in a BI environment. One of the possibilities I described was the use of snapshot replication as an ETL extraction tool:
    “Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards.
    In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!”

    Well I have tried it out and I must say it worked well. I was able to let replication services do work in a fraction of the time it would cost me to do the same in SSIS.
    What I did was the following:

    1. Configure snapshot replication for some Adventure Works tables, this was quite simple and straightforward.
    2. Create an SSIS package that executes the snapshot replication on demand and waits for its completion.
      This is something that you can’t do with out of the box functionality. While configuring the snapshot replication two SQL Agent Jobs are created, one for the creation of the snapshot and one for the distribution of the snapshot. Unfortunately these jobs are  asynchronous which means that if you execute them they immediately report back if the job started successfully or not, they do not wait for completion and report its result afterwards. So I had to create an SSIS package that executes the jobs and waits for their completion before the rest of the ETL process continues.

    Fortunately I was able to create the SSIS package with the desired functionality. I have made a step-by-step guide that will help you configure the snapshot replication and I have uploaded the SSIS package you need to execute it.

    Configure snapshot replication

    The first step is to create a publication on the database you want to replicate.
    Connect to SQL Server Management Studio and right-click Replication, choose for New.. Publication…

    The New Publication Wizard appears, click Next

    Choose your “source” database and click Next

    Choose Snapshot publication and click Next

    You can now select tables and other objects that you want to publish

    Expand Tables and select the tables that are needed in your ETL process

    In the next screen you can add filters on the selected tables which can be very useful. Think about selecting only the last x days of data for example.

    Its possible to filter out rows and/or columns. In this example I did not apply any filters.

    Schedule the Snapshot Agent to run at a desired time, by doing this a SQL Agent Job is created which we need to execute from a SSIS package later on.

    Next you need to set the Security Settings for the Snapshot Agent. Click on the Security Settings button.

    In this example I ran the Agent under the SQL Server Agent service account. This is not recommended as a security best practice. Fortunately there is an excellent article on TechNet which tells you exactly how to set up the security for replication services. Read it here and make sure you follow the guidelines!

    On the next screen choose to create the publication at the end of the wizard

    Give the publication a name (SnapshotTest) and complete the wizard

    The publication is created and the articles (tables in this case) are added

    Now the publication is created successfully its time to create a new subscription for this publication.

    Expand the Replication folder in SSMS and right click Local Subscriptions, choose New Subscriptions

    The New Subscription Wizard appears

    Select the publisher on which you just created your publication and select the database and publication (SnapshotTest)

    You can now choose where the Distribution Agent should run. If it runs at the distributor (push subscriptions) it causes extra processing overhead. If you use a separate server for your ETL process and databases choose to run each agent at its subscriber (pull subscriptions) to reduce the processing overhead at the distributor.

    Of course we need a database for the subscription and fortunately the Wizard can create it for you. Choose for New database

    Give the database the desired name, set the desired options and click OK

    You can now add multiple SQL Server Subscribers which is not necessary in this case but can be very useful.

    You now need to set the security settings for the Distribution Agent. Click on the …. button

    Again, in this example I ran the Agent under the SQL Server Agent service account. Read the security best practices here

    Click Next

    Make sure you create a synchronization job schedule again. This job is also necessary in the SSIS package later on.

    Initialize the subscription at first synchronization

    Select the first box to create the subscription when finishing this wizard

    Complete the wizard by clicking Finish

    The subscription will be created

    In SSMS you see a new database is created, the subscriber. There are no tables or other objects in the database available yet because the replication jobs did not ran yet.

    Now expand the SQL Server Agent, go to Jobs and search for the job that creates the snapshot:

    Rename this job to “CreateSnapshot”

    Now search for the job that distributes the snapshot:

    Rename this job to “DistributeSnapshot”

    Create an SSIS package that executes the snapshot replication

    We now need an SSIS package that will take care of the execution of both jobs. The CreateSnapshot job needs to execute and finish before the DistributeSnapshot job runs. After the DistributeSnapshot job has started the package needs to wait until its finished before the package execution finishes.
    The Execute SQL Server Agent Job Task is designed to execute SQL Agent Jobs from SSIS. Unfortunately this SSIS task only executes the job and reports back if the job started succesfully or not, it does not report if the job actually completed with success or failure. This is because these jobs are asynchronous.

    The SSIS package I’ve created does the following:

    1. It runs the CreateSnapshot job
    2. It checks every 5 seconds if the job is completed with a for loop
    3. When the CreateSnapshot job is completed it starts the DistributeSnapshot job
    4. And again it waits until the snapshot is delivered before the package will finish successfully


    Quite simple and the package is ready to use as standalone extract mechanism. After executing the package the replicated tables are added to the subscriber database and are filled with data:


    Download the SSIS package here (SSIS 2008)


    In this example I only replicated 5 tables, I could create a SSIS package that does the same in approximately the same amount of time. But if I replicated all the 70+ AdventureWorks tables I would save a lot of time and boring work! With replication services you also benefit from the feature that schema changes are applied automatically which means your entire extract phase wont break. Because a snapshot is created using the bcp utility (bulk copy) it’s also quite fast, so the performance will be quite good.

    Disadvantages of using snapshot replication as extraction tool is the limitation on source systems. You can only choose SQL Server or Oracle databases to act as a publisher.

    So if you plan to build an extract phase for your ETL process that will invoke a lot of tables think about replication services, it would save you a lot of time and thanks to the Extract SSIS package I’ve created you can perfectly fit it in your usual SSIS ETL process.

  • Replication Services in a BI environment

    In this blog post I will explain the principles of SQL Server Replication Services without too much detail and I will take a look on the BI capabilities that Replication Services could offer in my opinion.

    SQL Server Replication Services provides tools to copy and distribute database objects from one database system to another and maintain consistency afterwards. These tools basically copy or synchronize data with little or no transformations, they do not offer capabilities to transform data or apply business rules, like ETL tools do.
    The only “transformations” Replication Services offers is to filter records or columns out of your data set. You can achieve this by selecting the desired columns of a table and/or by using WHERE statements like this:
    SELECT <published_columns> FROM [Table] WHERE [DateTime] >= getdate() - 60

    There are three types of replication:

    Transactional Replication

    Transactional replication components and data flow

    This type replicates data on a transactional level. The Log Reader Agent reads directly on the transaction log of the source database (Publisher) and clones the transactions to the Distribution Database (Distributor), this database acts as a queue for the destination database (Subscriber). Next, the Distribution Agent moves the cloned transactions that are stored in the Distribution Database to the Subscriber.
    The Distribution Agent can either run at scheduled intervals or continuously which offers near real-time replication of data!

    So for example when a user executes an UPDATE statement on one or multiple records in the publisher database, this transaction (not the data itself) is copied to the distribution database and is then also executed on the subscriber. When the Distribution Agent is set to run continuously this process runs all the time and transactions on the publisher are replicated in small batches (near real-time), when it runs on scheduled intervals it executes larger batches of transactions, but the idea is the same.

    Snapshot Replication

    Snapshot replication components and data flow
    This type of replication makes an initial copy of database objects that need to be replicated, this includes the schemas and the data itself. All types of replication must start with a snapshot of the database objects from the Publisher to initialize the Subscriber. Transactional replication need an initial snapshot of the replicated publisher tables/objects to run its cloned transactions on and maintain consistency.

    The Snapshot Agent copies the schemas of the tables that will be replicated to files that will be stored in the Snapshot Folder which is a normal folder on the file system. When all the schemas are ready, the data itself will be copied from the Publisher to the snapshot folder. The snapshot is generated as a set of bulk copy program (BCP) files. Next, the Distribution Agent moves the snapshot to the Subscriber, if necessary it applies schema changes first and copies the data itself afterwards. The application of schema changes to the Subscriber is a nice feature, when you change the schema of the Publisher with, for example, an ALTER TABLE statement, that change is propagated by default to the Subscriber(s).

    Merge Replication
    Merge replication is typically used in server-to-client environments, for example when subscribers need to receive data, make changes offline, and later synchronize changes with the Publisher and other Subscribers, like with mobile devices that need to synchronize one in a while. Because I don’t really see BI capabilities here, I will not explain this type of replication any further.

    Replication Services in a BI environment
    Transactional Replication can be very useful in BI environments. In my opinion you never want to see users to run custom (SSRS) reports or PowerPivot solutions directly on your production database, it can slow down the system and can cause deadlocks in the database which can cause errors. Transactional Replication can offer a read-only, near real-time database for reporting purposes with minimal overhead on the source system.

    Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards.
    In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!

    I got a questing regarding the supported Replication Services features in the different versions of SQL Server (Standard,Enterprise,etc). There is a nice table on MSDN that shows this!

  • SSIS Denali as part of “Enterprise Information Management”

    When watching the SQL PASS session “What’s Coming Next in SSIS?” of Steve Swartz, the Group Program Manager for the SSIS team, an interesting question came up:

    Why is SSIS thought of to be BI, when we use it so frequently for other sorts of data problems?

    The answer of Steve was that he breaks the world of data work into three parts:

    • Process of inputs

    • BI
    • Enterprise Information Management
      All the work you have to do when you have a lot of data to make it useful and clean and get it to the right place. This covers master data management, data quality work, data integration and lineage analysis to keep track of where the data came from. All of these are part of Enterprise Information Management.

    Next, Steve told Microsoft is developing SSIS as part of a large push in all of these areas in the next release of SQL. So SSIS will be, next to a BI tool, part of Enterprise Information Management in the next release of SQL Server.

    I'm interested in the different ways people use SSIS, I've basically used it for ETL, data migrations and processing inputs. In which ways did you use SSIS?

  • Analysis Services Roadmap for SQL Server “Denali” and Beyond

    Last week Microsoft announced the “BI Semantic Model” (BISM). I wrote a blog post about this and now the Analysis Services team wrote an article named: Analysis Services – Roadmap for SQL Server “Denali” and Beyond.


  • Will SSAS, Cubes and MDX be abandoned because of the BI Semantic Model?

    At the PASS Summit that is happening in Seattle at the moment Microsoft announced the “BI Semantic Model” (BISM).

    It looks like BISM is something like the UDM that we now know from SSAS. While the UDM was the bridge between relational data to multidimensional data, BISM is the bridge between relational data to the column-based Vertipaq engine. Some compare BISM to Business Objects universes.

    The next version of SSAS will be able to either run in the old “UDM” mode or in “BISM” mode, a combination is not possible. Of course this will have some radical consequences, because there are a few major differences between the two modes:

    • The switch from multidimensional cubes to the in-memory Vertipaq engine
    • The switch from MDX to DAX

    So multidimensional cubes and MDX will be deprecated? No, not really, SSAS as we know it now will be a product in the future and will remain supported. But it looks like Microsoft will concentrate on BISM, mainly because multidimensional cubes and MDX are very difficult to learn. Microsoft wants to make BI more approachable and less difficult, just like with Self Service BI.
    I would say that it’s really time to start learning PowerPivot and DAX right now, if you have not already started learning it. If Microsoft will focus on the new BISM/Vertipaq technology that will be the future if you ask me.

    Chris Webb wrote an interesting article about BISM and it looks like he is not very enthusiastic about the strategy Microsoft takes here because this could be the end of SSAS cubes within a few years: “while it’s not true to say that Analysis Services cubes as we know them today and MDX are dead, they have a terminal illness. I’d give them two, maybe three more releases before they’re properly dead, based on the roadmap that was announced yesterday.”

    What’s also very interesting is the comprehensive comment on this article from Amir Netz. He explains BISM and UDM will live together in Analysis Services in the future and MOLAP is here to stay: “Make no mistake about it – MOLAP is still the bread and butter basis of SSAS, now and for a very long time. MDX is mature, functional and will stay with us forever.”

    Read the article from Chris Webb here and make sure you don’t miss the comment from Amir!

  • SQL Server code-named 'Denali' - Community Technology Preview 1 (CTP1)

    SQL Server Denali (SQL Server 2011) CTP1 has been released!

    Download it here

    SQL 2011 is expected to be ready in the third quarter in 2011! I’ve already blogged about a few new SSIS features here

    I will keep you posted!

  • SQL Azure Reporting is announced!


    With SQL Azure Reporting Services you can use SSRS as a service on the Azure platform with all the benefits of Azure and the most features and capabilities of premise. It’s also possible to embed your reports in your Windows or Azure applications.

    Benefits of the Azure platform for Azure Reporting Services are:

    • Highly available, the cloud services platform has built-in high availability and fault tolerance
    • Scalable, the cloud services platform automatically scales up and down
    • Secure, your reports and SQL Azure databases are on a safe place in the cloud
    • Cost effective, you don’t have to set up servers and you don’t have to invest in managing servers
    • Use the same tools you use today to develop your solutions. Just develop your reports in BIDS or Report Builder and deploy to Azure

    Disadvantages are:

    • SQL Azure databases are the only supported data sources in the first version, more data sources are expected to come
    • No developer extensibility in the first version, so no custom data sources, assemblies, report items or authentication
    • No subscriptions or scheduled delivery
    • No Windows Authentication, only SQL Azure username/password is supported in the first version, similar to SQL Azure database. When SQL Azure database gets Windows Authentication, Azure Reporting will follow

    Despite the disadvantages of the first version I think SQL Azure Reporting Services offers great capabilities and can be extremely useful for a lot of organizations.
    I’m really curious about the CTP, which will be available before the end of this year. You can sign up for the SQL Azure Reporting CTP here

    Read more about SQL Azure Reporting here

  • MCITP – I passed the 70-455 “Upgrade: Transition Your MCITP SQL Server 2005 BI Developer to MCITP SQL Server 2008 BI Developer” exam!

    Recently I passed the 70-455 exam. This exam upgrades your SQL 2005 MCTS and MCITP certifications to SQL 2008.


    The exam contains 2 sections(basically separate exams), each with 25 questions:
    - A part which covers exam 70-448: TS: Microsoft SQL Server 2008, Business Intelligence Development and Maintenance
    - A part which covers exam 70-452: PRO: Designing a Business Intelligence Infrastructure Using Microsoft SQL Server 2008

    You need to pass on both of the sections with a score that’s at least 700. If you fail one section, you fail on the entire exam.


    How did I study

    I searched the internet and the conclusion was that there is no preparation material available for the 70-452 exam but fortunately there was a self-paced training kit for the 70-448 exam, which also covers this exam. So i bought the book, scanned it for subjects that needed attention and fortunately that was enough to pass the exam for me.

    For the entire list of preparation materials for the 70-448 and 70-452 exams follow the links below:

    70-448 preparation materials

    70-452 preparation materials 


    My Current Transcript


  • The next version of SSIS is coming!

    The latest releases of SQL Server contained (almost) no new SSIS features. With the release of SSIS 2008 the ability to use C# scripts, the improved data flow and the cached lookup were most thrilling new features. The release of SQL 2008 R2 only gave us the ability to use a bulk insert mode for the ADO.NET destination, which was a bit disappointing.

    Fortunately Matt Mason from the SSIS team announced that the next version of SQL Server (SQL 11) contain quite some exiting new functionality for SSIS!

    - Undo/Redo support. Finally, this should have been added a long time ago ;-)

    - Improved copy/paste mechanism. Let’s hope we keep the formatting of components after copy/pasting them!

    - Data flow sequence container

    - New icons and rounded corners for tasks and transformations

    - Improved backpressure for data flow transformations with multiple inputs (for example a Merge Join). When one of the inputs get to much data compared to the other, the component that receives the data can tell the data flow that it needs more data on the other input

    - The Toolbox window will automatically locate and show newly installed custom tasks

    I’m Curious about the first CTP!

  • SSIS – Delete all files except for the most recent one

    Quite often one or more sources for a data warehouse consist of flat files. Most of the times these files are delivered as a zip file with a date in the file name, for example

    Currently I work at a project that does a full load into the data warehouse every night. A zip file with some flat files in it is dropped in a directory on a daily basis. Sometimes there are multiple zip files in the directory, this can happen because the ETL failed or somebody puts a new zip file in the directory manually. Because the ETL isn’t incremental only the most recent file needs to be loaded. To implement this I used the simple code below; it checks which file is the most recent and deletes all other files.

    Usage is quite simple, just copy/paste the code in your script task and create two SSIS variables:

    • SourceFolder (type String): The folder that contains the (zip) files
    • DateInFilename (type Boolean): A flag, set it to True if your filename ends with the date YYYYMMDD, set it to false if creation date of the files should be used

    Note: In a previous blog post I wrote about unzipping zip files within SSIS, you might also find this useful: SSIS – Unpack a ZIP file with the Script Task

    Public Sub Main()
        'Use this piece of code to loop through a set of files in a directory
        'and delete all files except for the most recent one based on a date in the filename.
        'File name example:
        Dim rootDirectory As New DirectoryInfo(Dts.Variables("SourceFolder").Value.ToString) 'Set the directory in SSIS variable SourceFolder. For example: D:\Export\
        Dim mostRecentFile As String = ""
        Dim currentFileDate As Integer
        Dim mostRecentFileDate As Integer
        Dim currentFileCreationDate As Date
        Dim mostRecentFileCreationDate As Date
        Dim dateInFilename As Boolean = Dts.Variables("DateInFilename").Value 'If your filename ends with the date YYYYMMDD set SSIS variable DateInFilename to True. If not set to False.
        If dateInFilename Then
            'Check which file is the most recent
            For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
                currentFileDate = CInt(Left(Right(fi.Name, 12), 8)) 'Get date from current filename (based on a file that ends with:
                If currentFileDate > mostRecentFileDate Then
                    mostRecentFileDate = currentFileDate
                    mostRecentFile = fi.Name
                End If
        Else 'Date is not in filename, use creation date
            'Check which file is the most recent
            For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
                currentFileCreationDate = fi.CreationTime 'Get creation date of current file
                If currentFileCreationDate > mostRecentFileCreationDate Then
                    mostRecentFileCreationDate = currentFileCreationDate
                    mostRecentFile = fi.Name
                End If
        End If
        'Delete all files except the most recent one
        For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
            If fi.Name <> mostRecentFile Then
                File.Delete(rootDirectory.ToString + "\" + fi.Name)
            End If
        Dts.TaskResult = ScriptResults.Success
    End Sub
  • SSIS - Package design pattern for loading a data warehouse - Part 2

    Since my last blog post about a SSIS package design pattern I’ve received quite some positive reactions and feedback. Microsoft also added a link to the post on the SSIS portal which made it clear to me that there is quite some attention for this subject.

    The feedback I received was mainly about two things:
    1. Can you visualize the process or make it clearer without the whole technical story so it's easier to understand.
    2. How should the Extract phase of the ETL process be implemented when source tables are used by multiple dimensions and/or fact tables.

    In this post I will try to answer these questions. By doing so I hope to offer a complete design pattern that is usable for most data warehouse ETL solutions developed using SSIS.

    SSIS package design pattern for loading a data warehouse

    Using one SSIS package per dimension / fact table gives developers and administrators of ETL systems quite some benefits and is advised by Kimball since SSIS has been released. I have mentioned these benefits in my previous post and will not repeat them here.

    When using a single modular package approach, developers sometimes face problems concerning flexibility or a difficult debugging experience. Therefore, they sometimes choose to spread the logic of a single dimension or fact table in multiple packages. I have thought about a design pattern with the benefits of a single modular package approach and still having all the flexibility and debugging functionalities developers need.

    If you have a little bit of programming knowledge you must have heard about classes and functions. Now think about your SSIS package as a class or object that exists within code. These classes contain functions that you can call separately from other classes (packages). That would be some nice functionality to have, but unfortunately this is not possible within SSIS by default.
    To realize this functionality in SSIS I thought about SSIS Sequence Containers as functions and SSIS packages as classes.
    I personally always use four Sequence Containers in my SSIS packages:
    - SEQ Extract (extract the necessary source tables to a staging database)
    - SEQ Transform (transform these source tables to a dimension or fact table)
    - SEQ Load (load this table into the data warehouse)
    - SEQ Process (process the data warehouse table to the cube)

    The technical trick that I performed - you can read about the inner working in my previous post - makes it possible to execute only a single Sequence Container within a package, just like with functions in classes when programming code.
    The execution of a single dimension or fact table can now be performed from a master SSIS package like this:

    1 - [Execute Package Task] DimCustomer.Extract
    2 - [Execute Package Task] DimCustomer.Transform
    3 - [Execute Package Task] DimCustomer.Load
    4 - [Execute Package Task] DimCustomer.Process

    The package is executed 4 times with an Execute Package Task, but each time only the desired function (Sequence Container) will run.

    If we look at this in a UML sequence diagram we see the following:

    I think this sequence diagram gives you a good overview of how this design pattern is organized. For the technical solution and the download of a template package you should check my previous post.

    How should the Extract phase of the ETL process be implemented when a single source table is used by multiple dimensions and/or fact tables?

    One of the questions that came up with using this design pattern is how to handle the extraction of source tables that are used in multiple dimensions and/or fact tables. The problem here is that a single table would be extracted multiple times which is, of course, undesirable.

    On coincidence I was reading the book “SQL Server 2008 Integration Services: Problem – Design - Solution” (which is a great book!) and one of the data extraction best practices (Chapter 5) is to use one package for the extraction of each source table. Each of these packages would have a very simple dataflow from the source table to the destination table within the staging area.
    Of course this approach will be more time consuming than using one big extract package with all table extracts in it but fortunately it also gives you some benefits:
    - Debugging, sometimes a source has changed, i.e. a column’s name could have been changed or completely deleted. The error that SSIS will log when this occurs will point the administrators straight to the right package and source table. Another benefit here is that only one package will fail and needs to be edited, while the others can still execute and remain unharmed.
    - Flexibility, you can execute a single table extract from anywhere (master package or dim/fact package).

    I recently created some solutions using this extract approach and really liked it. I used 2 SSIS projects:
    - one with the dimension and fact table packages
    - one with only the extract packages
    I have used the following naming conventions on the extract packages: Source_Table.dtsx and deployed them to a separate SSIS folder. This way the packages won’t bother the overview during development.
    A tip here is to use BIDS Helper; it has a great functionality to deploy one or more packages from BIDS.

    Merging this approach in the design pattern will give the following result:
    - The dimension and fact table extract Sequence Containers will no longer have data flow tasks in it but execute package tasks which point to the extract packages.
    - The Extract Sequence Container of the master package will execute all the necessary extract packages at once.

    This way a single source table will always get extracted only one time when executing your ETL from the master package and you still have the possibility to unit test your entire dimension or fact table packages.
    Drawing this approach again in a sequence diagram gives us the following example with a run from the master package (only the green Sequence Containers are executed):

    And like this with a run of a single Dimension package:

    Overall, the design pattern will now always look like this when executed from a master package:

    I think this design pattern is now good enough to be used as a standard approach for the most data warehouse ETL projects using SSIS. Thanks for all the feedback! New feedback is of course more than welcome!

  • SSIS – Package design pattern for loading a data warehouse

    I recently had a chat with some BI developers about the design patterns they’re using in SSIS when building an ETL system. We all agreed in creating multiple packages for the dimensions and fact tables and one master package for the execution of all these packages.

    These developers even created multiple packages per single dimension/fact table:

    • One extract package where the extract(E) logic of all dim/fact tables is stored
    • One dim/fact package with the transform(T) logic of a single dim/fact table
    • One dim/fact package with the load(L) logic of a single dim/fact table

    I like the idea of building the Extract, Transform and Load logic separately, but I do not like the way the logic was spread over multiple packages.
    I asked them why they chose for this solution and there were multiple reasons:

    • Enable running the E/T/L parts separately, for example: run only the entire T phase of all dim/fact tables.
    • Run the extracts of all dimensions and fact tables simultaneously to keep the loading window on the source system as short as possible.

    To me these are good reasons, running the E/T/L phases separately is a thing a developer often wants during the development and testing of an ETL system.
    Keeping the loading window on the source system as short as possible is something that’s critical in some projects.

    Despite the good arguments to design their ETL system like this, I still prefer the idea of having one package per dimension / fact table, with complete E/T/L logic, for the following reasons:

    • All the logic is in one place
    • Increase understandability
    • Perform unit testing
    • If there is an issue with a dimension or fact table, you only have to make changes in one place, which is safer and ore efficient
    • You can see your packages as separate ETL “puzzle pieces” that are reusable
    • It’s good from a project manager point of view; let your customer accept dimensions and fact tables one by one and freeze the appropriate package afterwards
    • The overview in BIDS, having an enormous amount of packages does not make it clearer ;-)
    • Simplifies deployment after changes have been made
    • Changes are easier to track in source control systems
    • Team development will be easier; multiple developers can work on different dim/fact tables without bothering each other.

    So basically my goal was clear: to build a solution that has all the possibilities the aforesaid developers asked for, but in one package per dimension / fact table; the best of both worlds.


    The solution I’ve created is based on a parent-child package structure. One parent (master) package will execute multiple child (dim/fact) packages. This solution is based on a single (child) package for each dimension and fact table. Each of these packages contains the following Sequence Containers in the Control Flow: 

    Normally it would not be possible to execute only the Extract, Transform, Load or (cube) Process Sequence Containers of the child (dim/fact) packages simultaneously.

    To make this possible I have created four Parent package variable configurations, one for each ETL phase Sequence Container in the child package:



    Each of these configurations is set on the Disable property of one of the Sequence Containers:

    Using this technique makes it possible to run separate Sequence Containers of the child package from the master package, simply by dis- or enabling the appropriate sequence containers with parent package variables.
    Because the default value of the Disable property of the Sequence Containers is False, you can still run an entire standalone child package, without the need to change anything.

    Ok, so far, so good. But, how do I execute only one phase of all the dimension and fact packages simultaneously? Well quite simple:

    First add 4 Sequence Containers to the Master package. One for each phase of the ETL, just like in the child packages

    Add Execute Package Tasks for all your packages in every Sequence Container


    If you would execute this master package now, every child package would run 4 times as there are 4 Execute Package Tasks that run the same package in every sequence container.
    To get the required functionality I have created 4 variables inside each Sequence Container (Scope). These will be used as parent variable to set the Disable properties in the child packages. So basically I’ve created 4 variables x 4 Sequence Containers = 16 variables for the entire master package.

    Variables for the EXTRACT Sequence Container (vDisableExtract False):

    Variables for the TRANSFORM Sequence Container (vDisableTransform False):

    The LOAD and PROCESS Sequence Containers contain variables are based on the same technique.


    Run all phases of a standalone package: Just execute the package:

    Run a single phase of the ETL system (Extract/Transform/Load/Process): Execute the desired sequence container in the main package:


    Run a single phase of a single package from the master package:

    Run multiple phases of the ETL system, for example only the T and L: Disable the Sequence Containers of the phases that need to be excluded in the master package:


    Run all the child packages in the right order from the master package:
    When you add a breakpoint on, for example, the LOAD Sequence Container you see that all the child packages are at the same ETL phase as their parent: 

    When pressing Continue the package completes: 


    This parent/child package design pattern for loading a Data Warehouse gives you all the flexibility and functionality you need. It’s ready and easy to use during development and production without the need to change anything.

    With only a single SSIS package for each dimension and fact table you now have the functionality that separate packages would offer. You will be able to, for example, run all the Extracts for all dimensions and fact tables simultaneously like the developers asked for and still have the benefits that come with the one package per dimension/fact table approach.

    Of course having a single package per dimension or fact table will not be the right choice in all cases but I think it is a good standard approach.
    Same applies to the ETL phases (Sequence Containers). I use E/T/L/P, but if you have different phases, which will be fine, you can still use the same technique.

    Download the solution with template packages from the URL’s below. Only thing you need to do is change the connection managers to the child packages (to your location on disk) and run the master package!

    Download for SSIS 2008

    Download for SSIS 2005

    If you have any suggestions, please leave them as a comment. I would like to know what your design pattern is as well!

    ATTENTION: See Part-2 on this subject for more background information!


    How to: Use the Values of Parent Variables in a Child Package:

  • SSIS - Blowing-out the grain of your fact table

    Recently I had to create a fact table with a lower grain than the source database. My source database contained order lines with a start- and end date and monthly revenue amounts.

    To create reports that showed overall monthly revenue per year, lowering the grain was necessary. Because the lines contained revenue per month I decided to blow out the grain of my fact table to monthly records for all the order lines of the source database. For example, an order line with a start date of 1 January 2009 and an end date of 31 December 2009 should result in 12 order lines in the fact table, one line for each month.

    To achieve this result I exploded the source records against my DimDate. I used a standard DimDate:

    The query below did the job; use it in a SSIS source component and it will explode the order lines to a monthly grain:

    Code Snippet
    1. SELECT OL.LineId
    2.       ,DD.ActualDate
    3.       ,OL.StartDate
    4.       ,OL.EndDate
    6.   FROM OrderLine OL
    7.   INNER JOIN DimDate DD
    8.       ON DD.Month
    9.       BETWEEN
    10.       (YEAR(OL.StartDate)*100+MONTH(OL.StartDate))
    11.       AND
    12.       (YEAR(OL.EndDate)*100+MONTH(OL.EndDate))
    14.   WHERE DD.DayOfMonth = 1


    Some explanation about this query below:

    · I always want to connect a record to the first day of the month in DimDate, that’s why this WHERE clause is used:

    Code Snippet
    1. WHERE DD.DayOfMonth = 1

    · Because I want to do a join on the month (format: YYYMM) of DimDate I need to format the start and end date on the same way (YYYYMM):

    Code Snippet
    1. (YEAR(OL.StartDate)*100+MONTH(OL.StartDate))

    The source, order lines with a start and end date:

    The Result, monthly order lines:

  • SSIS - Let the Excel connection manager pick the right column data types from an Excel source

    The excel connection manager scans every first 8 rows to determine the data type for a column in your SSIS source component. So if an Excel sheet column has integers on the first 8 rows and a string value on the 9th row, your data flow task will crash when executed because SSIS expects integers.

    Fortunately you can change the number of rows that Excel will scan with the TypeGuessRows registry property.

    Change TypeGuessRows:

    1. Start Registry Editor by typing "regedit" in the run bar of the Start menu.

    2. Search the register (CTRL-F) on "TypeGuessRows".

    3. Double click "TypeGuessRows" and edit the value.

    Todd McDermid (MVP) commented the following useful addition:
    "Unfortunately, that reg key only allows values from 1 to 16 - yes, you can only increase the number of rows Excel will "sample" to 16."

    Robbert Visscher commented:
    "The reg key also allows the value 0. When this value is set, the excel connection manager scans every row to determine the data type for a column in your SSIS source component."

    Thanks Robbert, I think setting it to 0 can be very powerful in some scenario's!

    So the conclusion of the comments of Todd and Robbert is that a value from 0 to 16 is possible:
    • TypeGuessRows 0: All rows will be scanned. This might hurt performance, so only use it when necessary.
    • TypeGuessRows 1-16: A value between 1 and 16 is the default range for this reg key, use this in normal scenario's.
Privacy Statement