THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jorg Klein

Microsoft Data & Analytics consultant and Microsoft Data Platform MVP from the Netherlands

  • SSIS Denali CTP3 – What’s new?

    Last week Microsoft released CTP3 of SQL Server Integration Services (SSIS), code name: Denali. In this blog post I will look into the new key features and some of the minor improvements in SSIS Denali.

    1. Development

    Shared Connection Managers
    Connection Managers can now be shared on SSIS project level.
    You can create them in the solution explorer, in the folder Connection Managers:
    clip_image001

    Once created, they will appear automatically in all your SSIS packages. The names are in bold so you can recognize them between your package-level connection managers:
    clip_image002

    You can also create Shared Cache Connection Managers for your cached lookups. This will be very useful for lookups that are performed multiple times, for example when you look up dimension tables multiple times from your fact table ETL. You can just convert the local cache connection manager by right clicking it:
    clip_image003

    What happened to the Shared Data Sources we knew from prior SSIS versions? Well they are no more. Shared Data Sources only lived at design time and not at runtime so they were not really useful anyway. Shared Connection Managers do live at runtime and offer even more possibilities as I will show later on, so they replace the old Shared Data Sources.

    Data Flow - Column mappings
    SSIS always mapped columns from source to transformations or destinations with the help of lineage ids. Every column had a unique metadata ID that was known by all components in the data flow. If something changed in the source this would break the lineage ids and raised error messages like: The external metadata column collection is out of synchronization with the data source columns.
    To fix this error you would re-map all broken lineage ids with the “Restore Invalid Column References Editor”.
    In Denali lineage-ids are no longer used. Mappings are done on column names, which is great because you can now use auto map on column names and even copy/paste pieces of another data flow and connect them by mapping the corresponding column names.

    Data Flow - Flexible order of authoring
    This improvement helps you edit data flow components even when they don’t have an input attached. Theoretically you can build your data flow backwards; start with the destination and track back to the source.

    Data Flow - Groupings
    You can now group data flow components. Select the components you wish to group, right click and select Group:
    clip_image004 


    The result is some sort of a data flow sequence container:
    clip_image005

    By clicking the arrow it will collapse:
    clip_image006

    Data flow groups are 100% eye candy; you can’t set any properties on them.

    Data Flow - Data Quality Services Cleansing transformation
    With this transformation you can apply data quality rules in the data flow. This is done by using a Knowledge Base which can be created by yourself or downloaded from the Windows Azure Marketplace. For example you could apply a rule that checks if a given postal code column is valid for a particular town column in your record.

    Data Flow - Data Tap
    In Denali, we have the possibility to attach a “tap” at a data flow path (arrow). This tap captures all data coming through and dumps it in CSV files.

    2. Configuration

    SSIS Denali is fundamentally different to its predecessors when it comes to the concept and usage of configurations. SSIS package configurations are obsolete (they will still be available if you really want to use them ;-)) and parameters and environments are in the new kids in town.

    Parameters
    SSIS Parameters look a lot like SSIS variables but there are some differences. There are two types of these parameters:
    1. Package Parameters:
    Look at Package parameters as C# parameters, which are passed as input to a C# function(=your package). You can set them when executing (call) a package and the lifecycle of the parameters are limited to the SSIS package itself.

    The difference with SSIS variables? You can set the parameters while executing a package with SSMS or an Execute Package Task.

    Define your Package Parameters at the Package Parameters tab:
    clip_image007

    2. Project Parameters:
    Identical to Package Parameters, except for the scope, these Parameters exist on project level and can be referenced throughout all SSIS packages in your project. Package Parameters only live inside a single SSIS package.

    You can define Project Parameters in the solution explorer within your SSIS project:
    clip_image008

    Both Package and Project Parameters can be referenced from your SSIS packages, you recognize them by their prefix, $Package or $Project:
    clip_image009

    Setting parameters in a Execute Package Task is achieved by the new Parameter bindings tab:
    clip_image010

    It’s also possible to parameterize SSIS tasks on the Control Flow by right clicking them and choose Parameterize:
    clip_image011

    Loads of capabilities here! I now realize that I have created a workaround for Package Parameters with my SSIS Package design pattern for loading a data warehouse where I (mis)used Package Configurations as Package Parameters. Creating a new package design pattern for Denali definitely goes on my TODO list!

    Environments
    Environments are a collection of SSIS package settings that can be define on the SSIS Server. At runtime, the environment will override these settings in the SSIS packages. You can create multiple environments and when you run your SSIS packages you can select which environment it should use. It’s also possible to let multiple SSIS projects run under one environment, so flexibility all around the clock.

    To make you understand the principle of Environments right away I have created a simple example that you will commonly use: Create two Environments, one with development settings and one with production settings.

    I have deployed a Test SSIS project to SSMS which contains one SSIS package with one Shared Connection Manager. Notice the new Integration Services folder structure:
    clip_image012

    Next right click Environments and choose Create Environment:
    clip_image013

    Type Development as name for the Environment and click OK:
    clip_image014

    Now double click the Development Environment:
    clip_image015

    Click on the Variables tab and create an Environment Variable which will overwrite the Shared Connection Manager. Type in the connection string for the development server as the Value and click OK:
    clip_image017

    Next create another Environment with the name Production and also create an Environment Variable with the name SCM_EnvironmentVar. The only difference between these two variables should be the value of the variable; it contains the production server connection string instead of the development value.
    You now have two Environments with one Environment Variable for the Shared Connection Manager each:
    clip_image018

    We now need to reference the Development and Production Environments in the Test project so they can be used. Right click the Test project and choose Configure:
    clip_image019

    Go to the references page and click Add..
    clip_image021

    Add both Environments:
    clip_image023

    Both environment are now referenced from the Test SSIS project and are available. Let’s configure the Shared Connection Manager so it will use the value specified in one of the environments. Click on the Parameters page, the Connection Managers tab and the next to the ConnectionString property of the Shared Connection Manager:
    clip_image025

    Now select Use Environment Variable and select SCM_EnvironmentVar. Click OK:
    clip_image026

    The name of our Environment Variable is shown in the Value box of the ConnectionString property:
    clip_image028

    We have now succesfully attached the Environment Variable to override the ConnectionString property with a value from either the Development or the Production Environment. Executing the package under one of both Environments is very easy. To do so, right click the package and choose Run:
    clip_image029

    The Run Package dialog box appears and you instantly see a message that reminds you to assign a value to the ConnectionString property of connection manager SharedConnectionManager:
    clip_image030

    At the bottom of the dialog box you can select the Environment your package should use. Select the .\Development Environment and click OK:
    clip_image031

    In the overview report you see the package has succesfully ran under the Development Environment:
    clip_image032

    If you change the package to use the Production Environment and you run the package again, you’ll get the following execution information:
    clip_image033

    Pretty straightforward and easy to use. The big difference? All the configuration work has been done in SQL Server Management Studio! Exactly the place where it should be done. In the current SSIS versions most of this work was done within Visual Studio and the configuration was done by editing tables. So unfriendly! Big improvement, great work MS!
    In this example I have set a property of a Shared Connection Manager but of course it is possible to set Package or Project Parameters with Environment Variables.

    3. Deployment

    We have had the option to deploy SSIS packages from Visual Studio for a long time thanks to BIDS Helper. Microsoft finally added this functionality out of the box in Denali. They even did more: you can now also build your projects from within Visual Studio!

    There are two ways to deploy (push) your projects to the new SSIS server, directly from Visual Studio or by using an Integration Services Project Deployment File. They both use the new Integration Services Deployment Wizard.
    Deployments directly from Visual Studio can be done by right clicking your project and then choose Deploy. Now, the Deployment Wizard will pop straight up:
    clip_image034

    The other way is building your project first. This was already possible in SSIS version(s) prior to Denali but did not really have useful meanings.

    In Denali this is different. Building your project will generate an .ispac file which contains your entire project and this can be used by the Deployment Wizard. Again a very nice new feature.

    Handing over a new release of a SSIS solution to a customer’s IT department for deployment can now be done in a very nice, professional, manner. Ispac files can also be opened by Visual Studio to import your SSIS project.
    Building your project can be done by right clicking your project and choosing Build:
    clip_image035

    The output window displays the following:
    clip_image036

    When looking at the bin\Development folder we see the new Test.aspac file:
    clip_image037

    clip_image038

    Double clicking (or choosing Deploy in Visual Studio) this file will start the new Integration Services Deployment Wizard:

    You can now select the project that you wish to deploy. You can either select to use the .ispac Project deployment file or choose to deploy an existing project located in any Integration Services catalog. The last option is useful when you wish to deploy from, for example, a TEST SSIS solution that is already located in a catalog to an ACCEPTATION environment in another catalog.
    clip_image039

    The project is loaded and validated:
    clip_image040

    Next, you now need to select the destination, which has to be a SSIS Server:
    clip_image041

    Review your selections and click on Deploy to start the actual deployment:
    clip_image042

    The results show a successful deployment. Notice the protection level has been changed. What happened here?
    The wizard clears the protection level, as we know from Visual Studio, to prevent “cannot decrypt password” errors.

    Instead of using protection levels, passwords will be stored in clear text. The entire package (with the passwords in it) will be stored encrypted in the SSIS Server tables you are deploying to.
    clip_image043

    The project has now been deployed to the server:
    clip_image044

    When you right click the Test project and choose for Versions you are able to see the current version of your project:
    clip_image046

    If you deploy a newer version later on, the existing version(s) stay on the server and you can easily role back to a previous version if you’ve made a bad deployment.

    The diagram below shows the entire deployment life cycle (source: TechNet):
    clip_image048

    4. Management

    The new SSIS Server is the central storage and administration point of your SSIS solutions. No longer is this a standalone server that you’ll need to manage. Basically it is a database with Stored Procedures.

    You now have an Integration Services node available in SSMS when you connect to the database engine:
    clip_image049

    Under the Integration Services node you will find your SSISDB catalog which holds all your SSIS solutions with its packages, environments, etc. The physical storage of these objects will be conducted in a SQL Server database with the same name as the catalog:
    clip_image050

    This database also contains all the stored procedures containing all the programming code for the SSIS Server:
    clip_image051

    With the help of these stored procedures you can manage your SSIS Server: e.g. your parameter values, connection managers, and override properties by using environments.
    Next to the configuration functionalities I have discussed earlier you can also implement security to control access to the catalog, both on folder level and package level. When you right click a folder/package in the catalog and choose the properties you’ll get the following window where you can manage security access:
    clip_image052

    Finally you’ll get an out of the box reporting dashboard which is built on the out of the box logging functionality that SSIS offers. On this dashboard you’ll get information about the execution history of your package and its sub packages, view which parameters were used, view specific messages and get a performance report over time.
    All the information is logged automatically if any package runs on the server. The information is very detailed; you can even get the row counts between SSIS data flow transformations!

    When running a package you are able to select a logging level on the Advanced tab:
    clip_image053
    You can choose for:
    - None: turn logging of for performance reasons
    - Basic: error and warning logging
    - Performance: detailed trace information
    - Verbose: diagnostics and fault debugging

    When I run my Test package I’ll get the following execution dashboard. There are some hyperlinks that navigate to more detailed reports.
    clip_image055

    So no need for creating your own logging framework anymore, it’s all out of the box!

  • SSIS - Connect to Oracle on a 64-bit machine (Updated for SSIS 2008 R2)

    We recently had a few customers where a connection to Oracle on a 64 bit machine was necessary. A quick search on the internet showed that this could be a big problem. I found all kind of blog and forum posts of developers complaining about this. A lot of developers will recognize the following error message:

    Test connection failed because of an error in initializing provider. Oracle client and networking components were not found. These components are supplied by Oracle Corporation and are part of the Oracle Version 7.3.3 or later client software installation.
    Provider is unable to function until these components are installed.


    After a lot of searching, trying and debugging I think I found the right way to do it!

    Problems

    Because BIDS is a 32 bit application, as well on 32 as on 64 bit machines, it cannot see the 64 bit driver for Oracle. Because of this, connecting to Oracle from BIDS on a 64 bit machine will never work when you install the 64 bit Oracle client.

    Another problem is the "Microsoft Provider for Oracle", this driver only exists in a 32 bit version and Microsoft has no plans to create a 64 bit one in the near future.

    The last problem I know of is in the Oracle client itself, it seems that a connection will never work with the instant client, so always use the full client.
    There are also a lot of problems with the 10G client, one of it is the fact that this driver can't handle the "(x86)" in the path of SQL Server. So using the 10G client is no option!

    Solution

    • Download the Oracle 11G full client.
    • Install the 32 AND the 64 bit version of the 11G full client (Installation Type: Administrator) and reboot the server afterwards. The 32 bit version is needed for development from BIDS with is 32 bit, the 64 bit version is needed for production with the SQLAgent, which is 64 bit.
    • Configure the Oracle clients (both 32 and 64 bits) by editing  the files tnsnames.ora and sqlnet.ora. Try to do this with an Oracle DBA or, even better, let him/her do this.
    • Use the "Oracle provider for OLE DB" from SSIS, don't use the "Microsoft Provider for Oracle" because a 64 bit version of it does not exist.
    • Schedule your packages with the SQLAgent.

    Background information

    • Visual Studio (BI Dev Studio)is a 32bit application.
    • SQL Server Management Studio is a 32bit application.
    • dtexecui.exe is a 32bit application.
    • dtexec.exe has both 32bit and 64bit versions.
    • There are x64 and x86 versions of the Oracle provider available.
    • SQLAgent is a 64bit process.

    My advice to BI consultants is to get an Oracle DBA or professional for the installation and configuration of the 2 full clients (32 and 64 bit). Tell the DBA to download the biggest client available, this way you are sure that they pick the right one ;-)

    Testing if the clients have been installed and configured in the right way can be done with Windows ODBC Data Source Administrator:
    Start...
    Programs...
    Administrative tools...
    Data Sources (ODBC)


    ADITIONAL STEPS FOR SSIS 2008 R2

    It seems that, unfortunately, some additional steps are necessary for SQL Server 2008 R2 installations:

    1. Open REGEDIT (Start… Run… REGEDIT) on the server and search for the following entry (for the 32 bits driver): HKEY_LOCAL_MACHINE\Software\Microsoft\MSDTC\MTxOCI
    Make sure the following values are entered:

    image

    2. Next, search for (for the 64 bits driver): HKEY_LOCAL_MACHINE\Software\Wow6432Node\Microsoft\MSDTC\MTxOCI
    Make sure the same values as above are entered.

    3. Reboot your server.

  • Replication Services as ETL extraction tool

    In my last blog post I explained the principles of Replication Services and the possibilities it offers in a BI environment. One of the possibilities I described was the use of snapshot replication as an ETL extraction tool:
    “Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards.
    In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!”

    Well I have tried it out and I must say it worked well. I was able to let replication services do work in a fraction of the time it would cost me to do the same in SSIS.
    What I did was the following:

    1. Configure snapshot replication for some Adventure Works tables, this was quite simple and straightforward.
    2. Create an SSIS package that executes the snapshot replication on demand and waits for its completion.
      This is something that you can’t do with out of the box functionality. While configuring the snapshot replication two SQL Agent Jobs are created, one for the creation of the snapshot and one for the distribution of the snapshot. Unfortunately these jobs are  asynchronous which means that if you execute them they immediately report back if the job started successfully or not, they do not wait for completion and report its result afterwards. So I had to create an SSIS package that executes the jobs and waits for their completion before the rest of the ETL process continues.

    Fortunately I was able to create the SSIS package with the desired functionality. I have made a step-by-step guide that will help you configure the snapshot replication and I have uploaded the SSIS package you need to execute it.

    Configure snapshot replication

    0 
    The first step is to create a publication on the database you want to replicate.
    Connect to SQL Server Management Studio and right-click Replication, choose for New.. Publication…

    1 
    The New Publication Wizard appears, click Next

    2
    Choose your “source” database and click Next

    3
    Choose Snapshot publication and click Next

    4 
    You can now select tables and other objects that you want to publish

    5
    Expand Tables and select the tables that are needed in your ETL process

    6
    In the next screen you can add filters on the selected tables which can be very useful. Think about selecting only the last x days of data for example.

    7
    Its possible to filter out rows and/or columns. In this example I did not apply any filters.

    8
    Schedule the Snapshot Agent to run at a desired time, by doing this a SQL Agent Job is created which we need to execute from a SSIS package later on.

    9
    Next you need to set the Security Settings for the Snapshot Agent. Click on the Security Settings button.

    10 
    In this example I ran the Agent under the SQL Server Agent service account. This is not recommended as a security best practice. Fortunately there is an excellent article on TechNet which tells you exactly how to set up the security for replication services. Read it here and make sure you follow the guidelines!

    11 
    On the next screen choose to create the publication at the end of the wizard

    12
    Give the publication a name (SnapshotTest) and complete the wizard

    13 
    The publication is created and the articles (tables in this case) are added


    Now the publication is created successfully its time to create a new subscription for this publication.

    14 
    Expand the Replication folder in SSMS and right click Local Subscriptions, choose New Subscriptions

    15 
    The New Subscription Wizard appears

    16 
    Select the publisher on which you just created your publication and select the database and publication (SnapshotTest)

    17 
    You can now choose where the Distribution Agent should run. If it runs at the distributor (push subscriptions) it causes extra processing overhead. If you use a separate server for your ETL process and databases choose to run each agent at its subscriber (pull subscriptions) to reduce the processing overhead at the distributor.

    18
    Of course we need a database for the subscription and fortunately the Wizard can create it for you. Choose for New database

    19 
    Give the database the desired name, set the desired options and click OK

    20
    You can now add multiple SQL Server Subscribers which is not necessary in this case but can be very useful.

    21 
    You now need to set the security settings for the Distribution Agent. Click on the …. button

    22
    Again, in this example I ran the Agent under the SQL Server Agent service account. Read the security best practices here

    23 
    Click Next

    24 
    Make sure you create a synchronization job schedule again. This job is also necessary in the SSIS package later on.

    25
    Initialize the subscription at first synchronization

    26
    Select the first box to create the subscription when finishing this wizard

    27
    Complete the wizard by clicking Finish

    28
    The subscription will be created

    29
    In SSMS you see a new database is created, the subscriber. There are no tables or other objects in the database available yet because the replication jobs did not ran yet.

    Now expand the SQL Server Agent, go to Jobs and search for the job that creates the snapshot:

     30
    Rename this job to “CreateSnapshot”

    Now search for the job that distributes the snapshot:

     31
    Rename this job to “DistributeSnapshot”

    Create an SSIS package that executes the snapshot replication

    We now need an SSIS package that will take care of the execution of both jobs. The CreateSnapshot job needs to execute and finish before the DistributeSnapshot job runs. After the DistributeSnapshot job has started the package needs to wait until its finished before the package execution finishes.
    The Execute SQL Server Agent Job Task is designed to execute SQL Agent Jobs from SSIS. Unfortunately this SSIS task only executes the job and reports back if the job started succesfully or not, it does not report if the job actually completed with success or failure. This is because these jobs are asynchronous.

    The SSIS package I’ve created does the following:

    1. It runs the CreateSnapshot job
    2. It checks every 5 seconds if the job is completed with a for loop
    3. When the CreateSnapshot job is completed it starts the DistributeSnapshot job
    4. And again it waits until the snapshot is delivered before the package will finish successfully

    33

    Quite simple and the package is ready to use as standalone extract mechanism. After executing the package the replicated tables are added to the subscriber database and are filled with data:

     32

    Download the SSIS package here (SSIS 2008)


    Conclusion

    In this example I only replicated 5 tables, I could create a SSIS package that does the same in approximately the same amount of time. But if I replicated all the 70+ AdventureWorks tables I would save a lot of time and boring work! With replication services you also benefit from the feature that schema changes are applied automatically which means your entire extract phase wont break. Because a snapshot is created using the bcp utility (bulk copy) it’s also quite fast, so the performance will be quite good.

    Disadvantages of using snapshot replication as extraction tool is the limitation on source systems. You can only choose SQL Server or Oracle databases to act as a publisher.

    So if you plan to build an extract phase for your ETL process that will invoke a lot of tables think about replication services, it would save you a lot of time and thanks to the Extract SSIS package I’ve created you can perfectly fit it in your usual SSIS ETL process.

  • Replication Services in a BI environment

    In this blog post I will explain the principles of SQL Server Replication Services without too much detail and I will take a look on the BI capabilities that Replication Services could offer in my opinion.

    SQL Server Replication Services provides tools to copy and distribute database objects from one database system to another and maintain consistency afterwards. These tools basically copy or synchronize data with little or no transformations, they do not offer capabilities to transform data or apply business rules, like ETL tools do.
    The only “transformations” Replication Services offers is to filter records or columns out of your data set. You can achieve this by selecting the desired columns of a table and/or by using WHERE statements like this:
    SELECT <published_columns> FROM [Table] WHERE [DateTime] >= getdate() - 60


    There are three types of replication:

    Transactional Replication

    Transactional replication components and data flow

    This type replicates data on a transactional level. The Log Reader Agent reads directly on the transaction log of the source database (Publisher) and clones the transactions to the Distribution Database (Distributor), this database acts as a queue for the destination database (Subscriber). Next, the Distribution Agent moves the cloned transactions that are stored in the Distribution Database to the Subscriber.
    The Distribution Agent can either run at scheduled intervals or continuously which offers near real-time replication of data!

    So for example when a user executes an UPDATE statement on one or multiple records in the publisher database, this transaction (not the data itself) is copied to the distribution database and is then also executed on the subscriber. When the Distribution Agent is set to run continuously this process runs all the time and transactions on the publisher are replicated in small batches (near real-time), when it runs on scheduled intervals it executes larger batches of transactions, but the idea is the same.


    Snapshot Replication

    Snapshot replication components and data flow
    This type of replication makes an initial copy of database objects that need to be replicated, this includes the schemas and the data itself. All types of replication must start with a snapshot of the database objects from the Publisher to initialize the Subscriber. Transactional replication need an initial snapshot of the replicated publisher tables/objects to run its cloned transactions on and maintain consistency.

    The Snapshot Agent copies the schemas of the tables that will be replicated to files that will be stored in the Snapshot Folder which is a normal folder on the file system. When all the schemas are ready, the data itself will be copied from the Publisher to the snapshot folder. The snapshot is generated as a set of bulk copy program (BCP) files. Next, the Distribution Agent moves the snapshot to the Subscriber, if necessary it applies schema changes first and copies the data itself afterwards. The application of schema changes to the Subscriber is a nice feature, when you change the schema of the Publisher with, for example, an ALTER TABLE statement, that change is propagated by default to the Subscriber(s).


    Merge Replication
    Merge replication is typically used in server-to-client environments, for example when subscribers need to receive data, make changes offline, and later synchronize changes with the Publisher and other Subscribers, like with mobile devices that need to synchronize one in a while. Because I don’t really see BI capabilities here, I will not explain this type of replication any further.


    Replication Services in a BI environment
    Transactional Replication can be very useful in BI environments. In my opinion you never want to see users to run custom (SSRS) reports or PowerPivot solutions directly on your production database, it can slow down the system and can cause deadlocks in the database which can cause errors. Transactional Replication can offer a read-only, near real-time database for reporting purposes with minimal overhead on the source system.

    Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards.
    In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!

    Update:
    I got a questing regarding the supported Replication Services features in the different versions of SQL Server (Standard,Enterprise,etc). There is a nice table on MSDN that shows this!

  • SSIS Denali as part of “Enterprise Information Management”

    When watching the SQL PASS session “What’s Coming Next in SSIS?” of Steve Swartz, the Group Program Manager for the SSIS team, an interesting question came up:

    Why is SSIS thought of to be BI, when we use it so frequently for other sorts of data problems?

    The answer of Steve was that he breaks the world of data work into three parts:

    • Process of inputs

    • BI
       
    • Enterprise Information Management
      All the work you have to do when you have a lot of data to make it useful and clean and get it to the right place. This covers master data management, data quality work, data integration and lineage analysis to keep track of where the data came from. All of these are part of Enterprise Information Management.
      image


    Next, Steve told Microsoft is developing SSIS as part of a large push in all of these areas in the next release of SQL. So SSIS will be, next to a BI tool, part of Enterprise Information Management in the next release of SQL Server.

    I'm interested in the different ways people use SSIS, I've basically used it for ETL, data migrations and processing inputs. In which ways did you use SSIS?

  • Analysis Services Roadmap for SQL Server “Denali” and Beyond

    Last week Microsoft announced the “BI Semantic Model” (BISM). I wrote a blog post about this and now the Analysis Services team wrote an article named: Analysis Services – Roadmap for SQL Server “Denali” and Beyond.

    BiSemanticModel

  • Will SSAS, Cubes and MDX be abandoned because of the BI Semantic Model?

    At the PASS Summit that is happening in Seattle at the moment Microsoft announced the “BI Semantic Model” (BISM).

    It looks like BISM is something like the UDM that we now know from SSAS. While the UDM was the bridge between relational data to multidimensional data, BISM is the bridge between relational data to the column-based Vertipaq engine. Some compare BISM to Business Objects universes.

    The next version of SSAS will be able to either run in the old “UDM” mode or in “BISM” mode, a combination is not possible. Of course this will have some radical consequences, because there are a few major differences between the two modes:

    • The switch from multidimensional cubes to the in-memory Vertipaq engine
    • The switch from MDX to DAX

    So multidimensional cubes and MDX will be deprecated? No, not really, SSAS as we know it now will be a product in the future and will remain supported. But it looks like Microsoft will concentrate on BISM, mainly because multidimensional cubes and MDX are very difficult to learn. Microsoft wants to make BI more approachable and less difficult, just like with Self Service BI.
    I would say that it’s really time to start learning PowerPivot and DAX right now, if you have not already started learning it. If Microsoft will focus on the new BISM/Vertipaq technology that will be the future if you ask me.

    Chris Webb wrote an interesting article about BISM and it looks like he is not very enthusiastic about the strategy Microsoft takes here because this could be the end of SSAS cubes within a few years: “while it’s not true to say that Analysis Services cubes as we know them today and MDX are dead, they have a terminal illness. I’d give them two, maybe three more releases before they’re properly dead, based on the roadmap that was announced yesterday.”

    What’s also very interesting is the comprehensive comment on this article from Amir Netz. He explains BISM and UDM will live together in Analysis Services in the future and MOLAP is here to stay: “Make no mistake about it – MOLAP is still the bread and butter basis of SSAS, now and for a very long time. MDX is mature, functional and will stay with us forever.”

    Read the article from Chris Webb here and make sure you don’t miss the comment from Amir!

  • SQL Server code-named 'Denali' - Community Technology Preview 1 (CTP1)

    SQL Server Denali (SQL Server 2011) CTP1 has been released!

    Download it here

    SQL 2011 is expected to be ready in the third quarter in 2011! I’ve already blogged about a few new SSIS features here

    I will keep you posted!

  • SQL Azure Reporting is announced!

    sql-azure-logo-lg

    With SQL Azure Reporting Services you can use SSRS as a service on the Azure platform with all the benefits of Azure and the most features and capabilities of premise. It’s also possible to embed your reports in your Windows or Azure applications.

    Benefits of the Azure platform for Azure Reporting Services are:

    • Highly available, the cloud services platform has built-in high availability and fault tolerance
    • Scalable, the cloud services platform automatically scales up and down
    • Secure, your reports and SQL Azure databases are on a safe place in the cloud
    • Cost effective, you don’t have to set up servers and you don’t have to invest in managing servers
    • Use the same tools you use today to develop your solutions. Just develop your reports in BIDS or Report Builder and deploy to Azure

    Disadvantages are:

    • SQL Azure databases are the only supported data sources in the first version, more data sources are expected to come
    • No developer extensibility in the first version, so no custom data sources, assemblies, report items or authentication
    • No subscriptions or scheduled delivery
    • No Windows Authentication, only SQL Azure username/password is supported in the first version, similar to SQL Azure database. When SQL Azure database gets Windows Authentication, Azure Reporting will follow

    Despite the disadvantages of the first version I think SQL Azure Reporting Services offers great capabilities and can be extremely useful for a lot of organizations.
    I’m really curious about the CTP, which will be available before the end of this year. You can sign up for the SQL Azure Reporting CTP here

    Read more about SQL Azure Reporting here

  • MCITP – I passed the 70-455 “Upgrade: Transition Your MCITP SQL Server 2005 BI Developer to MCITP SQL Server 2008 BI Developer” exam!

    Recently I passed the 70-455 exam. This exam upgrades your SQL 2005 MCTS and MCITP certifications to SQL 2008.

    image

    The exam contains 2 sections(basically separate exams), each with 25 questions:
    - A part which covers exam 70-448: TS: Microsoft SQL Server 2008, Business Intelligence Development and Maintenance
    - A part which covers exam 70-452: PRO: Designing a Business Intelligence Infrastructure Using Microsoft SQL Server 2008

    You need to pass on both of the sections with a score that’s at least 700. If you fail one section, you fail on the entire exam.

     

    How did I study

    I searched the internet and the conclusion was that there is no preparation material available for the 70-452 exam but fortunately there was a self-paced training kit for the 70-448 exam, which also covers this exam. So i bought the book, scanned it for subjects that needed attention and fortunately that was enough to pass the exam for me.

    For the entire list of preparation materials for the 70-448 and 70-452 exams follow the links below:

    70-448 preparation materials

    70-452 preparation materials 

     

    My Current Transcript

    image

  • The next version of SSIS is coming!

    The latest releases of SQL Server contained (almost) no new SSIS features. With the release of SSIS 2008 the ability to use C# scripts, the improved data flow and the cached lookup were most thrilling new features. The release of SQL 2008 R2 only gave us the ability to use a bulk insert mode for the ADO.NET destination, which was a bit disappointing.

    Fortunately Matt Mason from the SSIS team announced that the next version of SQL Server (SQL 11) contain quite some exiting new functionality for SSIS!

    - Undo/Redo support. Finally, this should have been added a long time ago ;-)

    - Improved copy/paste mechanism. Let’s hope we keep the formatting of components after copy/pasting them!

    - Data flow sequence container

    - New icons and rounded corners for tasks and transformations

    - Improved backpressure for data flow transformations with multiple inputs (for example a Merge Join). When one of the inputs get to much data compared to the other, the component that receives the data can tell the data flow that it needs more data on the other input

    - The Toolbox window will automatically locate and show newly installed custom tasks

    I’m Curious about the first CTP!

  • SSIS – Delete all files except for the most recent one

    Quite often one or more sources for a data warehouse consist of flat files. Most of the times these files are delivered as a zip file with a date in the file name, for example FinanceDataExport_20100528.zip

    Currently I work at a project that does a full load into the data warehouse every night. A zip file with some flat files in it is dropped in a directory on a daily basis. Sometimes there are multiple zip files in the directory, this can happen because the ETL failed or somebody puts a new zip file in the directory manually. Because the ETL isn’t incremental only the most recent file needs to be loaded. To implement this I used the simple code below; it checks which file is the most recent and deletes all other files.

    Usage is quite simple, just copy/paste the code in your script task and create two SSIS variables:

    • SourceFolder (type String): The folder that contains the (zip) files
    • DateInFilename (type Boolean): A flag, set it to True if your filename ends with the date YYYYMMDD, set it to false if creation date of the files should be used

    Note: In a previous blog post I wrote about unzipping zip files within SSIS, you might also find this useful: SSIS – Unpack a ZIP file with the Script Task

    Public Sub Main()
    
        'Use this piece of code to loop through a set of files in a directory
        'and delete all files except for the most recent one based on a date in the filename.
    
        'File name example:
        'DataExport_20100413.zip
    
        Dim rootDirectory As New DirectoryInfo(Dts.Variables("SourceFolder").Value.ToString) 'Set the directory in SSIS variable SourceFolder. For example: D:\Export\
        Dim mostRecentFile As String = ""
        Dim currentFileDate As Integer
        Dim mostRecentFileDate As Integer
        Dim currentFileCreationDate As Date
        Dim mostRecentFileCreationDate As Date
    
        Dim dateInFilename As Boolean = Dts.Variables("DateInFilename").Value 'If your filename ends with the date YYYYMMDD set SSIS variable DateInFilename to True. If not set to False.
    
    
        If dateInFilename Then
    
            'Check which file is the most recent
            For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
    
                currentFileDate = CInt(Left(Right(fi.Name, 12), 8)) 'Get date from current filename (based on a file that ends with: YYYYMMDD.zip)
    
                If currentFileDate > mostRecentFileDate Then
    
                    mostRecentFileDate = currentFileDate
                    mostRecentFile = fi.Name
    
                End If
    
            Next
    
        Else 'Date is not in filename, use creation date
    
            'Check which file is the most recent
            For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
    
                currentFileCreationDate = fi.CreationTime 'Get creation date of current file
    
                If currentFileCreationDate > mostRecentFileCreationDate Then
    
                    mostRecentFileCreationDate = currentFileCreationDate
                    mostRecentFile = fi.Name
    
                End If
    
            Next
    
        End If
    
    
        'Delete all files except the most recent one
        For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
    
            If fi.Name <> mostRecentFile Then
    
                File.Delete(rootDirectory.ToString + "\" + fi.Name)
    
            End If
    
        Next
    
        Dts.TaskResult = ScriptResults.Success
    End Sub
  • SSIS - Package design pattern for loading a data warehouse - Part 2

    Since my last blog post about a SSIS package design pattern I’ve received quite some positive reactions and feedback. Microsoft also added a link to the post on the SSIS portal which made it clear to me that there is quite some attention for this subject.

    The feedback I received was mainly about two things:
    1. Can you visualize the process or make it clearer without the whole technical story so it's easier to understand.
    2. How should the Extract phase of the ETL process be implemented when source tables are used by multiple dimensions and/or fact tables.

    In this post I will try to answer these questions. By doing so I hope to offer a complete design pattern that is usable for most data warehouse ETL solutions developed using SSIS.


    SSIS package design pattern for loading a data warehouse

    Using one SSIS package per dimension / fact table gives developers and administrators of ETL systems quite some benefits and is advised by Kimball since SSIS has been released. I have mentioned these benefits in my previous post and will not repeat them here.

    When using a single modular package approach, developers sometimes face problems concerning flexibility or a difficult debugging experience. Therefore, they sometimes choose to spread the logic of a single dimension or fact table in multiple packages. I have thought about a design pattern with the benefits of a single modular package approach and still having all the flexibility and debugging functionalities developers need.

    If you have a little bit of programming knowledge you must have heard about classes and functions. Now think about your SSIS package as a class or object that exists within code. These classes contain functions that you can call separately from other classes (packages). That would be some nice functionality to have, but unfortunately this is not possible within SSIS by default.
    To realize this functionality in SSIS I thought about SSIS Sequence Containers as functions and SSIS packages as classes.
    I personally always use four Sequence Containers in my SSIS packages:
    - SEQ Extract (extract the necessary source tables to a staging database)
    - SEQ Transform (transform these source tables to a dimension or fact table)
    - SEQ Load (load this table into the data warehouse)
    - SEQ Process (process the data warehouse table to the cube)

    The technical trick that I performed - you can read about the inner working in my previous post - makes it possible to execute only a single Sequence Container within a package, just like with functions in classes when programming code.
    The execution of a single dimension or fact table can now be performed from a master SSIS package like this:

    1 - [Execute Package Task] DimCustomer.Extract
    2 - [Execute Package Task] DimCustomer.Transform
    3 - [Execute Package Task] DimCustomer.Load
    4 - [Execute Package Task] DimCustomer.Process

    The package is executed 4 times with an Execute Package Task, but each time only the desired function (Sequence Container) will run.

    If we look at this in a UML sequence diagram we see the following:
    ETL_Example1

    I think this sequence diagram gives you a good overview of how this design pattern is organized. For the technical solution and the download of a template package you should check my previous post.


    How should the Extract phase of the ETL process be implemented when a single source table is used by multiple dimensions and/or fact tables?

    One of the questions that came up with using this design pattern is how to handle the extraction of source tables that are used in multiple dimensions and/or fact tables. The problem here is that a single table would be extracted multiple times which is, of course, undesirable.

    On coincidence I was reading the book “SQL Server 2008 Integration Services: Problem – Design - Solution” (which is a great book!) and one of the data extraction best practices (Chapter 5) is to use one package for the extraction of each source table. Each of these packages would have a very simple dataflow from the source table to the destination table within the staging area.
    Of course this approach will be more time consuming than using one big extract package with all table extracts in it but fortunately it also gives you some benefits:
    - Debugging, sometimes a source has changed, i.e. a column’s name could have been changed or completely deleted. The error that SSIS will log when this occurs will point the administrators straight to the right package and source table. Another benefit here is that only one package will fail and needs to be edited, while the others can still execute and remain unharmed.
    - Flexibility, you can execute a single table extract from anywhere (master package or dim/fact package).

    I recently created some solutions using this extract approach and really liked it. I used 2 SSIS projects:
    - one with the dimension and fact table packages
    - one with only the extract packages
    I have used the following naming conventions on the extract packages: Source_Table.dtsx and deployed them to a separate SSIS folder. This way the packages won’t bother the overview during development.
    A tip here is to use BIDS Helper; it has a great functionality to deploy one or more packages from BIDS.

    Merging this approach in the design pattern will give the following result:
    - The dimension and fact table extract Sequence Containers will no longer have data flow tasks in it but execute package tasks which point to the extract packages.
    - The Extract Sequence Container of the master package will execute all the necessary extract packages at once.

    This way a single source table will always get extracted only one time when executing your ETL from the master package and you still have the possibility to unit test your entire dimension or fact table packages.
    Drawing this approach again in a sequence diagram gives us the following example with a run from the master package (only the green Sequence Containers are executed):
    ETL_Example2

    And like this with a run of a single Dimension package:
    ETL_Example2_Customer

    Overall, the design pattern will now always look like this when executed from a master package:
    ETL_Overview


    Conclusion
    I think this design pattern is now good enough to be used as a standard approach for the most data warehouse ETL projects using SSIS. Thanks for all the feedback! New feedback is of course more than welcome!

  • SSIS – Package design pattern for loading a data warehouse

    I recently had a chat with some BI developers about the design patterns they’re using in SSIS when building an ETL system. We all agreed in creating multiple packages for the dimensions and fact tables and one master package for the execution of all these packages.

    These developers even created multiple packages per single dimension/fact table:

    • One extract package where the extract(E) logic of all dim/fact tables is stored
    • One dim/fact package with the transform(T) logic of a single dim/fact table
    • One dim/fact package with the load(L) logic of a single dim/fact table

    I like the idea of building the Extract, Transform and Load logic separately, but I do not like the way the logic was spread over multiple packages.
    I asked them why they chose for this solution and there were multiple reasons:

    • Enable running the E/T/L parts separately, for example: run only the entire T phase of all dim/fact tables.
    • Run the extracts of all dimensions and fact tables simultaneously to keep the loading window on the source system as short as possible.

    To me these are good reasons, running the E/T/L phases separately is a thing a developer often wants during the development and testing of an ETL system.
    Keeping the loading window on the source system as short as possible is something that’s critical in some projects.

    Despite the good arguments to design their ETL system like this, I still prefer the idea of having one package per dimension / fact table, with complete E/T/L logic, for the following reasons:

    • All the logic is in one place
    • Increase understandability
    • Perform unit testing
    • If there is an issue with a dimension or fact table, you only have to make changes in one place, which is safer and ore efficient
    • You can see your packages as separate ETL “puzzle pieces” that are reusable
    • It’s good from a project manager point of view; let your customer accept dimensions and fact tables one by one and freeze the appropriate package afterwards
    • The overview in BIDS, having an enormous amount of packages does not make it clearer ;-)
    • Simplifies deployment after changes have been made
    • Changes are easier to track in source control systems
    • Team development will be easier; multiple developers can work on different dim/fact tables without bothering each other.

    So basically my goal was clear: to build a solution that has all the possibilities the aforesaid developers asked for, but in one package per dimension / fact table; the best of both worlds.

    Solution:

    The solution I’ve created is based on a parent-child package structure. One parent (master) package will execute multiple child (dim/fact) packages. This solution is based on a single (child) package for each dimension and fact table. Each of these packages contains the following Sequence Containers in the Control Flow: 
     ChildControlFlow 

    Normally it would not be possible to execute only the Extract, Transform, Load or (cube) Process Sequence Containers of the child (dim/fact) packages simultaneously.

    To make this possible I have created four Parent package variable configurations, one for each ETL phase Sequence Container in the child package:

    clip_image003

    clip_image005

    Each of these configurations is set on the Disable property of one of the Sequence Containers:
    clip_image007

    Using this technique makes it possible to run separate Sequence Containers of the child package from the master package, simply by dis- or enabling the appropriate sequence containers with parent package variables.
    Because the default value of the Disable property of the Sequence Containers is False, you can still run an entire standalone child package, without the need to change anything.

    Ok, so far, so good. But, how do I execute only one phase of all the dimension and fact packages simultaneously? Well quite simple:


    First add 4 Sequence Containers to the Master package. One for each phase of the ETL, just like in the child packages


    Add Execute Package Tasks for all your packages in every Sequence Container


    clip_image008


    If you would execute this master package now, every child package would run 4 times as there are 4 Execute Package Tasks that run the same package in every sequence container.
    To get the required functionality I have created 4 variables inside each Sequence Container (Scope). These will be used as parent variable to set the Disable properties in the child packages. So basically I’ve created 4 variables x 4 Sequence Containers = 16 variables for the entire master package.

    Variables for the EXTRACT Sequence Container (vDisableExtract False):
    clip_image009

    Variables for the TRANSFORM Sequence Container (vDisableTransform False):
    clip_image010

    The LOAD and PROCESS Sequence Containers contain variables are based on the same technique.

    Results:

    Run all phases of a standalone package: Just execute the package:
    clip_image011

    Run a single phase of the ETL system (Extract/Transform/Load/Process): Execute the desired sequence container in the main package:

    RunAllTransforms 

    Run a single phase of a single package from the master package:
    RunSinglePhaseOfOnePackage


    Run multiple phases of the ETL system, for example only the T and L: Disable the Sequence Containers of the phases that need to be excluded in the master package:

    RunMultiplePhasesAndExcludeOthers

    Run all the child packages in the right order from the master package:
    When you add a breakpoint on, for example, the LOAD Sequence Container you see that all the child packages are at the same ETL phase as their parent: 
    RunCompleteMasterPackageBreakPoint


    When pressing Continue the package completes: 
    RunCompleteMasterPackageBreakPointCompleted


    Conclusion:

    This parent/child package design pattern for loading a Data Warehouse gives you all the flexibility and functionality you need. It’s ready and easy to use during development and production without the need to change anything.

    With only a single SSIS package for each dimension and fact table you now have the functionality that separate packages would offer. You will be able to, for example, run all the Extracts for all dimensions and fact tables simultaneously like the developers asked for and still have the benefits that come with the one package per dimension/fact table approach.

    Of course having a single package per dimension or fact table will not be the right choice in all cases but I think it is a good standard approach.
    Same applies to the ETL phases (Sequence Containers). I use E/T/L/P, but if you have different phases, which will be fine, you can still use the same technique.

    Download the solution with template packages from the URL’s below. Only thing you need to do is change the connection managers to the child packages (to your location on disk) and run the master package!


    Download for SSIS 2008

    Download for SSIS 2005


    If you have any suggestions, please leave them as a comment. I would like to know what your design pattern is as well!


    ATTENTION: See Part-2 on this subject for more background information!


    Backgrounds:

    How to: Use the Values of Parent Variables in a Child Package: http://technet.microsoft.com/en-us/library/ms345179.aspx

  • SSIS - Blowing-out the grain of your fact table

    Recently I had to create a fact table with a lower grain than the source database. My source database contained order lines with a start- and end date and monthly revenue amounts.

    To create reports that showed overall monthly revenue per year, lowering the grain was necessary. Because the lines contained revenue per month I decided to blow out the grain of my fact table to monthly records for all the order lines of the source database. For example, an order line with a start date of 1 January 2009 and an end date of 31 December 2009 should result in 12 order lines in the fact table, one line for each month.

    To achieve this result I exploded the source records against my DimDate. I used a standard DimDate:
    clip_image001[4]

    The query below did the job; use it in a SSIS source component and it will explode the order lines to a monthly grain:

    Code Snippet
    1. SELECT OL.LineId
    2.       ,DD.ActualDate
    3.       ,OL.StartDate
    4.       ,OL.EndDate
    5.       
    6.   FROM OrderLine OL
    7.   INNER JOIN DimDate DD
    8.       ON DD.Month
    9.       BETWEEN
    10.       (YEAR(OL.StartDate)*100+MONTH(OL.StartDate))
    11.       AND
    12.       (YEAR(OL.EndDate)*100+MONTH(OL.EndDate))
    13.       
    14.   WHERE DD.DayOfMonth = 1

     


    Some explanation about this query below:

    · I always want to connect a record to the first day of the month in DimDate, that’s why this WHERE clause is used:

    Code Snippet
    1. WHERE DD.DayOfMonth = 1


    · Because I want to do a join on the month (format: YYYMM) of DimDate I need to format the start and end date on the same way (YYYYMM):

    Code Snippet
    1. (YEAR(OL.StartDate)*100+MONTH(OL.StartDate))

    The source, order lines with a start and end date:
    clip_image002[4]


    The Result, monthly order lines:
    clip_image003[4]

Privacy Statement