THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Eric Johnson

My Thoughts on SQL Server and Database Related Technologies.

  • The Trouble with SSIS Sorting

    Many SSIS transformations, such as the Pivot and the Data Profiling Tasks, require that the data being fed into them be sorted. Without first sorting, some transformations will throw an error and not run, while others will run but the results will not be as expected. The answer is to sort the data before it is fed into the transformation; the problem is that this isn’t as simple as you may think. The obvious and simple way to accomplish this is to use the Sort Transformation in SSIS. The Sort Transformation will complete the objective of sorting the data, but the performance might not be exactly what you would expect. The Sort Transformation does all its work in memory; SSIS packages do things in individual sets. This means your data may need to be sorted and resorted again. If you try to work with large data sets, this sorting will take a long time and use a lot of memory. The solution is to sort the data from the source using an ORDER BY clause. That alone, however, won’t get it done. You also have to tell SSIS that the data is sorted and how it is sorted. This is done in the Advanced Editor of your data source. First, set the IsSorted property of the source’s output to true. Next, set the SortKeyPosition property of each column you specified in your ORDER BY clause; 1 for first, 2 for second, and so on. A SortKeyPosition value of 0 is the default and means that the column is not used in the sort. That’s all there is to it, using this method you will be able to sort your data so the SSIS transformations are happy, and can work without hours of waiting and memory hogging.


  • Understanding the SSIS Package Protection Level

    One property of all SSIS packages that you must understand is the ProtectionLevel. This property tells SSIS how to handle sensitive information stored within your packages. Most commonly this is a password stored in a connection string. Why is this information important? If you don’t set the ProtectionLevel correctly, the package may become unusable. Other developers may be unable to open the package or the package may fail when you go to execute it. Understanding these options lets you get out in front of possible problems and will help you to fix an issue if a problem crops up. In a perfect world, you would not need to store sensitive data, but each and every environment is different. Let’s look at each of the ProtectionLevel options.

    DontSaveSensitive

    When the package is saved, sensitive values will be removed. This will result in passwords needing to be supplied to the package, through a configuration file or by the user.

    EncryptSensitiveWithUserKey

    This will encrypt all sensitive data on the package with a key based on the current user profile. This sensitive data can only be opened by the user that saved it. It another user opens the package, all sensitive information will be replaced with blanks. This is often a problem when a package is sent to another user to work on.

    EncryptSensitiveWithPassword

    Sensitive data will be saved in the package and encrypted with a supplied password. Every time the package is opened in the designer, you will need to supply the password in order to retrieve the sensitive information. If you cancel the password prompt, you will be able to open the package but all sensitive data will be replaced with blanks. This works well if a package will be edited by multiple users.

    EncryptAllWithPassword

    This works the same as EncryptSensitiveWithPassword except that the whole package will be encrypted with the supplied password. When opening the package in the designer, you will need to specify the password or you won’t be able to view any part of the package.

    EncryptAllWithUserKey

    This works the same as EncryptSensitiveWithUserKey except that the whole package will be encrypted. Only the user that created the package will be allowed to open the package.

    ServerStorage

    This option will use SQL Server database roles to encrypt information. This will only work if the package is saved to an SSIS server for execution.

    So that’s it. This option is pretty basic but it is important to understand so that you can be spared unnecessary frustration.


  • Troubleshooting a Failed Maintenance Plan

    I recently ran into an odd little problem with a Maintenance Plan that I wanted to share. I had a plan that was running the Check Database Integrity Task that suddenly started failing. Nothing about the databases had changed and if you ran a manually DBCC CHECKDB all the databases came back clean. The job history showed a meaningless, truncated message:

    Executing query "DECLARE @Guid UNIQUEIDENTIFIER      EXECUTE msdb..sp...".: 100% complete  End Progress  DTExec: The package execution returned DTSER_FAILURE (1).  Started:  10:14:30 AM  Finished: 10:14:31 AM  Elapsed:  1.185 seconds.  The package execution failed.  The step failed.

    When looking at the Maintenance Plan history directly, I had an error that stated: Alter failed for Server 'ejohnsonmobile'. Really nothing at this point has provided any details into a possible fix. Next step was to run a Profiler trace to see what was really going on. When setting up a trace to look for errors, be sure to include the User Error Message and Exception events. Since this is a Maintenance Plan, which is just an SSIS Package, the trace will show you that there is a lot more going on than just the CHECKDB. When looking for errors in the trace I noticed an exception just after the following query was run:

    EXEC sys.sp_configure N'user options', 0 RECONFIGURE

    The exception this caused was:

    Error: 5808, Severity: 16, State: 1
    Ad hoc update to system catalogs is not supported.

    That’s a little odd, the query was modifying the User Options configuration value, why would we see an error about Ad Hocs updates? Here is where the problem becomes clear. At some point, the Allow Updates configuration option was set to 1. In SQL Server 2000 and older, this option allowed users to make updates directly to the system tables. In SQL Server 2005 and newer, the system tables are gone, replaced with the Resource Database and system views. This option is no longer supported in SQL Server 2005 and beyond, and though you can set Allow Updates to 1 with no error, as soon as you run RECONFIGURE, you will receive the error that ad hoc updates are not supported. Because Allow Updates had been set to 1 and my Maintenance Plan runs the RECONGIFURE statement, this error was thrown and the Maintenance Plan failed.

    All I had to do was run “sp_configure 'Allow Updates', 0” to set Allow Updates back to 0 and then everything started working again.


  • Check Out My SSIS Screen Casts

    Two I did for TechTarget:

    Using the Pivot Transformation in SQL Server Integration Services:
    http://searchsqlserver.techtarget.com/video/0,297151,sid87_gci1374920,00.html

    Using package configurations in SQL Server Integration Services:
    http://searchsqlserver.techtarget.com/video/0,297151,sid87_gci1369995,00.html


  • SSIS Package Builds

    So you have written an SSIS package and now its time to package it up and send it off to your QA department. Excellent; you need to use the build function in Business Intelligence Development Studio. On the surface this is pretty simply but there are a few things to understand. You can’t just right click your project and select Build, or even use the Build Solution menu option. Before you can do either of these things, you need to adjust your project’s build settings. To do that, right-click the project in Solution Explorer, select Properties, and then select Deployment Utility. This should have you looking at the follow dialog.

    clip_image002

    Notice that the CreateDeploymentUtility option is set to False. This must be set to true before you can build your project. The other setting of interest is DeploymentOutputPath which controls where the build will be written. Set CreateDeploymentUtility to True and click OK. Now you can right-click the project in Solution Explorer and select Build. This will write several files out to your deployment directory. The number of files will vary based on your project; in the case of my test project three files were written.

    1. Package.dtsx – The actual SSIS package that was developed. You should have one DTSX file for each package in your solution.

    2. Config.dtsConfig – This is the XML configuration file for my package. You could have several of these, or none, depending on whether your packages uses them

    3. Test.SSISDeploymentManifest – This is the Deployment Manifest file. You will have one of these per project. This file defines all the pieces of your project and is what you use to deploy the SSIS package to a server.

    Now that you have built your package, you can send all the files in your deployment directory to another person or group so that your package can be installed on an SSIS server. That process is simple; just double-click the SSISDeploymentManifest file on a machine that has the SQL Server client tools installed and this will launch the Package Installation Wizard, shown below.

    clip_image004

    This is a pretty simple wizard, but let’s quickly run through it. First you specify whether you want a File System Deployment or a SQL Server deployment. For this example I will choose SQL Server. Next you choose the location for the package, in our case the SQL Server and the path in MSDB. You can also choose to Rely on Server Storage for Encryption if you want sensitive information that was encrypted in the package to remain after the package has been deployed. Last, you must specify the directory on the server where you want dependencies files, such as configuration files, to be stored. If you have configurations, you will also be allowed to modify the configuration values during the wizard. Keep in mind these can also be modified later.

    That’s it, click Finish and the package will be installed on your SQL Server where it can be scheduled and run.


  • SSIS 2008: Looping Through Rows in a Table

    Inevitably when writing code, you have a need to write a loop in order to iterate over multiple objects. When writing code against a SQL Server, as we usually are doing in SSIS Packages, you often need to iterate over all the rows in a table. This can be done using an SSIS Foreach Loop Container, but the how is not obvious. Making this work is a two step process. You have to create a data flow to populate an SSIS variable with your table data, and then you have to configure your loop.

    First, let’s take a look at getting your table of information into an SSIS variable. You will need to add a new variable to the package with a Data Type of Object. This variable will hold a recordset that will represent your table. Next, you need to create a Data Flow, like the one shown here.

    clip_image002[1]

    The OLE DB Source will pull the data from your table or query and the recordset destination is used populate you variable. At this point the data from your table is stored in an ADO Recordset in your variable.

    Step two will be creating the Foreach Loop Container, shown below. Add the container to you package and make sure it is after the data flow you used to populate your variable. On the Collection page of the Foreach Loop Container’s properties, select Foreach ADO Enumerator as your enumerator and set your variable as the ADO Object Source. Next, make sure you choose Rows in the First Table under enumeration mode. Finally, you can move on to the Variable Mapping page and define where the values in each column are stored. This is done with a 0 based index where 0 is your first column, 1 is second column and so on. Define a variable for each column you need to read and each time the loop is executed the variables will be updated.

    clip_image004[1]

    That’s it, you can put any code you want in the Loop Container to execute it for each row in a table.


  • SQL Server 2008’s New Import and Export Wizard

    I have written a lot lately about SSIS package development, and that doesn’t apply to everyone that works with SQL Server 2008. So you might be asking, what’s does SSIS have to offer me, the production support DBA? Well, I am glad you asked. In SQL Server 2008, the Import and Export Wizard is built on the SSIS framework. In fact, at the end of the Wizard, you can save your settings into an SSIS package. If you have used the Wizard before, it will feel much the same, choose a source, choose a destination, specify your data, and go. In fact, if you don’t look close you may miss some of the new features.

    First, when you choose a destination or source, you will see the .NET Framework Data Providers in the drop downs. This means you can run the import and export wizard directly against Oracle as well as ODBC data sources.

    Next, there is a whole set of mapping files that define how data from one data source should map into destinations that are of a different type, Oracle to SQL Server or SQL Server to DB2 for example. These XML files can be found here (Assuming you installed to C:\): C:\Program Files\Microsoft SQL Server\100\DTS\MappingFiles. During the wizard, you can accept the decisions made by using these files or you can edit the data type mappings. For example, below is a snippet of the OracleToMSSql file which maps Oracle data to SQL Server data:

    <dtm:DataTypeMapping >
            <dtm:SourceDataType>
                <dtm:DataTypeName>NUMBER</dtm:DataTypeName>
            </dtm:SourceDataType>
            <dtm:DestinationDataType>
                <dtm:NumericType>
                    <dtm:DataTypeName>NUMERIC</dtm:DataTypeName>
                    <dtm:UseSourcePrecision/>
                    <dtm:UseSourceScale/>
                </dtm:NumericType>
            </dtm:DestinationDataType>
        </dtm:DataTypeMapping>   

    As you can see, NUMBER in Oracle maps to NUMERIC in SQL Server. Lastly, the wizard has been tuned to scale better. For example, if you import a large number of tables, they will be spilt across multiple data flows to reduce concurrent processing.

    In the past, the import and export wizard has been a bit of a last resort for DBAs, but now I think its time to give this old dog a second chance.


  • Working with SSIS Expressions

    In SQL Server Integration Services (SSIS) Packages, expressions are everywhere. You may have worked with Expressions in some of the transformation tasks, such as the Derive Column Transformation, but did you know they can also be used to set the properties of many of the tasks and transformations, or even used to modify the values in your SSIS Variables? Let’s look at an example. What if you have to write a lookup query that pulls data from two databases on the same SQL Server but you’re not sure what the databases will be named. Your query will run in the context of your lookup transformation, so the connection manager tied to your lookup will handle the first database, but you need to get a bit tricky for the second database. Take the following query as an example:

    SELECT ContactID, EmailAddress
    FROM Person.Employee
    JOIN GlobalContacts.dbo.Email
        ON Employee.GlobalID  = Email.GlobalID
    WHERE Email.EmailAddress IS NOT NULL
    AND Email.EmailPromotion <> 0

    The Person.Employee table will be found in the database defined by your Connection Manager, the dbo.Email table needs to be in the GlobalContacts database in order to be found; but in our case, the GlobalContacts database is often named differently, for example GlobalContactsQA and GlobalContactsVer2. So how do we dynamically build our query to handle the fluidity of this database name? Here is one possible solution: We are going to add a couple variables to our package, RefDatabase and LookupQuery.

    clip_image002

    As you can see, we have set the RefDatabase variable to “GlobalContactsQA”. This variable can be passed in from a parent package or set with an SSIS configuration. For LookupQuery, set the EvaluateAsExpression property of the LookupQuery variable to True.

    clip_image004

    Next, add the following code to the Expression property:

    "SELECT ContactID, EmailAddress
    FROM Person.Employee
    JOIN " + @[User::RefDatabase]  + ".dbo.Email
        ON Employee.GlobalID  = Email.GlobalID
    WHERE Email.EmailAddress IS NOT NULL
    AND Email.EmailPromotion <> 0"

    Be sure to include all the quotes shown in the example. This expression will use the value of the RefDatabase variable to dynamically modify the SQL query. Last but not least, you will need to change your lookup to use the SQL Command From a Variable option and point it to the LookupQuery variable. That’s all there is to it. The biggest mistake I have seen made, and I have done this myself, is setting the Expression property of the variable and forgetting to set the EvaluateAsExpression property to true. This will result in your variable having an empty value or your variable retaining the last value you hard coded.

    Remember, expressions are everywhere in SSIS and can be used to save you a lot of time. Just make sure to do a little research before you use them as there is often a way to accomplish your objective without an expression and too many expressions can make a package difficult to maintain.


  • Managing Lookup Cache in SQL Server 2008

    The Lookup transformation in SSIS has changed a lot in SQL Server 2008. One of the best new features is the ability to pre-build your cache which gives you a lot of control over what is cached and how the cached data is managed. The basic lookup offers additional cache features such as Full, Partial, or even No Cache, but the real power comes in the new Cache Transformation. The Cache Transformation uses the new Cache Connection manager to allow you to build your cache before it is needed. The connection manager and the Cache Transformation are pretty simple. For the Cache Manager Connection, you just need to specify the columns that will be stored and which columns will be indexed, as shown below. The indexed columns will be the only columns you can use as a lookup column in a Lookup Transformation. 

    clip_image002

    Once you have the connection manager configured, you just use a Cache Transformation to pump data into the cache. The Cache Transformation is very simple; you specify a Cache Connection manager and set up the mapping of the incoming data to the columns in the cache. Last but not least, you pull data from your source and send it into the Cache Transformation. I have a very simple example data flow below which is populating a lookup cache; and the source in this example uses the T-SQL shown.

    SELECT ContactID, EmailAddress
    FROM Person.contact
    WHERE EmailAddress IS NOT NULL
    AND EmailPromotion <> 0

    clip_image004

    This allows me to populate my cache with all email addresses that are not blank, where the person has not opted out of email promotion. Remember, the data flow to populate your cache can be as complex as you want. This can really allow you to pare down large sets of data to a smaller, easier to use data set that is appropriate for caching.

    Once you have populated your cache, all you have left to do is use it in a Lookup Transformation. On the first page of your Lookup Transformation properties, specify a Connection Type of Cache Manager and on the connection page, specify your Cache Connection Manager object. Everything else is just like using the Lookup with a regular data source. You will need to set up your mappings and the column (or columns) that you want to return. One other note: when using a Cache Connection Manager with a lookup, you have to select Full Cache as the cache mode. The other two are disabled and for good reason. You have taken control of your cache at this point and don’t want SSIS doing anything further.

    So there you have it, managing your own lookup cache should allow you to make your SSIS packages a little more streamlined and allow them to run a bit faster. Not to mention, you should be able to minimize the memory hogging packages that occurred when older lookups got a little out of hand.


  • Using SSIS Package Configurations

    SQL Server Integration Services (SSIS) is a very powerful tool for creating ETL Packages. Part of what makes it so powerful is its ability to use package configurations. Package configurations allow you to externally store information that the package needs so that information can be easily changed without re-writing the package itself. You can store connection strings, variable values, package passwords, isolation levels, and much more. Basically, if you can set the value in SSIS, it can probably be stored in a package configuration. This makes modifying a package, such as when it moves from QA to production, or when a production server’s name changes, much easier. Package configurations come in several flavors and which one you use will depend on your individual needs. Also, each type stores data differently, so in some cases multiple values can be saved and in others you can store only a single value. Here is a quick breakdown of the types of package configurations.

    · XML Configuration File – Multiple values can be stored in an XML file on disk

    · Environment Variable – A single value can be stored in a Windows Environment variable

    · Registry Entry – A single value can be stored in a registry key

    · Parent Package Variable – The parent package can pass a variable to the package which contains the configuration value

    · SQL Server – Multiple values can be stored in a table on an SQL Server

    Each package can have more than one configuration and I personally like the environment variable/ SQL Server combo. This allows me to store most of my values in one place, the SQL Server, and still be able to dynamically change the location of my configuration server without needing to change the package. Here is how I set up my package configurations.

    First, make sure that you have a connection set up for your SQL Server where you want to store package configuration details. Then, from your SSIS control flow, click the SSIS menu and choose Package Configurations. This opens the package configuration dialog box. From here, you must check Enable Package Configurations in order to do anything. Once you have enabled configurations, click Add and then choose Environment Variable from the Configuration Type drop down. Name it something like ConfigServer and click next. Browse to the Connection Managers folder, find your configuration server connection and select its connection string property, as shown below.

    clip_image002

    Click next, name the configuration and click finish. Next, go and create a Windows Environment variable with the same name you used in the package configuration and set its value to a valid SQL Server connection sting for your configuration server, something like:
    Data Source=CSSRV04;Initial Catalog=msdb;Provider=SQLNCLI10.1;Integrated Security=SSPI;

    This package configuration will allow your SSIS package to find the SQL Server where the rest of your configuration will be stored.

    Now add another package configuration, this time with a type of SQL Server. Specify all the connection details to the SQL Server on which you want to store your SSIS configuration, including the table to which you want the values written and click next. Now pick any and all values you want to store in SQL Server, name the configuration and click finish.

    The next time your package runs, it will use the Environment Variable to find your SQL Server and the SQL Server to find the rest of your settings. These values can be changed at any time without the package requiring any modification.


  • SSIS 2008: Data Profiling Task

    SQL Server 2008 Integration Services shipped with the wonderful new Data Profiling Task. This task, much as the name implies, will profile the data in a given table and return a ton of useful information. The task gathers statistics for candidate keys, column length, NULL data ratio, data patterns, and much more. All of the results are written out to XML. The task makes more sense when you look at it, so let’s jump in. The configuration of the task is pretty simple, you select the types of profiles that you want to run and then set the specific options for that profile. In the figure below, I have selected Column Length Distribution for all columns (*) in the HumanResources.Employee table. This profile will gather length statistics on the columns such as Minimum and Maximum Length and provide a distribution of all of the lengths found.  On the general page of the task, you also need to provide a destination, which can be a file connection or a variable.

    image

    Once you set up the task with all the data profiles you want to run, you simply execute the SSIS package and let it work it’s magic. If you wrote to a variable, you’re on you own to do something with the data within the SSIS package. On the other hand, if your wrote to a file, the SQL Server 2008 Client Tools come complete with the Data Profile Viewer which can be found at in your program menu under Microsoft SQL Server 2008 | Integration Services. This is a pretty simple application that allows you to browse the XML file created by the Data Profiling Task. The figure below shows the results for the Column Length Distribution profile I ran earlier. As you can see, there is quite a bit of information returned, and this is just one of the profiles. If you take some time and dig into all the profile types, I think you will be very impressed with the amount of information this task provides. If you’re the creative type, you can run this task and write the data to a variable which can be further queried by your SSIS package. With a little work, you could have a great tool at your disposal the next time your presented with a new database that you nothing about.

    image


  • File and Filegroup Space Details

    Here is a little SQL Script I wrote that gives you file and filegroup size details for all databases on SQL Server 2005 or 2008. This returns all the vital information I am looking for when I look at database sizes and growth such as file size, space used, free space, and the physical filename. There is also a second result set with details rolled up to the filegroup. If you were so inclined, you could capture this information on a regular basis and do some trending. Script attached.

    DECLARE @database_id int
    DECLARE @database_name sysname
    DECLARE @sql_string nvarchar(2000)
    DECLARE @file_size TABLE
        (
        [database_name] [sysname] NULL,
        [groupid] [smallint] NULL,
        [groupname] sysname NULL,
        [fileid] [smallint] NULL,
        [file_size] [decimal](12, 2) NULL,
        [space_used] [decimal](12, 2) NULL,
        [free_space] [decimal](12, 2) NULL,
        [name] [sysname] NOT NULL,
        [filename] [nvarchar](260) NOT NULL
        )

    SELECT TOP 1 @database_id = database_id
        ,@database_name = name
    FROM sys.databases
    WHERE database_id > 0
    ORDER BY database_id

    WHILE @database_name IS NOT NULL
    BEGIN

        SET @sql_string = 'USE ' + QUOTENAME(@database_name) + CHAR(10)
        SET @sql_string = @sql_string + 'SELECT
                                            DB_NAME()
                                            ,sysfilegroups.groupid
                                            ,sysfilegroups.groupname
                                            ,fileid
                                            ,convert(decimal(12,2),round(sysfiles.size/128.000,2)) as file_size
                                            ,convert(decimal(12,2),round(fileproperty(sysfiles.name,''SpaceUsed'')/128.000,2)) as space_used
                                            ,convert(decimal(12,2),round((sysfiles.size-fileproperty(sysfiles.name,''SpaceUsed''))/128.000,2)) as free_space
                                            ,sysfiles.name
                                            ,sysfiles.filename
                                        FROM sys.sysfiles
                                        LEFT OUTER JOIN sys.sysfilegroups
                                            ON sysfiles.groupid = sysfilegroups.groupid'

        INSERT INTO @file_size
            EXEC sp_executesql @sql_string   

        --Grab next database
        SET @database_name = NULL
        SELECT TOP 1 @database_id = database_id
            ,@database_name = name
        FROM sys.databases
        WHERE database_id > @database_id
        ORDER BY database_id
    END

    --File Sizes
    SELECT database_name, groupid, ISNULL(groupname,'TLOG') groupname, fileid, name, file_size, space_used, free_space, filename
    FROM @file_size

    --File Group Sizes
    SELECT database_name, groupid, ISNULL(groupname,'TLOG') groupname, SUM(file_size) as file_size, SUM(space_used) as space_used, SUM(free_space) as free_space
    FROM @file_size
    GROUP BY database_name, groupid, groupname

    Eric Johnson
    SQL Server MVP
    Co-Host CS Techcast

  • SSIS 2008 and the New Lookup

      SSIS is a pretty useful tool for designing ETL processes. One of the transformations I was disappointed with in 2005 was the lookup. I found it a little better than the lookup functionality in DTS, but not too much better. Well, Microsoft must have read my mind because SQL Server 2008 has a new shiny Lookup transformation that is, in my opinion, much improved.

      In 2005, the lookup had cache options, but they were really heavy handed. And you could deal with a lookup not finding a match by using the error output, but then how would you know the difference between a non-match and a real error? Let's start with the cache. There are now three cache options and two choices for your cache source.

      clip_image001

      • Full Cache - This will load the entire reference dataset into memory before the first lookup is performed. This can be very efficient for small tables, but think what would happen if your lookup table was a few GB in size….that's a lot to load into cache.
      • Partial Cache - This is probably what most people think of when they conceptualize lookup cache in their head. At first, the cache is empty and each time a row matches or doesn't match for that matter, the row, or the fact that the lookup didn't find a row, is cached. Subsequent rows with the same lookup fields will find their data, or lack thereof, in cache.
      • No Cache - The lookup will generate the reference dataset each time the transformation runs.

      As for data sources for your cache, you can still go to the database, but now you can also set up your own custom cache connection. This gives you the ability to build your cache in a separate step and then reference it with your lookup. This is a little involved so I won't go into more detail here, but keep your eyes open for another blog entry... ;) Furthermore, if you use the partial cache, you can manage your cache size on the Advanced page of the Lookup's properties.

      So I like the changes to caching, but I like the output options even more. You can now specify that rows with no match be redirected to their own output, or be ignored all together. This now gives you three outputs to work with; one for a match, one for no match, and another for errors. I put together a small sample package and added a screen shot below, you can see that I am using all three outputs. In this case, when an e-mail address is not found in the lookup, I use a derived column to put "NA" into the email address column before I load. I then union this data with the rows that found a match and do an insert into my destination table.

      In the end, this new lookup is much cleaner and easier to use than its 2005 predecessor. Plus the addition of the Cache Connection Manager is a big win…..more on that coming soon.

      clip_image002

       

    Eric Johnson
    SQL Server MVP
    Co-Host CS Techcast

  • SSIS Connection Wizard

    Just found a little something new in SSIS 2008, The Integration Services Connections Project Wizard. This thing starts up when you first begin a new SSIS project. The whole point is to walk you though your initial connection configurations. Once the connections are set up, it even asks you if each is a source, destination, or both and then it creates your first Data Flow task complete with the correct sources and destinations. This probably isn't all that useful to someone that has spent any measurable amount of time with SSIS, but for new comers I think its great. I remember the first time I opened SSIS, it took me a minute to figure out how to get a connection , a source, and a destination all set up. Granted I tend not to read documentation, so I am probably a special case.

    In any event, if this pops up in your face the next time you start a project. Have a look, it’s a neat little addition.

    image


  • Slowly Changing Dimensions - Duplicate Data Issues

    The Slowly Changing Dimension (SCD) transformations are pretty handy in SSIS packages. They let you compare a new set of data to an existing table and insert or update as needed. Alternatively, you can have them insert new rows and mark old rows as "Expired" instead of updating rows when a change occurs. This is great if you want to maintain history. Now these are meant for dimensions in a data warehouse, but you can use them against any table in your database. The reason for this post is to look at an issue that can occur with SCDs when you have duplicate data in your incoming data set.

    When an SCD analyzes a set of data it looks at each row and compares it to the destination. It decides whether the row will be inserted or updated (or inserted with the old row being marked inactive). Notice I said this comparison is with the incoming data and the destination. The SCD does check for duplicates that may exist in the incoming data. So if you have duplicates, that also happen to be new rows when compared to the destination, all the duplicate rows get "flagged" for insertion. You can probably guess what happens next. The first row gets inserted but the second and subsequent duplicates cause a Primary Key violation when they attempt to insert the now duplicate record.

    Now this might seem like a glitch to some, but this functionality is "by design ". The button line is that the data being inserted should be de-duplicated prior to being inserted into the destination. In a perfect world, the data would come from a source where it lived in a normalized, and therefore de-duplicated, state. If you don't have the luxury of living in a perfect world, you can build a manual data de-dup process or use a Sort transformation. One of the options of the Sort transformation is to "Remove Rows with Duplicate Sort Values". Just drop one of these in, check the box, and you have magically de-dup your values.


More Posts Next page »
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement