My Thoughts on SQL Server and Database Related Technologies.
So you have written an SSIS package and now its time to package it up and send it off to your QA department. Excellent; you need to use the build function in Business Intelligence Development Studio. On the surface this is pretty simply but there are a few things to understand. You can’t just right click your project and select Build, or even use the Build Solution menu option. Before you can do either of these things, you need to adjust your project’s build settings. To do that, right-click the project in Solution Explorer, select Properties, and then select Deployment Utility. This should have you looking at the follow dialog.
Notice that the CreateDeploymentUtility option is set to False. This must be set to true before you can build your project. The other setting of interest is DeploymentOutputPath which controls where the build will be written. Set CreateDeploymentUtility to True and click OK. Now you can right-click the project in Solution Explorer and select Build. This will write several files out to your deployment directory. The number of files will vary based on your project; in the case of my test project three files were written.
1. Package.dtsx – The actual SSIS package that was developed. You should have one DTSX file for each package in your solution.
2. Config.dtsConfig – This is the XML configuration file for my package. You could have several of these, or none, depending on whether your packages uses them
3. Test.SSISDeploymentManifest – This is the Deployment Manifest file. You will have one of these per project. This file defines all the pieces of your project and is what you use to deploy the SSIS package to a server.
Now that you have built your package, you can send all the files in your deployment directory to another person or group so that your package can be installed on an SSIS server. That process is simple; just double-click the SSISDeploymentManifest file on a machine that has the SQL Server client tools installed and this will launch the Package Installation Wizard, shown below.
This is a pretty simple wizard, but let’s quickly run through it. First you specify whether you want a File System Deployment or a SQL Server deployment. For this example I will choose SQL Server. Next you choose the location for the package, in our case the SQL Server and the path in MSDB. You can also choose to Rely on Server Storage for Encryption if you want sensitive information that was encrypted in the package to remain after the package has been deployed. Last, you must specify the directory on the server where you want dependencies files, such as configuration files, to be stored. If you have configurations, you will also be allowed to modify the configuration values during the wizard. Keep in mind these can also be modified later.
That’s it, click Finish and the package will be installed on your SQL Server where it can be scheduled and run.
Inevitably when writing code, you have a need to write a loop in order to iterate over multiple objects. When writing code against a SQL Server, as we usually are doing in SSIS Packages, you often need to iterate over all the rows in a table. This can be done using an SSIS Foreach Loop Container, but the how is not obvious. Making this work is a two step process. You have to create a data flow to populate an SSIS variable with your table data, and then you have to configure your loop.
First, let’s take a look at getting your table of information into an SSIS variable. You will need to add a new variable to the package with a Data Type of Object. This variable will hold a recordset that will represent your table. Next, you need to create a Data Flow, like the one shown here.
The OLE DB Source will pull the data from your table or query and the recordset destination is used populate you variable. At this point the data from your table is stored in an ADO Recordset in your variable.
Step two will be creating the Foreach Loop Container, shown below. Add the container to you package and make sure it is after the data flow you used to populate your variable. On the Collection page of the Foreach Loop Container’s properties, select Foreach ADO Enumerator as your enumerator and set your variable as the ADO Object Source. Next, make sure you choose Rows in the First Table under enumeration mode. Finally, you can move on to the Variable Mapping page and define where the values in each column are stored. This is done with a 0 based index where 0 is your first column, 1 is second column and so on. Define a variable for each column you need to read and each time the loop is executed the variables will be updated.
That’s it, you can put any code you want in the Loop Container to execute it for each row in a table.
I have written a lot lately about SSIS package development, and that doesn’t apply to everyone that works with SQL Server 2008. So you might be asking, what’s does SSIS have to offer me, the production support DBA? Well, I am glad you asked. In SQL Server 2008, the Import and Export Wizard is built on the SSIS framework. In fact, at the end of the Wizard, you can save your settings into an SSIS package. If you have used the Wizard before, it will feel much the same, choose a source, choose a destination, specify your data, and go. In fact, if you don’t look close you may miss some of the new features.
First, when you choose a destination or source, you will see the .NET Framework Data Providers in the drop downs. This means you can run the import and export wizard directly against Oracle as well as ODBC data sources.
Next, there is a whole set of mapping files that define how data from one data source should map into destinations that are of a different type, Oracle to SQL Server or SQL Server to DB2 for example. These XML files can be found here (Assuming you installed to C:\): C:\Program Files\Microsoft SQL Server\100\DTS\MappingFiles. During the wizard, you can accept the decisions made by using these files or you can edit the data type mappings. For example, below is a snippet of the OracleToMSSql file which maps Oracle data to SQL Server data:
As you can see, NUMBER in Oracle maps to NUMERIC in SQL Server. Lastly, the wizard has been tuned to scale better. For example, if you import a large number of tables, they will be spilt across multiple data flows to reduce concurrent processing.
In the past, the import and export wizard has been a bit of a last resort for DBAs, but now I think its time to give this old dog a second chance.
In SQL Server Integration Services (SSIS) Packages, expressions are everywhere. You may have worked with Expressions in some of the transformation tasks, such as the Derive Column Transformation, but did you know they can also be used to set the properties of many of the tasks and transformations, or even used to modify the values in your SSIS Variables? Let’s look at an example. What if you have to write a lookup query that pulls data from two databases on the same SQL Server but you’re not sure what the databases will be named. Your query will run in the context of your lookup transformation, so the connection manager tied to your lookup will handle the first database, but you need to get a bit tricky for the second database. Take the following query as an example:
SELECT ContactID, EmailAddress
ON Employee.GlobalID = Email.GlobalID
WHERE Email.EmailAddress IS NOT NULL
AND Email.EmailPromotion <> 0
The Person.Employee table will be found in the database defined by your Connection Manager, the dbo.Email table needs to be in the GlobalContacts database in order to be found; but in our case, the GlobalContacts database is often named differently, for example GlobalContactsQA and GlobalContactsVer2. So how do we dynamically build our query to handle the fluidity of this database name? Here is one possible solution: We are going to add a couple variables to our package, RefDatabase and LookupQuery.
As you can see, we have set the RefDatabase variable to “GlobalContactsQA”. This variable can be passed in from a parent package or set with an SSIS configuration. For LookupQuery, set the EvaluateAsExpression property of the LookupQuery variable to True.
Next, add the following code to the Expression property:
"SELECT ContactID, EmailAddress
JOIN " + @[User::RefDatabase] + ".dbo.Email
ON Employee.GlobalID = Email.GlobalID
WHERE Email.EmailAddress IS NOT NULL
AND Email.EmailPromotion <> 0"
Be sure to include all the quotes shown in the example. This expression will use the value of the RefDatabase variable to dynamically modify the SQL query. Last but not least, you will need to change your lookup to use the SQL Command From a Variable option and point it to the LookupQuery variable. That’s all there is to it. The biggest mistake I have seen made, and I have done this myself, is setting the Expression property of the variable and forgetting to set the EvaluateAsExpression property to true. This will result in your variable having an empty value or your variable retaining the last value you hard coded.
Remember, expressions are everywhere in SSIS and can be used to save you a lot of time. Just make sure to do a little research before you use them as there is often a way to accomplish your objective without an expression and too many expressions can make a package difficult to maintain.
The Lookup transformation in SSIS has changed a lot in SQL Server 2008. One of the best new features is the ability to pre-build your cache which gives you a lot of control over what is cached and how the cached data is managed. The basic lookup offers additional cache features such as Full, Partial, or even No Cache, but the real power comes in the new Cache Transformation. The Cache Transformation uses the new Cache Connection manager to allow you to build your cache before it is needed. The connection manager and the Cache Transformation are pretty simple. For the Cache Manager Connection, you just need to specify the columns that will be stored and which columns will be indexed, as shown below. The indexed columns will be the only columns you can use as a lookup column in a Lookup Transformation.
Once you have the connection manager configured, you just use a Cache Transformation to pump data into the cache. The Cache Transformation is very simple; you specify a Cache Connection manager and set up the mapping of the incoming data to the columns in the cache. Last but not least, you pull data from your source and send it into the Cache Transformation. I have a very simple example data flow below which is populating a lookup cache; and the source in this example uses the T-SQL shown.
SELECT ContactID, EmailAddress
WHERE EmailAddress IS NOT NULL
AND EmailPromotion <> 0
This allows me to populate my cache with all email addresses that are not blank, where the person has not opted out of email promotion. Remember, the data flow to populate your cache can be as complex as you want. This can really allow you to pare down large sets of data to a smaller, easier to use data set that is appropriate for caching.
Once you have populated your cache, all you have left to do is use it in a Lookup Transformation. On the first page of your Lookup Transformation properties, specify a Connection Type of Cache Manager and on the connection page, specify your Cache Connection Manager object. Everything else is just like using the Lookup with a regular data source. You will need to set up your mappings and the column (or columns) that you want to return. One other note: when using a Cache Connection Manager with a lookup, you have to select Full Cache as the cache mode. The other two are disabled and for good reason. You have taken control of your cache at this point and don’t want SSIS doing anything further.
So there you have it, managing your own lookup cache should allow you to make your SSIS packages a little more streamlined and allow them to run a bit faster. Not to mention, you should be able to minimize the memory hogging packages that occurred when older lookups got a little out of hand.
SQL Server Integration Services (SSIS) is a very powerful tool for creating ETL Packages. Part of what makes it so powerful is its ability to use package configurations. Package configurations allow you to externally store information that the package needs so that information can be easily changed without re-writing the package itself. You can store connection strings, variable values, package passwords, isolation levels, and much more. Basically, if you can set the value in SSIS, it can probably be stored in a package configuration. This makes modifying a package, such as when it moves from QA to production, or when a production server’s name changes, much easier. Package configurations come in several flavors and which one you use will depend on your individual needs. Also, each type stores data differently, so in some cases multiple values can be saved and in others you can store only a single value. Here is a quick breakdown of the types of package configurations.
· XML Configuration File – Multiple values can be stored in an XML file on disk
· Environment Variable – A single value can be stored in a Windows Environment variable
· Registry Entry – A single value can be stored in a registry key
· Parent Package Variable – The parent package can pass a variable to the package which contains the configuration value
· SQL Server – Multiple values can be stored in a table on an SQL Server
Each package can have more than one configuration and I personally like the environment variable/ SQL Server combo. This allows me to store most of my values in one place, the SQL Server, and still be able to dynamically change the location of my configuration server without needing to change the package. Here is how I set up my package configurations.
First, make sure that you have a connection set up for your SQL Server where you want to store package configuration details. Then, from your SSIS control flow, click the SSIS menu and choose Package Configurations. This opens the package configuration dialog box. From here, you must check Enable Package Configurations in order to do anything. Once you have enabled configurations, click Add and then choose Environment Variable from the Configuration Type drop down. Name it something like ConfigServer and click next. Browse to the Connection Managers folder, find your configuration server connection and select its connection string property, as shown below.
Click next, name the configuration and click finish. Next, go and create a Windows Environment variable with the same name you used in the package configuration and set its value to a valid SQL Server connection sting for your configuration server, something like:
Data Source=CSSRV04;Initial Catalog=msdb;Provider=SQLNCLI10.1;Integrated Security=SSPI;
This package configuration will allow your SSIS package to find the SQL Server where the rest of your configuration will be stored.
Now add another package configuration, this time with a type of SQL Server. Specify all the connection details to the SQL Server on which you want to store your SSIS configuration, including the table to which you want the values written and click next. Now pick any and all values you want to store in SQL Server, name the configuration and click finish.
The next time your package runs, it will use the Environment Variable to find your SQL Server and the SQL Server to find the rest of your settings. These values can be changed at any time without the package requiring any modification.
SQL Server 2008 Integration Services shipped with the wonderful new Data Profiling Task. This task, much as the name implies, will profile the data in a given table and return a ton of useful information. The task gathers statistics for candidate keys, column length, NULL data ratio, data patterns, and much more. All of the results are written out to XML. The task makes more sense when you look at it, so let’s jump in. The configuration of the task is pretty simple, you select the types of profiles that you want to run and then set the specific options for that profile. In the figure below, I have selected Column Length Distribution for all columns (*) in the HumanResources.Employee table. This profile will gather length statistics on the columns such as Minimum and Maximum Length and provide a distribution of all of the lengths found. On the general page of the task, you also need to provide a destination, which can be a file connection or a variable.
Once you set up the task with all the data profiles you want to run, you simply execute the SSIS package and let it work it’s magic. If you wrote to a variable, you’re on you own to do something with the data within the SSIS package. On the other hand, if your wrote to a file, the SQL Server 2008 Client Tools come complete with the Data Profile Viewer which can be found at in your program menu under Microsoft SQL Server 2008 | Integration Services. This is a pretty simple application that allows you to browse the XML file created by the Data Profiling Task. The figure below shows the results for the Column Length Distribution profile I ran earlier. As you can see, there is quite a bit of information returned, and this is just one of the profiles. If you take some time and dig into all the profile types, I think you will be very impressed with the amount of information this task provides. If you’re the creative type, you can run this task and write the data to a variable which can be further queried by your SSIS package. With a little work, you could have a great tool at your disposal the next time your presented with a new database that you nothing about.
Here is a little SQL Script I wrote that gives you file and filegroup size details for all databases on SQL Server 2005 or 2008. This returns all the vital information I am looking for when I look at database sizes and growth such as file size, space used, free space, and the physical filename. There is also a second result set with details rolled up to the filegroup. If you were so inclined, you could capture this information on a regular basis and do some trending. Script attached.
DECLARE @database_id int
DECLARE @database_name sysname
DECLARE @sql_string nvarchar(2000)
DECLARE @file_size TABLE
[database_name] [sysname] NULL,
[groupid] [smallint] NULL,
[groupname] sysname NULL,
[fileid] [smallint] NULL,
[file_size] [decimal](12, 2) NULL,
[space_used] [decimal](12, 2) NULL,
[free_space] [decimal](12, 2) NULL,
[name] [sysname] NOT NULL,
[filename] [nvarchar](260) NOT NULL
SELECT TOP 1 @database_id = database_id
,@database_name = name
WHERE database_id > 0
ORDER BY database_id
WHILE @database_name IS NOT NULL
SET @sql_string = 'USE ' + QUOTENAME(@database_name) + CHAR(10)
SET @sql_string = @sql_string + 'SELECT
,convert(decimal(12,2),round(sysfiles.size/128.000,2)) as file_size
,convert(decimal(12,2),round(fileproperty(sysfiles.name,''SpaceUsed'')/128.000,2)) as space_used
,convert(decimal(12,2),round((sysfiles.size-fileproperty(sysfiles.name,''SpaceUsed''))/128.000,2)) as free_space
LEFT OUTER JOIN sys.sysfilegroups
ON sysfiles.groupid = sysfilegroups.groupid'
INSERT INTO @file_size
EXEC sp_executesql @sql_string
--Grab next database
SET @database_name = NULL
SELECT TOP 1 @database_id = database_id
,@database_name = name
WHERE database_id > @database_id
ORDER BY database_id
SELECT database_name, groupid, ISNULL(groupname,'TLOG') groupname, fileid, name, file_size, space_used, free_space, filename
--File Group Sizes
SELECT database_name, groupid, ISNULL(groupname,'TLOG') groupname, SUM(file_size) as file_size, SUM(space_used) as space_used, SUM(free_space) as free_space
GROUP BY database_name, groupid, groupname
SSIS is a pretty useful tool for designing ETL processes. One of the transformations I was disappointed with in 2005 was the lookup. I found it a little better than the lookup functionality in DTS, but not too much better. Well, Microsoft must have read my mind because SQL Server 2008 has a new shiny Lookup transformation that is, in my opinion, much improved.
In 2005, the lookup had cache options, but they were really heavy handed. And you could deal with a lookup not finding a match by using the error output, but then how would you know the difference between a non-match and a real error? Let's start with the cache. There are now three cache options and two choices for your cache source.
- Full Cache - This will load the entire reference dataset into memory before the first lookup is performed. This can be very efficient for small tables, but think what would happen if your lookup table was a few GB in size….that's a lot to load into cache.
- Partial Cache - This is probably what most people think of when they conceptualize lookup cache in their head. At first, the cache is empty and each time a row matches or doesn't match for that matter, the row, or the fact that the lookup didn't find a row, is cached. Subsequent rows with the same lookup fields will find their data, or lack thereof, in cache.
- No Cache - The lookup will generate the reference dataset each time the transformation runs.
As for data sources for your cache, you can still go to the database, but now you can also set up your own custom cache connection. This gives you the ability to build your cache in a separate step and then reference it with your lookup. This is a little involved so I won't go into more detail here, but keep your eyes open for another blog entry... ;) Furthermore, if you use the partial cache, you can manage your cache size on the Advanced page of the Lookup's properties.
So I like the changes to caching, but I like the output options even more. You can now specify that rows with no match be redirected to their own output, or be ignored all together. This now gives you three outputs to work with; one for a match, one for no match, and another for errors. I put together a small sample package and added a screen shot below, you can see that I am using all three outputs. In this case, when an e-mail address is not found in the lookup, I use a derived column to put "NA" into the email address column before I load. I then union this data with the rows that found a match and do an insert into my destination table.
In the end, this new lookup is much cleaner and easier to use than its 2005 predecessor. Plus the addition of the Cache Connection Manager is a big win…..more on that coming soon.
Just found a little something new in SSIS 2008, The Integration Services Connections Project Wizard. This thing starts up when you first begin a new SSIS project. The whole point is to walk you though your initial connection configurations. Once the connections are set up, it even asks you if each is a source, destination, or both and then it creates your first Data Flow task complete with the correct sources and destinations. This probably isn't all that useful to someone that has spent any measurable amount of time with SSIS, but for new comers I think its great. I remember the first time I opened SSIS, it took me a minute to figure out how to get a connection , a source, and a destination all set up. Granted I tend not to read documentation, so I am probably a special case.
In any event, if this pops up in your face the next time you start a project. Have a look, it’s a neat little addition.
The Slowly Changing Dimension (SCD) transformations are pretty handy in SSIS packages. They let you compare a new set of data to an existing table and insert or update as needed. Alternatively, you can have them insert new rows and mark old rows as "Expired" instead of updating rows when a change occurs. This is great if you want to maintain history. Now these are meant for dimensions in a data warehouse, but you can use them against any table in your database. The reason for this post is to look at an issue that can occur with SCDs when you have duplicate data in your incoming data set.
When an SCD analyzes a set of data it looks at each row and compares it to the destination. It decides whether the row will be inserted or updated (or inserted with the old row being marked inactive). Notice I said this comparison is with the incoming data and the destination. The SCD does check for duplicates that may exist in the incoming data. So if you have duplicates, that also happen to be new rows when compared to the destination, all the duplicate rows get "flagged" for insertion. You can probably guess what happens next. The first row gets inserted but the second and subsequent duplicates cause a Primary Key violation when they attempt to insert the now duplicate record.
Now this might seem like a glitch to some, but this functionality is "by design ". The button line is that the data being inserted should be de-duplicated prior to being inserted into the destination. In a perfect world, the data would come from a source where it lived in a normalized, and therefore de-duplicated, state. If you don't have the luxury of living in a perfect world, you can build a manual data de-dup process or use a Sort transformation. One of the options of the Sort transformation is to "Remove Rows with Duplicate Sort Values". Just drop one of these in, check the box, and you have magically de-dup your values.
Good Question! This is one of those error messages that you will eventually encounter in SQL Server if you work with it enough. A login’s default database is a delicate thing that will prevent them from being able to logon if there is a problem. The default database defines which database the user will be automatically logged into if they connect and they don’t specify a database. Several things can cause the above error message. The most common is a login that does not have access to the database that is configured as its default. This is easily remedied; either set up a corresponding user for the login in their default database or change the default to a database to which the login has access. This error will also rear its ugly head if you drop a database that was configured as the user’s default database. There is no check when a database is dropped to make sure it won’t affect users, so dropping someone’s default database will leave them in the dark when it comes to logging in. As I said, this only happens if the user’s connection method does not specify a database. Many applications will provide the database name and will be immune to the issue. The fix is simple, make sure the logon has access to their default database and make sure the default database exists.
This problem can get a little worse if you break the default database of your system administrator accounts, such as your Windows account that has ‘sa’ level rights or the ‘sa’ login itself (and no other logins have the permissions to fix the problem). In this case, the fix is still simple, but you need to use an application that allows you to specify a database, such as SQLCMD. SQLCMD is a command-line tool that comes with SQL Server 2005 and 2008. When you login, you can specify a database to use that will override the default. So to fix your problem, login with SQLCMD as follows:
sqlcmd -S sql2008 –U sa -P password -d master
This will let you login and use the master database despite the incorrect default database. To fix your login simply run the following query from the SQLCMD command prompt (you can use any database that exists and to which the login has access, master is used as an example):
ALTER LOGIN sa WITH DEFAULT_DATABASE=master
That’s it, the next time you login your new default database of master, or whatever you specified, will be used.
This is a problem that plagues DBAs everywhere. When you restore a database, you run the risk of orphaning the users in the database. All users are linked via a SID to a login and if you have SQL Server logins, who’s SIDs are managed by SQL Server, you are at risk. Typically a restore to the same server from which the backup was taken won’t cause an issue unless you dropped and recreated the login. Generally the problem rears its ugly head when you restore a backup to a server that was not the original location. You planned ahead and created the same logins on the new server as existed on the old server, so why do the users end up orphaned? As I mentioned earlier, SQL Server manages the SIDs for SQL Server logins so there is no guarantee that the new login has the same SID as the original login did. Then when you restore your database, the users in that database are expecting SIDs that are not there and the next thing you know you have orphaned users. Just a note, this does not occur with Windows Logins because the SID is controlled by Windows or Active Directory. Unless you drop and re-create the user in Windows, the SID of an Active Directory user will be the same on all SQL Servers and hence your user accounts see the SID they are looking for. So, the million dollar question is, how do you fix the problem without dropping and re-creating the user and messing up the permissions in the process? Microsoft provides us with a handy little stored procedure called sp_change_users_login that you can use to fix orphaned users. This procedure can do several things; it can tell you which users are orphaned, it lets you fix an orphaned user manually, and it can attempt to automatically fix your issues. So let’s look at an example. I have deliberately orphaned a user called Annie in the AdventureWorks2008 database. When I run sp_change_users_login with the REPORT option, I can see that I indeed have an orphaned user.
EXEC sp_change_users_login 'REPORT'
To fix this orphaned user all I have to do is run sp_change_users_login with the UPDATE_ONE action and tell SQL Server the name of my orphaned user and the name of the appropriate login.
EXEC sp_change_users_login 'UPDATE_ONE','Annie','Annie'
There you have it, a simple quick fix to orphaned users that you can use next time you have an issue. I just want to add one more thing regarding the AUTO_FIX action of sp_change_users_login. If you use this option, the procedure tries to automatically fix your orphaned users by matching user name to login name. If no match is found, it will create the appropriate login for you. The only reason I don’t like it is that is has the potential to create logins you don’t want, especially if your login names happen to deliberately differ from your user names.
They have been around forever, but have you really ever implemented Application Roles in SQL Server? For that matter, do you really know what they are and how to use them? In this short tip, I hope to explain just that.
First off, we need to talk briefly about how application security can be implemented in SQL Server. There is some debate over which model is better and I am not endorsing any specific one here. Regardless of whether you use SQL Server Logins or Windows Authentication, you still have to decide whether an application will use a single login to access SQL Server (and all appropriate database objects), or allow each individual user to have their own login. They each have their very own sets of pros and cons, but I want to focus on a specific con of each user having their own login. The biggest issue is that each user login has access to your server and to one or more databases. Does the user need to delete data as part of their job? If so, they will have this right whether they log in via an application or directly to the server. Often, the application controls what can and cannot be deleted based on a set of business rules; these rules usually don’t exist on the SQL Server itself. In short, if each user has their own login, they can access SQL Server directly and potentially cause some damage.
This brings us to application roles. You create them and assign permissions to them just like regular database roles but you can’t put users in them. Instead, the goal of application role is to provide a best of both worlds scenario for application and user security. Here’s how they work. You set up each user to have an account on the SQL Server with practically no rights. All they should be able to do is login to the server and run a system stored procedure called sp_setapprole. This procedure accepts a couple parameters, including the name and password for the application role. Running sp_setapprole will immediately endow the user with all the permissions that you set up on the application role for the current session only.
So what does this mean for security? As long as the password for activating the application role is only known to the application, your users will not have any rights when they login to the SQL Server directly. In order to have the permissions they need, they will be required to use the application that knows the password and can unlock the permission for the application role. Now you can have SQL Server manage individual logins and still have a secure environment that uses the rules and filters in place within your applications.