This is the blog of Jamie Thomson, a freelance data mangler in London
SQL Server Data Tools (SSDT) was released in Spring 2012 and I have been using it to build SQL Server databases nearly every day since then. In that time I’ve learnt a lot about how to make best use of the tool and I want to share that experience in a new series of blog posts called “10 days of SSDT”. I shall be publishing a different blog post every day for 10 days, each day revealing something that you may not know about SSDT. I hope you enjoy the series and contribute via the comments!
Its probably only fair to point out that this was intended to be “30 days of SSDT” but I laboured over writing the first 10 and then my second daughter turned up and since then time has been somewhat at a premium. Hence you’re getting the first 10 . Hopefully one day I’ll find time to write the other 20!
SSDT is a tool to aid with developing databases. It is not intended for administering databases and hence is not intended as a replacement for SQL Server Management Studio (SSMS). Within that developer remit it supports two distinct ways of working, online and offline. Those two distinct usage scenarios are most obvious with SQL Server Object Explorer (or SSOX for short). SSOX is a pane within Visual Studio that is made visible from from the View menu or by using the keyboard shortcut CTRL+\, CTRL+S and it looks something like this:
There are two top-level nodes:
- SQL Server – For the online usage scenario. This node is a collection of SQL Server instances that can be expanded so you can examine what’s in them. I think of this node as a hybrid of the Object Explorer and Registered Servers panes in (SSMS).
- Projects – offline projects
The online usage scenario is quite simply for working with SQL Server databases that already exist. In the screenshot above notice that I have two SQL Server instances listed, (localdb)\Projects (more about that later in the series) and mhknbn2kdz.database.windows.net which just happens to be the server that hosts AdventureWorks on Azure. From here I can browse through the objects on those servers and interrogate them just as I would in SSMS. Within SSDT/SSOX however there is one important difference, when I make changes to the database I am not affecting the database directly, I am affecting an in-memory model of the database. This approach has two important advantages,
- I can view the impact of a change (e.g. if I change a column name will it break any views that reference it) without actually making the change
- I can make many changes to the model and then push them into my actual database en masse and SSDT will work out the best way of doing that. I don’t have to write the code that deploys my code.
Working online within Visual Studio is often referred to as connected development.
The offline usage scenario is all about writing new database “stuff” that you want to deploy. Ostensibly that doesn’t sound too different from the online scenario but the key here is that there is no requirement for a full-blown SQL Server instance to exist. In fact, you can download and install SSDT today (its free) and start building SQL Server databases without having to install SQL Server. Working offline within Visual Studio is often referred to as disconnected development.
When working offline you interact with SSOX as I previously mentioned but you will also find yourself using a pane inside Visual Studio called “Solution Explorer”. As anyone who has used Visual Studio will tell you Solution Explorer is where your code lives and that’s no different for SSDT. A solution is a collection of projects and each SSDT project pertains to a SQL Server database.
Any questions so far? Please put them in the comments section below.
If you want to learn more about SSDT then come along to my training course Introduction to SQL Server Data Tools in association with Technitrain on 23rd/24th February 2015 in London. There is a discount if you register before the end of 2014.
Have you ever wanted to use Powershell to call the Azure BLOB Storage API in order to create a new BLOB? You haven’t? What have you been doing all your life? (!!!)
Well, if your life ever does reach the same giddy heights as mine and you find that you need this ability added to your personal armoury then this blog post should be a massive help. Figuring this out is a battle with deciphering one particularly gnarly MSDN article: Authentication for the Azure Storage Services, or you can just adapt the script linked to below!
Behold, here is the “Hello World” of using Powershell to create a BLOB in Azure BLOB Storage using the REST API. Take the script, add your storage account name and storage key where required, and you’re good to go! I’ve made the script available as a Github Gist: https://gist.github.com/jamiekt/2efdd38c1b26d78cc6cd. For those that like to see pretty colour coding, here’s a screenshot:
One tip – I encountered an issue where the script file had gotten stored with a Byte Order Mark (BOM) and this was causing an error due to an incorrect Content-Length header. If you have this problem then Notepad++ can remove the BOM for you:
Thank you to Gaurav Mantri for his massive massive help with this.
I’ve been doing alot of work with hadoop recently, specifically Microsoft’s hadoop distribution, HDInsight. Microsoft have an HDInsight Emulator available that makes it possible for one to run HDInsight on their Windows workstation and on the 29th August 2014 they released a new version of the Emulator that includes the recently released HBase feature (get the emulator from http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT).
After I installed it I noticed that none of the services associated with the emulator were started and this was due to a logon failure:
All of those services are configured to start as user .\hadoop:
I happen to be using Windows 8 and I’ll freely admit that I don’t have the foggiest idea how Windows 8 messes with users and groups because user “hadoop” is not visible in “Control Panel\User Accounts and Family Safety\User Accounts\Manage Accounts”. However, it is possible to discover information from the command line using “net user hadoop”:
Today is 3rd September 2014, notice that the password expired nearly five months ago (which I suspect is shortly after I first installed the emulator). Personally I think that given this is only for use in an emulator there is no need for the password to ever expire and I’ve fed that back to Microsoft.
Changing the password is easy when you know how (though it took me a while to figure it out – hence this blog post), run a command prompt as administrator and type
“net user hadoop *”
You will be prompted to enter a new password:
After which time the services will start successfully.
Something to be aware of!
When using Powershell I’ve always found it useful to be able to time how long commands/scripts are running for, invaluable when running long-running scripts. I used to do something similar to this in all my scripts:
which is rather crude. It also has the disadvantage that I have to remember to actually put this into my scripts. What I’d prefer would be a way for Powershell to always tell me how long it takes my commands/scripts to run and thanks to my old Conchango colleague Stuart Preston I now have a way. Using Stuart’s code that he has put up in a gist one can have Powershell output this information every time a command is executed:
Note that the current location is still present in the title bar (Stuart’s code does that too) and time elapsed since the last command is displayed. Perfect! Simply put Stuart’s code into your profile and you’re done.
A simple little trick but one which I think its worth calling out in a blog post!
If one spends long enough in my company one will likely eventually have to listen to me bark on about subscribable calendars. I was banging on about them way back in 2009, I’ve cajoled SQLBits into providing one, provided one myself for the World Cup, and opined that they could be transformative for the delivery of BI. I believe subscribable calendars can change the world but have never been good at elucidating why I thought so, for that reason I always direct people to read a blog by Scott Adams (yes, the guy who draws Dilbert) entitled Calendar as Filter. In that blog post Scott writes:
I think the family calendar is the organizing principle into which all external information should flow. I want the kids' school schedules for sports and plays and even lunch choices to automatically flow into the home calendar.
Everything you do has a time dimension. If you are looking for a new home, the open houses are on certain dates, and certain houses that fit your needs are open at certain times. If you are shopping for some particular good, you often need to know the store hours. Your calendar needs to know your shopping list and preferences so it can suggest good times to do certain things
I think the biggest software revolution of the future is that the calendar will be the organizing filter for most of the information flowing into your life. You think you are bombarded with too much information every day, but in reality it is just the timing of the information that is wrong. Once the calendar becomes the organizing paradigm and filter, it won't seem as if there is so much.
I wholly agree and hence was delighted to discover (via the Hanselminutes podcast) that Scott has a startup called CalendarTree.com whose raison d’etre is to solve this very problem. What better way to describe a Scott Adams startup than with a Scott Adams comic:
I implore you to check out Calendar Tree and make the world a tiny bit better by using it to share any information that has a time dimension to it. Don’t be a dinosaur, use Calendar tree!
Like many folks I’m a keen podcast listener and yesterday my commute was filled by listening to Scott Hunter being interviewed on .Net Rocks about the next version of ASP.Net. One thing Scott said really struck a chord with me. I don’t remember the full quote but he was talking about how the ASP.Net project file (i.e. the .csproj file) is going away. The rationale being that the main purpose of that file is to list all the other files in the project, and that’s something that the file system is pretty good at. In Scott’s own words (that someone helpfully put in the comments):
A file that lists files is really redundant when the OS already does this
Romeliz Valenciano correctly pointed out on Twitter that there will still be a project.json file however no longer will there be a need to keep a list of files in a project file. I suspect project.json will simply contain a list of exclusions where necessary rather than the current approach where the project file is a list of inclusions.
On the face of it this seems like a pretty good idea. I’ve long been a fan of convention over configuration and this is a great example of that. Instead of listing all the files in a separate file, just treat all the files in the directory as being part of the project. Ostensibly the approach is if its in the directory, its part of the project. Simple.
Now I’m not an ASP.net developer, far from it, but it did occur to me that the same approach could be applied to the two Visual Studio project types that I am most familiar with, SSIS & SSDT. Like many people I’ve long been irritated by SSIS projects that display a faux file system inside Solution Explorer. As you can see in the screenshot below the project has Miscellaneous and Connection Managers folders but no such folders exist on the file system:
This may seem like a minor thing but it means useful Solution Explorer features like Show All Files and Open Folder in Windows Explorer don’t work and quite frankly it makes me feel like a second class citizen in the Microsoft ecosystem. I’m a developer, treat me like one. Don’t try and hide the detail of how a project works under the covers, show it to me. I’m a big boy, I can handle it!
Would it not be preferable to simply treat all the .dtsx files in a directory as being part of a project? I think it would, that’s pretty much all the .dtproj file does anyway (that, and present things in a non-alphabetic order – something else that wildly irritates me), so why not just get rid of the .dtproj file?
In the case of SSDT the .sqlproj actually does a whole lot more than simply list files because it also states the BuildAction of each file (Build, NotInBuild, Post-Deployment, etc…) but I see no reason why the convention over configuration approach can’t help us there either. Want to know which is the Post-deployment script? Well, its the one called Post-DeploymentScript.sql! Simple!
So that’s my new crusade. Let’s kill all the project files (well, the .dtproj & .sqlproj ones anyway). Are you with me?
2015-06-03 A Twitter conversation with Kevin Cunnane suggests the problems outlined in this blog post may soon be alleviated: https://twitter.com/jamiet/timelines/606034876681355264 ;
An issue that regularly seems to rear its head on my travels is that of headless build servers for SSDT. What does that mean exactly? Let me give you my interpretation of it.
A SQL Server Data Tools (SSDT) project incorporates a build process that will basically parse all of the files within the project and spit out a .dacpac file. Where an organisation employs a Continuous Integration process they will likely want to automate the building of that dacpac whenever someone commits a change to the source control repository. In order to do that the organisation will use a build server (e.g. TFS, TeamCity, Jenkins) and hence that build server requires all the pre-requisite software that understands how to build an SSDT project.
The simplest way to install all of those pre-requisites is to install SSDT itself however a lot of folks don’t like that approach because it installs a lot unnecessary components on there, not least Visual Studio itself. Those folks (of which i am one) are of the opinion that it should be unnecessary to install a heavyweight GUI in order to simply get a few software components required to do something that inherently doesn’t even need a GUI. The phrase “headless build server” is often used to describe a build server that doesn’t contain any heavyweight GUI tools such as Visual Studio and is a desirable state for a build server.
In his blog post Headless MSBuild Support for SSDT (*.sqlproj) Projects Gert Drapers outlines the steps necessary to obtain a headless build server for SSDT:
This article describes how to install the required components to build and publish SQL Server Data Tools projects (*.sqlproj) using MSBuild without installing the full SQL Server Data Tool hosted inside the Visual Studio IDE.
Frankly however going through these steps is a royal PITA and folks like myself have longed for Microsoft to support headless build support for SSDT by providing a distributable installer that installs only the pre-requisites for building SSDT projects. Yesterday in MSDN forum thread Building a VS2013 headless build server - it's sooo hard Mike Hingley complained about this very thing and it prompted a response from Kevin Cunnane from the SSDT product team:
The official recommendation from the TFS / Visual Studio team is to install the version of Visual Studio you use on the build machine.
I, like many others, would rather not have to install full blown Visual Studio and so I asked:
Is there any chance you'll ever support any of these scenarios:
- Installation of all build/deploy pre-requisites without installing the VS shell?
- TFS shipping with all of the pre-requisites for doing SSDT project build/deploys
- 3rd party build servers (e.g. TeamCity) shipping with all of the requisites for doing SSDT project build/deploys
I have to say that the lack of a single installer containing all the pre-requisites for SSDT build/deploy puzzles me. Surely the DacFX installer would be a perfect vehicle for that?
Kevin replied again:
The answer is no for all 3 scenarios. We looked into this issue, discussed it with the Visual Studio / TFS team, and in the end agreed to go with their latest guidance which is to install Visual Studio (e.g. VS2013 Express for Web) on the build machine. This is how Visual Studio Online is doing it and it's the approach recommended for customers setting up their own TFS build servers. I would hope this is compatible with 3rd party build servers but have not verified whether this works with TeamCity etc.
Note that DacFx MSI isn't a suitable release vehicle for this as we don't want to include Visual Studio/MSBuild dependencies in that package. It's meant to just include the core DacFx DLLs used by SSMS, SqlPackage.exe on the command line, etc.
What this means is we won't be providing a separate MSI installer or nuget package with just the necessary build DLLs you need to run your build and tests. If someone wanted to create a script that generated a nuget package based on our DLLs and targets files, then release that somewhere on the web for easier integration with 3rd party build servers we've no problem with that.
Again, here’s the link to the thread and its worth reading in its entirety if this is something that interests you.
So there you have it. Microsoft will not be be providing support for headless build servers for SSDT but if someone in the community wants to go ahead and roll their own, go right ahead.
I’ve been very quiet on the blogging front of late. There are a few reasons for that but one of the main ones is that I’ve spent the past three months on a new gig immersing myself in Hadoop, primarily in Microsoft’s Hadoop offering called HDInsight. I’ve got a tonne of learnings that I want to share at some point but in this blog post I’ll start with a handy little script that I put together yesterday.
I’m using a tool in the Hadoop ecosystem called HBase which was made available on HDInsight in preview form about a month ago. HBase is a NoSQL solution intended to provide very very fast access to data and my colleagues and I think it might be well suited for a problem we’re currently architecting a solution for. In order to evaluate HBase we wanted to shove lots of meaningless data into it and in the world of HDInsight the means of communicating with your HDInsight cluster is Powershell. Hence I’ve written a Powershell script that will use HBase’s REST API to create a table and insert random data into it. Likely if you’ve googled this post then you’re already familiar with Hadoop, HDInsight, REST, Powershell, HBase, column families, cells, rowkeys and other associated jargon so I won’t cover any of those, what is important is the format of the XML payload that has to get POSTed/PUTted up to the REST API. That payload looks like this:
<?xml version="1.0" encoding="UTF-8"?>
The payload can contain as many cells as you like. When the payload gets POSTed/PUTted the values therein need to be base64 encoded but don’t worry, the script I’m sharing herein takes care of all that for you. The script will also create the table for you. The data that gets inserted is totally meaningless and is also identical for each row, modifying the script to insert something meaningful is an exercise for the reader.
Another nicety of this script is that it uses Invoke-RestMethod which is built into Powershell 4. You don’t need to install other Powershell modules, nothing Azure specific. If you have Powershell 4 you’re good to go!
Embedding code on this blog site is ugly so I’ve made it available on my OneDrive: CreateHBaseTableAndPopulateWithData.ps1 Screenshot below gives you an idea of what’s going on here.
Hope this helps!
UPDATE. I’ve posted a newer script CreateHBaseTableAndPopulateWithDataQuickly.ps1 which loads data in much quicker. This one sends multiple rows in each POST and hence I was able to insert 13.97m rows in 3 hours and 37 minutes which, given latency to the datacentre and that this was over a RESTful API, isn’t too bad. The previous version of the script did singleton inserts and hence would have taken weeks to insert that much data.
The number of POSTs and the number of rows in each POST are configurable.
I am still working apace on updates to my open source project SSISReportingPack, specifically I am working on improvements to usp_ssiscatalog which is a stored procedure that eases the querying and exploration of the data in the SSIS Catalog.
In this blog post I want to share a titbit of information about usp_ssiscatalog, that all the actions that you can take when you execute usp_ssiscatalog are documented within the stored procedure itself. For example if you simply execute
EXEC usp_ssiscatalog @action='exec'
in SSMS then switch over to the messages tab you will see some information about the action:
OK, that’s kinda cool. But what if you only want to see the documentation and don’t actually want any action to take place. Well you can do that too using the @show_docs_only parameter like so:
EXEC dbo.usp_ssiscatalog @a='exec',@show_docs_only=1;
That will only show the documentation. Wanna read all of the documentation? That’s simply:
EXEC dbo.usp_ssiscatalog @a='exec',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='execs',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='configure',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_created',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_running',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_canceled',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_failed',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_pending',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_ended_unexpectedly',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_succeeded',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_stopping',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_completed',@show_docs_only=1;
I hope that comes in useful for you sometime. Have fun exploring the documentation on usp_ssiscatalog. If you think the documentation can be improved please do let me know.
- Cortana – Microsoft’s digital personal assistant, competing with Apple’s Siri and Google Now.
“She gets to know you by learning your interests over time. She looks out for you, providing proactive, useful recommendations. And Cortana keeps you closer to the people and things you care about most, by keeping track of all that matters.” - http://www.windowsphone.com/en-US/features-8-1#Cortana
- Schema.org – “a collection of schemas that used to markup HTML pages, and that can also be used for structured data interoperability Search engines rely on this markup to improve the display of search results” https://schema.org/.
I think of schema.org as machine-readable metadata on a web page. This has existed for years in various guises (e.g. microformats), schema.org is a company-independent initiative to standardise those schemas.
One of the interesting features that Cortana provides is the ability to parse your email in order to discover flight information and then keep you up to date with information about that flight. Upon reading MSDN article Sending flight information to Microsoft Cortana with contextual awareness it transpires that this feature is entirely dependent upon the email containing schema.org markup:
Microsoft Cortana interprets schema.org markup in e-mails to extract airline flight reservation data
We all unknowingly make use use of http://schema.org markup all the time (every time you use Google for example, the share contract in Windows 8 & Windows Phone 8.1 can also leverage schema.org) however this is the first time I’ve been made aware of a user experience that is totally dependent on it.
I find this fascinating, I’ve long had an interest in structured metadata on the web (I’ve been banging on about it in blog posts for the last 10 years) and now my mind is in overdrive thinking of other scenarios that could leverage this. Taking a look at the documentation shows us that schema.org provides definitions of a multitude of things (in fact Thing is the base class from which everything else is derived), some examples:
Its not hard to envisage that Cortana might one day see mention of a music track in an email and offer the ability to play it for you, or see contact information and offer to add that to your contact list. Where this gets potentially even more interesting (to me, anyway) is when we consider that Cortana is already extensible by 3rd party apps – I foresee that in the future 3rd party apps will have the ability to subscribe to a particular schema and have Cortana notify them when it comes across an instance of that schema (imagine a Spotify app being informed that an email just arrived with information about a new music release and then offering to play that for you).
Perhaps this 3rd party extensibility could work the other way too by presenting other data sources to Cortana. For example, if Twitter ever get round to implementing annotations then Cortana could potentially read metadata hidden within tweets as well as combing through emails.
Expect that in the future Cortana will parse new data sources, find new types of information and allow more 3rd parties to act upon it. The potential for a degraded user experience with notifications flashing in your face all the time is a worry but frankly I don’t care, this is all fascinating stuff to a data geek like me.
P.S. Thank you to Savas Parastatidis from the Cortana dev team for highlighting Cortana’s use of schema.org.
Here’s some code that I absolutely know I’m going to need again in the future, what better place to put it than on my blog!
If you need to prompt the user for a password when using Powershell then you want to make sure that the value types in isn’t visible on the screen. That’s quite easy using the –AsSecureString parameter of the Read-Host cmdlet however its not quite so easy to retrieve the supplied value. The following code shows how to do it:
$response = Read-host "What's your password?" -AsSecureString
$password = [Runtime.InteropServices.Marshal]::PtrToStringAuto([Runtime.InteropServices.Marshal]::SecureStringToBSTR($response)
I don’t know of a quick and easy way to format Powershell code for a blog post so here’s a screenshot instead:
I’ve also put this on pastebin: http://pastebin.com/2D6xaz0U
All credit goes to Paul Williams for his post Converting System.Security.SecureString to String (in PowerShell)
I have maintained a watching brief on what I refer to as “cloud ETL”, that is the ability build ETL routines in a cloud environment and therefore leverage all the benefits that the cloud model brings*. Thus far my main opinion piece in this area is What would a cloud-based ETL tool look like? in which I laid out what features I thought a cloud ETL tool should have:
- Data transformation would be done “in the cloud” i.e. I wouldn’t need to own my own hardware in order to run it
- Ability to consume data from/push data to <many different data protocols>
- Adapters (possibly with a plug-in model) for cloud storage and API providers
- Job scheduler
- Workflow. (e.g. Do this, then do that. Do these things in parallel. Only do this if some condition is true. Restart from here in case of failure.)
- An IDE (open to debate whether the IDE should be “in the cloud” as well)
- Ability to carry out common transformations (join, aggregate, sort, projection) on those heterogeneous data sources
- Ability to authenticate using different authentication mechanisms
- Configurable logging
- Ability to publish transformed data in a manner that makes it consumable rather than insert it into another data store
Given that I have spent the majority of my career working with Microsoft technologies (in particular their ETL tool, SSIS) I am interested to know whether Microsoft will offer a cloud ETL tool. With that in mind I was interested to discover a new service on Azure that is currently in preview called Azure Automation (read Announcing Microsoft Azure Automation Preview). Azure Automation is essentially a a cloud-based workflow tool and, as I said above, workflow is a feature that I believe a cloud-based ETL tool should encompass:
- Workflow. (e.g. Do this, then do that. Do these things in parallel. Only do this if some condition is true. Restart from here in case of failure.)
SSIS developers will of course be aware that SSIS has its own workflow tool (termed the Control Flow). It always kind of bugged me that different Microsoft tools had their own workflow technology. SSIS had one, I believe BizTalk had one, there was another called Windows Workflow Foundation (WWF) and in fact there was a possibility within the SQL Server 2008 timeframe that SSIS would replace its Control Flow with WWF (that never happened and the Program Manager that wanted to do it has since left the SSIS product team).
Azure Automation is built upon Powershell Workflow which in turn is built upon WWF (now simply called Workflow Foundation – WF). It certainly seems as though WF is becoming the foundational workflow technology to rule them all within Microsoft and that is no bad thing in my opinion – it seems foolish to reinvent the wheel every time. Powershell Workflow has the following cmdlets for building workflows:
- Foreach –parallel
Those are all fairly self-explanatory. Of particular interest to me is Foreach –parallel (we’ve been asking for a native Parallel ForEach Loop in SSIS for years) and that might be even more useful in a scale-out infrastructure such as can be offered by the cloud (imagine firing off multiple FTP tasks in parallel, all working on different Azure nodes). Checkpoint-Workflow also sounds very interesting:
A checkpoint is a snapshot of the current state of the workflow, including the current values of variables, and any output generated up to that point, and it saves it to disk. You can add multiple checkpoints to a workflow by using different checkpoint techniques. Windows PowerShell automatically uses the data in newest checkpoint for the workflow to recover and resume the workflow if the workflow is interrupted, intentionally or unintentionally.
Stateful restartability that you can control, all out-of-the-box. How cool is that? So much better than the awful checkpointing feature within SSIS.
It certainly appears to me that Azure Automation could satisfy my desire for a workflow engine for the purposes of cloud-ETL. Now if only Microsoft were working on cloud-based dataflows too we’d have something akin to SSIS-in-the-cloud .
*My own personal opinion is that the benefits of the cloud model can be summed up simply as “OPEX not CAPEX”. You may have your own definition, and that’s OK.
On 31st March 2014 I released version 220.127.116.11 of SSIS Reporting Pack, my open source project that aims to enhance the SSIS Catalog that was introduced in SSIS 2012. This is a big release because it includes an entirely new feature -the Restart Framework.
The Restart Framework exists to cater for a deficiency within SSIS, that being the poor support for restartability. Let's define what I mean by restartability:
A SSIS execution that fails should, when re-executed, have the ability to start from the previous point of failure.
SSIS provides a feature called checkpoint files that are intended to help in this scenario but I am of the opinion that checkpoint files are an inadequate solution to the problem, I explain why in my blog post Why I don't use SSIS checkpoint files.
The Restart Framework was designed to overcome the many shortcomings of checkpoint files.
One of the fundamental tenets of the Restart Framework is that the packages that you, the developer, build for your solution should not be required to contain any variables, parameters, tasks, or event handlers in order to make them work with the Restart Framework. In fact your packages should be agnostic of the fact that they are being executed by the Restart Framework.
TL;DR: A video that demonstrates the installation and base functionality of the Restart Framework can be viewed at https://www.youtube.com/watch?v=syV0Wpwhlnk
Let's define some important terms that you will need to become familiar with if you are going to use the Restart Framework.
An ETLJob is the definition of some work that an end-to-end ETL process needs to perform. An ETLJob would typically incorporate many SSIS packages. Each ETLJob has a name (termed ETLJobName) which can be any value you want, some example ETLJobNames might include:
- Nightly Data Warehouse Load
- Monthly Reconciliation
- All backups
Each ETLJob contains one or more ETLJobStages. These are the "building blocks" of your solution and for each ETLJobStage there must exist a package in your SSIS project with a matching name. For example, an ETLJobStage with the name "FactInternetSales" will require a SSIS package called "FactInternetSales.dtsx".
The Restart Framework allows the declaration of dependencies between ETLJobStages - an ETLJobStage cannot start until all ETLJobStages with a lower ETLJobStageOrder have completed successfully. This is a fundamental tenet of the Restart Framework as it needs to know the order in which ETLJobStages need to occur in order that it can restart execution from the previous point of failure.
The Restart Framework provides some stored procedures that should be used to define ETLJobs, ETLJobStages and the dependencies between them.
One important point to make about ETLJobStages is that the Restart Framework only supports restartability of a failed ETLJobStage, the Restart Framework has no control (and, indeed, does not care) what occurs within that ETLJobStage. The implication therefore is that the onus is on the package developer to ensure that each ETLJobStage is re-runnable from the start of that package in the event of failure; in other words an ETLJobStage must be idempotent.
Each time an ETLJob is executed a record is inserted into a table called ETLJobHistory and a unique ETLJobHistoryId is assigned. Crucially, when a previously-failed ETLJob is restarted it retains the same ETLJobHistoryId, compare this to SSIS' own execution_id which will be different whenever an ETLJob is restarted.
The ETLJobHistoryId can be particularly useful when used for lineage purposes in a data warehouse loading routine. Every inserted or updated record can have the ETLJobHistoryId stored against it which is useful for providing lineage information such as when the record was inserted/updated.
This is the same database that houses usp_ssiscatalog and all of its supporting code modules. All of the database objects that support the Restart Framework are in a schema called RestartFramework.
The Restart Framework consists of two packages that must be included in every SSIS project that is intending to use the Restart Framework hence they will need to be added into your SSIS project within Visual Studio.
This package must be executed in order to have any execution be managed by the Restart Framework. It takes a single parameter, ETLJobName, to indicate which ETLJob it should execute. Root.dtsx will interrogate the Restart Framework metadata in the SsisReportingPack database to determine which ETLJobStages are included.
For each ETLJobStage Root.dtsx will fire off a new instance of ThreadControllor.dtsx, passing it a ThreadID and an ETLJobStageOrder.
Root.dtsx can fire off eight concurrent instances of ThreadController.dtsx. This number if configurable however eight is the maximum. You could easily extend Root.dtsx to fire off more than eight if you so desired.
Here is a screenshot of Root.dtsx control flow:
This package is responsible for calling your packages that actually do some work. It receives a ThreadId and ETLJobStageOrder from Root.dtsx which it uses to interrogate the database to get a list of ETLJobStageNames that it needs to execute. It loops over that list and executes a package of the same name from the current project.
When an ETLJobStage completes successfully it is the job of ThreadController.dtsx to update the database to indicate that this has occurred.
Here is a screenshot of ThreadController.dtsx control flow: