This is the blog of Jamie Thomson, a freelance data mangler in London
I’ve been very quiet on the blogging front of late. There are a few reasons for that but one of the main ones is that I’ve spent the past three months on a new gig immersing myself in Hadoop, primarily in Microsoft’s Hadoop offering called HDInsight. I’ve got a tonne of learnings that I want to share at some point but in this blog post I’ll start with a handy little script that I put together yesterday.
I’m using a tool in the Hadoop ecosystem called HBase which was made available on HDInsight in preview form about a month ago. HBase is a NoSQL solution intended to provide very very fast access to data and my colleagues and I think it might be well suited for a problem we’re currently architecting a solution for. In order to evaluate HBase we wanted to shove lots of meaningless data into it and in the world of HDInsight the means of communicating with your HDInsight cluster is Powershell. Hence I’ve written a Powershell script that will use HBase’s REST API to create a table and insert random data into it. Likely if you’ve googled this post then you’re already familiar with Hadoop, HDInsight, REST, Powershell, HBase, column families, cells, rowkeys and other associated jargon so I won’t cover any of those, what is important is the format of the XML payload that has to get POSTed/PUTted up to the REST API. That payload looks like this:
<?xml version="1.0" encoding="UTF-8"?>
The payload can contain as many cells as you like. When the payload gets POSTed/PUTted the values therein need to be base64 encoded but don’t worry, the script I’m sharing herein takes care of all that for you. The script will also create the table for you. The data that gets inserted is totally meaningless and is also identical for each row, modifying the script to insert something meaningful is an exercise for the reader.
Another nicety of this script is that it uses Invoke-RestMethod which is built into Powershell 4. You don’t need to install other Powershell modules, nothing Azure specific. If you have Powershell 4 you’re good to go!
Embedding code on this blog site is ugly so I’ve made it available on my OneDrive: CreateHBaseTableAndPopulateWithData.ps1 Screenshot below gives you an idea of what’s going on here.
Hope this helps!
UPDATE. I’ve posted a newer script CreateHBaseTableAndPopulateWithDataQuickly.ps1 which loads data in much quicker. This one sends multiple rows in each POST and hence I was able to insert 13.97m rows in 3 hours and 37 minutes which, given latency to the datacentre and that this was over a RESTful API, isn’t too bad. The previous version of the script did singleton inserts and hence would have taken weeks to insert that much data.
The number of POSTs and the number of rows in each POST are configurable.
I am still working apace on updates to my open source project SSISReportingPack, specifically I am working on improvements to usp_ssiscatalog which is a stored procedure that eases the querying and exploration of the data in the SSIS Catalog.
In this blog post I want to share a titbit of information about usp_ssiscatalog, that all the actions that you can take when you execute usp_ssiscatalog are documented within the stored procedure itself. For example if you simply execute
EXEC usp_ssiscatalog @action='exec'
in SSMS then switch over to the messages tab you will see some information about the action:
OK, that’s kinda cool. But what if you only want to see the documentation and don’t actually want any action to take place. Well you can do that too using the @show_docs_only parameter like so:
EXEC dbo.usp_ssiscatalog @a='exec',@show_docs_only=1;
That will only show the documentation. Wanna read all of the documentation? That’s simply:
EXEC dbo.usp_ssiscatalog @a='exec',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='execs',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='configure',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_created',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_running',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_canceled',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_failed',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_pending',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_ended_unexpectedly',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_succeeded',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_stopping',@show_docs_only=1;
EXEC dbo.usp_ssiscatalog @a='exec_completed',@show_docs_only=1;
I hope that comes in useful for you sometime. Have fun exploring the documentation on usp_ssiscatalog. If you think the documentation can be improved please do let me know.
- Cortana – Microsoft’s digital personal assistant, competing with Apple’s Siri and Google Now.
“She gets to know you by learning your interests over time. She looks out for you, providing proactive, useful recommendations. And Cortana keeps you closer to the people and things you care about most, by keeping track of all that matters.” - http://www.windowsphone.com/en-US/features-8-1#Cortana
- Schema.org – “a collection of schemas that used to markup HTML pages, and that can also be used for structured data interoperability Search engines rely on this markup to improve the display of search results” https://schema.org/.
I think of schema.org as machine-readable metadata on a web page. This has existed for years in various guises (e.g. microformats), schema.org is a company-independent initiative to standardise those schemas.
One of the interesting features that Cortana provides is the ability to parse your email in order to discover flight information and then keep you up to date with information about that flight. Upon reading MSDN article Sending flight information to Microsoft Cortana with contextual awareness it transpires that this feature is entirely dependent upon the email containing schema.org markup:
Microsoft Cortana interprets schema.org markup in e-mails to extract airline flight reservation data
We all unknowingly make use use of http://schema.org markup all the time (every time you use Google for example, the share contract in Windows 8 & Windows Phone 8.1 can also leverage schema.org) however this is the first time I’ve been made aware of a user experience that is totally dependent on it.
I find this fascinating, I’ve long had an interest in structured metadata on the web (I’ve been banging on about it in blog posts for the last 10 years) and now my mind is in overdrive thinking of other scenarios that could leverage this. Taking a look at the documentation shows us that schema.org provides definitions of a multitude of things (in fact Thing is the base class from which everything else is derived), some examples:
Its not hard to envisage that Cortana might one day see mention of a music track in an email and offer the ability to play it for you, or see contact information and offer to add that to your contact list. Where this gets potentially even more interesting (to me, anyway) is when we consider that Cortana is already extensible by 3rd party apps – I foresee that in the future 3rd party apps will have the ability to subscribe to a particular schema and have Cortana notify them when it comes across an instance of that schema (imagine a Spotify app being informed that an email just arrived with information about a new music release and then offering to play that for you).
Perhaps this 3rd party extensibility could work the other way too by presenting other data sources to Cortana. For example, if Twitter ever get round to implementing annotations then Cortana could potentially read metadata hidden within tweets as well as combing through emails.
Expect that in the future Cortana will parse new data sources, find new types of information and allow more 3rd parties to act upon it. The potential for a degraded user experience with notifications flashing in your face all the time is a worry but frankly I don’t care, this is all fascinating stuff to a data geek like me.
P.S. Thank you to Savas Parastatidis from the Cortana dev team for highlighting Cortana’s use of schema.org.
Here’s some code that I absolutely know I’m going to need again in the future, what better place to put it than on my blog!
If you need to prompt the user for a password when using Powershell then you want to make sure that the value types in isn’t visible on the screen. That’s quite easy using the –AsSecureString parameter of the Read-Host cmdlet however its not quite so easy to retrieve the supplied value. The following code shows how to do it:
$response = Read-host "What's your password?" -AsSecureString
$password = [Runtime.InteropServices.Marshal]::PtrToStringAuto([Runtime.InteropServices.Marshal]::SecureStringToBSTR($response)
I don’t know of a quick and easy way to format Powershell code for a blog post so here’s a screenshot instead:
I’ve also put this on pastebin: http://pastebin.com/2D6xaz0U
All credit goes to Paul Williams for his post Converting System.Security.SecureString to String (in PowerShell)
I have maintained a watching brief on what I refer to as “cloud ETL”, that is the ability build ETL routines in a cloud environment and therefore leverage all the benefits that the cloud model brings*. Thus far my main opinion piece in this area is What would a cloud-based ETL tool look like? in which I laid out what features I thought a cloud ETL tool should have:
- Data transformation would be done “in the cloud” i.e. I wouldn’t need to own my own hardware in order to run it
- Ability to consume data from/push data to <many different data protocols>
- Adapters (possibly with a plug-in model) for cloud storage and API providers
- Job scheduler
- Workflow. (e.g. Do this, then do that. Do these things in parallel. Only do this if some condition is true. Restart from here in case of failure.)
- An IDE (open to debate whether the IDE should be “in the cloud” as well)
- Ability to carry out common transformations (join, aggregate, sort, projection) on those heterogeneous data sources
- Ability to authenticate using different authentication mechanisms
- Configurable logging
- Ability to publish transformed data in a manner that makes it consumable rather than insert it into another data store
Given that I have spent the majority of my career working with Microsoft technologies (in particular their ETL tool, SSIS) I am interested to know whether Microsoft will offer a cloud ETL tool. With that in mind I was interested to discover a new service on Azure that is currently in preview called Azure Automation (read Announcing Microsoft Azure Automation Preview). Azure Automation is essentially a a cloud-based workflow tool and, as I said above, workflow is a feature that I believe a cloud-based ETL tool should encompass:
- Workflow. (e.g. Do this, then do that. Do these things in parallel. Only do this if some condition is true. Restart from here in case of failure.)
SSIS developers will of course be aware that SSIS has its own workflow tool (termed the Control Flow). It always kind of bugged me that different Microsoft tools had their own workflow technology. SSIS had one, I believe BizTalk had one, there was another called Windows Workflow Foundation (WWF) and in fact there was a possibility within the SQL Server 2008 timeframe that SSIS would replace its Control Flow with WWF (that never happened and the Program Manager that wanted to do it has since left the SSIS product team).
Azure Automation is built upon Powershell Workflow which in turn is built upon WWF (now simply called Workflow Foundation – WF). It certainly seems as though WF is becoming the foundational workflow technology to rule them all within Microsoft and that is no bad thing in my opinion – it seems foolish to reinvent the wheel every time. Powershell Workflow has the following cmdlets for building workflows:
- Foreach –parallel
Those are all fairly self-explanatory. Of particular interest to me is Foreach –parallel (we’ve been asking for a native Parallel ForEach Loop in SSIS for years) and that might be even more useful in a scale-out infrastructure such as can be offered by the cloud (imagine firing off multiple FTP tasks in parallel, all working on different Azure nodes). Checkpoint-Workflow also sounds very interesting:
A checkpoint is a snapshot of the current state of the workflow, including the current values of variables, and any output generated up to that point, and it saves it to disk. You can add multiple checkpoints to a workflow by using different checkpoint techniques. Windows PowerShell automatically uses the data in newest checkpoint for the workflow to recover and resume the workflow if the workflow is interrupted, intentionally or unintentionally.
Stateful restartability that you can control, all out-of-the-box. How cool is that? So much better than the awful checkpointing feature within SSIS.
It certainly appears to me that Azure Automation could satisfy my desire for a workflow engine for the purposes of cloud-ETL. Now if only Microsoft were working on cloud-based dataflows too we’d have something akin to SSIS-in-the-cloud .
*My own personal opinion is that the benefits of the cloud model can be summed up simply as “OPEX not CAPEX”. You may have your own definition, and that’s OK.
On 31st March 2014 I released version 126.96.36.199 of SSIS Reporting Pack, my open source project that aims to enhance the SSIS Catalog that was introduced in SSIS 2012. This is a big release because it includes an entirely new feature -the Restart Framework.
The Restart Framework exists to cater for a deficiency within SSIS, that being the poor support for restartability. Let's define what I mean by restartability:
A SSIS execution that fails should, when re-executed, have the ability to start from the previous point of failure.
SSIS provides a feature called checkpoint files that are intended to help in this scenario but I am of the opinion that checkpoint files are an inadequate solution to the problem, I explain why in my blog post Why I don't use SSIS checkpoint files.
The Restart Framework was designed to overcome the many shortcomings of checkpoint files.
One of the fundamental tenets of the Restart Framework is that the packages that you, the developer, build for your solution should not be required to contain any variables, parameters, tasks, or event handlers in order to make them work with the Restart Framework. In fact your packages should be agnostic of the fact that they are being executed by the Restart Framework.
TL;DR: A video that demonstrates the installation and base functionality of the Restart Framework can be viewed at https://www.youtube.com/watch?v=syV0Wpwhlnk
Let's define some important terms that you will need to become familiar with if you are going to use the Restart Framework.
An ETLJob is the definition of some work that an end-to-end ETL process needs to perform. An ETLJob would typically incorporate many SSIS packages. Each ETLJob has a name (termed ETLJobName) which can be any value you want, some example ETLJobNames might include:
- Nightly Data Warehouse Load
- Monthly Reconciliation
- All backups
Each ETLJob contains one or more ETLJobStages. These are the "building blocks" of your solution and for each ETLJobStage there must exist a package in your SSIS project with a matching name. For example, an ETLJobStage with the name "FactInternetSales" will require a SSIS package called "FactInternetSales.dtsx".
The Restart Framework allows the declaration of dependencies between ETLJobStages - an ETLJobStage cannot start until all ETLJobStages with a lower ETLJobStageOrder have completed successfully. This is a fundamental tenet of the Restart Framework as it needs to know the order in which ETLJobStages need to occur in order that it can restart execution from the previous point of failure.
The Restart Framework provides some stored procedures that should be used to define ETLJobs, ETLJobStages and the dependencies between them.
One important point to make about ETLJobStages is that the Restart Framework only supports restartability of a failed ETLJobStage, the Restart Framework has no control (and, indeed, does not care) what occurs within that ETLJobStage. The implication therefore is that the onus is on the package developer to ensure that each ETLJobStage is re-runnable from the start of that package in the event of failure; in other words an ETLJobStage must be idempotent.
Each time an ETLJob is executed a record is inserted into a table called ETLJobHistory and a unique ETLJobHistoryId is assigned. Crucially, when a previously-failed ETLJob is restarted it retains the same ETLJobHistoryId, compare this to SSIS' own execution_id which will be different whenever an ETLJob is restarted.
The ETLJobHistoryId can be particularly useful when used for lineage purposes in a data warehouse loading routine. Every inserted or updated record can have the ETLJobHistoryId stored against it which is useful for providing lineage information such as when the record was inserted/updated.
This is the same database that houses usp_ssiscatalog and all of its supporting code modules. All of the database objects that support the Restart Framework are in a schema called RestartFramework.
The Restart Framework consists of two packages that must be included in every SSIS project that is intending to use the Restart Framework hence they will need to be added into your SSIS project within Visual Studio.
This package must be executed in order to have any execution be managed by the Restart Framework. It takes a single parameter, ETLJobName, to indicate which ETLJob it should execute. Root.dtsx will interrogate the Restart Framework metadata in the SsisReportingPack database to determine which ETLJobStages are included.
For each ETLJobStage Root.dtsx will fire off a new instance of ThreadControllor.dtsx, passing it a ThreadID and an ETLJobStageOrder.
Root.dtsx can fire off eight concurrent instances of ThreadController.dtsx. This number if configurable however eight is the maximum. You could easily extend Root.dtsx to fire off more than eight if you so desired.
Here is a screenshot of Root.dtsx control flow:
This package is responsible for calling your packages that actually do some work. It receives a ThreadId and ETLJobStageOrder from Root.dtsx which it uses to interrogate the database to get a list of ETLJobStageNames that it needs to execute. It loops over that list and executes a package of the same name from the current project.
When an ETLJobStage completes successfully it is the job of ThreadController.dtsx to update the database to indicate that this has occurred.
Here is a screenshot of ThreadController.dtsx control flow:
I’m a frequent user of OneNote and so was delighted with today’s news that there is now a public API available so that third party apps and services can put stuff into your OneNote notebooks (an API? welcome to the modern web, OneNote). One of those third party services is ifttt so I’ve set up a few ifttt recipes to dump stuff into OneNote:
All very nice thanks very much.
I do have a few quibbles though (otherwise why would I be writing a blog post, right? ). Firstly, the API only allows you to create pages, it cannot append to existing ones. Second, and more importantly, you can’t choose which workbook section to create the page in. I find this really annoying, take the example of my ifttt recipe above that bungs all my blog posts into OneNote – how much more useful would it be if we could choose which section to put them into? As it stands right now I would have to go and move them all after the event. Still, credit where credit is due, the API exists and I harbour hopes that it will improve over time.
A OneNote API is nice and all but one thing I’ve been craving for years is an API that allows me to insert data into an Excel spreadsheet residing on OneDrive. I’ve written in the past about the Excel Services REST API
where I lamented:
Although I haven't demonstrated it here Excel Services' REST API does provide a makeshift way of altering the data by changing the value of specific cells however what it does not allow you to do is add new data into the workbook. Google Docs allows this.
Exploring the Excel Services REST API
Chris Webb (who has joined me in this crusade) raised a forum thread in June 2010 entitled Excel Web App API? where he requested such a thing, nearly four years later and we’re still waiting.
Ifttt allows recipes that trigger every time you tweet, how cool would it this could be used to insert a new row into an Excel spreadsheet on OneDrive for each of my tweets*? Well I would like that anyway and the existence of this new OneNote API rekindles my hope that one day such an API for Excel might exist – please don’t let me wait another 4 years though, Microsoft!
* Before anyone leaves a comment telling me so, I’m already aware that I can use ifttt to insert all my tweets into a Google Docs spreadsheet and indeed I’m already doing so. I’d just prefer it for Excel, that’s all.
This week a SharePoint conference took place somewhere and I took more than a passing interest because it clearly wasn't a SharePoint conference, it was a Office365/Yammer conference and as far as I can discern the big takeaways were:
It was interesting to me because Power BI is something that is on my radar and which is delivered via Office 365. This got me thinking about scenarios where Power BI & Yammer could play together more effectively.
The BI delivery team that I currently work for is trying find ways to make the information that we produce more discoverable, more accessible and to promote the use of the information that we provide throughout the company. The company is an Office365 customer however they pretty much use it only as an email & IM provider - none of the SharePoint-y stuff is used. The company is also a Yammer customer.
The confluence of Yammer and Power BI might make an interesting story here. Imagine, for example, the ability to build a Power View report using Power BI and then share that throughout the organisation using Yammer, perhaps via a Yammer group. Anyone viewing their Yammer feed would be able to view and interact with that Power View report without leaving Yammer. I’m not talking about simply viewing an image of a report either – I’d want to be able to slice’n’dice that report right within my Yammer feed.
I’ve long thought that we need to think of new ways of delivering BI to the masses and I believe social collaboration tools present a great opportunity to do that. I’m excited about what Yammer + Power BI could bring, let’s hope Microsoft don’t royally screw it up.
I still believe that Microsoft’s Master Data Services (MDS) should be offered through Power BI and again the opportunity to collaboratively compile and discuss data that resides in MDS is compelling. I see no reason why people wouldn’t want to change MDS data from within their Yammer feed – why would we force them to go elsewhere? Again I opine, bring the data to wherever your users are, don’t make them go somewhere else.
Hidden away behind all of the announcements was the implicit assertion that Windows Azure Active Directory is critical to Microsoft’s cloud efforts. Office 365 sits on top of Windows Azure Active Directory and I don’t think many people realise the significance of that. Whoever manages your company’s employees’ identities has a huge opportunity for selling new stuff to you and that’s why Windows Azure Active Directory is free. This is not a new play for Microsoft, over the past 20 years or so they’ve become a huge player in the corporate landscape and that’s in no small way down to Active Directory – own the identity and you can sell them other stuff like SharePoint, Windows, SQL Server etc… By allowing you to extend your Active Directory into the cloud and have pervasive groups its not far off being a no-brainer for companies to use Windows Azure & Office 365.
Active Directory in the cloud, public and private groups, identity management, developer APIs … those are the big plays here and is very much like what I described in my blog post Windows Live Groups predictions and “Active directory in the cloud”. The names and players have changed but the concepts I outlined there are now happening. Back then I said:
[This] gives rise to the idea of Groups becoming something analogous to an "active directory in the cloud". This is a disruptive idea partly because it could become the mechanism by which Microsoft grant access to their online properties in the future.
Even more powerful is the idea that 3rd party websites that authenticate visitors … could use Groups to determine what each user can do on that site. Groups will become part of an authentication infrastructure that anyone in the world can leverage.
This "active directory in the cloud" idea relies on a robust API that allows a 3rd party site to add and remove people from groups.
Believe it or not that was six years ago. Don’t want to say I told you so, but…
SET STATISTICS TIME ON
SET STATISTICS IO ON
return information about query executions and are very useful when doing performance tuning work as they inform how long a query took to execute and the amount of IO activity that occurred as a result of that query.
These are very effective features however to my mind they do have a drawback in that the information they provide is not accessible in the actual query window from which the query was executed. This means the results cannot be collected, stored in a table, and then queried – such information would have to be manually copied and pasted from the messages pane into (say) a spreadsheet for further analysis.
This is dumb. I’m a SQL Server developer, I want my data available so that I can bung it into a table in SQL Server and issue queries against it. That is why, a couple of weeks ago, I submitted a request to Microsoft Connect entitled Access to STATS TIME & STATS IO from my query in which I said:
I recently was doing some performance testing work where I was evaluating the affect of changing various settings on a particular query. I would have liked to simply run my query inside a couple of nested loops in order to test all permutations but I could not do that because every time I executed the query I had to pause so I could retrieve the stats returned from STATISTICS IO & STATISTCS TIME and manually copy and paste (yes, copy and paste) the information into a spreadsheet.
This feels pretty dumb in this day and age. Why can we not simply have access to that same information within my query? After all, we have @@ROWCOUNT, ERROR_MESSAGE(), ERROR_NUMBER() etc... that provide very useful information about the previously executed statement, how about @@STATISTICS for returning all the IO & timing info? We can parse the text returned by that function to get all the info we need.
Better still, provide individual functions e.g.:
Ralph Kemperdick noticed my submission and correctly suggested that the same information could be accessed using Extended Events. Based on this I’ve written a script (below) that issues a series of queries against the AdventureWorks2012 sample database, captures similar stats that would be captured by SET STATISTICS then presents them back at the end of the query. Here are those results:
The information is not as comprehensive as what you would get from SET STATISTICS (no Read-Ahead Reads for example, and no breakdown of IO per table) but should be sufficient for most purposes.
You can adapt the script accordingly for whatever information you want to capture, the important part of the script is the creation of the XEvents session for capturing the queries, then reading and shredding the XML results thereafter.
Hope this is useful!
UPDATE: Turns out you don't need all of this. I've just been informed that Richie Rump has written a parser at http://statisticsioparser.com/
that does all of this for you. Simple paste in your STATISTICS IO output and press the button - it will do all the hard work for you and give you the results back in a nice readable graph. You can paste in multiple results at once too.
--Create the event session
CREATE EVENT SESSION [queryperf] ON SERVER
ADD EVENT sqlserver.sql_statement_completed
ADD TARGET package0.event_file(SET filename=N'C:\temp\queryperf.xel',max_file_size=(2),max_rollover_files=(100))
WITH ( MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_MULTIPLE_EVENT_LOSS,
MAX_DISPATCH_LATENCY=120 SECONDS,MAX_EVENT_SIZE=0 KB,
--Set up some demo queries against AdventureWorks2012 in order to evaluate query time & IO
DECLARE @SalesPersonID INT;
DECLARE @salesTally INT;
DECLARE mycursor CURSOR FOR
FROM Sales.SalesOrderHeader soh
GROUP BY soh.SalesPersonID;
FETCH NEXT FROM mycursor INTO @SalesPersonID;
ALTER EVENT SESSION [queryperf] ON SERVER STATE = START;
WHILE @@FETCH_STATUS = 0
SELECT @salesTally = COUNT(*)
FROM Sales.SalesOrderHeader soh
INNER JOIN Sales.[SalesOrderDetail] sod ON soh.[SalesOrderID] = sod.[SalesOrderID]
WHERE SalesPersonID = @SalesPersonID
FETCH NEXT FROM mycursor INTO @SalesPersonID;
DROP EVENT SESSION [queryperf] ON SERVER;
--Extract query information from the XEvents target
SELECT CAST(event_data AS XML) AS event_data_XML
FROM sys.fn_xe_file_target_read_file('C:\temp\queryperf*.xel', NULL, NULL, NULL)
WHERE q.[statement] LIKE 'select @salesTally = count(*)%' --Filters out all the detritus that we're not interested in!
ORDER BY q.[timestamp] ASC
Accepted wisdom when one purchases an app from a business store is that one gets free updates for life. This is, quite obviously, an unsustainable business model and I suspect is the main reason why so many apps use advertising to generate income.
There is though, in the enterprise world at least, a move to a subscription-based business model (i.e. renting software) the most obvious examples of which are Office 365 and Adobe Creative Cloud and I’m left wondering why app stores don’t offer a similar option.
Today I installed an app called Tweetium that offers a (paid for) premium option, here is why the premium option exists:
Again this strikes me as unsustainable. The customer pays once yet Tweetium has to pay TweetMarker every month. Forever. It doesn’t take an expert mathematician to realise that eventually Tweetium’s monthly outlay could exceed the income they have saved up from purchases.
It seems to me there is a simple solution to all this. App stores could offer an option for customers to rent apps rather than buy them. Its more sustainable for the app vendor and the app store provider gets a more predictable income stream (which CFOs seem to like). Why don’t app stores not do this? Seems like a no-brainer to me
Just a random thought for a Sunday morning.
UPDATE: Apparently iOS & Android app stores *do* offer subscription models
, I just wasn't aware of it.
On 17th February 2014 (3 days ago) I visited an event called SQL Supper held at Microsoft’s central London office, Cardinal Place. The event was basically a QnA session with Mark Souza, Conor Cunningham, Nigel Ellis, Hatay Tuna & Ewan Fairweather and one part of the evening was loosely termed the gripe session where the attendees were invited to stick their hand in the air and when asked have a good old whinge about something in SQL Server that, well, frankly pissed them off. Given the members of the panel this was inevitably focused on the database platform in SQL Server rather than the BI stuff and this is what I was only too happy to gripe about:
Microsoft seem to have dropped the ball on database developer productivity, both in the language and the tooling. A decade ago this is something that SQL Server was renowned for, I put it to you that this is no longer the case. SSDT came out with SQL Server 2012 and its a great tool, I love it, but in the two years since there have been various maintenance releases but hardly any new features. SSMS has hardly changed for years, extensibility is still not truly supported. Intellisense does not work properly 100% of the time. As far as I can recall T-SQL has had only two major features (TRY/CATCH & windowing functions) in the last ten years.
Please fix this. Show database developers some love again.
I could write pages and pages of gripes just under the banner of developer productivity but I’ll leave you with that concise summary. It is of course a matter of opinion, feel free to agree or disagree.
In this week’s earlier blog post First release of my own personal T-SQL code library on Github I talked of how one could use a dacpac to distribute a bunch of code to different servers. Upon reading the blog post Jonathan Allen (of SQL Saturday Exeter fame), with whom I’ve been discussing dacpacs with on-and-off recently, sent me this email:
The DacPac thing I emailed about in December hasnt taken off yet but I have just downloaded your code library to take a look and I like the way the dacpac works. Should I be able to open that in VS or is the dacpac compiled/built in VS? The video you linked to didnt cover dapac at all so I am in the dark on how to create one/them.
If I can build a database and create a dacpac simply then this could be really useful.
Jonathan’s email made me realise that there is perhaps a lot of confusion about what dacpacs are, what they can be used for and how they can be used so I figured a braindump of what I know about them might be useful, that’s what you’re getting in this blog post.
What is a dacpac?
A dacpac is a file with a .dacpac extension.
In that single file are a collection of definitions of objects that one could find in a SQL Server database such as tables, stored procedures, views plus some instance level objects such as logins too (the complete list of supported objects for SQL Server 2012 can be found at DAC Support For SQL Server Objects and Versions). The fact that a dacpac is a file means you can do anything you could do with any other file, store it, email it, share it on a file server etc… and this means that they are a great way of distributing the definition of many objects (perhaps even an entire database) in a single file. Or, as Microsoft puts it, a self-contained unit of SQL Server database deployment that enables data-tier developers and database administrators to package SQL Server objects into a portable artifact called a DAC package, also known as a DACPAC. That in itself is, I think, very powerful.
Ostensibly a dacpac is a binary file so you can’t just open it up in your favourite text editor and look at the contents of it. However, what many people do not know is that the format of a dacpac is simply the common ZIP compression format and hence we can add .zip to the end of a dacpac filename:
and open it up like you would any other .zip file to have a look inside. If you do so you will see this:
The contents of that zip file conform to something called the Open Packaging Convention (OPC). OPC is a standard defined by Microsoft for, well, for zipping up files basically. You have likely used files conforming to OPC before without knowing it, docx, .xlsx, .pptx are the most common ones that you might recognise if you use Microsoft Office and there are some more obscure ones such as .ispac (SSIS2012 developers should recognise that). (For a more complete list of OPC-compliant file types see the wikipedia page).
Notice in the screenshot above showing the innards of TSQLCodeLibrary.dacpac the biggest file is model.xml. This is the file that contains the definition of all our SQL Server objects. I won’t screenshot that here but I encourage you to get hold of a .dacpac file (here’s one) and have a poke around to see what’s in that model.xml file.
What are dacpacs for?
Dacpacs are used for deploying SQL Server objects to an instance of SQL Server. That’s it. If your job does not ever involve doing that then you probably don’t need to read any further.
A .docx file (i.e. A Microsoft Word document) isn’t much use to someone if they don’t have the software (i.e. Microsoft Word) to make use of it and so the analogy holds for dacpacs; in order to use them you need to have some software installed and that software is called the Data-tier Application Framework (or DAC Framework for short, or DacFx for even shorter).
Incidentally, you may be wondering what DAC stands for at this point. I think its “Data-Tier Application” in which case you may be thinking that the acronym DAC is a stupid one especially as DAC also stands for something else in SQL Server, I would agree!
DacFx is available to download for free however you’ll probably never need to do that as installation of DacFX occurs whenever you install SQL Server, SQL Server client tools or SQL Server Data Tools (SSDT). If DacFX is installed you should be able to see it in Programs and Features:
How does one deploy a dacpac?
In dacpac nomenclature the correct term for deploying a dacpac is publishing however the two generally get used interchangeably. There are two methods of publishing a dacpac which I’ll cover below.
Publish via SSMS
In SSMS’s Object Explorer right-click on the databases node and select “Deploy Data-tier Application…” (told you they used those terms interchangeably):
This launches a wizard that prompts you to choose a dacpac, fill in some particulars (e.g. database name) and then deploy it for you by calling out to DacFx. Unfortunately this wizard is not very good because it doesn’t (currently) support all features of dacpacs, namely if your dacpacs contain any sqlcmd variables (I won’t cover those here but they are commonly used within dacpacs) a value needs to be supplied for them; the wizard doesn’t prompt you for a value and hence the deployment fails.
This. Is. Stupid. Microsoft should be suitably lambasted for not providing this basic functionality. Anyway, due to this limitation you’re most likely to be using the other method which is…
Publish via command-line
One component distributed in DacFx is a command-line tool called sqlpackage.exe which will quickly become your best friend if you use dacpacs a lot. sqlpackage.exe can do a lot of things and those “things” are referred to as actions, one of those actions is publishing a dacpac. Here’s the syntax for publishing a dacpac using sqlpackage.exe:
"%ProgramFiles(x86)%\Microsoft SQL Server\110 \DAC\bin\SqlPackage.exe"
/SourceFile:<path to your dacpac>
/TargetServerName:<SQL instance you are deploying to>
/TargetDatabaseName:<Name of either (a)the database to create or (b) the existing database to deploy into>
Publishing is idempotent
Notice from my comment above for TargetDatabaseName that you can deploy to an existing database. You might ask why you might want to publish into an existing database, after all, the objects you are publishing might already exist. This segues nicely into what I see as the biggest benefit of dacpacs and DacFx, the software interrogates the target database to determine whether or not the objects already exist or not and if they do not it will create them. If they do already exist it will determine whether the definition has changed or not and if it has, it will make those changes. DacFx will always protect your data so if it determines that an operation would cause data destruction (e.g. removing a column from a table) then it will (optionally) throw an error and fail. You never again need to write an ALTER statement or first check that an object exists in order to change an object definition, DacFx will do it for you. To put it another way, publishing using dacpacs and DacFx is idempotent.
How does one create a dacpac?
Of course in order to publish a dacpac you’re first going to have to create one and one of Jonathan’s questions above pertained to exactly this. There are two distinct ways to do create a dacpac.
Use an SSDT Project
SQL Server Data Tools (SSDT) projects are basically a project type within Visual Studio that provide a way of building DDL for SQL Server databases. I’m not going to cover SSDT projects in any detail here except to say that when such a project is built the output is a dacpac. Note that SSDT can also publish the dacpac for you however I didn’t mention that above as the publish operation is essentially another wrapper around the same DacFx functionality used by sqlpackage.exe.
Create from an existing database
One can right-click on a database in SSMS and click on “Extract Data-tier Application…” to create a dacpac containing the definition of all objects in that database:
Should you be using dacpacs? I can’t answer that question for you but hopefully what I’ve done is given you enough information so that you can answer it for yourself. Some people might like the way dacpacs encapsulate many objects into a single file and their idempotent deployment, others may prefer good old simple, handcrafted T-SQL scripts which don’t have any pre-requisites other than SQL Server itself. The choice is yours.
David Atkinson from Redgate has been in touch to tell me about another dacpac feature that I didn’t know about. It is possible to right-click on a dacpac in Windows Explorer and choose to unpack it:
That essentially unzips it but what you also get is a file called Model.sql that will create all of the objects in the dacpac:
Very useful indeed! David tells me that Redgate use this functionality to enable comparison of a dacpac using their SQL Compare tool as you can read about at Using a DACPAC as a data source.
Like many (most???) T-SQL developers I keep a stash of useful code that I’ve garnered down the years because I know its all going to come in useful at some point in the future. It includes code I’ve written myself and also code that others have shared on their own blogs. For example my code library includes the following:
I’ve never seen the point of keeping one’s code library to one’s self, might as well share it in case anyone else might find it useful, so up to now I’ve kept my collection of scripts publicly available on SkyDrive (go see it if you like).
That’s all fine and dandy but I figured this could be improved. SkyDrive is a file sharing site and whilst it includes a nice code viewer/editor it is not an ideal solution for storing code, code should be stored in a version control system (e.g. Git, TFS, Subversion, etc..). I opted to make my code library available on Github at https://github.com/jamiekt/TSQLCodeLibrary/ because it provides:
- file version history
- ability for anyone else to fork my code library and build upon it to maintain their own code library
- lots of tools necessary for modern code development
and moreover all the cool kids seem to be using Github so I figured I’d give it a bash as well.
The code library exists as a collection of views, functions and stored procedures in an SSDT project. I’m a massive fan of SSDT so there were many reasons for my choosing to do this but the overriding reason was that SSDT provides a single binary (i.e. a dacpac file) containing the entire code library that can be distributed as easily as emailing the file to someone. Deploying a dacpac is pretty simple and so is a great method for sharing T-SQL code.
What’s in my T-SQL code library?
In this first release, not much. There are only nine objects though I hasten to add that this is only a first release and I have a backlog of stuff that I need to add in there. One of the many advantages of using SSDT is that it makes it easy to add extended properties to describe the objects and the code library includes a view that surfaces all of that extended property information:
How do you install the code library?
Download the two binaries:
and store them together in a folder. Open a command prompt at that folder and type:
"%ProgramFiles(x86)%\Microsoft SQL Server\110\DAC\bin\SqlPackage.exe"
(replacing <your_sql_instance> with the name of the SQL Server instance where you want to create the code library and <prefered_database_name> with whatever you want the database to be called. Get rid of the line feeds as well, they are just used here for clarity)
This will create a SQL Server database containing my code library:
If any of the code in my code library proves useful to you then that’s great however my wish here is that some of you other folk out there feel motivated to share your own code in a similar manner. If you do so please post a comment below and let me know.
Yesterday on Twitter Ryan Desmond asked Is there a good read for #SSDT regarding deploying changes via schema compare vs solution deployment?
I don’t know of any article that covers this so in this blog post I offer my opinion on the subject.
First some background. When building databases offline using the database project type (.sqlproj) in SSDT you have two options for deploying the DDL code in your project into a physical database:
- Schema Compare
Under the covers both do the same basic operation; build a dacpac from your project, compare it to the target database, build a script that will make the requisite changes to the target database, execute that script.
Ryan was asking which of these one should use. I suggested that publishing was a better option and here are two reasons why:
- Publish will include your pre and post deployment scripts as well whereas Schema Compare will not. (And if your retort is that you cannot run those scripts more than once then you’re doing it wrong, rewrite them.)
- If the debug target for your project is configured correctly then a publish operation can be triggered simply by pressing F5. That’s massively more productive than the point-and-click nature of Schema Compare. Its even better if you have multiple SSDT projects in your solution as you can publish all of them with a single key stroke.
Does anyone out there have a different opinion? Let me know in the comments.