THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a freelance data mangler in London

  • Test for a warning message using Pester

    Here’s a little ditty that’s worth throwing out (if nothing else so that I can find it later), if you’re not using Pester to test your PowerShell code then this post probably isn’t for you. If you’re not using PowerShell stop reading now.

    I wanted to write a Pester test that tested whether or not a piece of code threw a warning or not. I discovered from Understanding Streams, Redirection, and Write-Host in PowerShell that it is possible to redirect stuff in the PowerShell Warning (or Verbose or Debug) stream to the output stream. Here’s an example of doing just that:

    "Here is a message that I will send to the warning stream then redirect back to the output stream" | Write-Warning 3>&1 | %{$_.message}


    The “3>&1” part says “take the contents of stream #3 (i.e. the Warning stream) and send it to stream #1 (i.e. the Output stream)”. Once its in the output stream you can operate upon it as normal.

    In my Pester test I wanted to know whether or not a particular piece of code returned a warning matching a given string. Using the above technique its simply:

    (Some-Code-That-Is-Getting-tested) 3>&1) -match "Desired warning message" | Should Be $true

    Cool stuff! I love Pester.


  • New OneDrive API, possibly unified with OneDrive for Business

    Microsoft offers two similar, but separate, services containing the OneDrive moniker. There’s OneDrive (which is free) and OneDrive for Business (which isn’t). Strangely (as my ex-colleague Mark Wilson points out at OneDrive for Business: lots of cloud storage; terrible sync client) the free offering is much better than the paid-for one.

    Microsoft state in their blog post Taking the Next Step in Sync for OneDrive on 7th January 2015:

    It was clear that the right approach was to converge to a single sync engine and experience that would be able to provide all of the benefits of the consumer and business service to all customers faster. We decided to start with the consumer sync engine foundation from Windows 7 and Windows 8 and add the right capabilities from the other two engines.

    That’s Microsoft-speak for “Yes, OneDrive for Business is shit so we’re dumping it in favour of the OneDrive stuff”. Good news.

    Back in October 2014 Microsoft announced the availability of a Files API that allowed developers the ability to access files stored in Office 365

    we’re enhancing the opportunity for developers with Office 365, with new APIs for mail, files, calendar and contacts. These new robust REST-based APIs empower all developers to leverage the more than 400 petabytes of data (as of March 2014) and popular services across Office 365 in any application.

    New Office 365 extensibility for Windows, iOS, Android and web developers

    I asked a couple of the guys that worked on the API would the files API work against OneDrive and/or OneDrive 4 Business? The answer from Chris Johnson was “working on it”:


    That sounded like good news too.

    Yesterday Microsoft announced the availability of The new OneDrive API which is a RESTful API. Given that they are attempting to unify OneDrive and OneDrive for Business, and also given Chris Johnson’s response to my question in October, I wondered whether this API would cover OneDrive for Business as well so I decided to check out the new OneDrive API using the API console and found something quite promising. A request to which GETs a user’s default drive returns:


    Note the mention of

    "driveType": "consumer"

    That says to me that there will be other driveTypes and so I assume that this API will be used to access OneDrive for Business stuff also. More good news if that’s the case. I tweeted that question but at the time of writing haven’t had an answer.

    Some questions still remain in my mind:

    • What’s the difference between this new OneDrive API and the files API announced back in October? Do we have two teams building essentially the same thing? That sounds like the bad old Microsoft of old so I do hope not.
    • Are we ever going to get a RESTful API that will enable us to push data into an Excel workbook that is stored on OneDrive? Chris Webb and I have been asking for this for five years now. Again I asked the question on Twitter and am still hoping for a response. I do have high hopes that the new Power BI REST API will sate my needs here (right now I wouldn’t know because if one isn’t in the US then one cannot access it).


  • Response to “SSDT limitations”

    A commenter on my blog post Considerations when starting a new SSDT database project asked me to comment on his blog post SSDT Limitations. i wrote a response but when I tried to post it I got told it was too long:


    so I’m posting my response here instead. I’d suggest taking a read of the commenter’s post before reading on.

    "SSDT does not handle multiple file groups"
    I haven't experienced this. I'm pretty sure I've used SSDT to deploy multiple filegroups. Can you clarify what is meant by "does not handle"?

    "SSDT wants to control everything"
    I'd restate that as "SSDT gives you the option to control everything". If you don't want to put all your database objects into your SSDT project, then don't.

    "Production will not look like UAT. For SSDT, everything looks the same."
    Yes, this is a problem. SSDT espouses a mantra of "build once, deploy anywhere" but this breaks down if you want something to be different on each environment - security is the obvious thing here. My approach is to specify roles in the database project that have permissions assigned to them and then later (perhaps in a post-deployment script) add users into those roles. I describe this more fully here: A strategy for managing security for different environments using the Database Development Tools in Visual Studio 2010

    "This means storing in version control production:
    -Logins & passwords
    -app role passwords
    If you're storing passwords in source control then that must mean you're using SQL authentication instead of Windows authentication - that's your decision and I don't see how that is a shortcoming of SSDT. If you don't want to store them in source control - don't. SSDT has mechanisms that allow you to specify values at deployment time (search for SSDT sqlcmd variables). Bottom line, if you're storing passwords in source control then more fool you - nothing in SSDT forces you to do this.

    "It is generally preferred to at least have the option to inspect what will be deployed"
    Correct. And that is why (as you pointed out) SSDT has "the option of either publishing directly to the database or generating a script to run for the deployment."

    "Unfortunately for SSDT the upgrade generates a single file. In practice this means is that if the database upgrade fails for any reason recovery is rather complex. You'll need to work out where the upgrade script failed and then what action you will take from there."
    True. There are things you can do to mitigate this:
    -Test your deployments on a copy of your production database first
    -Do small, frequent releases rather than the big bang approach of infrequently deploying large, monolithic sets of lots of changes

    "The product feels rather like a beta release, bringing even a simple project into production will expose some pretty serious limitation."
    Beta release (to me) implies that it doesn't behave the way its been designed to. I put it to you that it behaves exactly the way its been designed to, it just so happens that that you don't like that behaviour. Fair enough if that's the case, you're not alone in that regard. There are other options for doing deployments if you don't like the way that SSDT espouses. Perhaps look at migrations ( or the old-fashioned method of manually writing change scripts.

    "However the largest issue is more fundamental: the approach where you are expected to trust the magic box to generate a valid upgrade script."
    Yes, you do have to put your trust in SSDT. If you're not happy doing so, don't do it, or just use the option to generate the deployment script and not run it. I personally am happier putting my trust in a repeatable algorithm rather than human beings who tend to be error prone. That's just me, your mileage may vary.

    "It is very easy to generate scripts that will fail to upgrade the database successfully."
    There are certain changes that SSDT is not easily able to cater for (e.g. adding a non-nullable column to a table that contains data) however I'm not sure that constitutes "very easy to generate failing scripts". Personally I think SSDT is better at generating working deployment scripts than a human being is because its quicker and less error prone. As I've alluded, its imperative that you test your deployments before pushing to a production database.

    You asked me to comment and I have done. Clearly we have a difference of opinion about SSDT and that's OK. I hope you can find a way around your complaints because I honestly believe SSDT is the best tool out there for managing SQL Server deployments - I hope you come to the same conclusion one day.


  • Dumb querying, Impala vs Hive

    I alluded yesterday that I wouldn’t be posting much on this blog for the foreseeable future but I’ve come across something today that was quite interesting so I’m already back.

    I’m using Hadoop and using Hive (version 0.12) to query the contents of some tables. Here’s the output from issuing “SELECT COUNT(*) FROM tablename”


    Your eyes don’t deceive you, that’s 223seconds to determine that this table has 0 rows in it. That staggered me. A colleague suggested I try the same query using Impala instead:


    Same result, but this time it came back in <1second.

    What’s the moral here? I’m not really sure. I won’t go out on a limb and say “Impala is quicker” because all I’ve proved is that its quicker to count the number of rows in an empty table which is a fairly pointless exercise. Hive is designed for running operations in a batch manner over massive data sets, nonetheless I think its interesting and I suspect some of my buddies in the SQL server community (whom I suspect will be the only ones reading this) will find it interesting also. Note that Hive v0.12 is old (came out in October 2013) and also that the Stinger initiative is in full flow which promises much much faster response times to Hive queries like these:



  • New Year, new beginnings

    2014 was a year of great change in my personal life and 2015 has begun with a significant change in my work life, I have decided to end my 5-year sojourn into the world of freelancing by taking up a permanent role at Dunnhumby. This is a significant move for me and I wanted to explain why I’d done it (a number of folks have been asking) and to note the implications.

    As I alluded above 2014 was a very turbulent year and one that I am glad to put behind me. My experiences in that year have caused me to re-evaluate my priorities in life and I have have taken the decision to “go permie” because it means I can spend more time with my daughters, Bonnie and Nicki; Dunnhumby’s office is not a long commute which means I can be home by 1730 every night – anyone who works in and around London will tell you that that is virtually unheard of and for me it more than justifies losing the obvious benefits of being a freelancer.

    Working at Dunnhumby is a great move for me too. For 15 years I’ve built a career in the ETL/Data Integration space where the systems I worked on were all about deriving value from data yet were secondary to the client’s core business. At Dunnhumby data integration is their business and that excites me, I’m going to be Lead Developer on a core system that feeds data to some of the world’s largest retailers (with a little research you can probably guess one of them) so there is a tangible opportunity to contribute to the bottom line. Moreover, its refreshing to go into a meeting where everyone is already talking your language, when shifting, aggregating and managing data is what the company does you invariably find that everyone you speak to has more than a passing appreciation of what you do – that’s a welcome change for me. Can you name another company where someone in HR knows what ETL stands for?

    There will be opportunities to push our solution into the cloud and anyone knows me in a work capacity knows that that is an area that excites me. My first aim is to instil my beliefs about continuous integration, unit testing, failing fast, on-demand provisioning and other associated buzzwords into my new team and then lead them on a journey to a truly world class solution. Check back in a few years to see how I did!

    One negative aspect of the new role is that my 15 year association with SQL Server has reached a hiatus. The system I’m working on is built on Hadoop, there’s no SQL Server in sight and that saddens me. I’ve loved every minute of my involvement with SQL Server and have met some wonderful people on that journey, some of whom have become very close friends. On the positive side my SQL Server experience is not going to go to waste as I’ll still be writing SQL day-in, day-out (on Hive and Impala) and the principles that I stand by when I build ETL solutions apply equally in a none-SQL Server landscape. I’m looking forward to learning more about the other half live.

    Blogging about SQL Server and associated interests has been an intrinsic part of my life for the last ten years and in many ways has come to define who I am, I wonder now how it will continue. I admit that my desire to blog has waned somewhat since the heady days of 2006 and 2007 when I was averaging almost one blog post every two days but I’m hoping I still have that spark that persuades me to commit my learnings in writing from time to time. I’ll also keep answering questions on the SSDT forum as I’m still a massive fan of that toolset and believe that its differentiating value to SQL Server hasn’t been realised as it should.

    Not much else to say on this subject really. See you on the other side. Or on Twitter, you can’t keep me quiet on there.


  • Parallel Foreach loops, one reason to use Powershell Workflow instead of SSIS

    Lately I’ve been using Azure Automation which is a service on Azure for running Powershell Workflow scripts, as such as I’ve diving deep into Powershell Workflow. What’s Powershell Workflow? I’m glad you asked:

    A workflow is a sequence of programmed, connected steps that perform long-running tasks or require the coordination of multiple steps across multiple devices or managed nodes. Windows PowerShell Workflow lets IT pros and developers author sequences of multi-device management activities, or single tasks within a workflow, as workflows. By design, workflows can be long-running, repeatable, frequent, parallelizable, interruptible, stoppable, and restartable. They can be suspended and resumed; they can also continue after an unexpected interruption, such as a network outage or computer restart.
    Getting Started with Windows PowerShell Workflow

    So Powershell Workflow does exactly what it says on the tin, its a workflow engine. Those of you that, like me, have been knee deep in SQL Server Integration Services (SSIS) for most of the past decade will know that SSIS too has its own workflow engine – its got a different name, Control Flow, but its still a workflow engine. One frequent ask of SSIS’s Control Flow is a Parallel ForEach Loop but there seems little hope that we’re going to get one (raised on Connect in March 2014, closed as Won’t Fix 5 months later) without going to 3rd party solutions like Cozyroc’s Parallel Loop Task.

    As the title of this post has already told you Powershell Workflow has a big advantage over SSIS, it includes a Parallel Foreach Loop out-of-the-box and you can read all about it at about_Foreach-Parallel. Or type “help about_Foreach-Parallel” into your nearest Powershell window. From there:

    The Parallel parameter of the ForEach keyword runs the commands in a ForEach script block once for each item in a specified collection.
    The items in the collection, such as a disk in a collection of disks, are processed in parallel. The commands in the script block run sequentially on each item in the collection.

    That’s fairly self-explanatory. If you’re interested to know more I’ve put together a cool little Powershell Workflow demo that makes calls to a URI, first serially (using a conventional Foreach Loop) then parallelly (which isn’t, I’m pretty sure, a real word but I’m using it anyway) and hosted it as a githib gist: Parallel-vs-Serial.ps1. You can simply copy-paste the script into Powershell ISE, hit F5 and you’re away. On my machine the serial calls completed in 19seconds, the parallel calls in 13seconds. No brainer!

    Take a look if you have 5 minutes. This is good stuff.



  • Learn from me about SSDT in London, February 2015

    Microsoft released SQL Server Data Tools (SSDT) along with SQL Server 2012 in the Spring of 2012. Since then I’ve noticed an upward tick in both the number of organisations that are using SSDT and the number of questions that are getting asked about it on forums. There is some confusion about what SSDT actually is (Microsoft hasn’t helped there), why people should be using SSDT and, most importantly, how to make best use of it. If you want to know more then a good place to start is my blog series 10 days of SSDT or Get to Know SQL Server 2012's SQL Server Data Tools on

    Its clear that people want to learn more about SSDT so if those articles don’t satiate you know that I have joined forces with Technitrain to offer a 2–day training course, in London, in February 2015 called Introduction to SQL Server Data Tools



    The course will cover:

    Day 1

    • SSDT Positioning
    • IDE Tour
    • Connected Database Development
    • Declarative, offline, database development
    • Publishing
    • Continuous Integration (CI) and Continuous Deployment

    Day 2

    • Data Publishing
    • Refactoring
    • Database unit testing
    • References and composite projects
    • Database Drift
    • Code analysis

    If this sounds like your bag then please sign up on the Technitrain website. I ran this course for Technitrain in March 2014 after which 100% rated the trainer and the course content as outstanding. 100% also rated the course overall as outstanding.


  • Archiving Azure Automation logs to Azure BLOB Storage

    For the past few months at work I’ve been diving headlong into a service on Microsoft Azure called Azure Automation which is, to all intents and purposes, a service for hosting and running Powershell scripts (properly termed “runbooks”). I’ve got a lot to say and a lot to share about Azure Automation and this is the first such post.

    Each time Azure Automation runs a runbook (a running instance of a runbook is termed a job) it stores a log of that job however that log isn’t readily available as a log file on a file system as one might be used to. There are two ways to access the log, either via the Azure Management Portal at or programatically using Powershell cmdlets. Both are handy however I wanted the logs available as a file in Azure BLOB Storage so that they could be easily viewed and archived (in the future we’ll be building Hive tables over the top of the logs – remind me to blog about that later once its done). I have written a runbook called Move-AzureAutomationRunbookLogs that will archive logs for a given runbook and for a given time period into a given Azure BLOB Storage account and I’ve made it available on the Azure Automation gallery:


    It does pretty much what it says on the tin so if you’re interested hit the link and check it out. For those that are simply interested in reading the code I’ve provided a screenshot of it below too (suggestions for how to copy-paste Powershell code into Live Writer so that it keeps its formatting are welcome!!). There’s a screenshot of the output too if that’s the sort of thing that floats your boat.

    After this archiving runbook has run you’ll find two folders under the root folder that you specified:


    and inside each of those is a folder for each runbook whose logs you are collecting (the folder contains a file per job). The “job” folder contains a short summary of each job however the “joboutput” folder contains the good stuff – the actual output from the run.

    One last point, if you’re going to use this runbook I highly recommend doing so in conjunction with a technique outlined by Eamon O’Reilly at Monitoring Azure Services and External Systems with Azure Automation. Eamon’s post describes a technique for running operations such as these on a schedule and its something that I‘ve found to be very useful indeed.

    Hope this is useful!




  • An appeal for SSDT developers – Red Gate want to hear from you

    I've been contacted by Filippa Vajda from Red Gate Software who wants me to pass on the following:

    I’m a UX Specialist at Red Gate, working on a plugin for Visual Studio.

    I need some help in finding some people for some SSDT research. We really need to speak to some SSDT users to have a 30 minute – 1 hour max. chat on a few things we’ll be working on.

    I was wondering if you’d be able to either point me at some people to contact, or put some sort of appeal on your blog for any SSDT users interested in helping us out and getting a £25 or $/euro equivalent as a thank you?

    Hence this blog post. If you’d like to speak to Filippa then simply reply to this blog post with some contact details and I’m sure Filippa will be in touch. Hopefully the opportunity to provide feedback on a future product should be incentive enough, if not the 25 notes isn’t to be sniffed at.


  • Visual Studio is free for developers. How about SQL Server?

    Big announcements for developers in the Microsoft ecosystem today, Visual Studio is now free to folks that like to write code outside of enterprises (i.e. students, open source devs, home tinkerers etc…)

    Visual Studio Community 2013 is a new, free and fully featured edition of Visual Studio, available today, with access to the full Visual Studio extensibility ecosystem and support for targeting any platform, from devices and desktop to web and cloud services.

    As usual though the the most maligned developers in the Microsoft ecosystem, SQL Server devs, have been forgotten. There’s no full-featured version of SQL Server available for them is there? Add this to the ever-growing list of evidence that Microsoft doesn’t give a rat’s ass about the SQL Server dev community any more:

    • T-SQL is a decrepit, sad, excuse for a development language with only two* significant changes (TRY…CATCH, Windowing functions) this century
    • At Microsoft’s *developer* conference earlier this year there were 185 sessions. How many of those were devoted to SQL Server? One! (OK, two).
    • Archaic, buggy, IDEs for SSIS & SSRS
    • Ubiquitous “Won’t fix” responses on Connect

    Here’s a suggestion for you Microsoft. Throw us a bone and make SQL Server Developer Edition free too. Doing so isn’t going to save the community wads of cash (it can be bought today for only $59.95 on the Microsoft Store, I’m sure that doesn’t make up much of the double-digit annual growth that they’re currently enjoying) but that’s not the point, it would send out a message. It tells us in the SQL Server community that Microsoft still cares about us, that they still want us to use their stuff. That we’re not a forgotten breed. As a SQL Server developer I haven’t felt like that for quite some time.

    A pipe dream? Let’s see.


    *Just my opinion. Feel free to disagree in the comments below.

  • Hey, Windows 10, fix this!

    So I hear that in Windows 10 they’re making improvements to the DOS window. You’ll be able to resize it now. And use Shift plus the arrow keys to select text. That’s great. Stunning in fact. Now here’s some more things they might want to have a go at at the same time.

    Make the environment variable editor bigger

    Would it really be a great hardship to put a drag handle onto this thing?


    Make PATH editing work. All the time.

    A 1024 character limit? Really? Thanks for completely obliterating my %PATH% in the meantime!



    I don’t think I’m asking for the world. Am I?


    If anyone has installed Windows 10 feel free to ask them to fix this. Thank you kindly.


  • Passing credentials between Azure Automation runbooks

    Update: Turns out there's an even easier way of achieving this than the method I've described in this blog post. Joe Levy explains all in the comments below. 

    I’ve been doing a lot of work lately using Azure Automation to run (what are essentially) Powershell scripts against Azure. Recently the ability for those scripts authenticate against your Azure subscription using a username and password was provided (see Authenticating to Azure using Azure Active Directory) and it basically involves a call to Get-AutomationPSCredential, here’s a handy screenshot (stolen from the aforementioned blog post) to illustrate:


    That’s all fine and dandy however you may find that you want to modularise your runbooks so that you have lots of smaller discrete code modules rather than one monolithic script, if you do so you’re probably not going to want to make a call to Get-AutomationPSCredential each time (for a start, such calls make it hard to unit test your runbooks) hence you may prefer to pass the credentials between your runbooks instead.

    Here is how I do this. In your calling runbook get the credentials, extract the username and password and pass them to the called runbook:

    workflow CallingRunbook
        $AzureCred = Get-AutomationPSCredential -Name "MyCreds"
        $AzureUserName = $AzureCred.GetNetworkCredential().UserName
        $AzurePassword = $AzureCred.GetNetworkCredential().Password
        CalledRunbook -AzureUserName $AzureUserName -AzurePassword $AzurePassword

    In the called runbook use the passed-in values to authenticate to Azure

    workflow CalledRunbook
        $SecurePassword = $AzurePassword | ConvertTo-SecureString -AsPlainText -Force
        $AzureCred = New-Object System.Management.Automation.PSCredential `
                -ArgumentList $AzureUserName, $SecurePassword
        Add-AzureAccount -Credential $AzureCred

    Job done!


    A quick and dirty blog post but one which folks should find useful if they’re using Azure Automation. Feedback welcome.


  • Code Navigation (10 days of SSDT – Day 10)

    SQL Server Data Tools (SSDT) was released in Spring 2012 and I have been using it to build SQL Server databases nearly every day since then. In that time I’ve learnt a lot about how to make best use of the tool and I want to share that experience in a new series of blog posts called “10 days of SSDT”. I shall be publishing a different blog post every day for 10 days, each day revealing something that you may not know about SSDT. I hope you enjoy the series and contribute via the comments!

    One of the aspects of T-SQL development that really grates on me is that the features of both the language and tooling are light years behind that of our brethren that write in “higher order” languages like C#, VB, Java etc… One of the tooling features that folks who live in Visual Studio take for granted is code navigation, happily that feature is now available in Visual Studio for T-SQL developers when one installs SSDT.

    Be aware that SSDT’s code navigation features only work when you are working offline using SSDT projects.

    There are only two keyboard shortcuts you need to learn to master code navigation (and please do use keyboard shortcuts, don’t wimp out and use the mouse):

    • F12
    • Shift+F12

    Go To Definition

    Go To Definition allows you to position your cursor inside a reference to an object, press F12, and SSDT will show you the definition of that object. Sounds simple, and it is, but its wonderfully useful when you get into the swing of using it. The following screenshots are an attempt to convey it.

    Notice here the cursor is positioned inside a reference to a table called [datawarehouse].[DimCurrency]:


    Simply pressing F12 at that point opens the definition of that table:


    Find All References

    Find All References does the opposite of Go To Definition, it shows everywhere that an object is referenced. If you’ve ever been afraid to remove an object from a database because you didn’t know what was using it (let’s face it, we’ve all been there) then this feature can be a lifesaver.

    Take the previous example of table [datawarehouse].[DimCurrency]. If I again position the cursor inside a reference to that object and this time press Shift+F12 SSDT will search through my database model and present a list of all the objects where it is referred to:


    In the screenshot immediately above we see that this table is referred to 51 times. Those references may be in stored procedures, views, functions, triggers, constraints or extended properties. Double-clicking on any of the entries in that list takes you directly to the definition of the referring object.


    That really is it. Two very simple features but ones which can have a massively positive effect on your productivity.

    Hopefully this goes without saying but code navigation does not work if you are referencing objects using dynamic SQL, this is because SSDT sees that dynamic SQL string as just that, a string, rather than a reference to another object in the model. Probably just bear that in mind.

    Code navigation is top of the list of features that drive my preference for developing in offline SSDT rather than online in SSMS.


    If you want to learn more about SSDT then come along to my training course Introduction to SQL Server Data Tools in association with Technitrain on 23rd/24th February 2015 in London. There is a discount if you register before the end of 2014.

    Earlier posts in this series:

  • A story of scale-out on HDInsight

    Lately I have been doing a lot of work using Microsoft’s Hadoop offering, HDInsight. I suspect a lot of people who read my blog are unfamiliar with what Hadoop actually is so I thought I’d recount a recent test I did that exhibits the scale-out nature of processing data on Hadoop.

    The test used a mapreduce job (written in Java) to process an incoming CSV file and load it into an HBase database on that same cluster (HBase is a NoSQL data store, for the purposes of this blog post that’s all you need to know right now). Mapreduce is a technique for doing data processing on Hadoop’s scale out architecture where incoming data is split according to some expression over the incoming rows and then rows with the same result of that expression are combined using some sort of aggregation. The splitting and aggregation can be done on multiple nodes in the Hadoop cluster and hence I like to refer to mapreduce as distributed GroupBy.

    The actual work that our mapreduce job is doing isn’t important, but for your information its combining many rows of data pertaining to a single customer into a JSON document and loading that document into HBase. Upon successful completion of the job HBase contained 3125000 JSON documents.


    I loaded 1.6GB of raw source data (62.5m rows) into HBase using HDInsight, the quickest I was able to complete this was 34m2s.The key to speeding up our throughput was to (a) use a bigger HDInsight cluster and (b) split our input data into multiple files thus forcing the processing to be distributed over more map tasks. With more performance tuning (aka scaling out to more HDInsight nodes) I am confident we get this much lower.

    Note that it is possible to specify the number of tasks that the map phase uses rather than Hadoop guessing how many it should use, for this test I chose not to specify that. In other words, splitting the incoming data over multiple files is not a necessity, it was just a trick I pulled to affect the mapreduce job.

    The detail

    I generated a 1.6GB (1,677,562,500B) file containing 62 500 000 rows. On the first run I used an HDInsight cluster that had 2 worker nodes. The entire mapreduce job took ~1hr13m50s. The map phase took ~58m, reduce phase took ~1hr6m (so clearly they overlapped – that is because the reduce phase starts as soon as the first map task completes and as you will see below the map tasks completed at wildly different times).

    Even though the reduce phase took longer its actually the map phase which caused the long execution time. To try and pinpoint why it took so long I dove into the logs that Hadoop produces. Unless you tell Hadoop otherwise it determines how many tasks it should spin up in the map phase and in this case it determined it needed 4 map tasks:


    I’ve highlighted the elapsed times for each, note the 4th is much lower. This would explain why the reduce phase took so long, it started as soon as the first map task completed but then had to wait ~52minutes until all the other map tasks were complete.

    Each one of those tasks has its own task log and from those task logs I found the following information:

    Processing split: wasb://
    Processing split: wasb://
    Processing split: wasb://
    Processing split: wasb://

    The numbers at the end represent the byte ranges that each task is processing (the first one starts from byte 0 as you would expect). Notice the last one (1610612736+66949764). That means it is starting from byte 1610612736 and processing the next 66949764 bytes. Given that task is the 4th task of 4 it shouldn’t surprise you to know that if you add those two numbers together they come to 1677562500 which is exactly the same size as the input file. In other words, the logs tell us exactly what we should expect, that the input data has been split over the 4 tasks that it deemed necessary to process this file.

    Notice that the first 3 tasks processed 536 870 912B, the 4th processed only about 12% of that, 66 949 764B. This would explain why the 4th task completed so much quicker than the others. The data has not been split evenly, and clearly that’s a problem because one of the map tasks completed so much quicker than the others which ultimately means the reduce phase has to sit around waiting for all the data – the uneven split of the data has caused inefficient use of our cluster.

    We can infer some things from this:

    • The less data that a task has to process, then the less time that task takes to complete (pretty obvious)
    • If we can increase the number of tasks, the data will be distributed more uniformly over those tasks and they should complete much quicker (and in roughly the same amount of time) due to having less data to process and less competition for resources.

    Thus I ran the same test again changing only one variable, the number of nodes in the HDInsight cluster – I increased it from 2 to 20. I hoped that this would increase the number of map tasks. Unfortunately the job failed (my fault, I left some output files lying around from a previous run and that caused a failure) however it got as far as completing the map phase which is pretty much all I cared about:


    As you can see there were still only 4 tasks and they actually took longer. So, we didn’t achieve more tasks and thus we didn’t speed the job up. That’s not good. I can’t explain right now why they actually took longer. The same number of tasks (4) distributed over a greater number of nodes (20) would, you would think, be slightly quicker due to less resource contention. Bit of a weird one that and I can’t explain it right now.

    I wondered if splitting the input file into lots of smaller files would make a difference so I split that file into 20 equally sized smaller files and ran the job again on the 2-node cluster. This time we got 20 tasks:


    Which is great, however the map phase failed due to out-of-memory issues:


    So, I uploaded those same 20 files to the 20node cluster and ran again. We got 20 tasks in the map phase and, thankfully, this time they all completed successfully. The entire job (map + reduce) completed in 34m2s (less than half the time taken on the 2node cluster when loading the single file), the map phase completed in 10m34s, reduce phase took 24m46s. The overlap there is only 1m18s and that’s because the durations of the map tasks were more uniformly distributed due to the data being separated over more tasks. Here are the 20 map tasks with durations:


    That has been a bit of a braindump but I figured it might be interesting to anyone starting out on the path of doing data processing on Hadoop. Please post questions in the comments.


    UPDATE, Thanks to Paul Lambert I've found this very useful primer on attempting to artificially set the number of mappers and reducers: HowManyMapsAndReduces 

  • Declarative, Model Driven Development (10 days of SSDT – day 9)

    SQL Server Data Tools (SSDT) was released in Spring 2012 and I have been using it to build SQL Server databases nearly every day since then. In that time I’ve learnt a lot about how to make best use of the tool and I want to share that experience in a new series of blog posts called “10 days of SSDT”. I shall be publishing a different blog post every day for 10 days, each day revealing something that you may not know about SSDT. I hope you enjoy the series and contribute via the comments!

    I’ve used the word “model” a few times in this series already without expanding on it too much but its worth doing because its actually fundamental to how SSDT works, a model underpins every that is done in SSDT.

    In the context of SSDT a model is a representation of a database, pure and simple. In SSDT that model ostensibly exists in two places, in memory when one has an SSDT project open or one is working in connected development mode, and in a .dacpac file which gets produced when an SSDT project gets built. It may help to think of a .dacpac as a representation of the model, stored in a file. (If you want to learn more about dacpacs specifically refer to my January 2014 blog post Dacpac braindump - What is a dacpac?)

    To illustrate the point if one takes a look inside a .dacpac (which is actually just a zip file with a different extension) one will notice a file called model.xml:


    Open that up and you will see that its just a definition of lots of database objects represented as XML:


    There is an important point to make here. The model only states the definition of an object as it has been defined, it does not state how it should go about creating that object (there are no ALTER TABLE statement here for example). This notion of only defining the structure of an object, not how to create it, is known as a declarative approach to database development. One is declaring the intended state of a database, nothing else. The beauty of SSDT is that the tool can take your model and from that work out what it needs to do to get a given physical database into the same state as declared in the model. That is the fundamental underpinning of SSDT and is an important concept to grasp if one intends to use SSDT.


    If you want to learn more about SSDT then come along to my training course Introduction to SQL Server Data Tools in association with Technitrain on 23rd/24th February 2015 in London. There is a discount if you apply register before the end of 2014.

    Earlier posts in this series:

More Posts Next page »

This Blog


Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement