THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

SSIS Junkie

This is the blog of Jamie Thomson, a freelance SQL Server developer in London

  • M lives!

    In 2007 Microsoft announced a program of work under the codename Oslo which was intended to be a modelling technology and if you believed the hype this was going to be a very big initiative within Microsoft. By September 2010 Oslo was effectively dead (read blog post Update on SQL Server Modeling CTP (Repository/Modeling Services, "Quadrant" and "M") and if my memory serves me correctly that was quite a lot of anger from customers that had invested time and effort into Oslo. Specifically the three parts of Oslo that were killed were:

    • The Oslo Repository
    • Quadrant – Effectively a query tool for SQL Server that featured an infinite canvas (which I actually really liked)
    • M – a modelling language

    M in particular had a lot of hype behind it and many people were mystified when it was seemingly canned. In the aforementioned blog post the lead architect for Oslo, Don Box, had this to say:

    we created a language codenamed “M” for defining schema, constraints, queries, and transformations. While we used “M” to build the “Oslo” repository and “Quadrant,” there has been significant interest both inside and outside of Microsoft in using “M” for other applications. We are continuing our investment in this technology and will share our plans for productization once they are concrete.

    So there was a promise to keep M alive but since then nothing has been heard of it and I suspect many people (including myself) assumed it had gone the way of the dodo. As it turns out that isn’t the case, M does indeed live on as the query language that underpins the recently released Microsoft Data Explorer for Excel Preview (see my blog post Data Explorer hits full preview for more info). An updated version of Data Explorer was released today (read: A new build of “Data Explorer”, and an Auto Update feature) and one of the new features is something they call Advanced Query Editing, that is the ability to see the underlying query that gets built by the UI. Here’s an example I lifted from my previous post Traversing the Facebook Graph using Data Explorer:

    SNAGHTML2a9e666

    That query language…is M. Here it is in raw copyable form:

    let
        Source = Facebook.Graph("
    https://graph.facebook.com/me/statuses"),
        #"Expand likes" = Table.ExpandTableColumn(Source, "likes", {"name"}, {"likes.name"}),
        #"Expand comments" = Table.ExpandTableColumn(#"Expand likes", "comments", {"message"}, {"comments.message"}),
        HiddenOtherColumns = Table.SelectColumns(#"Expand comments",{"message", "likes.name", "comments.message"})
    in
        HiddenOtherColumns

    At this stage its looking very likely that Data Explorer will come to market which means that Don Box wasn’t lying and one of the most lamented parts of Oslo will do too. Data Explorer is effectively a free query builder for M queries – go and download and have a play.

    @Jamiet

  • sp_ssiscatalog v1.0.2.0 now available for download [SSIS]

    v1.0.2.0 – what’s in it?

    Things have been a bit quiet on the sp_ssiscatalog front since I last blogged about it three months ago in December 2012. Rest-assured development continues apace however and today I’m making available a minor update, v1.0.2.0 which is now available for download from Codeplex. For those that don’t know I describe sp_ssiscatalog as:

    sp_ssiscatalog is a stored procedure that makes it easy to query for information that is strewn around the SSIS Catalog.

    There aren’t too many functional changes in this release, it is more focused on making sp_ssiscatalog easier to use. Back in Documenting sp_ssiscatalog I explained how I was adding documentation to the messages tab of SSMS. Hence as of this new release when you execute sp_ssiscatalog you will see information such as this:

    image

    image

    which I think should be very useful for anyone that wants to use sp_ssiscatalog to its fullest. Even I who wrote the thing and has been using it day-in, day-out for quite some time now can’t remember the names of all the parameters – now I no longer have to!

    Note that you can turn off the display of the documentation using the @show_docs parameter:

    exec sp_ssiscatalog @show_docs=0
    

    If you want to display only the documentation and not actually have sp_ssiscatalog do any querying of the SSIS Catalog its @show_docs_only:

    exec sp_ssiscatalog @show_docs_only=1
    

    If you have any suggestions for future enhancements please put them in the comments below or submit them to the discussions page on the Codeplex site.

    As a reminder, here is the sort of thing you can do with sp_ssiscatalog:

    --Return all failed executions  
    EXEC [dbo].[sp_ssiscatalog] @operation_type='execs',@execs_status_desc='failed'
    
    --Return all executions for a specified folder
    EXEC [dbo].[sp_ssiscatalog] @operation_type='execs',@execs_folder_name='My folder'  
    
    --Return all executions of a specified package in a specified project
    EXEC [dbo].[sp_ssiscatalog] @operation_type='execs',@execs_project_name='My project', @execs_package_name='Pkg.dtsx'

    --Return information about the most recent execution
    EXEC [dbo].[sp_ssiscatalog]

    One last thing, if sp_ssiscatalog is useful to you and you’d like to support future development feel free to donate to my personal beer fund at http://www.paypal.com/cgi-bin/webscr?cmd=_xclick&business=jamie@jamie-thomson.net&item_name=Supporting%20sp_ssiscatalog.


    Installation Instructions

    1. Download the zip file at DB v1.0.2.0. It contains two files, SsisReportingPack.dacpac & SSISDB.dacpac
    2. Unzip to a folder of your choosing
    3. Open a command prompt and change to the directory into which you unzipped the files
    4. Execute:
      • "%PROGRAMFILES(x86)%\Microsoft SQL Server\110\DAC\bin\sqlpackage.exe" /a:Publish /tdn:SsisReportingPack /sf:SSISReportingPack.dacpac /v:SSISDB=SSISDB /tsn:(local)
        (/tsn specifies the target server, change as appropriate. /tdn specifies the database name, you can call it whatever you like.)

    If everything works OK you’ll see something like the following:

    image

    This will (if it doesn’t already exist) create a database called [SsisReportingPack] (or whatever you chose to call it) which contains [dbo].[sp_ssiscatalog].

    image

  • Deployment of client-specific database code using SSDT

    Yesterday I attended SQL Saturday 194 in Exeter for which many people deserve credit but especially the organisers Jonathan Allen & Annette Allen and the sponsors Fusion-IO, Confio Software, Nexus, Big Bang Data Company, Purple Frog Systems, Redgate, idera, Pluralsight, Jurys Inn Exeter & Wrox.

    image image

     


    I gave a talk entitled Declarative Database Development with SSDT; both during the session and then afterwards on Twitter Leo Pasta asked me

     

    image

    would you have any extra references on how to handle client-specific changes to the DB schema

    I promised Leo an answer, hence this blog post.

    If I understand Leo’s question correctly he has an application that is used by multiple clients, he has a separate instance of the application for every client, and the database requires client-specific customisations; luckily for Leo there is a new feature in SSDT that is perfect for this situation. First let’s set up our solution in SSDT:

    image

    We have three database projects:

    • _core - This contains all the objects that are common to all clients. For demonstration purposes it simply contains a table called [Sales]
    • clientFoo – All the objects that are required only by client “Foo”. In this case there is view which aggregates the data in [Sales] by Customer
      • clientBar – All the objects that are required only by client “Bar”. In this case there is view which aggregates the data in [Sales] by Location

      In order that the reference to table [Sales] in the two views can be resolved both of the client-specific projects have a database reference to _core:

      image

      It is those database references that allows objects in clientFoo/clientBar to refer to objects in _core. Now here’s the important bit. When we set up those two database references we must specify that the objects in the referenced project are intended to be in the same database:

      SNAGHTMLa9aab43

      Having objects from one database split over multiple projects is called composite projects. The effect of this is that (by default) whenever clientFoo or clientBar are deployed the objects in _core will get deployed as well. We can see this in the output when we deploy clientFoo:

      image

      Notice how even though we chose not to deploy _core two objects have been created; table [Sales] (from the _core project) and view [vSalesPerCustomer]. This is the new feature in SSDT that I spoke of above – a deployment of a project will (by default) also deploy all the objects in referenced projects where objects in the referenced database are intended to be in the same database (incidentally this functionality replaces the “partial projects” feature from previous incarnations of SSDT).

      Note that you can change this default behaviour in the advanced publish settings by unchecking the “Include composite objects” box:

      SNAGHTMLaac90d1

      Hence, with the “Include Composite Objects” setting turned on, we can deploy both projects clientFoo & clientBar and both will contain table [Sales] plus their own client-specific view:

      image

      which (I hope) is exactly what Leo was after!

      That’s all there is to it. A very very nice new feature of SSDT!

      @Jamiet

    • Weekend reading – Data Explorer, Quandl, Flatmerge and a SQL Saturday app

      Here are some things that have piqued my interest on the interwebs over the last few days.

      Parameterized queries and Security in Data Explorer

      Chris Webb put a post up on the Data Explorer forum asking about parameterizing queries in Data Explorer and Miguel Llopis from the Data Explorer product team replied with some useful information:

      …there is some risk for users to leak information to external sources when doing this, and so we try to prevent this from being done "by default". You can disable this level of protection by clicking the "Fast Combine" button in the Data Explorer ribbon tab. More information about Fast Combine and Privacy Levels can be found in our Help contents: http://office.microsoft.com/en-us/excel-help/privacy-levels-HA104009800.aspx

      Following Miguel’s link shows this information:

      image

      So, default behaviour in Data Explorer is that the user is protected from inadvertently leaking information to 3rd parties.  Its good to know that security has been prevalent thinking within the Data Explorer team however users do need to be aware that this behaviour exists, hence my mentioning it here.

      Quandl – a search engine for datasets

      I have stumbled across a site called Quandl that looks interesting, it bills itself as “Intelligent search for numerical data”. Essentially this is a search engine for finding datasets on the web which should be a useful resource in the emerging world of self service BI.

      I’m writing this on a train so as an example I used Quandl to search for data on UK train journeys and first result was Train movements : 1000 Train-kilometre : Passenger trains : United Kingdom:

      image

      Here we have some raw data pertaining to train movements in the UK from 2004 to 2011. Quandl provides a chart of the data, a link to the source and an indication of the age of the data. It also enables us to download the data and provides Excel, CSV, JSON & XML as choices of data format.

      An interesting idea indeed, Quandl is in its infancy though I shall be keeping a watching brief to see if it turns out to be a success or not.

      Publish your own datasets with Flatmerge

      The aforementioned Chris Webb put me onto this one. Flatmerge is a startup from Michigan, US that enables one to publish their data for public consumption:

      image

      With the FlatMerge data storage platform it's easy to share data in the cloud and use it in other applications. Just upload data and let FlatMerge discover it's actual data types and make the data and metadata available in JSON or XML format through (OData) URL queries.

      Flatmerge are using OData-compliant URI query formats and they tell me that OData output is coming soon:

      We currently support some OData queries. Data/Meta is returned in plain JSON or XML. OData output is coming soon!

      That Flatmerge chose to use OData to publish their data is interesting – I’ve long suspected that greater OData adoption wouldn’t be far away once Excel natively supported it as an external data source and Flatmerge have realised the value in doing this. Flatmerge enables one to publish data to the web, Quandl helps people find data on the web – perhaps these two should go out for coffee sometime Smile

      SQL Saturday app for Windows Phone

      Michael Wells has built a Windows Phone app for SQL Saturday (particularly pertinent for me at the moment as I am on my way home form SQL Saturday 194).

      image

      It provides data about each event, including the all important schedule information.

      I have slightly mixed feelings about this. On the one hand its fantastic to see a community member voluntarily build a great FREE resource for the SQL community – massive credit to Michael for doing this. On the other hand it highlights one of my pet peeves about the current app culture that is prevalent on smartphones – this is an app that you can only use if you have a certain type of phone. The information presented here is valuable and given away for free, why is it hidden behind a gated app store? Should there not be a SQL Saturday website that is optimised for and viewable on any mobile web browser? Better still, its the schedule data here that is most valuable so why not publish that data in a format that allows one to view that schedule in one’s phone/PC/tablet calendar regardless of the type of device they are using? That data format, by the way, is iCalendar which is something that regular readers are probably fed up of me banging on about.

      I hope that doesn’t detract from Michael’s great efforts here; his app is fulfilling an important need, I just happen to think its a shame that that need even exists when there are mechanisms already in place for delivering this data to us in a more efficient matter. On the other hand its hard to argue with the ease at which apps deliver information to us so perhaps I should just quietly climb down off of my soapbox! Comments are welcome!

      @Jamiet

    • New job, new computer, new software installs

      Three days ago I started a new job and put a shout out on Twitter looking for suggestions as to what software bits-n-pieces I as a SQL Server developer should install.

      image

      I got some great replies (click here to see them all) and here is the list of all “stuff” that I then installed (N.B. SQL Server and Visual Studio came pre-installed):

      A lot of those aren’t strictly speaking SQL Server related but, for me, they’re still invaluable.

      Do you have a similar list? Don’t be afraid to share ‘em and put a link below! And let me know if you think I’ve missed anything important.

      @Jamiet

    • Bootstrapping SQL Server bloggers and blog readers with Twitter!

      On 17th December 2009 Aaron Nelson (you may know him as @sqlvariant) had a great idea – he invented the #SQLHelp hashtag; with a little kickstart from Brent Ozar the idea grew and #SQLHelp became a successful QnA channel in the SQL Server community and is today going from strength to strength.

      I’m a great advocate of SQLHelp and not just because it builds bridges between those needing help with the people that are able to provide that help. It is also a great exemplar of the power of Twitter and, more specifically, the power of coalescing open data around a shared interest. As I thought more about this I figured there must be a way that the SQL Server community could further leverage what I think is a nascent opportunity around hashtags and as my mind wandered I thought about Steve Gillmor’s post from 5th May 2009 Rest in Peace, RSS in which he opined that RSS (the syndication technology that bootstrapped the blogging craze in the first decade of this century) should be replaced by Twitter feeds. Here’s a choice quote:

      It’s time to get completely off RSS and switch to Twitter. RSS just doesn’t cut it anymore.

      Steve isn’t averse to putting the cat among the pigeons with his blog posts and in this case I think he has a salient point. Whilst RSS isn’t a consumer technology (i.e. none of my none-techie friends have a clue what it is), Twitter most definitely is. One downside of RSS (in my opinion) is that most blog authors simply publish their outpourings then hope that it gets some Google juice and catches people’s attention. On the other hand, there are still a lot of people that use RSS readers and those people have a problem too – where do they find good bloggers and good blog material?

      So, consider this:

      • Lots of people are blogging great stuff but don’t have a way of telling people about it
      • Lots of people want to learn from great bloggers but might not know where to go and find that material

      Is there an opportunity to use Twitter to build bridges between bloggers and blog readers in a similar manner to how #SQLHelp has done between questioners and answerers? I think there is and that’s when I hit upon an idea – perhaps we as a community could (as https://ifttt.com/ expertly put it) put the internet to work for us.

      Here is my suggestion. If you as a blog author tweet a link to a newly published SQL Server related blog post and use the hashtags

      #sqlserver

      &

      #blogged

      and also a hashtag to indicate the language then that tweet (and the all important link) will be available at https://twitter.com/search?q=sqlserver%20blogged. One can then use Twitter’s ability to make search results available as an RSS feed and subscribe to that RSS feed in one’s RSS reader of choice.

      Is that a good idea? I think it is, but then again its my idea so I would, wouldn’t I? I hope a few people out there will get on board with this initiative (perhaps even blog and tweet about it) and hopefully if it can became a fraction as successful as SQLHelp.


      Call to action for bloggers

      If you as a blogger want to get involved with this initiative then its really very simple. Tweet a link to your SQL Server related blog posts along with a title and the following three hashtags

      • #sqlserver
      • #blogged
      • ISO 639-1 code indicating the language that the blog post is written in

      *ISO 639-1 is a standard for 2-digit language codes. You can view the complete list on the International Standards Organisations (ISOs) website at http://www.loc.gov/standards/iso639-2/php/English_list.php although here are a few to get you on your way:

      • en – English
      • de – German
      • fr – French
      • es – SPanish
      • zh – Chinese

      I would also encourage you to use other hashtags to more specifically define the subject matter as this might make for some interesting analysis later.

      As an example, here is a tweet that I just tweeted for my blog post Obtaining rowcounts when using Composable DML [T-SQL]

      image
      https://twitter.com/jamiet/status/308351607924486144

      Also, please blog about this yourselves (at the very least that gives you an opportunity to add your first tweet to the SQL Server twitter RSS stream).

      Call to action for blog readers

      If you are someone who enjoys reading SQL Server related blog posts wants to get involved in this initiative simply subscribe to the appropriate RSS feed in your RSS reader of choice and watch as (hopefully) great content flows into your RSS reader without you having to lift a finger. Here are a few such URLs:

      Thanks to Dan English for pointing out in the comments that the search URL can be amended to remove retweets.


      That’s all there is to it. Fingers crossed that this initiative catches on because there is a fantastic knowledge sharing opportunity here – let’s put the internet to work for us to make it happen.

      I have one more thing to say, a line that I stole from my ex-colleague Howard van Rooijen, one which I am a great believer in and which I believe is very pertinent here:

      Work smarter, not harder.

      @Jamiet

    • “There is now more data than water!” Err…what?

      I've heard some nonsensical marketing claims in my time but this one takes some beating:

      Digital data is now more vast than all the world’s oceans (there is 2.7 zetabytes of data on the planet versus the 1.37 zettalitres of seawater)
      http://www.microsoft.com/en-us/news/features/2013/feb13/02-13BigDataSolutions.aspx

      I could complain about the folly in comparing discrete and continuous values or in comparing physical volumes with abstract notions of data but I don’t really need to, do I? I do wish marketers wouldn’t sully our industry by coming out with ridiculous, meaningless statements like this.

      Adam Machanic expertly mocked this one with his deadpan:

      image

      I suspect Dilbert would have something similar to say…

      Dilbert.com
    • Traversing the Facebook Graph using Data Explorer

      As I mentioned yesterday Microsoft Data Explorer Preview for Excel has hit the streets and one new feature is a built-in adapter for talking to the Facebook Graph API:

      image

      I’ve been taking it for a spin for today and was quite interested in its capabilities. Even without landing any data into Excel this is still a handy interactive Facebook graph explorer as I’ll demonstrate herein.


      A good place to start is with a list of statuses. I have chosen “me” as the “Username or object id” (because frankly I didn’t know what else to put in there) and “statuses” as the dataset to return:

      image

      After the standard merry authentication dance with Facebook Connect one’s history of Facebook statuses is returned which on its own might make interesting reading:

      SNAGHTMLc6a91d4

      however let’s do something a bit more interesting. I remove the columns I’m not interested in:

      SNAGHTMLc6db2eb

      to leave just my statuses, likes and comments:

      SNAGHTMLc6e96f2

      Notice that the likes and comments fields contain either a hyperlinked “Table” or null, “Table” means we have a set of likes or comments respectively – in other words Data Explorer is letting me browse through the one-to-many relationships inherent in the data and I happen to think that is very cool indeed. For example, selecting a link shows me the likes or comments respectively for the status in question (“Want to hear my Elton John joke? Its a little bit funny…”):

      SNAGHTMLcb762f9

      What would be more interesting would be if I could expand to show all comments against a status and also who made that comment. Turns out we can do just that, first click on this button called “Expand”:

      image

      which displays a list of columns that I want to show in the resultset; I choose “from” & “message”:

      image

      and now I can see all the comments against the relevant status. For all you out there whose heads are in SQL-land, I’ve effectively LEFT JOINed statuses to comments:

      image

      Last step, the comments.from field tells me who left the comment:

      image

      So let’s do a similar trick to show the name of the commenter in my resultset and voila:

      SNAGHTMLc870b67

      we have a list of my statuses along with all the comments and the name of the person that made the comment! We have effectively joined three linked datasets (statuses, comments, person), all inside Excel, and all without writing a scrap of code. Personally I think that’s very cool.


      Now let’s go back to just our list of statuses:

      SNAGHTMLc6e96f2

      Let’s say we now want to count the number of likes and comments on each status and sort the data accordingly. I use the expand feature as before except this time I choose to aggregate rather than expand:

      image

      That gives me the number of likes and comments per status which I can then sort however I choose:

      image

      and with a couple of quick aesthetic column renames I have all my statuses with number of likes and comments in descending order of popularity (as measured by the count of likes):

      SNAGHTMLca00a5e

      I can then pull that resultset into Excel and chart it to see if there is any correlation between number of likes and number of comments:

      image

      Imagine aggregating over something more interesting than Facebook statuses and I think you can realise the potential here.


      Hopefully this has served as a useful introduction to Data Explorer’s ability to query Facebook data. What interests me most about this is not the ability to query Facebook per se, its the ability to query and traverse over graph data using a tool that isn’t aimed at developers. Now imagine being being able to do the same with the growing corpus of linked data graphs that exist on the web in RDF format and you start to realise the potential here; I truly hope that the ability to query RDF data is high on the agenda of the Data Explorer team*.

      If you want to have a go with Data Explorer yourself then go ahead and download it from http://www.microsoft.com/en-us/download/details.aspx?id=36803; its free, although you will need Excel 2010 or Excel 2013 installed.

      @Jamiet 

      * For what its worth I’ve had a watching brief on RDF, the Semantic Web and Linked Data since way back in November 2004 and I’ve been waiting since then for it to hit the big time. I don’t think that’s going to happen any time soon but having support from tools like Data Explorer would be a step in the right direction.

    • Data Explorer hits full preview

      A short recap

      At the PASS Summit 2011 a project that existed as part of the now-defunct SQL Azure Labs was announced – Data Explorer. I posted some initial thoughts at Thoughts on Data Explorer and continued to post some walkthroughs and feedback. Data Explorer was an interesting technology to me which in October 2011 I described as:

      Data Explorer is an ETL tool and given my obvious SSIS affiliations that makes it very interesting to me. That it runs as a cloud service and will be available to non-developers only makes it more intriguing and I can't wait until Data Explorer becomes available for us to tinker with later this year.

      That being said Data Explorer always felt uncomfortably as if it was an interesting piece of tech trying to find both a problem to solve and a place to live.

      Now in preview

      Today the second of those conundrums has been answered by the release of a new public preview; Data Explorer is going to live in Excel and its official title is now “Microsoft Data Explorer Preview for Excel” (I still dislike the name by the way):

      SNAGHTML6084ac5

      Here’s a video that describes Data Explorer’s pitch today:

      Microsoft Data Explorer Preview for Excel

      If the video doesn’t render properly check it out on YouTube instead.

      From the video:

      Data Explorer enhances the self-service experience in Excel by simplifying data discovery and access to a broad range of public and enterprise data sources enabling richer insights from data that has traditionally been difficult for users to get at.
      (emphasis is mine)

      In short, Data Explorer is targeted at Excel users (Microsoft calls them “Information Workers”) and now looks like this:

      image
      The data in this screenshot was imported from http://en.wikipedia.org/wiki/Super_league.

      which (I think) is neat and tidy and pretty intuitive whereas a year ago it looked like this:



      and didn’t seemingly know what it wanted to be other than yet another tool for developers. It didn’t have an identity and it didn’t have a raison d’etre. Today’s preview release goes some way to solving those things.

      Online Search

      One very interesting new feature is Online Search which enables you to search the web for datasets. In this example I use the feature to search for data on Apollo space missions:

      image

      I find a list of all the missions and with one click its in Excel:

      image

      where I can do all the normal stuff that one would normally do in Excel. Here I have used Excel 2013’s Flash Fill feature to extract the year and pivot the data to discover which rocket was most popular and in which year the most missions were launched:

      image

      You get the idea. This is powerful stuff! The workbook that I built for this simple demo is available for viewing online or download at 20130227 Apollo Missions - Simple Data Explorer Demo.xlsx.

      Today’s release is also the first from a product team that has a number of other interesting projects in the pipeline that you will see more of in 2013 – watch this space.

      Data Explorer. Is it A2P?

      One last point I’ll make is that Data Explorer as it exists today ably fulfils the “Acquire” and “Processing” aspects of Mike Reich’s A2P approach which I talked about only twelve days ago at ETL is dead, long live AP2 ?. Interestingly Data Explorer in its October 2011 guise also provided the last part of A2P – “Publishing” (I demo’d the Publishing ability at Data Explorer walkthrough – Parsing a Twitter list) however that ability has been removed in this release in favour of landing the data into Excel. I remain hopeful that the ability to publish a Data Explorer feed returns in the future.

      Get the bits

      Download “Microsoft Data Explorer Preview for Excel” from http://www.microsoft.com/en-us/download/details.aspx?id=36803

      Read more

      I’m not the only one blogging about this. Check out:

      @Jamiet

    • My first venture into Hadoop

      If you keep an eye on the tech industry you cannot have failed to notice that an Open Source technology called Hadoop is getting lots of coverage these days. You’ll be able to find lots of descriptions around the web about what Hadoop actually is so I won’t try and come up with one myself other than to say its a technology for storing and querying over large sets of data.

      Microsoft recently released a preview of their Hadoop offering which they are calling HDInsight; it can be installed on your own hardware (HDInsight Server) or you can sign up temporarily for their Azure-based offering (HSInsight Service). Learn more at http://www.microsoft.com/bigdata.

      Cindy Gross has provided an excellent step-by-step tutorial on how to store data on HDInsight (either via the service or server) and query it in her blog post Hurricane Sandy Mash-Up: Hive, SQL Server, PowerPivot & Power View. A few days ago I downloaded HDInsight server and followed Cindy’s instructions for installing it, inserting data, querying it and then building a Power View report atop it and the whole thing took less than two hours. Hence, if you’re looking to get familiar with Hadoop and you’re primarily a Windows guy like me then this is a really quick and easy (and free) way of going about it.

      Cindy’s tutorial demonstrates how to combine data from Hadoop with data from SQL Server using PowerPivot and the output of that combined data is a Power View report:

      image

      My attempt is available on SkyDrive at http://sdrv.ms/Zhv4tB although be aware that Power View doesn’t work on SkyDrive so you’ll have to download the containing Excel workbook and open it in Excel 2013 if you want to see anything useful.


      As I said above the steps to installing HDInsight, sticking data in it, and querying it were very easy thanks to Cindy’s thorough blog post and hence I was able to make some observations about this murky world of Hadoop that I’ve been hearing so much about.

      Firstly, I gather that the “normal” way of querying Hadoop is to write programs in Java however a technology called Hive provides an abstraction over Hadoop that can make the data therein appear as a table with rows and columns just as those of us immersed in the RDBMS world are used to. For example, the data provided by Cindy was provided as a text file:

      SNAGHTML10fe22b5

      which literally gets stored in Hadoop as-is. Hive allows us to surface that file as a table using the following syntax:

      CREATE EXTERNAL TABLE census (
        State_FIPS int
      , County_FIPS int
      , Population bigint
      , Pop_Age_Over_69 bigint
      , Total_Households bigint
      , Median_Household_Income bigint
      , KeyID string
      )
      COMMENT 'US Census Data'
      ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
      STORED AS TEXTFILE LOCATION '/user/demo/census';

      Its fairly intuitive to look at this CREATE EXTERNAL TABLE definition and understand exactly what Hive is doing. Its taking a text file whose “columns” are delimited by tabs and defining data types for those columns.

      There also exists an ODBC driver for Hive that enables one to query the data from anything that can talk ODBC (e.g. Excel).

      I have often heard it said that Hadoop is for querying over unstructured data however, to me, that text file has got lots of structure to it; its got column delimiters and row delimiters, data in each row is stored consistently. That observation led me to question the use of the word "”unstructured” and I asked Cindy to clarify what that word meant in this context. She replied that its more a question of when structure is applied (in this case its being applied by Hive after the data is stored) and she correctly pointed out that a different structure could have been defined in the CREATE EXTERNAL TABLE command without modifying the underlying data.

      If nothing else this tutorial helped to crystalize some of the marketing fluff that gets bandied about regarding Hadoop. Yes, you can probably store anything you like in there (hence “unstructured”) however in order to get any value from it you’re going to have to apply structure at some point even if that’s as simple as breaking a document into all the individual words by specifying a whitespace delimiter.

      That’s a short rundown of what was an equally short investigation. If Hadoop is something that interests you then downloading HDInsight Server and following Cindy’s tutorial might be a good place to start.

      @Jamiet

    • Weekend reading: Improvements to ODBC Data Source Adminstrator & dacpac vs bacpacs

      I read a few articles over the weekend and learnt a few things that I think might be worth bringing to folks’ attention.


      ODBC Administrator gets a lick of paint in Windows 8/Windows server 2012

      The ODBC Administrator in Windows 8 has had some minor improvements. There’s now no longer a 32bit & a 64bit version of it, there’s only one and it has a new column to tell you whether the 32bit, 64bit drivers or both are installed.

      image

      There are also a few PowerShell cmdlets to help with administration of ODBC Data Sources which if you’re a fan of automating deployments (as I am) is great news. Read more at ODBC DSN Management in the Next Release of Windows (code-named Windows “8” and Windows Server “8”).


      Bacpacs days may be numbered

      Bacpacs are files that act as an easy portable backup of schema+data of a SQL Server database and are typically used to move data between an on-premise SQL Server instance & Windows Azure SQL Database (aka SQL Azure). Recently however Dacpacs gained the ability to deploy data also so I asked the question on the SSDT forum Given that a dacpac can now deploy data, why would I ever need bacpacs? Gert Drapers, who until recently ran the team that builds SSDT and the DAC Framework, replied with:

      BACPACs are a left over of the past more then anything.The main reason will be that most of the existing tooling, SSMS 2012, the Azure management portal and teh Azure Import Export service, only support BACPACs.
      Right now the data support between the two is identical, DACPAC are or were supposed to add incremental data deployment, as right now it expects an empty table to load the data in to. Overtime I expect BACPAC to fade away and its usage to be replaced by DACPAC with data.

      In other words the only reason right now to use dacpacs is that that is what the tooling supports however you should expect that to change in the future. Note that Gert is no longer on the team that builds those bits so don’t quote this as being a formal roadmap.


      @Jamiet

    • ETL is dead, long live AP2 ?

      Three days ago I posted What would a cloud-based ETL tool look like? where I wondered out loud about the sorts of tools data integration dudes like myself would be using in the future. I got some good feedback and already have a list of “stuff” to go and look at including:

      • Boomi – They claim 1million cloud integrations (whatever one of those is) per day
      • AWS Data Pipeline – A web service that incorporates a scheduler, a workflow engine and (as the name suggests) a data pipeline engine
      • Informatica Cloud – An extension to Informatica’s market-leading PowerCenter for SalesForce.

      Most interesting to me though was a link that Joe Harris provided to a a blog post by Mike Reich entitled Rethinking ETL for the API age. Mike outlined a number of points that really struck a chord with me; the key one was his message that the Extract-Transform-Load (ETL) mantra that has been trumpeted for years should be replaced by something that is more pertinent for “the cloud” – Mike offers Acquiring, Processing and Publishing (AP2) as a new acronym (we all love acronyms, right?). The idea of publishing data rather than loading it really resonated with me as making data easily available in non-proprietary formats so that people can consume it in whatever manner they choose has long been an interest of mine.

      Here are some other bulleted thoughts that came into my head as I read Mike’s blog post:

      • Flows are fluid and flexible, unlike structured, point-to-point ‘pipelines’” – My interpretation of “fluid and flexible” is that these “flows” can be plugged together to create a greater whole. This gives rise to the notion of composability; imagine being able to leverage flows that other people have constructed in your own flows. Yahoo Pipes (which I first blogged about almost five years ago in Taking Yahoo Pipes for a test drive) was an early incarnation of this notion of composability and is a great demonstrator of what the future holds for us.
      • Composability further gives rise to the notion of a marketplace where one could sell “flows”. For example, one could build a flow that aggregated data for a given search term from both Google and Bing, deduplicated the results then made them available as a single feed; expose that feed via a marketplace and charge on a pay-per-use basis. Its a simplistic, contrived example but in my opinion aptly demonstrates the opportunity here. I think data marketplaces, perhaps more pertinently data integration marketplaces, are going to be huge, I really do. Given the technology agnostic nature that is being proposed here these marketplaces would be totally interoperable too, unlike the hateful app stores that today’s xkcd expertly satirises.
      • by using APIs to move information around, we decouple the data from the underlying technology and vendor” Absolutely true. An API is essentially a well-understood interface/abstraction over a proprietary data store so really there’s nothing new here (isn’t this what SOA was all about?) but there’s no harm in reiterating the point.
      • information is stored in multiple structures and formats. Any effort to manage information should focus on translating between structures rather than trying to develop a common schema” I worked on a project from 2005-2008 where we attempted to adhere to a supposed industry standard schema. Eventually we realised that those attempts were futile given that no business can be fitted neatly into an industry-standard-shaped-box and that dovetails nicely with Mike’s point here.
      • There are four common processing tasks; combining multiple streams, translating data formats, QA information, integrate third party processing” – I wonder if there is a fifth that we might refer to as data caching; after all, if we’re pulling data out of multiple APIs we are at the mercy of the speed at which those APIs can provide the data – is a person going to be prepared to wait for the data or do we need regularly cache the transformed data for easy retrieval?
      • Publishing should be application/technology agnostic” It would be hard for me to agree more with this point.

      As you can tell this is an area that I’m particularly interested in and shall continue to keep a watching brief.

      @Jamiet

    • SSDT database projects were not always going to output T-SQL…

      Interesting the stuff you pick up from reading forums…

      “The BATCH seperators [in SSDT deployment scripts] are used to provide scoping when resolving object shapes. SSDT or VSDB scripts for that matter serve a different purpose, they provide the blueprint of the model you want, and do not represent how it will be used to construct the target site. We choose TSQL as our DSL, at first we wanted to use a different representation, but user research in 2005 told us loud and clear users wanted to use TSQL as their DSL. In order to make the compiler work, we had to impose some simplifications, one is a more frequent use of BATCH seperators to indentify object scopes.”

      by Gert Drapers (former lead of the SSDT product team) at http://social.msdn.microsoft.com/Forums/en-US/ssdt/thread/47d4124b-82cd-48f9-8ba7-c2ae4c73cbcf/

      Personally I’m glad that they chose to use T-SQL as it makes debugging easier but I’d be interested to know what else was on the table!

      @Jamiet

    • What would a cloud-based ETL tool look like?

      Given that “moving data to the cloud” is, rightly or wrongly, currently in vogue in our industry I have to think that pretty soon there will be a glaring need for tools that help us to move data between these heterogeneous sources – a cloud-based ETL tool for cloud-based data if you will. Perhaps such a thing already exists -I’ve talked about Kapow in the past which may well be considered a form of cloud ETL tool given that it fetches data from the web– if you know of anything that might fit this very loose description feel free to let me know in the comments.

      I started to ponder what capabilities a cloud ETL tool should have and here’s a quick brainstormed list:

      Any thoughts here? As I said this is a brainstormed list so I don’t mind being told that I am approaching this from the wrong angle or even that I’m completely wrong . Should I be concentrating on scenarios rather than technologies?. I’m only too aware that given my ETL heritage my brain is already wired to consider how traditional ETL tools could be supplanted into the cloud (my mention of a job scheduler sold me out there) – perhaps that is completely wrong too and that my heritage is actually a disadvantage here.

      I’m interested to know what people think and hopefully trigger a conversation. I’m especially keen to hear about scenarios that you might have where you need to move and transform data that lives “in the cloud”.

      @Jamiet 

      *where applicable

      UPDATE. Within seconds of publishing this post I’d already been alerted to InformaticaCloud.com & AWS Data Pipeline. Checking those out now!

      I’ve been recommended to check out the following articles:

    • Exploring earnings data for the UK [Open Data]

      I have a burgeoning interest in the world of Open Data which wikipedia describes as:

      Open data is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. The goals of the open data movement are similar to those of other "Open" movements such as open source, open hardware, open content, and open access. The philosophy behind open data has been long established, but the term "open data" itself is recent, gaining popularity with the rise of the Internet and World Wide Web and, especially, with the launch of open-data government initiatives such as Data.gov.
      http://en.wikipedia.org/wiki/Open_data

      To that end I follow @LondonDatastore on Twitter as they are actively publishing Open Data pertaining to the city in which I live, London. Four days ago they announced that a new dataset had been released that provided earnings information, not just for London but for the whole country and going back many years too. The provided link, http://data.london.gov.uk/datastore/package/earnings-workplace-borough, brings up a page from where one can download an Excel workbook containing some data. Unfortunately the data in that workbook is not, in my opinion, provided in such a manner that makes it easily explorable (which I have complained about before on this blog); the data is spread over multiple worksheets:

      image

      Worse, the data is already aggregated and pivoted:

      image

      In other words the workbook does not contain the raw row-level data from which this pivoted data is produced. Thankfully a link (https://www.nomisweb.co.uk/query/construct/summary.asp?mode=construct&version=0&dataset=99) is provided from where the raw data can be downloaded. I found the interface there to be slightly clunky but that’s a minor quibble – that raw data is available should be considered a major boon. I downloaded earnings data for:

      • Local Authority
      • Gender
      • Full Time or Part Time
      • Year (1999-2012)

      and have made that raw data available in a publicly accessible Excel workbook. You can view that workbook online (only a browser required) at AnnualPay_by_YearGenderLocalAuthorityPartorfulltime.xlsx (from there you can also download for your own analysis).

      image

      The workbook also provides summaries over the raw data by way of pivot tables and charts. Arguably its clear to see, for example, that a gender imbalance exists although perhaps the gap may be lessening somewhat

      image

      Its interesting to note that the average salary for Males dropped off in 2009/2010. Perhaps the economic events of 2008 are the cause for that, checking out data for only City of London (the traditional financial hub of the UK) suggests that may well be the case:

      image

      There will be many more nuggets of information available in this data, all we need to do is set the data free so that people can find it for themselves. That is the aim of this blog post so hit the link: AnnualPay_by_YearGenderLocalAuthorityPartorfulltime.xlsx and see what other nuggets you can find! If dataviz is an area that interests you then this is a cracking dataset to explore!


      One other point I want to make is that the raw data is provided as a mean average which means that each aggregated figure is a mean average of some mean averages. This isn’t good as it distorts the data as I demonstrate with this simple example:

      image

      We have an average salary for both male and female (5000 & 9000 respectively) and the average of those two averages is 7000. However if we take the total “total salary” / total “tally of people asked” (127000 / 23) then the average is quite different – 5521.73.

      That’s a large discrepancy even for only two rows of data and it highlights the problem of providing averages rather than the figures that created those averages. To me this is a data quality issue – the raw source data does not provide the requisite level of detail to enable accurate analysis. Quality of data is paramount.

      @Jamiet

    This Blog

    Syndication

    Powered by Community Server (Commercial Edition), by Telligent Systems
      Privacy Statement