THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a freelance data mangler in London

Azure Automation – the beginning of cloud ETL on Azure?

I have maintained a watching brief on what I refer to as “cloud ETL”, that is the ability build ETL routines in a cloud environment and therefore leverage all the benefits that the cloud model brings*. Thus far my main opinion piece in this area is What would a cloud-based ETL tool look like? in which I laid out what features I thought a cloud ETL tool should have:

  • Data transformation would be done “in the cloud” i.e. I wouldn’t need to own my own hardware in order to run it
  • Ability to consume data from/push data to <many different data protocols>
  • Adapters (possibly with a plug-in model) for cloud storage and API providers
  • Job scheduler
  • Workflow. (e.g. Do this, then do that. Do these things in parallel. Only do this if some condition is true. Restart from here in case of failure.)
  • An IDE (open to debate whether the IDE should be “in the cloud” as well)
  • Ability to carry out common transformations (join, aggregate, sort, projection) on those heterogeneous data sources
  • Ability to authenticate using different authentication mechanisms
  • Configurable logging
  • Ability to publish transformed data in a manner that makes it consumable rather than insert it into another data store

http://sqlblog.com/blogs/jamie_thomson/archive/2013/02/12/what-would-a-cloud-based-etl-tool-look-like.aspx

Given that I have spent the majority of my career working with Microsoft technologies (in particular their ETL tool, SSIS) I am interested to know whether Microsoft will offer a cloud ETL tool. With that in mind I was interested to discover a new service on Azure that is currently in preview called Azure Automation (read Announcing Microsoft Azure Automation Preview). Azure Automation is essentially a a cloud-based workflow tool and, as I said above, workflow is a feature that I believe a cloud-based ETL tool should encompass:

  • Workflow. (e.g. Do this, then do that. Do these things in parallel. Only do this if some condition is true. Restart from here in case of failure.)

SSIS developers will of course be aware that SSIS has its own workflow tool (termed the Control Flow). It always kind of bugged me that different Microsoft tools had their own workflow technology. SSIS had one, I believe BizTalk had one, there was another called Windows Workflow Foundation (WWF) and in fact there was a possibility within the SQL Server 2008 timeframe that SSIS would replace its Control Flow with WWF (that never happened and the Program Manager that wanted to do it has since left the SSIS product team).

Azure Automation is built upon Powershell Workflow which in turn is built upon WWF (now simply called Workflow Foundation – WF). It certainly seems as though WF is becoming the foundational workflow technology to rule them all within Microsoft and that is no bad thing in my opinion – it seems foolish to reinvent the wheel every time. Powershell Workflow has the following cmdlets for building workflows:

  • Workflow
  • Parallel
  • Foreach –parallel
  • Sequence
  • InlineScript
  • Checkpoint-workflow
  • Suspend-workflow

Those are all fairly self-explanatory. Of particular interest to me is Foreach –parallel (we’ve been asking for a native Parallel ForEach Loop in SSIS for years) and that might be even more useful in a scale-out infrastructure such as can be offered by the cloud (imagine firing off multiple FTP tasks in parallel, all working on different Azure nodes). Checkpoint-Workflow also sounds very interesting:

A checkpoint is a snapshot of the current state of the workflow, including the current values of variables, and any output generated up to that point, and it saves it to disk. You can add multiple checkpoints to a workflow by using different checkpoint techniques. Windows PowerShell automatically uses the data in newest checkpoint for the workflow to recover and resume the workflow if the workflow is interrupted, intentionally or unintentionally.
http://technet.microsoft.com/en-us/library/jj574114.aspx

Stateful restartability that you can control, all out-of-the-box. How cool is that? So much better than the awful checkpointing feature within SSIS.

It certainly appears to me that Azure Automation could satisfy my desire for a workflow engine for the purposes of cloud-ETL. Now if only Microsoft were working on cloud-based dataflows too we’d have something akin to SSIS-in-the-cloud Winking smile.

@Jamiet

*My own personal opinion is that the benefits of the cloud model can be summed up simply as “OPEX not CAPEX”. You may have your own definition, and that’s OK.

Published Thursday, April 24, 2014 1:42 PM by jamiet

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

No Comments

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement