THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a freelance data mangler in London

ETL is dead, long live AP2 ?

Three days ago I posted What would a cloud-based ETL tool look like? where I wondered out loud about the sorts of tools data integration dudes like myself would be using in the future. I got some good feedback and already have a list of “stuff” to go and look at including:

  • Boomi – They claim 1million cloud integrations (whatever one of those is) per day
  • AWS Data Pipeline – A web service that incorporates a scheduler, a workflow engine and (as the name suggests) a data pipeline engine
  • Informatica Cloud – An extension to Informatica’s market-leading PowerCenter for SalesForce.

Most interesting to me though was a link that Joe Harris provided to a a blog post by Mike Reich entitled Rethinking ETL for the API age. Mike outlined a number of points that really struck a chord with me; the key one was his message that the Extract-Transform-Load (ETL) mantra that has been trumpeted for years should be replaced by something that is more pertinent for “the cloud” – Mike offers Acquiring, Processing and Publishing (AP2) as a new acronym (we all love acronyms, right?). The idea of publishing data rather than loading it really resonated with me as making data easily available in non-proprietary formats so that people can consume it in whatever manner they choose has long been an interest of mine.

Here are some other bulleted thoughts that came into my head as I read Mike’s blog post:

  • Flows are fluid and flexible, unlike structured, point-to-point ‘pipelines’” – My interpretation of “fluid and flexible” is that these “flows” can be plugged together to create a greater whole. This gives rise to the notion of composability; imagine being able to leverage flows that other people have constructed in your own flows. Yahoo Pipes (which I first blogged about almost five years ago in Taking Yahoo Pipes for a test drive) was an early incarnation of this notion of composability and is a great demonstrator of what the future holds for us.
  • Composability further gives rise to the notion of a marketplace where one could sell “flows”. For example, one could build a flow that aggregated data for a given search term from both Google and Bing, deduplicated the results then made them available as a single feed; expose that feed via a marketplace and charge on a pay-per-use basis. Its a simplistic, contrived example but in my opinion aptly demonstrates the opportunity here. I think data marketplaces, perhaps more pertinently data integration marketplaces, are going to be huge, I really do. Given the technology agnostic nature that is being proposed here these marketplaces would be totally interoperable too, unlike the hateful app stores that today’s xkcd expertly satirises.
  • by using APIs to move information around, we decouple the data from the underlying technology and vendor” Absolutely true. An API is essentially a well-understood interface/abstraction over a proprietary data store so really there’s nothing new here (isn’t this what SOA was all about?) but there’s no harm in reiterating the point.
  • information is stored in multiple structures and formats. Any effort to manage information should focus on translating between structures rather than trying to develop a common schema” I worked on a project from 2005-2008 where we attempted to adhere to a supposed industry standard schema. Eventually we realised that those attempts were futile given that no business can be fitted neatly into an industry-standard-shaped-box and that dovetails nicely with Mike’s point here.
  • There are four common processing tasks; combining multiple streams, translating data formats, QA information, integrate third party processing” – I wonder if there is a fifth that we might refer to as data caching; after all, if we’re pulling data out of multiple APIs we are at the mercy of the speed at which those APIs can provide the data – is a person going to be prepared to wait for the data or do we need regularly cache the transformed data for easy retrieval?
  • Publishing should be application/technology agnostic” It would be hard for me to agree more with this point.

As you can tell this is an area that I’m particularly interested in and shall continue to keep a watching brief.

@Jamiet

Published Friday, February 15, 2013 4:45 PM by jamiet
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Chris Nelson said:

Jamie,

Pardon me if I'm laughing at Cloud Data APIs saving the world (aka ETL 2.0 with magic bullets!!!) We get thousands of different files from hundreds of different clients. These in most case are people that can barely use Excel and barely understand delimiters. So whom is going to generate all these APIs???

:)

Chris  

February 15, 2013 2:09 PM
 

jamiet said:

Hello Chris,

Thanks for the comment.

I don't think I've insinuated anywhere that the scenarios I'm outlining here are going to solve any problems that we have with ETL solutions today, on the contrary I believe that new technologies inevitably bring with them a whole raft of new problems that need solving. The problems I experience on a day to day basis can nearly always be out in the "data quality issue" bucket and I don't see that changing any time soon.

I made a similar, tho more light-hearted, response to your comment on my previous post at http://sqlblog.com/blogs/jamie_thomson/archive/2013/02/12/what-would-a-cloud-based-etl-tool-look-like.aspx. I assure you I'm not advocating any "silver bullets" as you put it, merely stating my belief that the job of a data integrator may well be changing significantly.

Again, thanks for the comment. Debating this issue is good.

Regards

Jamie

February 15, 2013 3:51 PM
 

Frank Szendzielarz said:

Interesting. We have been tackling some obliquely related philosophical issues at the financial instituion where I work.

We are slowly re-architecting things. Some of the components that are being decommissioned are inhouse ETL types of software. Bit by bit these are becoming redundant because we are moving to more of an event-driven 'flow'.

As new data comes in (for example stock price updates), this can now be fed to other systems on an item by item basis using pub-sub messaging infrastructure.

What about joins and transformation? In at least one system, the ETL batch processing equivalent has become obsolete because we are using message based correlation in combination with out-of-order queued messages (natively supported by BufferedReceive in WF4 with net.msmq binding) in WF4 statemachines. The statemachine guarantees that messages are dequeued in the correct order, and instances of the statemachine correlate on specific keys.

In the cloud, Azure Service Bus has a pub-sub messaging system. Flows of data can be orchestrated using WF. Pub-sub messaging can allow the information to be published to subscribers. This way, information flow and transformation can be managed using dynamically generated or declaratively generated WF activities..

As for "AP2" on bulk data - I would call that "Reporting" if happens at intervals, or perhaps a continuously updated Report.

February 16, 2013 8:48 AM
 

jamiet said:

Thanks for the comment Frank, very interesting indeed, especially to hear what you are using in favour of an ETL process.

February 16, 2013 8:56 AM
 

Chris Nelson said:

Jamie,

I'm in a different situation than Frank. The data I deal with is at most a day to a month old, so it works well for ETL processing. I'm also dealing with multiple time perdiods and long term analysis.

February 18, 2013 8:23 AM
 

Sarath said:

Hello Chris,

So do you mean that emerging non-ETL kind of methods like AP2 are not suitable for data that is historic in nature?

Thanks,

Sarath

February 20, 2013 6:07 PM
 

pakki said:

Pakki's nunga current ETL velai parrunga

February 26, 2013 8:10 AM
 

SSIS Junkie said:

A short recap At the PASS Summit 2011 a project that existed as part of the now-defunct SQL Azure Labs

February 27, 2013 12:47 PM
 

SSIS Junkie said:

Three months ago I published a fairly scathing attack on what I saw as some lacklustre announcements

July 12, 2013 12:36 PM
 

Mark White said:

In terms of transforming data between structures... the one and only structure that *all* data can be transformed into is "flat and fine-grained".  If the transmission of flat, fine-grained data was mad efficient through columnar compression (like how Vertipaq stores data), then adopting "flat & fine-grained" as a standard wouldn't be a problem.

September 2, 2013 3:15 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement