Three days ago I posted What would a cloud-based ETL tool look like? where I wondered out loud about the sorts of tools data integration dudes like myself would be using in the future. I got some good feedback and already have a list of “stuff” to go and look at including:
- Boomi – They claim 1million cloud integrations (whatever one of those is) per day
- AWS Data Pipeline – A web service that incorporates a scheduler, a workflow engine and (as the name suggests) a data pipeline engine
- Informatica Cloud – An extension to Informatica’s market-leading PowerCenter for SalesForce.
Most interesting to me though was a link that Joe Harris provided to a a blog post by Mike Reich entitled Rethinking ETL for the API age. Mike outlined a number of points that really struck a chord with me; the key one was his message that the Extract-Transform-Load (ETL) mantra that has been trumpeted for years should be replaced by something that is more pertinent for “the cloud” – Mike offers Acquiring, Processing and Publishing (AP2) as a new acronym (we all love acronyms, right?). The idea of publishing data rather than loading it really resonated with me as making data easily available in non-proprietary formats so that people can consume it in whatever manner they choose has long been an interest of mine.
Here are some other bulleted thoughts that came into my head as I read Mike’s blog post:
- “Flows are fluid and flexible, unlike structured, point-to-point ‘pipelines’” – My interpretation of “fluid and flexible” is that these “flows” can be plugged together to create a greater whole. This gives rise to the notion of composability; imagine being able to leverage flows that other people have constructed in your own flows. Yahoo Pipes (which I first blogged about almost five years ago in Taking Yahoo Pipes for a test drive) was an early incarnation of this notion of composability and is a great demonstrator of what the future holds for us.
- Composability further gives rise to the notion of a marketplace where one could sell “flows”. For example, one could build a flow that aggregated data for a given search term from both Google and Bing, deduplicated the results then made them available as a single feed; expose that feed via a marketplace and charge on a pay-per-use basis. Its a simplistic, contrived example but in my opinion aptly demonstrates the opportunity here. I think data marketplaces, perhaps more pertinently data integration marketplaces, are going to be huge, I really do. Given the technology agnostic nature that is being proposed here these marketplaces would be totally interoperable too, unlike the hateful app stores that today’s xkcd expertly satirises.
- “by using APIs to move information around, we decouple the data from the underlying technology and vendor” Absolutely true. An API is essentially a well-understood interface/abstraction over a proprietary data store so really there’s nothing new here (isn’t this what SOA was all about?) but there’s no harm in reiterating the point.
- “information is stored in multiple structures and formats. Any effort to manage information should focus on translating between structures rather than trying to develop a common schema” I worked on a project from 2005-2008 where we attempted to adhere to a supposed industry standard schema. Eventually we realised that those attempts were futile given that no business can be fitted neatly into an industry-standard-shaped-box and that dovetails nicely with Mike’s point here.
- “There are four common processing tasks; combining multiple streams, translating data formats, QA information, integrate third party processing” – I wonder if there is a fifth that we might refer to as data caching; after all, if we’re pulling data out of multiple APIs we are at the mercy of the speed at which those APIs can provide the data – is a person going to be prepared to wait for the data or do we need regularly cache the transformed data for easy retrieval?
- “Publishing should be application/technology agnostic” It would be hard for me to agree more with this point.
As you can tell this is an area that I’m particularly interested in and shall continue to keep a watching brief.
@Jamiet