THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a data mangler in London working for Dunnhumby

Tabular Data Packages and a BIML challenge

Update. The challenge set below has been responded to by Scott Currie who also happens to be the inventor of BIML. He has made his work  available at https://github.com/bimlscript/BETDPI

In February 2004 I joined a company called Conchango and had the pleasure of working with some very very smart people (Mick Horne, Dan Perrin, Pete Spencer, Paul Mcmillan, Steve Wright amongst others) who schooled me in the ways and means of building data integration solutions. One piece of IP that Conchango had revolved around a means of transporting a collection of data files in a well-known format. We called it the "Generic Feed Interface" and it broke down like this:

  • Data was kept in CSV files. There was no limit to how many files were in a collection.
  • Each collection had two extra files, an inventory file and a metadata file. Both of these were also CSV files.
  • The inventory file listed all of the data files
  • The metadata file defined the data type of each column in each data file

Here’s a very simple example of what this looked like, a collection consisting of one datafile:

image

image

image

As well as this format of defining data our tangible IP consisted of two DTS packages that could:

  • be pointed at a database and produce such a collection
  • be given a collection and import it into a specified database, creating target tables if required

This was used to extract data from systems of record which could then later be imported elsewhere for additional processing (e.g. as a source for a data warehouse). We liked this approach because we were loosely-coupling our data warehouse from the systems-of-record, this brought benefits such as the ability to extract data from source at a different time from which it was loaded into the target. It was a great piece of IP and was used at a time which heralded a very successful period for us in the Microsoft BI space. Doubtless many of you reading this have invented and/or used similar mechanisms for moving data.


It was with interest then that earlier this week I was alerted (by Matt Smith) to an effort to achieve the same called Tabular Data Package.

image

Tabular Data Packages include a JSON file for the same as which we used the Inventory and Metadata file but nevertheless its the same basic concept, it is a well-known format for transporting data in human-readable files. Tabular Data Packages are advocated by the Open Data Institute, see Jeni Tennison's blog post on the subject 2014: The Year of CSV.

It occurs to me that it would be really useful to have a couple of SSIS packages that work with a Tabular Data Package in the same manner that our DTS packages worked with our data files collection all those years ago. Then again, using the SSIS object model to dynamically generate packages based on some known metadata (which is what would be required here) is notoriously difficult, better would be if there existed a code-generation tool for SSIS packages. Luckily such a thing exists, its called BIML, it is free and is becoming a very very popular means for developing SSIS solutions.

Therefore I’m issuing a friendly challenge. Can anyone out there build a BIML script that can, given a database connection string, generate a Tabular Data Package containing all the data in that database? Furthermore, can someone also build a BIML script that can consume a Tabular Data Package and push all the data therein into a SQL Server database?

The challenge is set. Is anyone game?

@Jamiet

Published Friday, March 27, 2015 10:58 PM by jamiet

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

sql joe said:

why so complicated? there are already definitions about how to store data in a flat file. take rfc 4180 for example. IMO it's very complete and handles all the corner cases (quotes within quotes, etc.)

SSIS will never work with CSV, it doesn't support new columns at runtime. so why even try using it? isn't it way easier to import the data outside ssis using a normal script? why?? it's like a one-liner against writing your own new single-purpose language?

March 31, 2015 3:21 AM
 

jamiet said:

Thx for the comment Joe.

"why so complicated? there are already definitions about how to store data in a flat file. take rfc 4180 for example. IMO it's very complete and handles all the corner cases (quotes within quotes, etc.)"

I presume what you mean is that the Tabular Data Package spec tries to reinvent rfc4180. Am I correct? Please do correct me if not.

I am guessing (which, given I didn't design the spec, is all that I can do) the answer would be that an rfc is only of use if people adhere to it which, in my experience, people that provide CSVs do not.

Ostensibly though your question is a question for whomever has defined the spec for Tabular Data Packages and hence if you have a grievance over that take it up with them at https://github.com/dataprotocols/dataprotocols/issues

"SSIS will never work with CSV, it doesn't support new columns at runtime"

I presume you mean "SSIS's dataflow engine doesn't allow unexpected columns to appear in the source". Is conformance to a schema a bad thing? That's not a rhetorical question, I'd genuinely like to know (although its digressing into a discussion about SSIS rather than a discussion about Tabular Data Packages per se)

"isn't it way easier to import the data outside ssis using a normal script? why??"

Perhaps so. If you'd prefer to not use SSIS, don't use it. I suspect there are folks out there that would like to use SSIS, hence this initiative.

March 31, 2015 3:40 AM
 

the same said:

Columns are not only "appearing" in flat files, you might want to use the same package for different flat files (or data sources). IMO this is the main reason why people start generating packages. I wasted weeks of my youth in such a project which exploded at the end. I am sure that with this BIML it's way easier, but isn't that fighting symptoms? isn't it the same bad thing than a programmer copy/pasting a function because he doesn't know that a parameter is? (or the tool doesn't allow you to add a parameter, where parameter is an analogy for a data set)

For me at last i've decided to never ever go down that road again. It doesn't mean you have to ditch ssis, just do these parts outside or in a script task.

March 31, 2015 10:13 AM
 

jamiet said:

"you might want to use the same package for different flat files (or data sources)"

OK, personally I'd never want to do that but each to their own.

I assume that the reason you'd want to do this is so you can reuse some "stuff" (e.g. transformation logic) that you've built for multiple files. that's a fair shout, its a shame SSIS has got such a diabolically poor story around code reuse.

March 31, 2015 10:45 AM
 

MarvinS said:

I like this idea actually! BIML is the perfect technique do define a 'template' package for your data tables. It could easily consume the JSON definitions and dynamically create a single package per definition. You'd only have to click 'regenerate' if one of your definitions change and you should be good to go. The only challenge I see is datatype matching between JSON-defintion and SSIS.

I'm wondering though... How would you get the sourcedata in a CSV-file? Don't you need to write some code (perhaps an SSIS package) to extract source system data and put it into a CSV-file? What's the benefit of this approach in let's say a traditional data warehouse environment? Wouldn't this approach greatly increase your DWH load time?

May 13, 2015 2:34 AM
 

jamiet said:

Hi Marvin,

"How would you get the sourcedata in a CSV-file?"

As I said above the challenge is:

"Can anyone out there build a BIML script that can, given a database connection string, generate a Tabular Data Package containing all the data in that database? Furthermore, can someone also build a BIML script that can consume a Tabular Data Package and push all the data therein into a SQL Server database?"

In  other words I'm looking for something that can both produce and consume the Tabular Data Package.

"What's the benefit of this approach in let's say a traditional data warehouse environment?"

You'll have to tell me what you mean by "traditional data warehouse environment" in this context and how it differs from what I'm proposing above.

"Wouldn't this approach greatly increase your DWH load time?"

Compared to what exactly?

Regards

Jamie

May 13, 2015 3:07 AM
 

MarvinS said:

I guess I misunderstood. I thought the CSV files would be already there, but they're part of the Tabular Data Package obviously.

Well, in a traditional DWH environment you would load your data from Source > Staging Area > DWH > DM. Aren't you introducing an extra layer by implementing the TDP? (Source > CSV > Staging Area > DWH > DM). How would this affect the load time?

I'm just trying to give this 'CSV layer' a place in the DWH environment of the client I'm currently working for to see if it could be benficial for us :-).

Regards,

Marvin

May 13, 2015 3:23 AM
 

jamiet said:

OK, I gotcha.

Perhaps if you just conceptualise your "Source > Staging Area > DWH" flow slightly differently. Does your staging area have to be a database per se? Could a Tabular Data Package act as a staging area of sorts? I'm not saying the answer is yes but what I am saying is that we shouldn't assume that dropping data into portable artefacts (aka "files") negates anything we've done before.

As I said previously I've seen great success when using similar techniques in the past. The ability to decouple your systems brings great benefits.

I also think that that raw performance isn't always the be all and end all - in my experience there's always a trade off to be made between speed, maintainability, restartability and god knows how many other *ilities one might dream up.

May 13, 2015 3:34 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Privacy Statement