THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Kevin Kline

ETL Demo With Data from Data.Gov

A little over a month ago, I wrote an article (Is There Such a Thing as Easy ETL) about expressor software and their desktop ETL application, expressor Studio. I wrote about how it seemed much easier than the native ETL tools in SQL Server when I was reading up on the tool, but that the "proof would be in the pudding" so to speak when I actually tried it out loading some free (and incredibly useful) data from the US federal data clearinghouse, Data.Gov.

If you'd rather not read my entire previous article - quick recap, expressor Studio uses “semantic types” to manage and abstract mappings between sources and targets. In essence, these types are used for describing data in terms that humans can understand—instead of describing data in terms that computers can understand. The idea of semantic abstraction is quite intriguing and it gave me an excuse to use data from data.gov to build a quick demo. You can download the complete data set I used from the following location: International Statistics. (Note: I have this dream that I'm going to someday download all of this free statistical data sets, build a bunch of amazing and high-value analytics, and make a mint. If, instead, YOU do all of those things, then please pay to send at least one of my seven kids to college in repayment for the inspiration. I'm not kidding. I have SEVEN kids. God help me).

The federal government, to their credit, has made great progress in making data available. However, there is a big difference between accessing data and understanding data. When I first looked at one of the data files I downloaded, I figured it was going to take me years to decrypt the field names. Luckily, I did notice an Excel file with field names and descriptions. Seriously, there are single letter field names in these files where the field name “G” has a description of “Age group indicator” (Oh Wow). See the figure below.

 

It's stuff like this that reminds me why we have data quality and master data management tools. Ok, back to expressor Studio. I quickly mapped a couple of files into expressor Studio using their “Read File” operator. It was fairly simple and easy to use. My data included files with country area information, population, and gender information by year. Once I mapped these files I quickly wanted to shed the default cryptic, nay, nonsensical names. I could have just renamed the fields when I initially mapped them into the system but that would mean I would have to manage the names in three separate locations. Bah! It made more sense to create a common semantic type and reuse it across all three files.

There are two flavors of semantic types within expressor Studio to handle your mappings, atomic types or composite types. An atomic type is simply a single field name whereas a composite type is a combination of one more atomic types. Since the data files had many common fields, I decided to create a core set of atomic types that I could then roll up into composite types based on the files I was mapping. This kept the mappings simple and easy to understand and most importantly the whole exercise took about 5 minutes. Once the types were created I simply mapped the cryptic names from the files to the business friendly names in my semantic type. (I can't even begin to imagine how long this would've taken using native tools, but certainly not 5 minutes).

 

Now I was ready to move my data. I took the data from three files and combined them into one master dataset. From there, my international statistics from Data.Gov were pumped right into my waiting SQL Server database. Note that I could've used Excel or just about any other database as my target instead of SQL Server.

Now, you might be saying to yourself "That looks easy because you read all the help files first." Actually, no. In fact, some of my buddies like to lovingly tell me to "RTFM" from time to time. It's not that it offends my masculinity to read a manual. I just usually like to have a go first and then, if needed, go back to the manual. In fact, all I really used was this 5-minute demo video that in noticed when I was downloading the tool.

If you're tackling ETL and you want it fast and easy, then you might want to check out their website, www.expressor-software.com, to learn more about the expressor company and products.

Enjoy!

-Kev

P.S. Follow me on Twitter!    

Published Friday, August 05, 2011 5:45 PM by KKline

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

jonmcrawford said:

you'll notice that someone was being cute, and the fieldnames of the three columns for age (4,5,6) spell AGE.

Gotta love the gov

August 8, 2011 10:21 AM
 

KKline said:

I can't claim that gov't technical writing is above reproach.  But I do enjoy having real data close at hand.  =^)

Thanks for the comment!

-Kev

August 8, 2011 1:43 PM
 

JayB said:

thanks for the expressor software link! I've trying to find the best way to brief my interns on ETL and it was perfect!

August 12, 2011 3:16 PM

Leave a Comment

(required) 
(required) 
Submit

About KKline

Kevin Kline is a well-known database industry expert, author, and speaker. Kevin is a long-time Microsoft MVP and was one of the founders of PASS, www.sqlpass.org.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement