THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Merrill Aldrich

Technical content about Microsoft data technologies. All opinions expressed are purely my own and do not reflect positions of my employer or associates.

Automate ETL Testing: Cross-Server Data Comparison with PowerShell

In collaboration with a current client over the past year or so, I've been working on the issue of Test Driven Development for ETL work, especially in the area of automated testing for data. He graciously agreed to allow me to rewrite some of the code we originally developed in Ruby as a set of PowerShell demo scripts and share it through user group and SQL Saturday presentations.

Attached are the demo scripts and PowerPoint presentation. The demo works based on a copy of AdventureWorks2012, which plays the role of an EDW project's transactional source system, and second empty database on a separate SQL Server instance, which plays the role of a new EDW under development. 

There is a single SSIS package that pulls some trivial demo data from AdventureWorks and places it into a table with a different schema in the warehouse. There is also a SQL script, for demonstration purposes, that introduces some known errors into the warehouse table, in order to validate that we can find them using tests.

The idea is that we have untransformed source data on one system and post-ETL transformed data in our new warehouse, and we need a TDD-friendly, automated and repeatable way to compare the results of two queries - one from the source system and a second from the new warehouse, where each query is expected to reproduce exactly the same result set.

The two queries and the data comparison should work regardless of what the source and destination systems are - they could be different servers or even different RDBMS's. The comparison should run in a fairly automated and unattended fashion and output an "exception report" of all the differences in data between the two systems.

The PowerShell scripts, numbered in demo order, build in complexity from a single script that issues two queries, all the way to two different methods of automated comparison across servers - hash and merge join - that are generalized for any pairs of queries, and can be stored centrally in a source control system for automated testing.

Published Sunday, November 22, 2015 9:08 AM by merrillaldrich

Attachment(s): ETLTestingPres.zip

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

andyleonard said:

This sounds awesome, Merrill! I can't wait to tinker with this code.

:{>

November 22, 2015 5:40 PM
 

merrillaldrich said:

Thanks Andy! Let me know if you have any suggestions. The technique has worked really well.

November 22, 2015 9:01 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Privacy Statement