In collaboration with a current client over the past year or so, I've been working on the issue of Test Driven Development for ETL work, especially in the area of automated testing for data. He graciously agreed to allow me to rewrite some of the code we originally developed in Ruby as a set of PowerShell demo scripts and share it through user group and SQL Saturday presentations.
Attached are the demo scripts and PowerPoint presentation. The demo works based on a copy of AdventureWorks2012, which plays the role of an EDW project's transactional source system, and second empty database on a separate SQL Server instance, which plays the role of a new EDW under development.
There is a single SSIS package that pulls some trivial demo data from AdventureWorks and places it into a table with a different schema in the warehouse. There is also a SQL script, for demonstration purposes, that introduces some known errors into the warehouse table, in order to validate that we can find them using tests.
The idea is that we have untransformed source data on one system and post-ETL transformed data in our new warehouse, and we need a TDD-friendly, automated and repeatable way to compare the results of two queries - one from the source system and a second from the new warehouse, where each query is expected to reproduce exactly the same result set.
The two queries and the data comparison should work regardless of what the source and destination systems are - they could be different servers or even different RDBMS's. The comparison should run in a fairly automated and unattended fashion and output an "exception report" of all the differences in data between the two systems.
The PowerShell scripts, numbered in demo order, build in complexity from a single script that issues two queries, all the way to two different methods of automated comparison across servers - hash and merge join - that are generalized for any pairs of queries, and can be stored centrally in a source control system for automated testing.