I received the following questions (paraphrased) from a friend:
I'm building an ETL process with SSIS in which I pull from about 40 tables in DB2. I put the data into staging tables in a SQL Server database.
Is it better to have one package with 40 data flows, 40 packages with one data flow each or something in between? Or should there be one data flow with a lot of source->destination modules? What the advantages and disadvantages of each approach?
The answer is: "It depends." That's one of the reasons you're having trouble finding a definitive answer online. It's not a bad question, so let me walk through the factors that would drive my design decisions:
First, are there dependencies? Are there foreign key relationships in the source database? Is referential integrity enforced (keep in mind there are non-database ways to enforce RI)? If there are dependencies or referential integrity exists (whether it's enforced or not), I load the parents first and then the children.
Second, I think about Operations: How much visibility do I want into this process? If it's going to take three hours to load, I probably want to break that down into several packages so I can at least see steps in the process completing. This will also assist in troubleshooting: "Package1.dtsx failed" isn't a pleasant message to troubleshoot if it contains 40 Data Flow Tasks. "Load Orders and OrderDetails.dtsx failed" is a much better starting place.
You've inspired another blog post - thank very much! Please keep the questions coming. And let me know what you decide.