One of the first tasks we must do well as SSIS developers is really understand our source data. That seems rather obvious but it when we fail to do this task well, we frequently end up having to re-plan, re-work and re-do the packages we have designed. Generally speaking there are five facets to data that we need to have correct understanding of before building a package:
- Schema and the sub-facets of cardinality and ordinality
Schema that is what entities we have, what attributes those entities have and positional relationship do the attributes have. When we think about source data entities, we natural gravitate towards understanding them in terms of their persistence medium. We have things like Comma Separated Value files, database tables and so on. I think that is somewhat dangerous -- data is more that its container. The important thing to understand here what whole thing is is that persisted trying to represent. Any given entity may have many attributes and there is no guarantee that any one persistence has all or even most of the attributes. In such cases, we must be especially careful about how we design our packages since integrating streams of entities from multiple data sources can become a very expensive process when done incorrectly.
Attributes normally represent scalar facets of an entity. Attributes themselves will have many properties like name, value and the data type chosen to express that value in. A key but many times easily overlooked property of an attribute is if any given instance of that attribute is allowed to be undefined (null in database terms). A difficult and hard problem to solve occurs when the source data is allowed to have undefined values for a given attribute but its destination is not.
For all but the most trivial of data integration projects, I feel it absolutely essential to gather and build a correct understanding of both the logical and physical data model of the sources and destinations before attempting to construct a package. Once this process is done, it becomes easier to see how entities and their attributes map. Do this at the logical level first, then at the physical level. This avoids not seeing the forest for the trees.
Two other aspects of the Schema deserve special attention: Cardinality and Ordinality.
By cardinality I ask what are the restrictions on distinct values occurring with an attribute or attribute set as well as what are the relations between entities. For example, most database products allow for constraints requiring unique values with one or more columns within the context of the table itself and between tables. If a destination does not allow for repeated attribute value but a source does, we know we will need to address this somehow in our work.
A less commonly considered aspect of data, especially when persisted, is it ordinality. By this I mean what order the persisted values are in and especially if they are a given sort order. Consider a source -- simple CSV file -- that represents a key-value pair. One of the things we definitely want to know about that source is if the keys have been written in any order and, if so, is it ascending or descending. Of course, the persisted data may be written in such a way that preserves order by many attribute values. Certain SSIS transformations require sorted data. If a data source is already in order, then this one less expensive task we need to do in the transformation process.
The importance of being able to parse a persisted data set properly should be obvious, yet one non-obvious thing about SSIS is that it really struggles with dynamic data sources. In cases where the data source where the schema is unknown in advance or varies within the persistence, we have to plan to do additional work. SSIS optimizes the data flow pipeline assuming a series of statically defined buffers are passed (in a sense) between components. It is ultimately up to us bridge the gap between a dynamic data source or destination and the static nature of the data flow engine.
The sheer volume data must be considered. The default settings for memory and CPU allocation are based on source data load of one gigabyte and small number of transformational steps. Specific transformations like sorting and aggregation can easily and rapidly consume memory to the point of exhaustion for larger data sources. Identifying such potential bottlenecks before coding the solution can make the package development process go faster.
I have four data sources from my demonstration:
- Two XML files containing information about known postal codes in the United States, one of the files suffers from truncated data for the Postal Code.
- A Comma Separated Value formatted file representing a list of Postal Codes, City, State and County names.
- A list of State names derived by query from the AdventureWorks database
The destination is a SQL Server database and this Postal Code data will be loaded into three tables:
- States, key by the State's abbreviation and containing the name of given state
- Cities, keyed by a CityID and having the name of the city and the abbreviation for the matching state
- ZipCodeList, key by the Postal Code and having a CityID and the Latitude and Longitude
The States table is populated by query during the construction of the database using the AdventureWorks Database. The other two tables will be populated from the CSV and XML files. However, there are two problems found in that data:
- In one of the XML files, the Postal Code for some locations were written without a leading zero
- There are more Postal Codes known in the XML files than in the CSV file, so we will need to invent a way discover the city and state that is most local to that Postal Code. In the presentation, I will show a way to do this using Microsoft's MapPoint 2006 via its COM APIs.