Once upon a time I blogged at http://consultingblogs.emc.com/jamiethomson
but that ended in August 2009 when I left EMC. There is a lot of
(arguably) valuable content over there however certain events in the past leave me
concerned that that content is not well cared for and I don't have any
confidence that it will still exist in the long term. Hence, I have
taken the decision to re-publish some of that content here at SQLBlog so
over the coming weeks and months you may find re-published content
popping up here from time-to-time.
This is the second such blog post in which I discuss the internals of the SSIS Dataflow. The first post in this series can be found at [SSIS] OnPipelineRowsSent.
During my activity on the SSIS forum I've noticed that much of the content is in regard to the dataflow task
and that's not a surprise given that its the most useful tool in the
SSIS box and also the most complex. This post is me brainstorming some
of the stuff that I know about the dataflow and hopefully it proves
useful to some of you.
- Buffer Architecture. If I'm ever
interviewing you for a job as a SSIS developer you can lay a lot of
money to say that I'll ask you to tell me what a buffer is. Buffers are
fundamental to the dataflow - they are what the dataflow uses to move
data around. A buffer is essentially an area of memory and by default
consists of approximately 10000 rows (usually slightly less than that)
and that's why when you execute a dataflow within BIDS the row counts on
the data paths go up in approximate increments of 10000. Part of performance
tuning a SSIS dataflow is about manipulating various
properties until you find the optimum number of rows in each buffer and
you can read more (much more) about that here.
- Dataflows
contain components which are generally categorised into synchronous and
asynchronous. The most definitive description of these is that the
output from a synchronous component uses the same buffer as the input;
asynchronous components create a new buffer for their output. All source
adapters are asynchronous components, all destination adapters are
synchronous. Synchronous components are generally quicker than asynchronous components.
- Asynchronous
components are further categorised as partially-blocking or
fully-blocking. Fully-blocking components require all rows from upstream
before they put any data into the output; partially-blocking components
will start to output data before they receive all upstream rows.
- Execution
trees. Each asynchronous component creates what is called an execution
tree in the dataflow. In SSIS 2005 (but not in later versions) each execution
tree uses one execution thread so another part of performance tuning is
to fully utilise all processors on your hardware. Read more here.
- OnPipelineRowsSent.
All executables in a SSIS package throw events and one of the events
throws by the dataflow is OnPipelineRowsSent. When a component outputs a
buffer of data then it throws a OnPipelineRowsSent event and thus
enables us to know how many rows each component has processed. When you
execute a dataflow within the development environment (aka BIDS) these events are consumed and are used to
change the rowcounts that you see increasing as more rows are
processed.
- Spooling. I said earlier that all buffers are a
space in memory but of course memory is finite so if there is more data
in the pipeline than can fit in memory then buffers will get spooled to
disc. The location on disc is defined by BLOBTempStoragePath &
BufferTempStoragePath. Spooling will severely impact dataflow
performance so avoid if possible.
- A lot of people ask if its
possible to remove columns from the dataflow once they have finished
using them. For example, if columns called [FirstName] & [LastName]
are concatenated together to make [FullName] its likely that those two
columns won't be needed anymore. The simple answer though is no. Once
the data is in memory it would be an overhead to remove the data and
"squeeze" the buffer up to make it slower which is why those columns
still appear downstream. This is nothing to be concerned about - its
highly highly unlikely they are heavily impacting performance. Of
course, if an asynchronous component is encountered then a new buffer
will be created on the output and the unrequired columns will (probably)
be removed. This issue is further discussed here.
- Following
on from the previous point...its intuitive to think that columns that
begin at a component don't exist prior to the data being processed by
that component. In fact that's not true. Prior to dataflow execution the
execution plan for a dataflow is determined and it is at that point
that all columns are defined and thus created (i.e. space is set aside in memory). So, all columns that will
be used in a buffer exist even before the buffer gets any data.
- The
datatypes of columns in the dataflow are different from datatypes used
for SSIS variables. To this day I don't understand why the SSIS team
opted to use different datatypes in the control flow and data flow and I
hope this changes one day. (UPDATE: SSIS Development Manager Jeff Bernhardt addresses this issue in A potted SSIS history via Connect.)
- The stock components (i.e. those provided out-of-the-box) are mostly written in native code (XML Source and the Script Component are exceptions to this rule). SSIS provides a .Net API that enables
you and I to build our own components and hence it is tempting to think
that these custom components won't work as quickly as stock components.
This is probably true but really the difference is negligible. The
majority of the work (validation, memory management, buffer editing
etc...) is done by native code so you're not going to suffer severe
performance problems by implementing custom components.
- The
BLOB data types (i.e. DT_TEXT, DT_NTEXT, DT_IMAGE) can severely impact
dataflow performance so try and avoid them if you can.
- Raw
files can be used to pass data from one dataflow to another - even if
those dataflows are in different packages. Raw files have a proprietary
file format that is essentially a match of the data in memory and hence
reading to and writing from them is extremely quick. People often seem
reticent to place data into raw files but I don't hesitate to recommend
using them if you need to.
- There is an important property on each component output called IsSorted. A lot of people think that setting this
property to TRUE will cause the data in that output to be sorted.
That's not true - this property only informs the dataflow engine that the data is
sorted, nothing more. If you set this property to TRUE and the data is
not sorted then you will probably be creating problems for yourself later on (for example a downstream Merge Join component will not fail but it won't produce the correct results either).
- Source
and destination adapters maintain external column collections which are
used to store the metadata of the external data sinks that those
adapters connect to. There are two reasons for this as far as I can
determine. Firstly to enable offline development (a big criticism of SSIS's predecessor DTS
was that offline development wasn't possible) and secondly to enable
the dataflow to validate itself. More information here.
- Although
it appears in BIDS as though the data in a buffer "moves" from one
component to another that isn't actually the case. Data in a buffer
doesn't actually move about in memory. My fellow MVP Phil Brammer (blog | twitter) once used an analogy of
cars travelling on a road to describe this. The buffers are analogous to
cars on the road and milestones along the road are analogous to the
components. Instead of thinking of the cars moving along the road to
reach the milestones, think of the cars as being stationery and the road
moving along underneath the cars.
- Back pressure is an important concept in an SSIS dataflow. Backpressure occurs when a dataflow is producing data to a destination faster than the destination can consume it (a common phenomenon when inserting into a relational database table) - this creates contention further back down the dataflow, hence the term "backpressure". Michael Entin (one of the original developer geniuses that built the dataflow engine) talks more about back pressure at SSIS Backpressure Mechanism.
I'll probably add to this
post over time as new things occur to me. In the meantime if you want a
more detailed description of how the dataflow works then Kirk Haselden's book has a whole chapter devoted to it. You can also pose questions in the comments although I'd urge you to post questions to the SSIS forum
where more people will be available to answer and where your question
may already have been answered.
@Jamiet