THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a freelance data mangler in London

Dataflow mechanics [SSIS]

Once upon a time I blogged at http://consultingblogs.emc.com/jamiethomson but that ended in August 2009 when I left EMC. There is a lot of (arguably) valuable content over there however certain events in the past leave me concerned that that content is not well cared for and I don't have any confidence that it will still exist in the long term. Hence, I have taken the decision to re-publish some of that content here at SQLBlog so over the coming weeks and months you may find re-published content popping up here from time-to-time.

This is the second such blog post in which I discuss the internals of the SSIS Dataflow. The first post in this series can be found at [SSIS] OnPipelineRowsSent.


During my activity on the SSIS forum I've noticed that much of the content is in regard to the dataflow task and that's not a surprise given that its the most useful tool in the SSIS box and also the most complex. This post is me brainstorming some of the stuff that I know about the dataflow and hopefully it proves useful to some of you.

  • Buffer Architecture. If I'm ever interviewing you for a job as a SSIS developer you can lay a lot of money to say that I'll ask you to tell me what a buffer is. Buffers are fundamental to the dataflow - they are what the dataflow uses to move data around. A buffer is essentially an area of memory and by default consists of approximately 10000 rows (usually slightly less than that) and that's why when you execute a dataflow within BIDS the row counts on the data paths go up in approximate increments of 10000. Part of performance tuning a SSIS dataflow is about manipulating various properties until you find the optimum number of rows in each buffer and you can read more (much more) about that here.
  • Dataflows contain components which are generally categorised into synchronous and asynchronous. The most definitive description of these is that the output from a synchronous component uses the same buffer as the input; asynchronous components create a new buffer for their output. All source adapters are asynchronous components, all destination adapters are synchronous. Synchronous components are generally quicker than asynchronous components.
  • Asynchronous components are further categorised as partially-blocking or fully-blocking. Fully-blocking components require all rows from upstream before they put any data into the output; partially-blocking components will start to output data before they receive all upstream rows.
  • Execution trees. Each asynchronous component creates what is called an execution tree in the dataflow. In SSIS 2005 (but not in later versions) each execution tree uses one execution thread so another part of performance tuning is to fully utilise all processors on your hardware. Read more here.
  • OnPipelineRowsSent. All executables in a SSIS package throw events and one of the events throws by the dataflow is OnPipelineRowsSent. When a component outputs a buffer of data then it throws a OnPipelineRowsSent event and thus enables us to know how many rows each component has processed. When you execute a dataflow within the development environment (aka BIDS) these events are consumed and are used to change the rowcounts that you see increasing as more rows are processed.
  • Spooling. I said earlier that all buffers are a space in memory but of course memory is finite so if there is more data in the pipeline than can fit in memory then buffers will get spooled to disc. The location on disc is defined by BLOBTempStoragePath & BufferTempStoragePath. Spooling will severely impact dataflow performance so avoid if possible.
  • A lot of people ask if its possible to remove columns from the dataflow once they have finished using them. For example, if columns called [FirstName] & [LastName] are concatenated together to make [FullName] its likely that those two columns won't be needed anymore. The simple answer though is no. Once the data is in memory it would be an overhead to remove the data and "squeeze" the buffer up to make it slower which is why those columns still appear downstream. This is nothing to be concerned about - its highly highly unlikely they are heavily impacting performance. Of course, if an asynchronous component is encountered then a new buffer will be created on the output and the unrequired columns will (probably) be removed. This issue is further discussed here.
  • Following on from the previous point...its intuitive to think that columns that begin at a component don't exist prior to the data being processed by that component. In fact that's not true. Prior to dataflow execution the execution plan for a dataflow is determined and it is at that point that all columns are defined and thus created (i.e. space is set aside in memory). So, all columns that will be used in a buffer exist even before the buffer gets any data.
  • The datatypes of columns in the dataflow are different from datatypes used for SSIS variables. To this day I don't understand why the SSIS team opted to use different datatypes in the control flow and data flow and I hope this changes one day. (UPDATE: SSIS Development Manager Jeff Bernhardt addresses this issue in A potted SSIS history via Connect.)
  • The stock components (i.e. those provided out-of-the-box) are mostly written in native code (XML Source and the Script Component are exceptions to this rule). SSIS provides a .Net API that enables you and I to build our own components and hence it is tempting to think that these custom components won't work as quickly as stock components. This is probably true but really the difference is negligible. The majority of the work (validation, memory management, buffer editing etc...) is done by native code so you're not going to suffer severe performance problems by implementing custom components.
  • The BLOB data types (i.e. DT_TEXT, DT_NTEXT, DT_IMAGE) can severely impact dataflow performance so try and avoid them if you can.
  • Raw files can be used to pass data from one dataflow to another - even if those dataflows are in different packages. Raw files have a proprietary file format that is essentially a match of the data in memory and hence reading to and writing from them is extremely quick. People often seem reticent to place data into raw files but I don't hesitate to recommend using them if you need to.
  • There is an important property on each component output called IsSorted. A lot of people think that setting this property to TRUE will cause the data in that output to be sorted. That's not true - this property only informs the dataflow engine that the data is sorted, nothing more. If you set this property to TRUE and the data is not sorted then you will probably be creating problems for yourself later on (for example a downstream Merge Join component will not fail but it won't produce the correct results either).
  • Source and destination adapters maintain external column collections which are used to store the metadata of the external data sinks that those adapters connect to. There are two reasons for this as far as I can determine. Firstly to enable offline development (a big criticism of SSIS's predecessor DTS was that offline development wasn't possible) and secondly to enable the dataflow to validate itself. More information here.
  • Although it appears in BIDS as though the data in a buffer "moves" from one component to another that isn't actually the case. Data in a buffer doesn't actually move about in memory. My fellow MVP Phil Brammer (blog | twitter) once used an analogy of cars travelling on a road to describe this. The buffers are analogous to cars on the road and milestones along the road are analogous to the components. Instead of thinking of the cars moving along the road to reach the milestones, think of the cars as being stationery and the road moving along underneath the cars.
  • Back pressure is an important concept in an SSIS dataflow. Backpressure occurs when a dataflow is producing data to a destination faster than the destination can consume it (a common phenomenon when inserting into a relational database table) - this creates contention further back down the dataflow, hence the term "backpressure". Michael Entin (one of the original developer geniuses that built the dataflow engine) talks more about back pressure at SSIS Backpressure Mechanism.

I'll probably add to this post over time as new things occur to me. In the meantime if you want a more detailed description of how the dataflow works then Kirk Haselden's book has a whole chapter devoted to it. You can also pose questions in the comments although I'd urge you to post questions to the SSIS forum where more people will be available to answer and where your question may already have been answered.

@Jamiet

Published Tuesday, October 25, 2011 9:48 AM by jamiet

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

D. Pendleton said:

You may want to clarify the IsSorted variable tip: IsSorted is a property of an Output, not a Data Flow Path.

October 25, 2011 2:40 PM
 

jamiet said:

Good point, I'll change that now. Thanks for the heads-up.

October 25, 2011 8:13 PM
 

Geir-Arne said:

Nice brainstorm :)

Is Kirk Haselden's book still recommended for SSIS 2008?

November 3, 2011 5:37 AM
 

jamiet said:

Hi Geir-Anne,

There were some minor changes to how the dataflow works in SSIS2008:

http://blogs.msdn.com/b/michen/archive/2007/06/11/katmai-ssis-data-flow-improvements.aspx

http://blogs.msdn.com/b/michen/archive/2007/08/31/buffer.endofrowset.aspx

but other than those the basic principles in kirk's book should still be relevant.

regards

Jamie

November 3, 2011 6:21 AM
 

Tiago said:

Hi there,

Thanks for your great article. Keep going with this fantastic work.

Can i leave a question for you ?

In our company we use the isSorted Property setted true, ordering the data in the source, gaining a lot in performance.

We receive thousands of rows and we do some logic inside the ssis.

After this, we have to use the agregate task and after that we use the merge join task.

Have you ever had some problems using the merge join after using agregate task(with isSorted a true) ? I'm having wrong joined data... :(

Thanks a lot

November 15, 2011 6:39 PM
 

Tiago said:

I forgot to say that i resolved the problem using another sort just before the merge join. But i hate to use sort tasks inside ssis.

November 15, 2011 7:05 PM
 

jamiet said:

Hi Tiago,

I don't recall any such problems in that scenario I'm afraid, you'll have to get more information as to exactly what is being lost, and how.

November 16, 2011 3:58 AM
 

Tiago said:

Hi Jamiet,

I have here some info that i hope is enough for understanding my problem.

This is my dataflow :

http://imageshack.us/photo/my-images/72/dataflow.png/

I use some data ordered in the Source(isSorted true)

Order by Data,Hour,campanha ASC

This is the aggregate function

Aggregate configuration:

http://imageshack.us/photo/my-images/855/agregate.png/

The data sent back from aggregate comes unsorted by the fields sorted in the source

http://imageshack.us/photo/my-images/269/unsortedafterusingaggre.png/

And here i have the sorted data(using sort task)

http://imageshack.us/photo/my-images/856/sorttaskafteraggregate.png/

Why the aggregate don't send me back the data ordered ?

Thanks

November 17, 2011 1:16 PM
 

jamiet said:

Tiago,

Ah, I see. Sorry, you confused me with your mention of Merge Join.

Basically you want to know why the output of the Aggregate is not sorted, correct? Fair question, intuitively you might think that this should be the case.

The honest answer is that I don't know why the Aggregate Output is not sorted but I'd also say that I am not surprised. We don't know the internal algorithms that are used by the Aggregate component so all we can do is speculate about the cause of this behaviour; I would guess that as soon as a row is encountered on the input then the aggregate will go through some logic similar to:

IF (a_row_exists_on_output_with_same_group_by_columns_as_on_input_row)

{

 ChangeTheAggregatedColumnValuesOnThatExistingRow()

}

ELSE

{

 CreateNewRowOnOutputUsingGroupByColumnsFromTheInputRow()

}

One thing I would ask you is, is the output from your Conditional Split sorted?

Regards

Jamie

November 18, 2011 4:11 AM
 

Chris Adkin said:

Jamie,

Another very good read. Regarding your comment about roughly 10,000 rows fitting in the buffer, Rob Farley blogged on this and how using a FAST(10000) hint, if the source is a query, can help improve performance.

Regards,

Chris

November 19, 2011 11:21 AM
 

SSIS Junkie said:

Once upon a time I blogged at http://consultingblogs.emc.com/jamiethomson but that ended in August 2009

January 29, 2012 1:09 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement