<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://sqlblog.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Search results matching tags 'sql server integration services' and 'dataflow'</title><link>http://sqlblog.com/search/SearchResults.aspx?o=DateDescending&amp;tag=sql+server+integration+services,dataflow&amp;orTags=0</link><description>Search results matching tags 'sql server integration services' and 'dataflow'</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP2 (Build: 61129.1)</generator><item><title>Dataflow mechanics [SSIS]</title><link>http://sqlblog.com/blogs/jamie_thomson/archive/2011/10/25/ssis-dataflow-mechanics.aspx</link><pubDate>Tue, 25 Oct 2011 07:48:00 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:39363</guid><dc:creator>jamiet</dc:creator><description>&lt;blockquote&gt;&lt;p&gt;&lt;i&gt;Once upon a time I blogged at &lt;a href="http://consultingblogs.emc.com/jamiethomson" target="_blank"&gt;http://consultingblogs.emc.com/jamiethomson&lt;/a&gt;
 but that ended in August 2009 when I left EMC. There is a lot of 
(arguably) valuable content over there however certain events in the past leave me 
concerned that that content is not well cared for and I don't have any 
confidence that it will still exist in the long term. Hence, I have 
taken the decision to re-publish some of that content here at SQLBlog so
 over the coming weeks and months you may find re-published content 
popping up here from time-to-time.&lt;/i&gt;&lt;/p&gt;&lt;p&gt;&lt;i&gt;This is the second such blog post in which I discuss the internals of the SSIS Dataflow. The first post in &lt;a href="http://sqlblog.com/blogs/jamie_thomson/archive/tags/repost/default.aspx" target="_blank"&gt;this series&lt;/a&gt; can be found at &lt;a href="http://sqlblog.com/blogs/jamie_thomson/archive/2011/08/19/ssis-onpipelinerowssent-event.aspx" target="_blank"&gt;[SSIS] OnPipelineRowsSent&lt;/a&gt;.&lt;br&gt;&lt;/i&gt;&lt;/p&gt;&lt;/blockquote&gt;


&lt;hr&gt;
&lt;p&gt;During my activity on the &lt;a href="http://forums.microsoft.com/MSDN/ShowForum.aspx?ForumID=80&amp;amp;SiteID=1" target="_blank"&gt;SSIS forum&lt;/a&gt; I've noticed that much of the content is in regard to the &lt;a href="http://sqlblog.com/blogs/jamie_thomson/archive/tags/dataflow/default.aspx" target="_blank"&gt;dataflow&lt;/a&gt; task
 and that's not a surprise given that its the most useful tool in the 
SSIS box and also the most complex. This post is me brainstorming some 
of the stuff that I know about the dataflow and hopefully it proves 
useful to some of you.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Buffer Architecture. If I'm ever 
interviewing you for a job as a SSIS developer you can lay a lot of 
money to say that I'll ask you to tell me what a buffer is. Buffers are 
fundamental to the dataflow - they are what the dataflow uses to move 
data around. A buffer is essentially an area of memory and by default 
consists of approximately 10000 rows (usually slightly less than that) 
and that's why when you execute a dataflow within BIDS the row counts on
 the data paths go up in approximate increments of 10000. Part of performance 
tuning a SSIS dataflow is about manipulating various 
properties until you find the optimum number of rows in each buffer and 
you can read more (much more) about that &lt;a href="http://www.microsoft.com/technet/prodtechnol/sql/2005/ssisperf.mspx" target="_blank"&gt;here&lt;/a&gt;. &lt;/li&gt;&lt;li&gt;Dataflows
 contain components which are generally categorised into synchronous and
 asynchronous. The most definitive description of these is that the 
output from a synchronous component uses the same buffer as the input; 
asynchronous components create a new buffer for their output. All source
 adapters are asynchronous components, all destination adapters are 
synchronous. Synchronous components are &lt;i&gt;generally&lt;/i&gt; quicker than asynchronous components. &lt;/li&gt;&lt;li&gt;Asynchronous
 components are further categorised as partially-blocking or 
fully-blocking. Fully-blocking components require all rows from upstream
 before they put any data into the output; partially-blocking components
 will start to output data before they receive all upstream rows. &lt;/li&gt;&lt;li&gt;Execution
 trees. Each asynchronous component creates what is called an execution 
tree in the dataflow. In SSIS 2005 (but not in later versions) each execution
 tree uses one execution thread so another part of performance tuning is
 to fully utilise all processors on your hardware. Read more &lt;a href="http://blogs.conchango.com/jamiethomson/archive/2005/10/02/2227.aspx" target="_blank"&gt;here&lt;/a&gt;. &lt;/li&gt;&lt;li&gt;OnPipelineRowsSent.
 All executables in a SSIS package throw events and one of the events 
throws by the dataflow is &lt;a href="http://sqlblog.com/blogs/jamie_thomson/archive/2011/08/19/ssis-onpipelinerowssent-event.aspx" target="_blank"&gt;OnPipelineRowsSent&lt;/a&gt;. When a component outputs a
 buffer of data then it throws a OnPipelineRowsSent event and thus 
enables us to know how many rows each component has processed. When you 
execute a dataflow within the development environment (aka BIDS) these events are consumed and are used to
 change the rowcounts that you see increasing as more rows are 
processed. &lt;/li&gt;&lt;li&gt;Spooling. I said earlier that all buffers are a 
space in memory but of course memory is finite so if there is more data 
in the pipeline than can fit in memory then buffers will get spooled to 
disc. The location on disc is defined by BLOBTempStoragePath &amp;amp; 
BufferTempStoragePath. Spooling will severely impact dataflow 
performance so avoid if possible. &lt;/li&gt;&lt;li&gt;A lot of people ask if its 
possible to remove columns from the dataflow once they have finished 
using them. For example, if columns called [FirstName] &amp;amp; [LastName] 
are concatenated together to make [FullName] its likely that those two 
columns won't be needed anymore. The simple answer though is no. Once 
the data is in memory it would be an overhead to remove the data and 
"squeeze" the buffer up to make it slower which is why those columns 
still appear downstream. This is nothing to be concerned about - its 
highly highly unlikely they are heavily impacting performance. Of 
course, if an asynchronous component is encountered then a new buffer 
will be created on the output and the unrequired columns will (probably)
 be removed. This issue is further discussed &lt;a href="http://forums.microsoft.com/MSDN/showpost.aspx?postid=1582077&amp;amp;siteid=1" target="_blank"&gt;here&lt;/a&gt;. &lt;/li&gt;&lt;li&gt;Following
 on from the previous point...its intuitive to think that columns that 
begin at a component don't exist prior to the data being processed by 
that component. In fact that's not true. Prior to dataflow execution the
 execution plan for a dataflow is determined and it is at that point 
that all columns are defined and thus created (i.e. space is set aside in memory). So, all columns that will
 be used in a buffer exist even before the buffer gets any data. &lt;/li&gt;&lt;li&gt;The
 datatypes of columns in the dataflow are different from datatypes used 
for SSIS variables. To this day I don't understand why the SSIS team 
opted to use different datatypes in the control flow and data flow and I 
hope this changes one day. (UPDATE: SSIS Development Manager Jeff Bernhardt addresses this issue in &lt;a href="http://sqlblog.com/blogs/jamie_thomson/archive/2011/04/07/a-potted-ssis-history-via-connect.aspx" target="_blank"&gt;A potted SSIS history via Connect&lt;/a&gt;.)&lt;/li&gt;&lt;li&gt;The stock components (i.e. those provided out-of-the-box) are mostly written in native code (XML Source and the Script Component are exceptions to this rule). SSIS provides a .Net API that enables 
you and I to build our own components and hence it is tempting to think 
that these custom components won't work as quickly as stock components. 
This is probably true but really the difference is negligible. The 
majority of the work (validation, memory management, buffer editing 
etc...) is done by native code so you're not going to suffer severe 
performance problems by implementing custom components. &lt;/li&gt;&lt;li&gt;The 
BLOB data types (i.e. DT_TEXT, DT_NTEXT, DT_IMAGE) can severely impact 
dataflow performance so try and avoid them if you can. &lt;/li&gt;&lt;li&gt;Raw 
files can be used to pass data from one dataflow to another - even if 
those dataflows are in different packages. Raw files have a proprietary 
file format that is essentially a match of the data in memory and hence 
reading to and writing from them is extremely quick. People often seem 
reticent to place data into raw files but I don't hesitate to recommend 
using them if you need to. &lt;/li&gt;&lt;li&gt;There is an important property on each&amp;nbsp;component output&amp;nbsp;called IsSorted. A lot of people think that setting this 
property to TRUE will cause the data in that&amp;nbsp;output to be sorted. 
That's not true - this property only&amp;nbsp;informs the dataflow engine that the data is
 sorted, nothing more. If you set this property to TRUE and the data is 
not sorted then you will probably be creating problems for yourself later on (for example a downstream Merge Join component will not fail but it won't produce the correct results either). &lt;/li&gt;&lt;li&gt;Source
 and destination adapters maintain external column collections which are
 used to store the metadata of the external data sinks that those 
adapters connect to. There are two reasons for this as far as I can 
determine. Firstly to enable offline development (a big criticism of SSIS's predecessor DTS
 was that offline development wasn't possible) and secondly to enable 
the dataflow to validate itself. More information &lt;a href="http://blogs.conchango.com/jamiethomson/archive/2006/05/23/SSIS_3A00_-The-difference-between-output-columns-and-external-columns.aspx" target="_blank"&gt;here&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;Although
 it appears in BIDS as though the data in a buffer "moves" from one 
component to another that isn't actually the case. Data in a buffer 
doesn't actually move about in memory. My fellow MVP Phil Brammer (&lt;a href="http://www.ssistalk.com/" target="_blank"&gt;blog&lt;/a&gt; | &lt;a href="http://twitter.com/#%21/philbrammer" target="_blank"&gt;twitter&lt;/a&gt;) once used an analogy&amp;nbsp;of 
cars travelling on a road to describe this. The buffers are analogous to
 cars on the road and&amp;nbsp;milestones&amp;nbsp;along the road&amp;nbsp;are analogous to&amp;nbsp;the 
components. Instead of thinking of the cars moving along the road to 
reach the milestones, think of the cars as being stationery and the road
 moving along underneath the cars.&lt;/li&gt;&lt;li&gt;Back pressure is an important concept in an SSIS dataflow. Backpressure occurs when a dataflow is producing data to a destination faster than the destination can consume it (a common phenomenon when inserting into a relational database table) - this creates contention further back down the dataflow, hence the term "backpressure". Michael Entin (one of the original developer geniuses that built the dataflow engine) talks more about back pressure at &lt;a href="http://blogs.msdn.com/b/michen/archive/2007/06/12/ssis-backpressure-mechanism.aspx" target="_blank"&gt;SSIS Backpressure Mechanism&lt;/a&gt;.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;I'll probably add to this
 post over time as new things occur to me. In the meantime if you want a
 more detailed description of how the dataflow works then &lt;a href="http://www.amazon.com/Microsoft-Server-2005-Integration-Services/dp/0672327813/ref=pd_bbs_sr_1/104-4935346-0790357?ie=UTF8&amp;amp;s=books&amp;amp;qid=1191898179&amp;amp;sr=8-1" target="_blank"&gt;Kirk Haselden's book&lt;/a&gt; has a whole chapter devoted to it. You can also pose questions in the comments although I'd urge you to post questions to the &lt;a href="http://forums.microsoft.com/MSDN/ShowForum.aspx?ForumID=80&amp;amp;SiteID=1" target="_blank"&gt;SSIS forum&lt;/a&gt;
 where more people will be available to answer and where your question 
may already have been answered.&lt;/p&gt;&lt;p&gt;&lt;a href="http://twitter.com/jamiet" target="_blank"&gt;@Jamiet&lt;/a&gt;&lt;/p&gt;</description></item><item><title>[SSIS] OnPipelineRowsSent event</title><link>http://sqlblog.com/blogs/jamie_thomson/archive/2011/08/19/ssis-onpipelinerowssent-event.aspx</link><pubDate>Fri, 19 Aug 2011 11:48:00 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:37927</guid><dc:creator>jamiet</dc:creator><description>&lt;blockquote&gt;
&lt;p&gt;&lt;i&gt;Once upon a time I blogged at &lt;a href="http://consultingblogs.emc.com/jamiethomson" target="_blank"&gt;http://consultingblogs.emc.com/jamiethomson&lt;/a&gt; but that ended in August 2009 when I left EMC. There is a lot of valuable content over there however certain events in the past leave me concerned that that content is not well cared for and I don't have any confidence that it will still exist in the long term. Hence, I have taken the decision to re-publish some of that content here at SQLBlog so over the coming weeks and months you may find re-published content popping up here from time-to-time.&lt;/i&gt;&lt;/p&gt;&lt;p&gt;&lt;i&gt;This is the first such blog post in &lt;a href="http://sqlblog.com/blogs/jamie_thomson/archive/tags/repost/default.aspx" target="_blank"&gt;this series&lt;/a&gt; in which I talk about the little-known OnPipelineRowsSent event.&lt;/i&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;/blockquote&gt;

&lt;p&gt;An &lt;a href="http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=1317964&amp;amp;SiteID=1"&gt;interesting discussion&lt;/a&gt; today on the &lt;a href="http://forums.microsoft.com/MSDN/ShowForum.aspx?ForumID=80&amp;amp;SiteID=1"&gt;SSIS Forum&lt;/a&gt;
 made me realise that perhaps not&amp;nbsp;all SSIS developers&amp;nbsp;are aware of the 
OnPipelineRowsSent event in SSIS and how valuable it can be, certainly 
when debugging data-flows. This isn't all that surprising because &lt;a href="http://msdn2.microsoft.com/en-us/library/ms141026.aspx"&gt;Books Online&lt;/a&gt;&amp;nbsp;is very sparse in this area. Hence I thought it would make a useful blog topic.&lt;/p&gt;

&lt;p&gt;OnPipelineRowsSent
 is a type of&amp;nbsp;event in SSIS which means the&amp;nbsp;information that it provides
 can be logged&amp;nbsp;by whatever log provider you happen to be using.&amp;nbsp;In 
order&amp;nbsp;to understand the information that OnPipelineRowsSent provides it 
is critical to understand the buffer architecture of the &lt;a href="http://msdn2.microsoft.com/en-us/library/ms141122.aspx"&gt;SSIS data-flow&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To
 explain&amp;nbsp;that&amp;nbsp;VERY simply, a buffer is an area of memory that contains 
rows of data as they pass through the pipeline. Each data-path in the 
data-flow&amp;nbsp;will contain one of more buffers and each of those buffers 
(except for the last one) will contain the same number of rows. If you 
want to understand more about the buffer architecture of the data-flow 
then &lt;a href="http://www.amazon.com/Microsoft-Server-2005-Integration-Services/dp/0672327813"&gt;Kirk Haselden's book&lt;/a&gt; has a chapter devoted wholly to it.&lt;/p&gt;

&lt;p&gt;Here is what Books Online &lt;a href="http://msdn2.microsoft.com/en-us/library/ms141122.aspx"&gt;DOES say&lt;/a&gt; about OnPipelineRowsSent:&lt;/p&gt;

&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;p&gt;&lt;b&gt;OnPipelineRowsSent&lt;/b&gt; &lt;/p&gt;
&lt;/td&gt;

&lt;td&gt;
&lt;p&gt;Reports the number of rows provided to a component input by a call to the &lt;b&gt;ProcessInput&lt;/b&gt; method. The log entry includes the component name. &lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;

&lt;p&gt;In
 slightly plainer english that means that for every buffer of data that 
is consumed by a component, you get a log entry. So, if you have&amp;nbsp;(e.g.) a
 &lt;a href="http://msdn2.microsoft.com/en-us/library/ms141069.aspx"&gt;Derived Column Component&lt;/a&gt;&amp;nbsp;that
 receives 105000&amp;nbsp;rows and the size of the buffer is 10000 rows, you'll 
get 11 OnPipelineRowsSent log entries for that&amp;nbsp;component&amp;nbsp;(there will 
only be 5000 rows in the last buffer).&amp;nbsp;Basically it enables you to know 
how many rows are passing through each data-path in the data-flow.&lt;/p&gt;

&lt;p&gt;The available information is the same as you get for any eventhandler. &amp;nbsp;i.e. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;event&lt;/li&gt;

&lt;li&gt;computer&lt;/li&gt;

&lt;li&gt;operator&lt;/li&gt;

&lt;li&gt;source&lt;/li&gt;

&lt;li&gt;sourceid&lt;/li&gt;

&lt;li&gt;executionid&lt;/li&gt;

&lt;li&gt;starttime&lt;/li&gt;

&lt;li&gt;endtime&lt;/li&gt;

&lt;li&gt;datacode&lt;/li&gt;

&lt;li&gt;databytes&lt;/li&gt;

&lt;li&gt;message&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and here is a portion of a logfile containing those records:&lt;/p&gt;

&lt;p&gt;&lt;font face="courier new"&gt;&lt;font color="#ff0000"&gt;OnPipelineRowsSent,CGOJATHOMSON,INT\Jamie.Thomson,Data
 Flow 
Task,{EB25C41A-EFE2-4902-9610-99C0309A3B59},{33B4926F-54A9-4E00-9B74-92720A2786E0},08/03/2007
 20:49:56,08/03/2007 20:49:56,0,0x,Rows were provided to a data flow 
component as input. :&amp;nbsp; : 1228 : Merge Join Output : 1225 : Union All 1 :
 1226 : Union All Input 1 : 9936&lt;br&gt;&lt;/font&gt;&lt;font color="#0000ff"&gt;OnPipelineRowsSent,CGOJATHOMSON,INT\Jamie.Thomson,Data
 Flow 
Task,{EB25C41A-EFE2-4902-9610-99C0309A3B59},{33B4926F-54A9-4E00-9B74-92720A2786E0},08/03/2007
 20:49:56,08/03/2007 20:49:56,0,0x,Rows were provided to a data flow 
component as input. :&amp;nbsp; : 1475 : Union All Output 1 : 1470 : Sort 3 : 
1471 : Sort Input : 9936&lt;br&gt;&lt;/font&gt;&lt;font color="#ff0000"&gt;OnPipelineRowsSent,CGOJATHOMSON,INT\Jamie.Thomson,Data
 Flow 
Task,{EB25C41A-EFE2-4902-9610-99C0309A3B59},{33B4926F-54A9-4E00-9B74-92720A2786E0},08/03/2007
 20:49:56,08/03/2007 20:49:56,0,0x,Rows were provided to a data flow 
component as input. :&amp;nbsp; : 1228 : Merge Join Output : 1225 : Union All 1 :
 1226 : Union All Input 1 : 2085&lt;br&gt;&lt;/font&gt;&lt;font color="#0000ff"&gt;OnPipelineRowsSent,CGOJATHOMSON,INT\Jamie.Thomson,Data
 Flow 
Task,{EB25C41A-EFE2-4902-9610-99C0309A3B59},{33B4926F-54A9-4E00-9B74-92720A2786E0},08/03/2007
 20:49:56,08/03/2007 20:49:56,0,0x,Rows were provided to a data flow 
component as input. :&amp;nbsp; : 1475 : Union All Output 1 : 1470 : Sort 3 : 
1471 : Sort Input : 9936&lt;br&gt;&lt;/font&gt;&lt;font color="#ff0000"&gt;OnPipelineRowsSent,CGOJATHOMSON,INT\Jamie.Thomson,Data
 Flow 
Task,{EB25C41A-EFE2-4902-9610-99C0309A3B59},{33B4926F-54A9-4E00-9B74-92720A2786E0},08/03/2007
 20:49:56,08/03/2007 20:49:56,0,0x,Rows were provided to a data flow 
component as input. :&amp;nbsp; : 1475 : Union All Output 1 : 1470 : Sort 3 : 
1471 : Sort Input : 2085&lt;br&gt;&lt;/font&gt;&lt;font color="#0000ff"&gt;OnPipelineRowsSent,CGOJATHOMSON,INT\Jamie.Thomson,Data
 Flow 
Task,{EB25C41A-EFE2-4902-9610-99C0309A3B59},{33B4926F-54A9-4E00-9B74-92720A2786E0},08/03/2007
 20:49:56,08/03/2007 20:49:56,0,0x,Rows were provided to a data flow 
component as input. :&amp;nbsp; : 2121 : Sort Output : 2113 : Merge Join 2 : 2114
 : Merge Join Left Input : 9936 &lt;/font&gt;&lt;/font&gt;&lt;/p&gt;

&lt;p&gt;The important stuff is what comes in the last field, the message so&amp;nbsp;let's break that down. Here is an example message:&lt;/p&gt;

&lt;p&gt;&lt;font color="#0000ff" face="Courier New"&gt;Rows
 were provided to a data flow component as input. :&amp;nbsp; : 1030 : OLE DB 
Source Output : 1025 : Sort 2 : 1026 : Sort Input : 9972&lt;/font&gt;&lt;/p&gt;

&lt;p&gt;And here's what it is constituted of:&amp;nbsp;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rows
 were provided to a data flow component as input. - That's consistent in
 each message, so to be honest they could have left it out&lt;/li&gt;

&lt;li&gt;&lt;div&gt;1030&amp;nbsp;- ID of the data-path providing the buffer&lt;/div&gt;&lt;/li&gt;

&lt;li&gt;&lt;div&gt;OLE DB Source Output - Name of the&amp;nbsp;data-path providing the buffer&lt;/div&gt;&lt;/li&gt;

&lt;li&gt;&lt;div&gt;1025 - ID of the component receiving the buffer&lt;/div&gt;&lt;/li&gt;

&lt;li&gt;&lt;div&gt;Sort 2 - Name of the component receiving the buffer&lt;/div&gt;&lt;/li&gt;

&lt;li&gt;&lt;div&gt;1026 - ID of the input receiving the buffer&lt;/div&gt;&lt;/li&gt;

&lt;li&gt;&lt;div&gt;Sort Input - Name of the input receiving the buffer&lt;/div&gt;&lt;/li&gt;

&lt;li&gt;&lt;div&gt;9972&amp;nbsp;- Number of rows in the buffer&lt;/div&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As
 you can see, this information will enable you to determine exactly how 
many rows are output from each component. Thus, if you are not receiving
 the expected number of rows at a destination this event will help you 
to deduce where you are losing them from.&lt;/p&gt;

&lt;p&gt;You may not know this 
but&amp;nbsp;you actually encounter OnPipelineRowsSent every time you run a 
data-flow within the SSIS Designer in&amp;nbsp;BIDS. Notice how the designer 
shows the number of rows passing through the pipeline and through each 
component:&lt;/p&gt;
&lt;img src="http://i1177.photobucket.com/albums/x349/jamiekth/df.jpg" alt="Photobucket" border="0"&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;How&amp;nbsp;does the designer know how many rows are passing through the 
data-flow? Simple! Its because the SSIS Designer consumes the 
OnPipelineRowsSent event that the executing package "throws" up and 
then&amp;nbsp;presents that information visually&amp;nbsp;as the numbers that you see 
flashing in front of your eyes as the data-flow executes.&lt;/p&gt;

&lt;p&gt;I think 
that just about covers the simple stuff around OnPipelineRowsSent. If 
you have any questions then post them here as a comment.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://twitter.com/jamiet" target="_blank"&gt;@Jamiet&lt;/a&gt; &lt;br&gt;&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;</description></item><item><title>Why is there no CROSS JOIN component in SSIS?</title><link>http://sqlblog.com/blogs/jamie_thomson/archive/2011/07/08/why-is-there-no-cross-join-component-in-ssis.aspx</link><pubDate>Fri, 08 Jul 2011 09:02:00 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:36718</guid><dc:creator>jamiet</dc:creator><description>&lt;p&gt;I saw an &lt;a href="http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/88142e03-443d-4a93-8f33-1e09def4e97b" target="_blank"&gt;interesting question on the SSIS MSDN forum &lt;/a&gt;yesterday where someone was asking how to do a cartesian product (i.e. a CROSS JOIN) in a SQL Server Integration Services (SSIS) dataflow. Links were provided to various articles that cover this (see &lt;a href="http://feedproxy.google.com/%7Er/blogspot/twNO/%7E3/liRfkEU5Njo/performing-cross-join-cartesian-product.html" target="_blank"&gt;Performing a Cross Join (Cartesian Product) in SSIS&lt;/a&gt; which explains how you can fool the Merge Join component into doing a CROSS JOIN) after which the original poster replied saying:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;&lt;i&gt;Such a long procedure to perform a simple cartesian product. SSIS must really improve from this point of view.&lt;/i&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Well apparently others agree because there are two requests on Connect for a Cross Join component (&lt;a href="https://connect.microsoft.com/SQLServer/feedback/details/161154/ssis-add-cross-join-to-merge-join-types" target="_blank"&gt;161154&lt;/a&gt; &amp;amp; &lt;a href="https://connect.microsoft.com/SQLServer/feedback/details/127215/add-cross-join-option-to-merge-join" target="_blank"&gt;127215&lt;/a&gt;) - both have been closed with status "Won't Fix". I don't know why this is but I can speculate.&lt;/p&gt;&lt;p&gt;Think about what is being asked for here. A CROSS JOIN operation potentially 
requires masses of memory (and would even more so for SSIS given that we
 are typically processing large data volumes). CROSS JOINs work in SQL Server because we have
 this helpful little thing called [tempdb] and there is no analogous mechanism in SSIS (save for disk spooling). SSIS is deliberately not optimized for mass data spooling - it is optimized for processing small 
batches of data at a time before passing them off to
 some destination.&lt;/p&gt;
&lt;p&gt;I suspect this is why there is no Cross Join component to go alongside the Merge Join component. The capabilities to 
achieve a CROSS JOIN operation exist (see link above) but they want you to be 
deliberate about it -&amp;nbsp;they're not going to give you enough rope to hang 
yourself with by providing a component
 that could bring&amp;nbsp;the machine of an unwitting developer to its knees.&lt;/p&gt;&lt;p&gt;As I said this is speculation so don't assume this is the real reason for there being no Cross Join component.&lt;/p&gt;&lt;p&gt;What are your thoughts? Assuming its true is this justification enough for there being no Cross Join component? Should there be one? Do you even need one? Let me know in the comments.&lt;/p&gt;&lt;p&gt;&lt;a href="http://twitter.com/jamiet" target="_blank"&gt;@jamiet&lt;/a&gt;&amp;nbsp; &lt;br&gt;&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;</description></item><item><title>Merge Join component sorted outputs [SSIS]</title><link>http://sqlblog.com/blogs/jamie_thomson/archive/2011/01/31/merge-join-component-sorted-outputs-ssis.aspx</link><pubDate>Mon, 31 Jan 2011 19:40:00 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:33070</guid><dc:creator>jamiet</dc:creator><description>&lt;P&gt;One question that I have been asked a few times of late in regard to performance tuning SSIS data flows is this:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;Why isn’t the Merge Join output sorted (i.e.IsSorted=True)?&lt;/EM&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;This is a fair question. After all both of the Merge Join inputs are sorted, hence why wouldn’t the output be sorted as well? Well here’s a little secret, the Merge Join output IS sorted! There’s a caveat though – it is only under certain circumstances and SSIS itself doesn’t do a good job of informing you of it.&lt;/P&gt;
&lt;P&gt;Let’s take a look at an example. Here we have a dataflow that consumes data from the [AdventureWorks2008].[Sales].[SalesOrderHeader] &amp;amp; [AdventureWorks2008].[Sales].[SalesOrderDetail] tables then joins them using a Merge Join component:&lt;/P&gt;
&lt;P&gt;&lt;A href="http://sqlblog.com/blogs/jamie_thomson/image_7C25788A.png"&gt;&lt;IMG style="BACKGROUND-IMAGE:none;BORDER-BOTTOM:0px;BORDER-LEFT:0px;MARGIN:;PADDING-LEFT:0px;PADDING-RIGHT:0px;DISPLAY:inline;BORDER-TOP:0px;BORDER-RIGHT:0px;PADDING-TOP:0px;" title=image border=0 alt=image src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_7210D75F.png" width=628 height=256&gt;&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Let’s take a look inside the editor of the Merge Join:&lt;/P&gt;
&lt;P&gt;&lt;A href="http://sqlblog.com/blogs/jamie_thomson/image_29E31B83.png"&gt;&lt;IMG style="BACKGROUND-IMAGE:none;BORDER-BOTTOM:0px;BORDER-LEFT:0px;MARGIN:;PADDING-LEFT:0px;PADDING-RIGHT:0px;DISPLAY:inline;BORDER-TOP:0px;BORDER-RIGHT:0px;PADDING-TOP:0px;" title=image border=0 alt=image src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_06D2AA13.png" width=705 height=705&gt;&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;We are joining on the [SalesOrderId] field (which is what the two inputs just happen to be sorted upon). We are also putting [SalesOrderHeader].[SalesOrderId] into the output. Believe it or not the output from this Merge Join component is sorted (i.e. has IsSorted=True) but unfortunately the Merge Join component does not have an Advanced Editor hence it is hidden away from us. There are a couple of ways to prove to you that is the case; I could open up the package XML inside the .dtsx file and show you the metadata but there is an easier way than that – I can attach a Sort component to the output. Take a look:&lt;/P&gt;
&lt;P&gt;&lt;A href="http://sqlblog.com/blogs/jamie_thomson/image_7200B192.png"&gt;&lt;IMG style="BACKGROUND-IMAGE:none;BORDER-BOTTOM:0px;BORDER-LEFT:0px;MARGIN:;PADDING-LEFT:0px;PADDING-RIGHT:0px;DISPLAY:inline;BORDER-TOP:0px;BORDER-RIGHT:0px;PADDING-TOP:0px;" title=image border=0 alt=image src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_288E5CD7.png" width=998 height=435&gt;&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Notice that the Sort component is attempting to sort on the [SalesOrderId] column. This gives us the following warning:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;Validation warning. DFT Get raw data: {992B7C9A-35AD-47B9-A0B0-637F7DDF93EB}: The data is already sorted as specified so the transform can be removed.&lt;/EM&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;The warning proves that the output from the Merge Join is sorted!&lt;/P&gt;
&lt;P&gt;It must be noted that the Merge Join output will only have IsSorted=True if at least one of the join columns is included in the output.&lt;/P&gt;
&lt;P&gt;So there you go, the Merge Join component can indeed produce a sorted output and that’s very useful in order to avoid unnecessary expensive Sort operations downstream. Hope this is useful to someone out there!&lt;/P&gt;
&lt;P&gt;&lt;A href="http://twitter.com/jamiet"&gt;@Jamiet&lt;/A&gt;&amp;nbsp; &lt;BR&gt;&lt;/P&gt;
&lt;P&gt;P.S. Thank you to Bob Bojanic on the SSIS product team who pointed this out to me!&lt;/P&gt;</description></item><item><title>Always use dtexec.exe to test performance of your dataflows. No exceptions.</title><link>http://sqlblog.com/blogs/jamie_thomson/archive/2011/01/12/always-use-dtexec-exe-to-test-performance-of-your-dataflows-no-exceptions.aspx</link><pubDate>Wed, 12 Jan 2011 21:51:00 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:32572</guid><dc:creator>jamiet</dc:creator><description>&lt;P&gt;Earlier this evening I posted a blog post entitled &lt;A href="http://sqlblog.com/blogs/jamie_thomson/archive/2011/01/12/investigation-can-different-combinations-of-components-effect-dataflow-performance.aspx" target=_blank&gt;Investigation: Can different combinations of components effect Dataflow performance?&lt;/A&gt; where I compared the performance of three different dataflows all working to the same overall goal. I wanted to make one last point related to the results but I thought it warranted a blog post all of its own.&lt;/P&gt;
&lt;P&gt;Here is a screenshot of one of the dataflows that I was testing:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;A href="http://sqlblog.com/blogs/jamie_thomson/image_4B0ECC80.png"&gt;&lt;IMG style="BACKGROUND-IMAGE:none;BORDER-BOTTOM:0px;BORDER-LEFT:0px;PADDING-LEFT:0px;PADDING-RIGHT:0px;DISPLAY:inline;BORDER-TOP:0px;BORDER-RIGHT:0px;PADDING-TOP:0px;" title=image border=0 alt=image src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_1DAD36B0.png" width=758 height=337&gt;&lt;/A&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Pretty complicated I’m sure you’ll agree. Now, when I executed this dataflow in the test it was executing in ~19seconds however in that case I was executing using the command-line tool &lt;A href="http://msdn.microsoft.com/en-us/library/ms162810.aspx" target=_blank&gt;dtexec&lt;/A&gt;. I also tried executing inside the BIDS development environment and in that case it took &lt;B&gt;much &lt;/B&gt;longer – 139seconds. That’s more than seven times as long.&lt;/P&gt;
&lt;P&gt;The point I want to make is very simple. If you are testing your dataflows for performance &lt;B&gt;please&lt;/B&gt; use dtexec. Nothing else will suffice.&lt;/P&gt;
&lt;P&gt;&lt;A href="http://twitter.com/jamiet"&gt;@Jamiet&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;UPDATE: Matt Masson from the SSIS team has posted a great&amp;nbsp;blog post&amp;nbsp;explaining exactly &lt;EM&gt;why&lt;/EM&gt; dtexec is quicker. Go read it at &lt;A href="http://blogs.msdn.com/b/mattm/archive/2011/01/15/why-does-my-package-run-slower-in-bids-than-dtexec.aspx?wa=wsignin1.0"&gt;Why does my package run slower in BIDS than DTEXEC?&lt;/A&gt;&lt;/P&gt;</description></item><item><title>Investigation: Can different combinations of components affect Dataflow performance?</title><link>http://sqlblog.com/blogs/jamie_thomson/archive/2011/01/12/investigation-can-different-combinations-of-components-effect-dataflow-performance.aspx</link><pubDate>Wed, 12 Jan 2011 20:28:00 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:32571</guid><dc:creator>jamiet</dc:creator><description>&lt;h3&gt;Introduction&lt;/h3&gt;
&lt;p&gt;The Dataflow task is one of the core components (if not &lt;i&gt;the&lt;/i&gt; core component) of SQL Server Integration Services (SSIS) and often the most misunderstood. This is not surprising, its an incredibly complicated beast and we’re abstracted away from that complexity via some boxes that go yellow red or green and that have some lines drawn between them.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="http://sqlblog.com/blogs/jamie_thomson/image_04FD06F1.png"&gt;&lt;img src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_1B6FA578.png" style="background-image:none;border-width:0px;padding-left:0px;padding-right:0px;display:inline;padding-top:0px;" title="SSIS Dataflow" alt="Example Dataflow" border="0" width="641" height="200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;i&gt;Example dataflow&lt;/i&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this blog post I intend to look under that facade and get into some of the nuts and bolts of the Dataflow Task by investigating how the decisions we make when building our packages can affect performance. I will do this by comparing the performance of three dataflows that all have the same input, all produce the same output, but which all operate slightly differently by way of having different transformation components.&lt;/p&gt;

&lt;p&gt;I also want to use this blog post to challenge a common held opinion that I see perpetuated over and over again on the &lt;a href="http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/threads" target="_blank"&gt;SSIS forum&lt;/a&gt;. That is, that people assume adding components to a dataflow will be detrimental to overall performance. Its not surprising that people think this –it is intuitive to think that more components means more work- however this is not a view that I share. I have always been of the opinion that there are many factors affecting dataflow duration and the number of components is actually one of the less important ones; having said that I have never proven that assertion and that is one reason for this investigation. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;i&gt;I have actually seen evidence that some people think dataflow duration is simply a function of number of rows and number of components. I’ll happily call that one out as a myth even without any investigation!&lt;/i&gt;&amp;nbsp;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;The Setup&lt;/h3&gt;
&lt;p&gt;I have a 2GB datafile which is a list of 4731904 (~4.7million) customer records with various attributes against them and it contains 2 columns that I am going to use for categorisation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[YearlyIncome] &lt;/li&gt;

&lt;li&gt;[BirthDate] &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The data file is a SSIS raw format file which I chose to use because it is the quickest way of getting data into a dataflow and given that I am testing the transformations, not the source or destination adapters, I want to minimise external influences as much as possible.&lt;/p&gt;

&lt;p&gt;In the test I will split the customers according to month of birth (12 of those) and whether or not their yearly income is above or below 50000 (2 of those); in other words I will be splitting them into 24 discrete categories and in order to do it I shall be using different combinations of SSIS’ Conditional Split and Derived Column transformation components. The 24 datapaths that occur will each input to a rowcount component, again because this is the least resource intensive means of terminating a datapath.&lt;/p&gt;

&lt;p&gt;The test is being carried out on a Dell XPS Studio laptop with a quad core (8 logical Procs) Intel Core i7 at 1.73GHz and Samsung SSD hard drive. Its running SQL Server 2008 R2 on Windows 7.&lt;/p&gt;
&lt;h3&gt;The Variables&lt;/h3&gt;
&lt;p&gt;Here are the three combinations of components that I am going to test:&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;b&gt;One Conditional Split&lt;/b&gt; - A single &lt;a href="http://msdn.microsoft.com/en-us/library/ms137886.aspx" target="_blank"&gt;Conditional Split&lt;/a&gt; component &lt;i&gt;CSPL Split by Month of Birth and income category&lt;/i&gt; that will use expressions on [YearlyIncome] &amp;amp; [BirthDate] to send each row to one of 24 outputs. 
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="http://sqlblog.com/blogs/jamie_thomson/image_41AA7BA2.png"&gt;&lt;img src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_5593E86B.png" style="background-image:none;border:0px none;padding-left:0px;padding-right:0px;display:inline;padding-top:0px;" title="Dataflow 1 - Raw Source and Conditional Split" alt="Dataflow 1 - Raw Source and Conditional Split" border="0" width="244" height="131"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This next screenshot displays the expression logic in use:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="http://sqlblog.com/blogs/jamie_thomson/image_70D3C49A.png"&gt;&lt;img src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_36784EB9.png" style="background-image:none;border-width:0px;padding-left:0px;padding-right:0px;display:inline;padding-top:0px;" title="image" alt="image" border="0" width="578" height="222"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;

&lt;li&gt;&lt;b&gt;Derived Column &amp;amp; Conditional Split&lt;/b&gt; - A &lt;a href="http://msdn.microsoft.com/en-us/library/ms141069.aspx" target="_blank"&gt;Derived Column&lt;/a&gt; component &lt;i&gt;DER Income Category&lt;/i&gt; that adds a new column [IncomeCategory] which will contain one of two possible text values {“LessThan50000”,”GreaterThan50000”} and uses [YearlyIncome] to determine which value each row should get. A Conditional Split component &lt;i&gt;CSPL Split by Month of Birth and Income Category&lt;/i&gt; then uses that new column in conjunction with [BirthDate] to determine which of the same 24 outputs to send each row to. Put more simply, I am separating the Conditional Split of #1 into a Derived Column and a Conditional Split. 
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="http://sqlblog.com/blogs/jamie_thomson/image_7FBAF3B4.png"&gt;&lt;img src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_1A8719C1.png" style="background-image:none;border-width:0px;padding-left:0px;padding-right:0px;display:inline;padding-top:0px;" title="Dataflow 2 - Raw Source, Derived Column and Conditional Split" alt="Dataflow 2 - Raw Source, Derived Column and Conditional Split" border="0" width="244" height="217"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The next screenshots display the expression logic in use:&lt;/p&gt;

&lt;table cellpadding="2" cellspacing="0"&gt;

&lt;tr&gt;
&lt;td&gt;&lt;i&gt;DER Income Category&lt;/i&gt;&lt;/td&gt;

&lt;td&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;&lt;a href="http://sqlblog.com/blogs/jamie_thomson/image_323E5127.png"&gt;&lt;img src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_4A61BB82.png" style="background-image:none;border-width:0px;padding-left:0px;padding-right:0px;display:inline;padding-top:0px;" title="image" alt="image" border="0" width="625" height="103"&gt;&lt;/a&gt;&lt;/td&gt;

&lt;td&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;&amp;nbsp;&lt;/td&gt;

&lt;td&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;&lt;i&gt;CSPL Split by Month of Birth and Income Category&lt;/i&gt;&lt;/td&gt;

&lt;td&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;&lt;a href="http://sqlblog.com/blogs/jamie_thomson/image_491D22A3.png"&gt;&lt;img src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_358FC302.png" style="background-image:none;border-width:0px;padding-left:0px;padding-right:0px;display:inline;padding-top:0px;" title="image" alt="image" border="0" width="570" height="183"&gt;&lt;/a&gt;&lt;/td&gt;

&lt;td&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;

&lt;li&gt;&lt;b&gt;Three Conditional Splits &lt;/b&gt;- A Conditional Split component that produces two outputs based on [YearlyIncome], one for each Income Category. Each of those outputs will go to a further Conditional Split that splits the input into 12 outputs, one for each month of birth (identical logic in each). In this case then I am separating the single Conditional Split of #1 into three Conditional Split components. 
&lt;blockquote&gt;
&lt;p&gt;&lt;a href="http://sqlblog.com/blogs/jamie_thomson/image_580451BD.png"&gt;&lt;img src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_5DDEF556.png" style="background-image:none;border-width:0px;padding-left:0px;padding-right:0px;display:inline;padding-top:0px;" title="Dataflow 3 - 3 Conditional Splits" alt="Dataflow 3 - 3 Conditional Splits" border="0" width="450" height="197"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The next screenshots display the expression logic in use:&lt;/p&gt;

&lt;table cellpadding="2" cellspacing="0"&gt;

&lt;tr&gt;
&lt;td&gt;&lt;i&gt;CSPL Split by Income Category&lt;/i&gt;&lt;/td&gt;

&lt;td&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;
&lt;blockquote style="margin-right:0px;" dir="ltr"&gt;
&lt;p&gt;&lt;a href="http://sqlblog.com/blogs/jamie_thomson/image_387501E8.png"&gt;&lt;img src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_77668283.png" style="background-image:none;border-width:0px;padding-left:0px;padding-right:0px;display:inline;padding-top:0px;" title="image" alt="image" border="0" width="556" height="139"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/td&gt;

&lt;td&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;&amp;nbsp;&lt;/td&gt;

&lt;td&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;&lt;i&gt;CSPL Split by Month of Birth 1&amp;amp; 2&lt;/i&gt;&lt;/td&gt;

&lt;td&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;&lt;a href="http://sqlblog.com/blogs/jamie_thomson/image_4EE7A06F.png"&gt;&lt;img src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_148C2A8E.png" style="background-image:none;border-width:0px;padding-left:0px;padding-right:0px;display:inline;padding-top:0px;" title="image" alt="image" border="0" width="607" height="205"&gt;&lt;/a&gt;&lt;/td&gt;

&lt;td&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each of these combinations will provide an input to one of the 24 rowcount components, just the same as before. For illustration here is a screenshot of the dataflow containing three Conditional Split components:&lt;/p&gt;

&lt;blockquote&gt; 
&lt;p&gt;&lt;a href="http://sqlblog.com/blogs/jamie_thomson/image_4B0ECC80.png"&gt;&lt;img src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_1DAD36B0.png" style="background-image:none;border:0px none;padding-left:0px;padding-right:0px;display:inline;padding-top:0px;" title="image" alt="image" border="0" width="746" height="331"&gt;&lt;/a&gt;&lt;/p&gt;
 &lt;/blockquote&gt;

&lt;p&gt;As you can these dataflows have a fair bit of work to do and remember that they’re doing that work for 4.7million rows.&lt;/p&gt;

&lt;p&gt;I will execute each dataflow 10 times and use the average for comparison. I foresee three possible outcomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The dataflow containing just one Conditional Split (i.e. #1) will be quicker &lt;/li&gt;

&lt;li&gt;There is no significant difference between any of them &lt;/li&gt;

&lt;li&gt;One of the two dataflows containing multiple transformation components will be quicker &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Regardless of which of those outcomes come to pass we will have learnt something and that makes this an interesting test to carry out. Note that I will be executing the dataflows using dtexec.exe rather than hitting F5 within BIDS.&lt;/p&gt;
&lt;h3&gt;The Results and Analysis&lt;/h3&gt;
&lt;p&gt;The table below shows all of the executions, 10 for each dataflow. It also shows the average for each along with a standard deviation.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="http://sqlblog.com/blogs/jamie_thomson/image_32C6BBB7.png"&gt;&lt;img src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_17ADECA9.png" style="background-image:none;border:0px none;padding-left:0px;padding-right:0px;display:inline;padding-top:0px;" title="image" alt="image" border="0" width="663" height="273"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;i&gt;All durations are in seconds.&lt;/i&gt; &lt;br&gt;&lt;i&gt;I’m pasting a screenshot because I frankly can’t be bothered with the faffing about needed to make a presentable HTML table.&lt;/i&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is plain to see from the average that the dataflow containing three conditional splits is significantly faster, the other two taking 43% and 52% longer respectively. This seems strange though, right? Why does the dataflow containing the most components outperform the other two by such a big margin? The answer is actually quite logical when you put some thought into it and I’ll explain that below.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;i&gt;Before progressing, a side note. The standard deviation for the “Three Conditional Splits” dataflow is orders of magnitude smaller – indicating that performance for this dataflow can be predicted with much greater confidence too.&lt;/i&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;The Explanation&lt;/h3&gt;
&lt;p&gt;I refer you to the screenshot above that shows how &lt;i&gt;CSPL Split by Month of Birth and salary category&lt;/i&gt; in the first dataflow is setup. Observe that there is a case for each combination of Month Of Date and Income Category – 24 in total. These expressions get evaluated in the order that they appear and hence if we assume that Month of Date and Income Category are uniformly distributed in the dataset we can deduce that the expected number of expression evaluations for each row is &lt;b&gt;12.5&lt;/b&gt; i.e. 1 (the minimum) + 24 (the maximum) divided by 2 = 12.5.&lt;/p&gt;

&lt;p&gt;Now take a look at the screenshots for the second dataflow. We are doing one expression evaluation in &lt;i&gt;DER Income Category&lt;/i&gt; and we have the same 24 cases in &lt;i&gt;CSPL Split by Month of Birth and Income Category&lt;/i&gt; as we had before, only the expression differs slightly. In this case then we have 1 + 12.5 = &lt;b&gt;13.5&lt;/b&gt; expected evaluations for each row – that would account for the slightly longer average execution time for this dataflow.&lt;/p&gt;

&lt;p&gt;Now onto the third dataflow, the quick one. &lt;i&gt;CSPL Split by Income Category&lt;/i&gt; does a maximum of 2 expression evaluations thus the expected number of evaluations per row is 1.5. &lt;i&gt;CSPL Split by Month of Birth 1&lt;/i&gt; &amp;amp; &lt;i&gt;CSPL Split by Month of Birth 2&lt;/i&gt; both have less work to do than the previous Conditional Split components because they only have 12 cases to test for thus the expected number of expression evaluations is 6.5 There are two of them so total expected number of expression evaluations for this dataflow is 6.5 + 6.5 + 1.5 = &lt;b&gt;14.5&lt;/b&gt;.&lt;/p&gt;

&lt;p&gt;14.5 is still more than 12.5 &amp;amp; 13.5 though so why is the third dataflow so much quicker? Simple, the conditional expressions in the first two dataflows have two boolean predicates to evaluate – one for Income Category and one for Month of Birth; the expressions in the Conditional Split in the third dataflow however only have one predicate thus they are doing a lot less work. To sum up, the difference in execution times can be attributed to the difference between:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;font face="Consolas"&gt;MONTH(BirthDate) == 1 &amp;amp;&amp;amp; &lt;/font&gt;&lt;b&gt;&lt;font face="Consolas"&gt;YearlyIncome &amp;lt;= 50000 &lt;br&gt;&lt;/font&gt;&lt;/b&gt;and &lt;br&gt;&lt;font face="Consolas"&gt;MONTH(BirthDate) == 1&lt;/font&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the first two dataflows &lt;b&gt;YearlyIncome &amp;lt;= 50000&lt;/b&gt; gets evaluated an average of 12.5 times for every row whereas in the third dataflow it is evaluated once and once only. Multiply those 11.5 extra operations by 4.7million rows and you get a significant amount of extra CPU cycles – that’s where our duration difference comes from.&lt;/p&gt;
&lt;h3&gt;The Wrap-up&lt;/h3&gt;
&lt;p&gt;The obvious point here is that adding new components to a dataflow isn’t necessarily going to make it go any slower, moreover you may be able to achieve significant improvements by splitting logic over multiple components rather than one. Performance tuning is all about reducing the amount of work that needs to be done and that doesn’t necessarily mean use less components, indeed sometimes you may be able to reduce workload in ways that aren’t immediately obvious as I think I have proven here.&lt;/p&gt;

&lt;p&gt;Of course there are many variables in play here and your mileage will most definitely vary. I encourage you to &lt;a href="http://cid-550f681dad532637.office.live.com/self.aspx/Public/BlogShare/20110112/20110111%20Chaining%20Expression%20components.zip" target="_blank"&gt;download the package&lt;/a&gt; and see if you get similar results – let me know in the comments. The package contains all three dataflows plus a fourth dataflow that will create the 2GB raw file for you (you will also need the &lt;a href="http://sqlserversamples.codeplex.com/releases/view/45923" target="_blank"&gt;[AdventureWorksDW2008] sample database&lt;/a&gt; from which to source the data); simply disable all dataflows except the one you want to test before executing the package and remember, execute using &lt;a href="http://msdn.microsoft.com/en-us/library/ms162810.aspx" target="_blank"&gt;dtexec&lt;/a&gt;, not within BIDS.&lt;/p&gt;

&lt;p&gt;If you want to explore dataflow performance tuning in more detail then here are some links you might want to check out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www2.sqlblog.com/blogs/jamie_thomson/archive/2010/05/14/inequality-joins-asynchronous-transformations-and-lookups-ssis.aspx" target="_blank"&gt;Inequality joins, Asynchronous transformations and Lookups&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href="http://www2.sqlblog.com/blogs/jamie_thomson/archive/2010/03/02/destination-adapter-comparison-ssis-video-nugget.aspx" target="_blank"&gt;Destination Adapter Comparison&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href="http://www2.sqlblog.com/blogs/jamie_thomson/archive/2010/08/31/don-t-turn-the-dataflow-into-a-cursor-ssis.aspx" target="_blank"&gt;Don’t turn the dataflow into a cursor&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href="http://pragmaticworks.com/Resources/webinars/Month.aspx/23" target="_blank"&gt;SSIS Dataflow – Designing for performance (webinar)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Any comments? Let me know!&lt;/p&gt;

&lt;p&gt;&lt;a href="http://twitter.com/jamiet"&gt;@Jamiet&lt;/a&gt;&lt;/p&gt;</description></item><item><title>Don’t turn the dataflow into a cursor [SSIS]</title><link>http://sqlblog.com/blogs/jamie_thomson/archive/2010/08/31/don-t-turn-the-dataflow-into-a-cursor-ssis.aspx</link><pubDate>Tue, 31 Aug 2010 21:22:00 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:28477</guid><dc:creator>jamiet</dc:creator><description>&lt;p&gt;I saw a thread on the SSIS forum today that went something like this:&lt;/p&gt;  &lt;hr&gt;  &lt;blockquote&gt;   &lt;p&gt;I have the following dataset:&lt;/p&gt;    &lt;table cellspacing="0" cellpadding="2"&gt;       &lt;tr&gt;         &lt;td&gt;AccountNo&lt;/td&gt;          &lt;td&gt;Date&lt;/td&gt;          &lt;td&gt;DailyMovement&lt;/td&gt;       &lt;/tr&gt;        &lt;tr&gt;         &lt;td&gt;00000001&lt;/td&gt;          &lt;td&gt;28/08/2010&lt;/td&gt;          &lt;td&gt;10&lt;/td&gt;       &lt;/tr&gt;        &lt;tr&gt;         &lt;td&gt;00000001&lt;/td&gt;          &lt;td&gt;29/08/2010&lt;/td&gt;          &lt;td&gt;5&lt;/td&gt;       &lt;/tr&gt;        &lt;tr&gt;         &lt;td&gt;00000001&lt;/td&gt;          &lt;td&gt;30/08/2010&lt;/td&gt;          &lt;td&gt;7&lt;/td&gt;       &lt;/tr&gt;        &lt;tr&gt;         &lt;td&gt;00000002&lt;/td&gt;          &lt;td&gt;28/08/2010&lt;/td&gt;          &lt;td&gt;8&lt;/td&gt;       &lt;/tr&gt;        &lt;tr&gt;         &lt;td&gt;00000002&lt;/td&gt;          &lt;td&gt;29/08/2010&lt;/td&gt;          &lt;td&gt;6&lt;/td&gt;       &lt;/tr&gt;     &lt;/table&gt;    &lt;p&gt;for which I want to compute a running total per [AccountNo] &amp;amp; [Date] like so:&lt;/p&gt;    &lt;table cellspacing="0" cellpadding="2"&gt;       &lt;tr&gt;         &lt;td&gt;AccountNo&lt;/td&gt;          &lt;td&gt;Date&lt;/td&gt;          &lt;td&gt;DailyMovement&lt;/td&gt;          &lt;td&gt;RunningTotal&lt;/td&gt;       &lt;/tr&gt;        &lt;tr&gt;         &lt;td&gt;00000001&lt;/td&gt;          &lt;td&gt;28/08/2010&lt;/td&gt;          &lt;td&gt;10&lt;/td&gt;          &lt;td&gt;10&lt;/td&gt;       &lt;/tr&gt;        &lt;tr&gt;         &lt;td&gt;00000001&lt;/td&gt;          &lt;td&gt;29/08/2010&lt;/td&gt;          &lt;td&gt;5&lt;/td&gt;          &lt;td&gt;15&lt;/td&gt;       &lt;/tr&gt;        &lt;tr&gt;         &lt;td&gt;00000001&lt;/td&gt;          &lt;td&gt;30/08/2010&lt;/td&gt;          &lt;td&gt;7&lt;/td&gt;          &lt;td&gt;22&lt;/td&gt;       &lt;/tr&gt;        &lt;tr&gt;         &lt;td&gt;00000002&lt;/td&gt;          &lt;td&gt;28/08/2010&lt;/td&gt;          &lt;td&gt;8&lt;/td&gt;          &lt;td&gt;8&lt;/td&gt;       &lt;/tr&gt;        &lt;tr&gt;         &lt;td&gt;00000002&lt;/td&gt;          &lt;td&gt;29/08/2010&lt;/td&gt;          &lt;td&gt;6&lt;/td&gt;          &lt;td&gt;14&lt;/td&gt;       &lt;/tr&gt;     &lt;/table&gt;    &lt;p&gt;How do I do that using a script component?&lt;/p&gt; &lt;/blockquote&gt;  &lt;hr&gt;  &lt;p&gt;That last comment:&lt;/p&gt;  &lt;blockquote&gt;   &lt;p&gt;&lt;i&gt;How do I do that using a script component?&lt;/i&gt;&lt;/p&gt; &lt;/blockquote&gt;  &lt;p&gt;is a fairly common question. People assume that if a calculated value is dependant on prior rows then a script component needs to be involved because that running total needs to be tracked somewhere, but that isn’t necessarily the case. Ask yourself, how would you do this if the data were residing in a database table, would you do this?:&lt;code style="font-size:12px;"&gt;&lt;span style="color:black;"&gt;&lt;/span&gt;      &lt;/code&gt;&lt;/p&gt;&lt;code style="font-size:12px;"&gt;&lt;blockquote&gt;&lt;font size="2"&gt;&lt;span style="color:blue;"&gt;SELECT &lt;/span&gt;&lt;span style="color:red;"&gt;'00000001' &lt;/span&gt;&lt;span style="color:blue;"&gt;AS &lt;/span&gt;&lt;span style="color:black;"&gt;[AccountNo]&lt;/span&gt;&lt;span style="color:gray;"&gt;, &lt;/span&gt;&lt;span style="color:magenta;"&gt;CONVERT&lt;/span&gt;&lt;span style="color:gray;"&gt;(&lt;/span&gt;&lt;span style="color:black;"&gt;date&lt;/span&gt;&lt;span style="color:gray;"&gt;,&lt;/span&gt;&lt;span style="color:red;"&gt;'20080828'&lt;/span&gt;&lt;span style="color:gray;"&gt;) &lt;/span&gt;&lt;span style="color:blue;"&gt;AS &lt;/span&gt;&lt;span style="color:black;"&gt;[Date]&lt;/span&gt;&lt;span style="color:gray;"&gt;,&lt;/span&gt;&lt;span style="color:black;"&gt;10 &lt;/span&gt;&lt;span style="color:blue;"&gt;AS &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:black;"&gt;[DailyMovement]            &lt;br&gt;&lt;/span&gt;&lt;span style="color:blue;"&gt;INTO&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:#434343;"&gt;#balances            &lt;br&gt;&lt;/span&gt;&lt;span style="color:blue;"&gt;UNION&amp;nbsp; &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:gray;"&gt;ALL            &lt;br&gt;&lt;/span&gt;&lt;span style="color:blue;"&gt;SELECT &lt;/span&gt;&lt;span style="color:red;"&gt;'00000001' &lt;/span&gt;&lt;span style="color:blue;"&gt;AS &lt;/span&gt;&lt;span style="color:black;"&gt;[AccountNo]&lt;/span&gt;&lt;span style="color:gray;"&gt;, &lt;/span&gt;&lt;span style="color:magenta;"&gt;CONVERT&lt;/span&gt;&lt;span style="color:gray;"&gt;(&lt;/span&gt;&lt;span style="color:black;"&gt;date&lt;/span&gt;&lt;span style="color:gray;"&gt;,&lt;/span&gt;&lt;span style="color:red;"&gt;'20080829'&lt;/span&gt;&lt;span style="color:gray;"&gt;) &lt;/span&gt;&lt;span style="color:blue;"&gt;AS &lt;/span&gt;&lt;span style="color:black;"&gt;[Date]&lt;/span&gt;&lt;span style="color:gray;"&gt;,&lt;/span&gt;&lt;span style="color:black;"&gt;5 &lt;/span&gt;&lt;span style="color:blue;"&gt;AS &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:black;"&gt;[DailyMovement]            &lt;br&gt;&lt;/span&gt;&lt;span style="color:blue;"&gt;UNION&amp;nbsp; &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:gray;"&gt;ALL            &lt;br&gt;&lt;/span&gt;&lt;span style="color:blue;"&gt;SELECT &lt;/span&gt;&lt;span style="color:red;"&gt;'00000001' &lt;/span&gt;&lt;span style="color:blue;"&gt;AS &lt;/span&gt;&lt;span style="color:black;"&gt;[AccountNo]&lt;/span&gt;&lt;span style="color:gray;"&gt;, &lt;/span&gt;&lt;span style="color:magenta;"&gt;CONVERT&lt;/span&gt;&lt;span style="color:gray;"&gt;(&lt;/span&gt;&lt;span style="color:black;"&gt;date&lt;/span&gt;&lt;span style="color:gray;"&gt;,&lt;/span&gt;&lt;span style="color:red;"&gt;'20080830'&lt;/span&gt;&lt;span style="color:gray;"&gt;) &lt;/span&gt;&lt;span style="color:blue;"&gt;AS &lt;/span&gt;&lt;span style="color:black;"&gt;[Date]&lt;/span&gt;&lt;span style="color:gray;"&gt;,&lt;/span&gt;&lt;span style="color:black;"&gt;7 &lt;/span&gt;&lt;span style="color:blue;"&gt;AS &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:black;"&gt;[DailyMovement]            &lt;br&gt;&lt;/span&gt;&lt;span style="color:blue;"&gt;UNION&amp;nbsp; &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:gray;"&gt;ALL            &lt;br&gt;&lt;/span&gt;&lt;span style="color:blue;"&gt;SELECT &lt;/span&gt;&lt;span style="color:red;"&gt;'00000002' &lt;/span&gt;&lt;span style="color:blue;"&gt;AS &lt;/span&gt;&lt;span style="color:black;"&gt;[AccountNo]&lt;/span&gt;&lt;span style="color:gray;"&gt;, &lt;/span&gt;&lt;span style="color:magenta;"&gt;CONVERT&lt;/span&gt;&lt;span style="color:gray;"&gt;(&lt;/span&gt;&lt;span style="color:black;"&gt;date&lt;/span&gt;&lt;span style="color:gray;"&gt;,&lt;/span&gt;&lt;span style="color:red;"&gt;'20080828'&lt;/span&gt;&lt;span style="color:gray;"&gt;) &lt;/span&gt;&lt;span style="color:blue;"&gt;AS &lt;/span&gt;&lt;span style="color:black;"&gt;[Date]&lt;/span&gt;&lt;span style="color:gray;"&gt;,&lt;/span&gt;&lt;span style="color:black;"&gt;8 &lt;/span&gt;&lt;span style="color:blue;"&gt;AS &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:black;"&gt;[DailyMovement]            &lt;br&gt;&lt;/span&gt;&lt;span style="color:blue;"&gt;UNION&amp;nbsp; &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:gray;"&gt;ALL            &lt;br&gt;&lt;/span&gt;&lt;span style="color:blue;"&gt;SELECT &lt;/span&gt;&lt;span style="color:red;"&gt;'00000002' &lt;/span&gt;&lt;span style="color:blue;"&gt;AS &lt;/span&gt;&lt;span style="color:black;"&gt;[AccountNo]&lt;/span&gt;&lt;span style="color:gray;"&gt;, &lt;/span&gt;&lt;span style="color:magenta;"&gt;CONVERT&lt;/span&gt;&lt;span style="color:gray;"&gt;(&lt;/span&gt;&lt;span style="color:black;"&gt;date&lt;/span&gt;&lt;span style="color:gray;"&gt;,&lt;/span&gt;&lt;span style="color:red;"&gt;'20080829'&lt;/span&gt;&lt;span style="color:gray;"&gt;) &lt;/span&gt;&lt;span style="color:blue;"&gt;AS &lt;/span&gt;&lt;span style="color:black;"&gt;[Date]&lt;/span&gt;&lt;span style="color:gray;"&gt;,&lt;/span&gt;&lt;span style="color:black;"&gt;6 &lt;/span&gt;&lt;span style="color:blue;"&gt;AS &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:black;"&gt;[DailyMovement]            &lt;br&gt;            &lt;br&gt;&lt;/span&gt;&lt;span style="color:blue;"&gt;SELECT &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:black;"&gt;[AccountNo]            &lt;br&gt;&lt;/span&gt;&lt;span style="color:gray;"&gt;,&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:black;"&gt;[Date]            &lt;br&gt;&lt;/span&gt;&lt;span style="color:gray;"&gt;,&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:black;"&gt;[DailyMovement]            &lt;br&gt;&lt;/span&gt;&lt;span style="color:gray;"&gt;,&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; (&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color:blue;"&gt;SELECT&amp;nbsp; &lt;/span&gt;&lt;span style="color:magenta;"&gt;SUM&lt;/span&gt;&lt;span style="color:gray;"&gt;(&lt;/span&gt;&lt;span style="color:black;"&gt;DailyMovement&lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:gray;"&gt;)            &lt;br&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color:blue;"&gt;FROM&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color:#434343;"&gt;#balances &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:black;"&gt;b2            &lt;br&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color:blue;"&gt;WHERE&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color:black;"&gt;b1.[AccountNo] &lt;/span&gt;&lt;span style="color:blue;"&gt;= &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:black;"&gt;b2.[AccountNo]            &lt;br&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color:gray;"&gt;AND&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color:black;"&gt;b1.[Date] &lt;/span&gt;&lt;span style="color:gray;"&gt;&amp;gt;= &lt;/span&gt;&lt;span style="color:black;"&gt;b2.[Date]&lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:gray;"&gt;)            &lt;br&gt;&lt;/span&gt;&lt;span style="color:blue;"&gt;FROM&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color:#434343;"&gt;#balances &lt;/span&gt;&lt;span style="color:black;"&gt;b1;&lt;/span&gt;&lt;/font&gt;&lt;/blockquote&gt;   &lt;/code&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;  &lt;p&gt;OK that works, and here’s a screenshot to prove it:&lt;/p&gt;  &lt;blockquote&gt;   &lt;p&gt;&lt;a href="http://sqlblog.com/blogs/jamie_thomson/image_57F5D682.png"&gt;&lt;img style="background-image:none;border-right-width:0px;padding-left:0px;padding-right:0px;display:inline;border-top-width:0px;border-bottom-width:0px;border-left-width:0px;padding-top:0px;" title="image" border="0" alt="image" width="469" height="200" src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_556CA4C4.png"&gt;&lt;/a&gt;&lt;/p&gt; &lt;/blockquote&gt;  &lt;p&gt;But really, would any of you actually &lt;i&gt;do&lt;/i&gt; this? Hopefully not, the use of the &lt;a target="_blank" href="http://en.wikipedia.org/wiki/Correlated_subquery"&gt;correlated subquery&lt;/a&gt; has simply turned what should be a nice set operation into a cursor-in-disguise (something I have &lt;a target="_blank" href="http://sqlblog.com/blogs/jamie_thomson/archive/2010/05/14/inequality-joins-asynchronous-transformations-and-lookups-ssis.aspx"&gt;talked about before&lt;/a&gt;) because that subquery will be getting executed for every row in the table. Instead you could run the following on that same dataset:&lt;/p&gt;  &lt;blockquote&gt;   &lt;p&gt;&lt;code style="font-size:12px;"&gt;&lt;span style="color:blue;"&gt;&lt;font size="2"&gt;SELECT &lt;/font&gt;&lt;/span&gt;&lt;font size="2"&gt;&lt;span style="color:black;"&gt;b1.[AccountNo]            &lt;br&gt;&lt;/span&gt;&lt;span style="color:gray;"&gt;,&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:black;"&gt;b1.[Date]            &lt;br&gt;&lt;/span&gt;&lt;span style="color:gray;"&gt;,&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:black;"&gt;b1.[DailyMovement]            &lt;br&gt;&lt;/span&gt;&lt;span style="color:gray;"&gt;,&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color:magenta;"&gt;SUM&lt;/span&gt;&lt;span style="color:gray;"&gt;(&lt;/span&gt;&lt;span style="color:black;"&gt;b2.[DailyMovement]&lt;/span&gt;&lt;span style="color:gray;"&gt;) &lt;/span&gt;&lt;span style="color:blue;"&gt;AS &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:black;"&gt;[RunningTotal]            &lt;br&gt;&lt;/span&gt;&lt;span style="color:blue;"&gt;FROM&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color:#434343;"&gt;#balances &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:black;"&gt;b1            &lt;br&gt;&lt;/span&gt;&lt;span style="color:blue;"&gt;INNER&amp;nbsp; JOIN &lt;/span&gt;&lt;span style="color:#434343;"&gt;#balances &lt;/span&gt;&lt;span style="color:black;"&gt;b2 &lt;/span&gt;&lt;span style="color:blue;"&gt;ON&amp;nbsp; &lt;/span&gt;&lt;span style="color:black;"&gt;b1.[AccountNo] &lt;/span&gt;&lt;span style="color:blue;"&gt;= &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:black;"&gt;b2.[AccountNo]            &lt;br&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color:gray;"&gt;AND &lt;/span&gt;&lt;span style="color:black;"&gt;b1.[Date] &lt;/span&gt;&lt;span style="color:gray;"&gt;&amp;gt;= &lt;/span&gt;&lt;/font&gt;&lt;font size="2"&gt;&lt;span style="color:black;"&gt;b2.[Date]            &lt;br&gt;&lt;/span&gt;&lt;span style="color:blue;"&gt;GROUP&amp;nbsp; BY &lt;/span&gt;&lt;span style="color:black;"&gt;b1.[AccountNo]&lt;/span&gt;&lt;span style="color:gray;"&gt;,&lt;/span&gt;&lt;span style="color:black;"&gt;b1.[Date]&lt;/span&gt;&lt;span style="color:gray;"&gt;,&lt;/span&gt;&lt;span style="color:black;"&gt;b1.[DailyMovement]&lt;/span&gt;&lt;span style="color:gray;"&gt;;&lt;/span&gt;&lt;/font&gt;&lt;/code&gt; &lt;/p&gt; &lt;/blockquote&gt;  &lt;p&gt;and you get the same result but with a much more amenable execution plan (execute with &lt;code style="font-size:12px;"&gt;&lt;font size="2"&gt;&lt;span style="color:blue;"&gt;SET STATISTICS &lt;/span&gt;&lt;span style="color:black;"&gt;IO &lt;/span&gt;&lt;span style="color:blue;"&gt;ON&lt;/span&gt;&lt;/font&gt;&lt;/code&gt; if you don’t believe me)!&lt;/p&gt;  &lt;p&gt;The same principle applies in a SSIS dataflow. Often there is no need to resort to a script component, the same result can be achieved using some smart dataflow design. You can use, for example, the same approach as in the second SQL query (above) using a combination of SSIS’ Sort, Aggregate, Merge Join and Conditional Split components. That is not to say that one will be quicker than the other but at least you’ll have something that is more intuitive and arguably more maintainable. Of course if performance gain is your primary goal then the correct advice is, as always, “test and measure, test and measure, test and measure”!!!&lt;/p&gt;  &lt;p&gt;Ask yourself “If I could, how would I solve this using T-SQL?” and see if that influences your dataflow design at all. Invariably script components should be the &lt;i&gt;last&lt;/i&gt; choice, not the first!&lt;/p&gt;  &lt;p&gt;&lt;a target="_blank" href="http://twitter.com/jamiet"&gt;@jamiet&lt;/a&gt;&lt;/p&gt;</description></item><item><title>SSIS gotcha: Beware of multiple outputs from a synchronous script component</title><link>http://sqlblog.com/blogs/jamie_thomson/archive/2010/07/04/ssis-gotcha-beware-of-multiple-outputs-from-a-synchronous-script-component.aspx</link><pubDate>Sun, 04 Jul 2010 20:43:12 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:26727</guid><dc:creator>jamiet</dc:creator><description>&lt;p&gt;A few days ago I was &lt;a href="http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/c0f2b376-66a3-49d8-8925-ada9a362ba53/?prof=required" target="_blank"&gt;alerted&lt;/a&gt; to a peculiarity of SSIS’s Script Component that I believe people need to be aware of. Its a peculiarity that can rear its head when your script component is &lt;strong&gt;synchronous and has multiple outputs&lt;/strong&gt;. Here’s an example of a dataflow that contains such a component:&lt;/p&gt;  &lt;blockquote&gt;   &lt;p&gt;&lt;a href="http://sqlblog.com/blogs/jamie_thomson/image_71055C17.png"&gt;&lt;img style="border-bottom:0px;border-left:0px;display:inline;border-top:0px;border-right:0px;" class="wlDisabledImage" title="ssis dataflow with a script component with multiple synchronous outputs" border="0" alt="ssis dataflow with a script component with multiple synchronous outputs" src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_12A184E9.png" width="596" height="312" /&gt;&lt;/a&gt;&lt;/p&gt; &lt;/blockquote&gt;  &lt;p&gt;Some things to note about this dataflow that you can’t tell from this screenshot:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;The source component will create a 1-row,1-column dataset with the value “1” in it&lt;/li&gt;    &lt;li&gt;Even though it has three outputs “SCR Send data to multiple outputs” is still a synchronous component*&lt;/li&gt; &lt;/ul&gt;  &lt;blockquote&gt;   &lt;p&gt;&lt;em&gt;*If you don’t understand how a synchronous component can have multiple outputs and/or have never heard of an Exclusion Group then you should go and read my blog post from 2005 &lt;/em&gt;&lt;a href="http://consultingblogs.emc.com/jamiethomson/archive/2005/09/05/SSIS-Nugget_3A00_-Multiple-outputs-from-a-synchronous-script-transform.aspx" target="_blank"&gt;&lt;em&gt;Multiple outputs from a synchronous script transform&lt;/em&gt;&lt;/a&gt;&lt;em&gt; – the blog post you’re reading right now won’t make much sense without it! If you have ever used the Multicast component then it shouldn’t surprise you that a synchronous component can have multiple outputs – because that’s a pretty good definition of what the Multicast component does!&lt;/em&gt;&lt;/p&gt; &lt;/blockquote&gt;  &lt;p&gt;Here’s the important code inside “SCR Send data to multiple outputs”:&lt;/p&gt;  &lt;blockquote&gt;   &lt;p&gt;&lt;a href="http://sqlblog.com/blogs/jamie_thomson/image_055AFBD6.png"&gt;&lt;img style="border-bottom:0px;border-left:0px;display:inline;border-top:0px;border-right:0px;" class="wlDisabledImage" title="script component code multiple outputs" border="0" alt="script component code multiple outputs" src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_5D484CB6.png" width="592" height="198" /&gt;&lt;/a&gt;&lt;/p&gt; &lt;/blockquote&gt;  &lt;p&gt;The code:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;increments the incoming value and puts it into Output0&lt;/li&gt;    &lt;li&gt;increments the value again and puts it into Output1&lt;/li&gt;    &lt;li&gt;increments the value again and puts it into Output2&lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;Given that out starting value is “1” you might expect that our three outputs would contain the values “2”, “3” &amp;amp; “4” respectively but in fact that is not the case. Here is what we actually see in those three outputs when we execute the dataflow:&lt;/p&gt;  &lt;blockquote&gt;   &lt;p&gt;&lt;a href="http://sqlblog.com/blogs/jamie_thomson/image_0984D39B.png"&gt;&lt;img style="border-bottom:0px;border-left:0px;display:inline;border-top:0px;border-right:0px;" class="wlDisabledImage" title="SSIS data viewers, multiple outputs" border="0" alt="SSIS data viewers, multiple outputs" src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_07D407C7.png" width="475" height="175" /&gt;&lt;/a&gt;&lt;/p&gt;    &lt;p&gt;&lt;a href="http://sqlblog.com/blogs/jamie_thomson/image_054AD609.png"&gt;&lt;img style="border-bottom:0px;border-left:0px;display:inline;border-top:0px;border-right:0px;" class="wlDisabledImage" title="ssis executed dataflow" border="0" alt="ssis executed dataflow" src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_428B8AD0.png" width="595" height="302" /&gt;&lt;/a&gt;&lt;/p&gt; &lt;/blockquote&gt;  &lt;p&gt;Each of our outputs has the same value “4”. Why is that? The trick is in understanding something fundamental about synchronous components, &lt;em&gt;they only ever output the same number of rows as are input&lt;/em&gt;.&amp;#160; The fact that in the data flow above it appears as though three rows have been output is simply an illusion that is best explained by former SSIS Development Manager Kirk Haselden who left the following comment on my &lt;a href="http://consultingblogs.emc.com/jamiethomson/archive/2005/09/05/SSIS-Nugget_3A00_-Multiple-outputs-from-a-synchronous-script-transform.aspx" target="_blank"&gt;blog post&lt;/a&gt; that I linked to earlier:&lt;/p&gt;  &lt;blockquote&gt;   &lt;p&gt;&lt;em&gt;The additional or duplicate rows are an illusion. The Dataflow Task actually tracks what buffer columns and rows are visible to the downstream transforms, but doesn't copy any buffers or rows. It simply &amp;quot;exposes&amp;quot; them with row and column views.        &lt;br /&gt;Truly, the synchronous outputs only send the same number of columns* as they receive on their inputs.         &lt;br /&gt;&lt;/em&gt;&lt;em&gt;- Kirk Haselden, 15th March 2006       &lt;br /&gt;&lt;font size="1"&gt;*This is a typo. Kirk meant to say “rows” not “columns”&lt;/font&gt;&lt;/em&gt;&lt;/p&gt; &lt;/blockquote&gt;  &lt;p&gt;In our case the script component has simply incremented the same value three times and we are looking at that same incremented value of “4” in each of our outputs.&lt;/p&gt;  &lt;p&gt;I have produced a short video that demonstrates this behaviour using the dataflow pictured above. Embedding videos here on SQLBlog is however a fiddly experience so for ease I’ll just direct you to view it on Vimeo at &lt;a href="http://vimeo.com/13081087" target="_blank"&gt;Multiple Outputs from a synchronous script component&lt;/a&gt;.&lt;/p&gt;  &lt;p&gt;Hope this helps. Any questions please put them in the comments.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://twitter.com/jamiet" target="_blank"&gt;@JamieT&lt;/a&gt;&lt;/p&gt;</description></item><item><title>Inequality joins, Asynchronous transformations and Lookups : SSIS</title><link>http://sqlblog.com/blogs/jamie_thomson/archive/2010/05/14/inequality-joins-asynchronous-transformations-and-lookups-ssis.aspx</link><pubDate>Fri, 14 May 2010 21:13:00 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:25178</guid><dc:creator>jamiet</dc:creator><description>&lt;P&gt;It is pretty much accepted by SQL Server Integration Services (SSIS) developers that synchronous transformations are generally quicker than asynchronous transformations (for a description of synchronous and asynchronous transformations go read &lt;A href="http://consultingblogs.emc.com/jamiethomson/archive/2005/02/11/SSIS_3A00_-Asynchronous-and-synchronous-data-flow-components.aspx" target=_blank&gt;Asynchronous and synchronous data flow components&lt;/A&gt;). Notice I said “generally” and not “always”; there are circumstances where using asynchronous transformations can be beneficial and in this blog post I’ll demonstrate such a scenario, one that is pretty common when building data warehouses.&lt;/P&gt;
&lt;P&gt;Imagine I have a &lt;FONT face="Courier New"&gt;[Customer]&lt;/FONT&gt; dimension table that manages information about all of my customers as a slowly-changing dimension. If that is a type 2 slowly changing dimension then you will likely have multiple rows per customer in that table. Furthermore you might also have datetime fields that indicate the effective time period of each member record. &lt;/P&gt;
&lt;P&gt;Here is such a table that contains data for four dimension members {Terry, Max, Henry, Horace}:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;A href="http://sqlblog.com/blogs/jamie_thomson/clip_image002_75E15D51.jpg"&gt;&lt;IMG style="BORDER-RIGHT-WIDTH:0px;DISPLAY:inline;BORDER-TOP-WIDTH:0px;BORDER-BOTTOM-WIDTH:0px;BORDER-LEFT-WIDTH:0px;" title="customer dimension table" border=0 alt="customer dimension table" src="http://sqlblog.com/blogs/jamie_thomson/clip_image002_thumb_129AD267.jpg" width=564 height=303&gt;&lt;/A&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Notice that we have multiple records per customer and that the &lt;FONT face="Courier New"&gt;[SCDStartDate] &lt;/FONT&gt;of a record is equivalent to the &lt;FONT face="Courier New"&gt;[SCDEndDate]&lt;/FONT&gt; of the record that preceded it (if there was one). (Note that &lt;A href="http://sqlblog.com/blogs/jamie_thomson/archive/2009/11/28/debunking-kimball-effective-dates.aspx"&gt;I am on record as saying I am not a fan of this technique&lt;/A&gt; of storing an &lt;FONT face="Courier New"&gt;[SCDEndDate] &lt;/FONT&gt;but for the purposes of clarity I have included it here.)&lt;/P&gt;
&lt;P&gt;Anyway, the idea here is that we will have some incoming data containing &lt;FONT face="Courier New"&gt;[CustomerName]&lt;/FONT&gt; &amp;amp; &lt;FONT face="Courier New"&gt;[EffectiveDate]&lt;/FONT&gt; and we need to use those values to lookup &lt;FONT face="Courier New"&gt;[Customer].[CustomerId]&lt;/FONT&gt;. The logic will be:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;I&gt;Lookup a [CustomerId] WHERE [CustomerName]=[CustomerName] AND [SCDStartDate] &amp;lt;= [EffectiveDate] AND [EffectiveDate] &amp;lt;= [SCDEndDate]&lt;/I&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;The conventional approach to this would be to use a full cached lookup but that isn’t an option here because we are using inequality conditions. The obvious next step then is to use a non-cached lookup which enables us to change the SQL statement to use inequality operators:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;A href="http://sqlblog.com/blogs/jamie_thomson/clip_image004_4FDB872E.jpg"&gt;&lt;IMG style="BORDER-RIGHT-WIDTH:0px;DISPLAY:inline;BORDER-TOP-WIDTH:0px;BORDER-BOTTOM-WIDTH:0px;BORDER-LEFT-WIDTH:0px;" title="no cache lookup" border=0 alt="no cache lookup" src="http://sqlblog.com/blogs/jamie_thomson/clip_image004_thumb_532CF909.jpg" width=689 height=486&gt;&lt;/A&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Let’s take a look at the dataflow:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;A href="http://sqlblog.com/blogs/jamie_thomson/clip_image006_50A3C74B.jpg"&gt;&lt;IMG style="BORDER-RIGHT-WIDTH:0px;DISPLAY:inline;BORDER-TOP-WIDTH:0px;BORDER-BOTTOM-WIDTH:0px;BORDER-LEFT-WIDTH:0px;" title="dataflow lookup" border=0 alt="dataflow lookup" src="http://sqlblog.com/blogs/jamie_thomson/clip_image006_thumb_1A529F3C.jpg" width=264 height=404&gt;&lt;/A&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Notice these are all synchronous components. This approach works just fine however it does have the limitation that it has to issue a SQL statement against your lookup set for every row thus we can expect the execution time of our dataflow to increase linearly in line with the number of rows in our dataflow; that’s not good.&lt;/P&gt;
&lt;P&gt;OK, that’s the obvious method. Let’s now look at a different way of achieving this using an asynchronous Merge Join transform coupled with a Conditional Split. I’ve shown it post-execution so that I can include the row counts which help to illustrate what is going on here:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;A href="http://sqlblog.com/blogs/jamie_thomson/clip_image008_1FC10FE0.jpg"&gt;&lt;IMG style="BORDER-RIGHT-WIDTH:0px;DISPLAY:inline;BORDER-TOP-WIDTH:0px;BORDER-BOTTOM-WIDTH:0px;BORDER-LEFT-WIDTH:0px;" title="dataflow merge join" border=0 alt="dataflow merge join" src="http://sqlblog.com/blogs/jamie_thomson/clip_image008_thumb_16F10794.jpg" width=508 height=443&gt;&lt;/A&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Notice that there are more rows output from our Merge Join component than on the input. That is because we are joining on &lt;FONT face="Courier New"&gt;[CustomerName] &lt;/FONT&gt;and, as we know, we have multiple records per &lt;FONT face="Courier New"&gt;[CustomerName] &lt;/FONT&gt;in our lookup set. Notice also that there are two asynchronous components in here (the Sort and the Merge Join).&lt;/P&gt;
&lt;P&gt;I have embedded a video below that compares the execution times for each of these two methods. The video is just over 8minutes long.&lt;/P&gt;
&lt;OBJECT&gt;
&lt;embed style="WIDTH:600px;HEIGHT:450px;" src="http://vimeo.com/moogaloop.swf?clip_id=11750066&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=00ADEF&amp;amp;fullscreen=1" allowfullscreen="true"&gt;&lt;/embed&gt;&lt;/OBJECT&gt;
&lt;BLOCKQUOTE&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;&lt;A href="http://vimeo.com/11750066" target=_blank&gt;View on Vimeo&lt;/A&gt;&lt;/EM&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For those that can’t be bothered watching the video I’ll tell you the results here. The dataflow that used the Lookup transform took &lt;STRONG&gt;36 seconds &lt;/STRONG&gt;whereas the dataflow that used the Merge Join took &lt;STRONG&gt;less than two seconds&lt;/STRONG&gt;. An illustration in case it is needed:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;A href="http://sqlblog.com/blogs/jamie_thomson/image_34EF1588.png"&gt;&lt;IMG style="BORDER-BOTTOM:0px;BORDER-LEFT:0px;DISPLAY:inline;BORDER-TOP:0px;BORDER-RIGHT:0px;" title=image border=0 alt=image src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_13FBA2E1.png" width=392 height=158&gt;&lt;/A&gt; &lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Pretty conclusive proof that in some scenarios it may be quicker to use an asynchronous component than a synchronous one. Your mileage may of course vary.&lt;/P&gt;
&lt;P&gt;The scenario outlined here is analogous to performance tuning procedural SQL that uses cursors. It is common to eliminate cursors by converting them to set-based operations and that is effectively what we have done here. Our non-cached lookup is performing a discrete operation for every single row of data, exactly like a cursor does. By eliminating this cursor-in-disguise we have dramatically sped up our dataflow.&lt;/P&gt;
&lt;P&gt;I hope all of that proves useful. You can download the package that I demonstrated in the video from my SkyDrive at &lt;A title=http://cid-550f681dad532637.skydrive.live.com/self.aspx/Public/BlogShare/20100514/20100514%20Lookups%20and%20Merge%20Joins.zip href="http://cid-550f681dad532637.skydrive.live.com/self.aspx/Public/BlogShare/20100514/20100514%20Lookups%20and%20Merge%20Joins.zip"&gt;http://cid-550f681dad532637.skydrive.live.com/self.aspx/Public/BlogShare/20100514/20100514%20Lookups%20and%20Merge%20Joins.zip&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Comments are welcome as always.&lt;/P&gt;
&lt;P&gt;&lt;A href="http://twitter.com/jamiet" target=_blank&gt;@Jamiet&lt;/A&gt;&lt;/P&gt;</description></item><item><title>Destination Adapter Comparison : SSIS Video Nugget</title><link>http://sqlblog.com/blogs/jamie_thomson/archive/2010/03/02/destination-adapter-comparison-ssis-video-nugget.aspx</link><pubDate>Tue, 02 Mar 2010 12:11:00 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:22728</guid><dc:creator>jamiet</dc:creator><description>
&lt;p&gt;In this, my latest video nugget, I demonstrate and compare three ways that you can insert data into a SQL Server table from the SSIS DataFlow:&lt;/p&gt;
  
&lt;ul&gt;   
&lt;li&gt;OLE DB Destination without Fast Load &lt;/li&gt;
    
&lt;li&gt;OLE DB Destination with Fast Load &lt;/li&gt;
    
&lt;li&gt;SQL Server Destination &lt;/li&gt;
 &lt;/ul&gt;
  
&lt;p&gt;The demo peels back the covers to see how these three methods actually operate against the database. The video is embedded below however at the time of writing we are still having issues with getting videos to show up on SQLBlog meaning that the video may appear too small to view hence in the meantime I’ll refer you to view the video on Vimeo at &lt;a href="http://vimeo.com/9806121" title="http://vimeo.com/9806121"&gt;http://vimeo.com/9806121&lt;/a&gt; (it is 7m19s long).&lt;/p&gt;
  
&lt;p&gt;If after watching this you want to know any more then you may glean some useful information from my blog post &lt;a href="http://consultingblogs.emc.com/jamiethomson/archive/2006/08/14/SSIS_3A00_-Destination-Adapter-Comparison.aspx" target="_blank"&gt;SSIS: Destination Adapter Comparison&lt;/a&gt; from August 2006 or feel free to post any questions in the comments below and I’ll do my best to answer them! &lt;/p&gt;
  
&lt;p&gt;Hope this is useful!&lt;/p&gt;
  
&lt;p&gt;&lt;a href="http://twitter.com/jamiet" target="_blank"&gt;@Jamiet&lt;/a&gt;&lt;/p&gt;
 
&lt;embed src="http://vimeo.com/moogaloop.swf?clip_id=9806121&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=00ADEF&amp;amp;fullscreen=1" allowfullscreen="true" style="width:600px;height:450px;"&gt;&lt;a href="http://vimeo.com/moogaloop.swf?clip_id=9806121&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=00ADEF&amp;amp;fullscreen=1" class="onxldjuucmcshidmscok"&gt;&lt;/a&gt;&lt;a href="http://vimeo.com/moogaloop.swf?clip_id=9806121&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=00ADEF&amp;amp;fullscreen=1" class="onxldjuucmcshidmscok"&gt;&lt;/a&gt;&lt;a href="http://sqlblog.com/controlpanel/blogs/" class="onxldjuucmcshidmscok"&gt;&lt;/a&gt;&lt;a href="http://sqlblog.com/controlpanel/blogs/" class="onxldjuucmcshidmscok"&gt;&lt;/a&gt;&lt;a href="http://vimeo.com/moogaloop.swf?clip_id=9806121&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=00ADEF&amp;amp;fullscreen=1" class="onxldjuucmcshidmscok"&gt;&lt;/a&gt;&lt;a href="http://vimeo.com/moogaloop.swf?clip_id=9806121&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=00ADEF&amp;amp;fullscreen=1" class="onxldjuucmcshidmscok"&gt;&lt;/a&gt;&lt;a href="http://vimeo.com/moogaloop.swf?clip_id=9806121&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=00ADEF&amp;amp;fullscreen=1" class="onxldjuucmcshidmscok"&gt;&lt;/a&gt;&lt;a href="http://vimeo.com/moogaloop.swf?clip_id=9806121&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=00ADEF&amp;amp;fullscreen=1" class="onxldjuucmcshidmscok"&gt;&lt;/a&gt;&lt;a href="http://sqlblog.com/controlpanel/blogs/" class="onxldjuucmcshidmscok"&gt;&lt;/a&gt;&lt;a href="http://sqlblog.com/controlpanel/blogs/" class="onxldjuucmcshidmscok"&gt;&lt;/a&gt;</description></item></channel></rss>