<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://sqlblog.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>SSIS Junkie : lookup</title><link>http://sqlblog.com/blogs/jamie_thomson/archive/tags/lookup/default.aspx</link><description>Tags: lookup</description><dc:language>en</dc:language><generator>CommunityServer 2.1 SP2 (Build: 61129.1)</generator><item><title>Inequality joins, Asynchronous transformations and Lookups : SSIS</title><link>http://sqlblog.com/blogs/jamie_thomson/archive/2010/05/14/inequality-joins-asynchronous-transformations-and-lookups-ssis.aspx</link><pubDate>Fri, 14 May 2010 21:13:00 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:25178</guid><dc:creator>jamiet</dc:creator><slash:comments>15</slash:comments><comments>http://sqlblog.com/blogs/jamie_thomson/comments/25178.aspx</comments><wfw:commentRss>http://sqlblog.com/blogs/jamie_thomson/commentrss.aspx?PostID=25178</wfw:commentRss><wfw:comment>http://sqlblog.com/blogs/jamie_thomson/rsscomments.aspx?PostID=25178</wfw:comment><description>&lt;P&gt;It is pretty much accepted by SQL Server Integration Services (SSIS) developers that synchronous transformations are generally quicker than asynchronous transformations (for a description of synchronous and asynchronous transformations go read &lt;A href="http://consultingblogs.emc.com/jamiethomson/archive/2005/02/11/SSIS_3A00_-Asynchronous-and-synchronous-data-flow-components.aspx" target=_blank&gt;Asynchronous and synchronous data flow components&lt;/A&gt;). Notice I said “generally” and not “always”; there are circumstances where using asynchronous transformations can be beneficial and in this blog post I’ll demonstrate such a scenario, one that is pretty common when building data warehouses.&lt;/P&gt;
&lt;P&gt;Imagine I have a &lt;FONT face="Courier New"&gt;[Customer]&lt;/FONT&gt; dimension table that manages information about all of my customers as a slowly-changing dimension. If that is a type 2 slowly changing dimension then you will likely have multiple rows per customer in that table. Furthermore you might also have datetime fields that indicate the effective time period of each member record. &lt;/P&gt;
&lt;P&gt;Here is such a table that contains data for four dimension members {Terry, Max, Henry, Horace}:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;A href="http://sqlblog.com/blogs/jamie_thomson/clip_image002_75E15D51.jpg"&gt;&lt;IMG style="BORDER-RIGHT-WIDTH:0px;DISPLAY:inline;BORDER-TOP-WIDTH:0px;BORDER-BOTTOM-WIDTH:0px;BORDER-LEFT-WIDTH:0px;" title="customer dimension table" border=0 alt="customer dimension table" src="http://sqlblog.com/blogs/jamie_thomson/clip_image002_thumb_129AD267.jpg" width=564 height=303&gt;&lt;/A&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Notice that we have multiple records per customer and that the &lt;FONT face="Courier New"&gt;[SCDStartDate] &lt;/FONT&gt;of a record is equivalent to the &lt;FONT face="Courier New"&gt;[SCDEndDate]&lt;/FONT&gt; of the record that preceded it (if there was one). (Note that &lt;A href="http://sqlblog.com/blogs/jamie_thomson/archive/2009/11/28/debunking-kimball-effective-dates.aspx"&gt;I am on record as saying I am not a fan of this technique&lt;/A&gt; of storing an &lt;FONT face="Courier New"&gt;[SCDEndDate] &lt;/FONT&gt;but for the purposes of clarity I have included it here.)&lt;/P&gt;
&lt;P&gt;Anyway, the idea here is that we will have some incoming data containing &lt;FONT face="Courier New"&gt;[CustomerName]&lt;/FONT&gt; &amp;amp; &lt;FONT face="Courier New"&gt;[EffectiveDate]&lt;/FONT&gt; and we need to use those values to lookup &lt;FONT face="Courier New"&gt;[Customer].[CustomerId]&lt;/FONT&gt;. The logic will be:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;I&gt;Lookup a [CustomerId] WHERE [CustomerName]=[CustomerName] AND [SCDStartDate] &amp;lt;= [EffectiveDate] AND [EffectiveDate] &amp;lt;= [SCDEndDate]&lt;/I&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;The conventional approach to this would be to use a full cached lookup but that isn’t an option here because we are using inequality conditions. The obvious next step then is to use a non-cached lookup which enables us to change the SQL statement to use inequality operators:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;A href="http://sqlblog.com/blogs/jamie_thomson/clip_image004_4FDB872E.jpg"&gt;&lt;IMG style="BORDER-RIGHT-WIDTH:0px;DISPLAY:inline;BORDER-TOP-WIDTH:0px;BORDER-BOTTOM-WIDTH:0px;BORDER-LEFT-WIDTH:0px;" title="no cache lookup" border=0 alt="no cache lookup" src="http://sqlblog.com/blogs/jamie_thomson/clip_image004_thumb_532CF909.jpg" width=689 height=486&gt;&lt;/A&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Let’s take a look at the dataflow:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;A href="http://sqlblog.com/blogs/jamie_thomson/clip_image006_50A3C74B.jpg"&gt;&lt;IMG style="BORDER-RIGHT-WIDTH:0px;DISPLAY:inline;BORDER-TOP-WIDTH:0px;BORDER-BOTTOM-WIDTH:0px;BORDER-LEFT-WIDTH:0px;" title="dataflow lookup" border=0 alt="dataflow lookup" src="http://sqlblog.com/blogs/jamie_thomson/clip_image006_thumb_1A529F3C.jpg" width=264 height=404&gt;&lt;/A&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Notice these are all synchronous components. This approach works just fine however it does have the limitation that it has to issue a SQL statement against your lookup set for every row thus we can expect the execution time of our dataflow to increase linearly in line with the number of rows in our dataflow; that’s not good.&lt;/P&gt;
&lt;P&gt;OK, that’s the obvious method. Let’s now look at a different way of achieving this using an asynchronous Merge Join transform coupled with a Conditional Split. I’ve shown it post-execution so that I can include the row counts which help to illustrate what is going on here:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;A href="http://sqlblog.com/blogs/jamie_thomson/clip_image008_1FC10FE0.jpg"&gt;&lt;IMG style="BORDER-RIGHT-WIDTH:0px;DISPLAY:inline;BORDER-TOP-WIDTH:0px;BORDER-BOTTOM-WIDTH:0px;BORDER-LEFT-WIDTH:0px;" title="dataflow merge join" border=0 alt="dataflow merge join" src="http://sqlblog.com/blogs/jamie_thomson/clip_image008_thumb_16F10794.jpg" width=508 height=443&gt;&lt;/A&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Notice that there are more rows output from our Merge Join component than on the input. That is because we are joining on &lt;FONT face="Courier New"&gt;[CustomerName] &lt;/FONT&gt;and, as we know, we have multiple records per &lt;FONT face="Courier New"&gt;[CustomerName] &lt;/FONT&gt;in our lookup set. Notice also that there are two asynchronous components in here (the Sort and the Merge Join).&lt;/P&gt;
&lt;P&gt;I have embedded a video below that compares the execution times for each of these two methods. The video is just over 8minutes long.&lt;/P&gt;
&lt;OBJECT&gt;
&lt;embed style="WIDTH:600px;HEIGHT:450px;" src="http://vimeo.com/moogaloop.swf?clip_id=11750066&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=00ADEF&amp;amp;fullscreen=1" allowfullscreen="true"&gt;&lt;/embed&gt;&lt;/OBJECT&gt;
&lt;BLOCKQUOTE&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;&lt;A href="http://vimeo.com/11750066" target=_blank&gt;View on Vimeo&lt;/A&gt;&lt;/EM&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For those that can’t be bothered watching the video I’ll tell you the results here. The dataflow that used the Lookup transform took &lt;STRONG&gt;36 seconds &lt;/STRONG&gt;whereas the dataflow that used the Merge Join took &lt;STRONG&gt;less than two seconds&lt;/STRONG&gt;. An illustration in case it is needed:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;A href="http://sqlblog.com/blogs/jamie_thomson/image_34EF1588.png"&gt;&lt;IMG style="BORDER-BOTTOM:0px;BORDER-LEFT:0px;DISPLAY:inline;BORDER-TOP:0px;BORDER-RIGHT:0px;" title=image border=0 alt=image src="http://sqlblog.com/blogs/jamie_thomson/image_thumb_13FBA2E1.png" width=392 height=158&gt;&lt;/A&gt; &lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Pretty conclusive proof that in some scenarios it may be quicker to use an asynchronous component than a synchronous one. Your mileage may of course vary.&lt;/P&gt;
&lt;P&gt;The scenario outlined here is analogous to performance tuning procedural SQL that uses cursors. It is common to eliminate cursors by converting them to set-based operations and that is effectively what we have done here. Our non-cached lookup is performing a discrete operation for every single row of data, exactly like a cursor does. By eliminating this cursor-in-disguise we have dramatically sped up our dataflow.&lt;/P&gt;
&lt;P&gt;I hope all of that proves useful. You can download the package that I demonstrated in the video from my SkyDrive at &lt;A title=http://cid-550f681dad532637.skydrive.live.com/self.aspx/Public/BlogShare/20100514/20100514%20Lookups%20and%20Merge%20Joins.zip href="http://cid-550f681dad532637.skydrive.live.com/self.aspx/Public/BlogShare/20100514/20100514%20Lookups%20and%20Merge%20Joins.zip"&gt;http://cid-550f681dad532637.skydrive.live.com/self.aspx/Public/BlogShare/20100514/20100514%20Lookups%20and%20Merge%20Joins.zip&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Comments are welcome as always.&lt;/P&gt;
&lt;P&gt;&lt;A href="http://twitter.com/jamiet" target=_blank&gt;@Jamiet&lt;/A&gt;&lt;/P&gt;&lt;img src="http://sqlblog.com/aggbug.aspx?PostID=25178" width="1" height="1"&gt;</description><category domain="http://sqlblog.com/blogs/jamie_thomson/archive/tags/sql+server+integration+services/default.aspx">sql server integration services</category><category domain="http://sqlblog.com/blogs/jamie_thomson/archive/tags/ssis/default.aspx">ssis</category><category domain="http://sqlblog.com/blogs/jamie_thomson/archive/tags/video/default.aspx">video</category><category domain="http://sqlblog.com/blogs/jamie_thomson/archive/tags/dataflow/default.aspx">dataflow</category><category domain="http://sqlblog.com/blogs/jamie_thomson/archive/tags/lookup/default.aspx">lookup</category><category domain="http://sqlblog.com/blogs/jamie_thomson/archive/tags/Merge+Join/default.aspx">Merge Join</category></item><item><title>SSIS Lookup component tuning tips</title><link>http://sqlblog.com/blogs/jamie_thomson/archive/2010/03/18/ssis-lookup-component-tuning-tips.aspx</link><pubDate>Thu, 18 Mar 2010 20:11:00 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:23534</guid><dc:creator>jamiet</dc:creator><slash:comments>5</slash:comments><comments>http://sqlblog.com/blogs/jamie_thomson/comments/23534.aspx</comments><wfw:commentRss>http://sqlblog.com/blogs/jamie_thomson/commentrss.aspx?PostID=23534</wfw:commentRss><wfw:comment>http://sqlblog.com/blogs/jamie_thomson/rsscomments.aspx?PostID=23534</wfw:comment><description>&lt;P&gt;Yesterday evening I attended &lt;A href="http://sqlserverfaq.com/events/215/SQL-2008-Data-Warehousing-Features-Vincent-Rainardi-Index-Views-and-Computed-Columns-Neil-Hambly-Business-Intelligence-in-Office-2010-ExcelPowerPivotSharePoint-Duncan-Sutcliffe-finishing-with-a-Q-A-on-BI-with-the-panel.as"&gt;a London meeting of the UK SQL Server User Group&lt;/A&gt; at Microsoft’s offices in London Victoria. As usual it was both a fun and informative evening and in particular there seemed to be a few questions arising about tuning the SSIS Lookup component; I rattled off some comments and figured it would be prudent to drop some of them into a dedicated blog post, hence the one you are reading right now.&lt;/P&gt;
&lt;H2&gt;Scene setting&lt;/H2&gt;
&lt;P&gt;A popular pattern in SSIS is to use a Lookup component to determine whether a record in the pipeline already exists in the intended destination table or not and I cover this pattern in my 2006 blog post &lt;A href="http://consultingblogs.emc.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx" target=_blank&gt;Checking if a row exists and if it does, has it changed?&lt;/A&gt; (note to self: must rewrite that blog post for SSIS2008).&lt;/P&gt;
&lt;P&gt;Fundamentally the SSIS lookup component (when using FullCache option) sucks some data out of a database and holds it in memory so that it can be compared to data in the pipeline. One of the big benefits of using SSIS dataflows is that they process data one &lt;A href="http://msdn.microsoft.com/en-us/library/dd795224.aspx" target=_blank&gt;buffer&lt;/A&gt; at a time; that means that not all of the data from your source exists in the dataflow at the same time and is why a SSIS dataflow can process data volumes that far exceed the available memory.&lt;/P&gt;
&lt;P&gt;&lt;I&gt;However&lt;/I&gt;, that only applies to data in the pipeline; for reasons that are hopefully obvious ALL of the data in the lookup set must exist in the memory cache for the duration of the dataflow’s execution which means that any memory used by the lookup cache will not be available to be used as a pipeline buffer. Moreover, there’s an obvious correlation between the amount of data in the lookup cache and the time it takes to charge that cache; the more data you have then the longer it will take to charge and the longer you have to wait until the dataflow actually starts to do anything. For these reasons your goal is simple: &lt;B&gt;ensure that the lookup cache contains as little data as possible&lt;/B&gt;.&lt;/P&gt;
&lt;H2&gt;General tips&lt;/H2&gt;
&lt;P&gt;Here is a simple tick list you can follow in order to tune your lookups:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Use a SQL statement to charge your cache, don’t just pick a table from the dropdown list made available to you. (Read why in &lt;A href="http://consultingblogs.emc.com/jamiethomson/archive/2006/02/21/SSIS_3A00_-SELECT-_2A002E002E002E00_-or-select-from-a-dropdown-in-an-OLE-DB-Source-component_3F00_.aspx" target=_blank&gt;SELECT *... or select from a dropdown in an OLE DB Source component?&lt;/A&gt;)&lt;/LI&gt;
&lt;LI&gt;&lt;FONT color=#800000&gt;Only pick the columns that you need, ignore everything else&lt;/FONT&gt;&lt;/LI&gt;
&lt;LI&gt;Make the database columns that your cache is populated from as narrow as possible. If a column is defined as VARCHAR(20) then SSIS will allocate 20 bytes for every value in that column – that is a big waste if the actual values are significantly less than 20 characters in length.&lt;/LI&gt;
&lt;LI&gt;&lt;FONT color=#800000&gt;Do you need DT_WSTR typed columns or will DT_STR suffice? DT_WSTR uses twice the amount of space to hold values that can be stored using a DT_STR so if you can use DT_STR, consider doing so. Same principle goes for the numerical datatypes DT_I2/DT_I4/DT_I8.&lt;/FONT&gt;&lt;/LI&gt;
&lt;LI&gt;Only populate the cache with data that you KNOW you will need. In other words, think about your WHERE clause!&lt;/LI&gt;&lt;/UL&gt;
&lt;H2&gt;Thinking outside the box&lt;/H2&gt;
&lt;P&gt;It is tempting to build a large monolithic dataflow that does many things, one of which is a Lookup. Often though you can make better use of your available resources by, well, &lt;I&gt;mixing things up a little&lt;/I&gt; and here are a few ideas to get your creative juices flowing:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;There is no rule that says everything has to happen in a single dataflow. If you have some particularly resource intensive lookups then consider putting that lookup into a dataflow all of its own and using raw files to pass the pipeline data in and out of that dataflow.&lt;/LI&gt;
&lt;LI&gt;&lt;FONT color=#800000&gt;Know your data. If you think, for example, that the majority of your incoming rows will match with only a small subset of your lookup data then consider chaining multiple lookup components together; the first would use a FullCache containing that data subset and the remaining data that doesn’t find a match could be passed to a second lookup that perhaps uses a NoCache lookup thus negating the need to pull all of that least-used lookup data into memory.&lt;/FONT&gt;&lt;/LI&gt;
&lt;LI&gt;Do you need to process all of your incoming data all at once? If you can process different partitions of your data separately then you can partition your lookup cache as well. For example, if you are using a lookup to convert a location into a [LocationId] then why not process your data one region at a time? This will mean your lookup cache only has to contain data for the location that you are currently processing and with the ability of &lt;A href="http://consultingblogs.emc.com/jamiethomson/archive/2007/11/16/Katmai_3A00_-SSIS_3A00_-Lookup-component-gets-a-makeover.aspx" target=_blank&gt;the Lookup in SSIS2008&lt;/A&gt; and beyond to charge the cache using a dynamically built SQL statement you’ll be able to achieve it using the same dataflow and simply loop over it using a ForEach loop.&lt;/LI&gt;
&lt;LI&gt;&lt;FONT color=#800000&gt;Taking the previous data partitioning idea further … a dataflow can contain more than one data path so why not split your data using a conditional split component and, again, charge your lookup caches with only the data that they need for that partition.&lt;/FONT&gt;&lt;/LI&gt;
&lt;LI&gt;Lookups have two uses: to (1) find a matching row from the lookup set and (2) put attributes from that matching row into the pipeline. Ask yourself, do you need to do these two things at the same time? After all once you have the key column(s) from your lookup set then you can use that key to get the rest of attributes further downstream, perhaps even in another dataflow.&lt;/LI&gt;
&lt;LI&gt;&lt;FONT color=#800000&gt;Are you using the same lookup data set multiple times? If so, consider the &lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/bb895316.aspx" target=_blank&gt;file caching option&lt;/A&gt;&lt;FONT color=#800000&gt;&lt;/FONT&gt;&lt;FONT color=#800000&gt; in SSIS 2008 and beyond.&lt;/FONT&gt;&lt;/LI&gt;
&lt;LI&gt;[From Sam Loud in the comments] Sometimes, it's better not to cache your lookup set at all. If you haver a very large, well indexed lookup set, that needs to be accessed by a relatively small number of pipeline rows, you may well be better off using No Cache, and doing the lookup row-by-row.&lt;/LI&gt;
&lt;LI&gt;&lt;FONT color=#800000&gt;Above all, experiment and be creative with different combinations. You may be surprised at what works.&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;
&lt;H2&gt;Final&amp;nbsp; thoughts&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;FONT color=#800000&gt;If you want to know more about how the Lookup component differs in SSIS2008 from SSIS2005 then I have a dedicated blog post about that at &lt;/FONT&gt;&lt;A href="http://consultingblogs.emc.com/jamiethomson/archive/2007/11/16/Katmai_3A00_-SSIS_3A00_-Lookup-component-gets-a-makeover.aspx" target=_blank&gt;&lt;FONT color=#800000&gt;Lookup component gets a makeover&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#800000&gt;.&lt;/FONT&gt;&lt;/LI&gt;
&lt;LI&gt;I am on a mini-crusade at the moment to get a BULK MERGE feature into the database engine, the thinking being that if the database engine can quickly merge massive amounts of data in a similar manner to how it can insert massive amounts using BULK INSERT then that’s a lot of work that wouldn’t have to be done in the SSIS pipeline. If you think that is a good idea then go and vote for &lt;A href="https://connect.microsoft.com/SQLServer/feedback/details/540038/bulk-merge" target=_blank&gt;BULK MERGE on Connect&lt;/A&gt;.&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;If you have any other tips to share then please stick them in the comments.&lt;/P&gt;
&lt;P&gt;Hope this helps!&lt;/P&gt;
&lt;P&gt;&lt;A href="http://twitter.com/jamiet"&gt;@Jamiet&lt;/A&gt;&lt;/P&gt;&lt;img src="http://sqlblog.com/aggbug.aspx?PostID=23534" width="1" height="1"&gt;</description><category domain="http://sqlblog.com/blogs/jamie_thomson/archive/tags/sql+server/default.aspx">sql server</category><category domain="http://sqlblog.com/blogs/jamie_thomson/archive/tags/sql+server+integration+services/default.aspx">sql server integration services</category><category domain="http://sqlblog.com/blogs/jamie_thomson/archive/tags/ssis/default.aspx">ssis</category><category domain="http://sqlblog.com/blogs/jamie_thomson/archive/tags/lookup/default.aspx">lookup</category></item></channel></rss>