<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://sqlblog.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Alberto Ferrari : TableDifference</title><link>http://sqlblog.com/blogs/alberto_ferrari/archive/tags/TableDifference/default.aspx</link><description>Tags: TableDifference</description><dc:language>en</dc:language><generator>CommunityServer 2.1 SP2 (Build: 61129.1)</generator><item><title>SSIS: living in a parallel world? Not yet...</title><link>http://sqlblog.com/blogs/alberto_ferrari/archive/2008/09/14/ssis-living-in-a-parallel-world-not-yet.aspx</link><pubDate>Sun, 14 Sep 2008 19:57:00 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:8899</guid><dc:creator>AlbertoFerrari</dc:creator><slash:comments>4</slash:comments><comments>http://sqlblog.com/blogs/alberto_ferrari/comments/8899.aspx</comments><wfw:commentRss>http://sqlblog.com/blogs/alberto_ferrari/commentrss.aspx?PostID=8899</wfw:commentRss><description>&lt;P&gt;Today I come into a nice bug (misfeature?) of SSIS 2008 and I'd like to share it with you. Let's start from the beginning:&lt;/P&gt;
&lt;P&gt;One of the keys to make something speedy is to parallelize it. In TableDifference I read from two inputs, compare them and produce several outputs for new, deleted, updated rows and so on. It is the classical "produce/consumer" paradigm that works fine in parallel, creating a consumer thread that compares rows and different threads to produce them. &lt;/P&gt;
&lt;P&gt;During the porting of the component in SQL 2008 I discovered that my code does not work as expected. It reads all the lines, then terminates abnormally without producing a single row of output. It took me a couple of hours and several coffes to discover the change in the way SSIS handles asyncronous components causing my bug.&lt;/P&gt;
&lt;P&gt;In order to make the problem easier to understand (for microsoft connect too), I produced the same behaviour with a much simpler script component. Imagine to have a script transformation that needs to make some very complex work on each row. You will end up with something like this:&lt;/P&gt;&lt;PRE class=csharpcode&gt;    &lt;SPAN class=kwrd&gt;public&lt;/SPAN&gt; &lt;SPAN class=kwrd&gt;override&lt;/SPAN&gt; &lt;SPAN class=kwrd&gt;void&lt;/SPAN&gt; Input0_ProcessInputRow(Input0Buffer Row) {
        Output0Buffer.AddRow();
        Output0Buffer.LineID = VeryComplexFunction(Row);
        &lt;SPAN class=kwrd&gt;if&lt;/SPAN&gt; (Row.EndOfRowset ()) {
           OutputBuffer.SetEndOfRowset ();
        }
    }&lt;/PRE&gt;
&lt;P&gt;But, if you have a 64 CPU computer. You can imagine to detach many threads, each one handling just one very complex function, in this way you will produce a massive parallel computation. Wow! The solution to get maximum speed would be this:&lt;/P&gt;&lt;PRE class=csharpcode&gt;    &lt;SPAN class=kwrd&gt;public&lt;/SPAN&gt; &lt;SPAN class=kwrd&gt;override&lt;/SPAN&gt; &lt;SPAN class=kwrd&gt;void&lt;/SPAN&gt; Input0_ProcessInputRow(Input0Buffer Row) {
        Thread T = &lt;SPAN class=kwrd&gt;new&lt;/SPAN&gt; Thread(SendAsyncRow);
        T.Start(Row.EndOfRowset());
    }

    &lt;SPAN class=kwrd&gt;public&lt;/SPAN&gt; &lt;SPAN class=kwrd&gt;void&lt;/SPAN&gt; SendAsyncRow(Object O) {
        Output0Buffer.AddRow();
        Output0Buffer.LineID = VeryComplexFunction();
        &lt;SPAN class=kwrd&gt;if&lt;/SPAN&gt; ((&lt;SPAN class=kwrd&gt;bool&lt;/SPAN&gt;)O == &lt;SPAN class=kwrd&gt;true&lt;/SPAN&gt;) {
            Output0Buffer.SetEndOfRowset();
        }
    }&lt;/PRE&gt;
&lt;P&gt;You detach a separate thread for each row, the thread will compute the very complex function and send it to the output buffer as soon as it is ready. In order to make SSIS understand when everything is finished, you signal SetEndOfRowset in the last thread.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;Don't start crying, this code will never work, the SetEndOfRowset is called on the last read line and not on the last produced one, but this is just pseudocode to understand the topic, don't blame me for inconsistencies. :) Let's go on with the main topic.&lt;/EM&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Under SSIS 2005 this code (in VB, of course) would work fine as the engine detects the end of processing when it receives the EndOfRowset on the outputs. In SSIS 2008 the engine detects the end of the processing as soon as the &lt;EM&gt;&lt;STRONG&gt;last call to ProcessInputRow has finished&lt;/STRONG&gt;&lt;/EM&gt;. So, in our case, after the last thread has been detached SSIS believes that the processing is finished and calls the cleaning up of the component, completely ignoring both the presence of separate threads running and our williness to not terminate the component because we know that we need some more time.&lt;/P&gt;
&lt;P&gt;The result? As with any other parallel bug you will sometime get a crash, on the same inputs sometime finish corectly, sometime get an error... the nightmare of any programmer.&lt;/P&gt;
&lt;P&gt;If you want to try it by yourself, cut and paste this code to a transformation script in a data flow that process some lines ( rows 10/20 lines are enough):&lt;/P&gt;&lt;PRE class=csharpcode&gt;&lt;SPAN class=kwrd&gt;using&lt;/SPAN&gt; System;
&lt;SPAN class=kwrd&gt;using&lt;/SPAN&gt; System.Data;
&lt;SPAN class=kwrd&gt;using&lt;/SPAN&gt; Microsoft.SqlServer.Dts.Pipeline.Wrapper;
&lt;SPAN class=kwrd&gt;using&lt;/SPAN&gt; Microsoft.SqlServer.Dts.Runtime.Wrapper;

&lt;SPAN class=kwrd&gt;using&lt;/SPAN&gt; System.Threading;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
&lt;SPAN class=kwrd&gt;public&lt;/SPAN&gt; &lt;SPAN class=kwrd&gt;class&lt;/SPAN&gt; ScriptMain : UserComponent {

    &lt;SPAN class=kwrd&gt;bool&lt;/SPAN&gt; useThreads = &lt;SPAN class=kwrd&gt;false&lt;/SPAN&gt;;

    &lt;SPAN class=rem&gt;/// &amp;lt;summary&amp;gt;&lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// &lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// Processes input rows&lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// &lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// &amp;lt;/summary&amp;gt;&lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// &amp;lt;param name="Row"&amp;gt;&amp;lt;/param&amp;gt;&lt;/SPAN&gt;
    &lt;SPAN class=kwrd&gt;public&lt;/SPAN&gt; &lt;SPAN class=kwrd&gt;override&lt;/SPAN&gt; &lt;SPAN class=kwrd&gt;void&lt;/SPAN&gt; Input0_ProcessInputRow(Input0Buffer Row) {
        Thread T = &lt;SPAN class=kwrd&gt;new&lt;/SPAN&gt; Thread(SendAsyncRow);

        &lt;SPAN class=kwrd&gt;if&lt;/SPAN&gt; (useThreads) {
            T.Start(Row.EndOfRowset());
        } &lt;SPAN class=kwrd&gt;else&lt;/SPAN&gt; {
            SendRow();
        }
    }

    &lt;SPAN class=rem&gt;/// &amp;lt;summary&amp;gt;&lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// &lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// Sends a row in sync mode&lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// &lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// &amp;lt;/summary&amp;gt;&lt;/SPAN&gt;
    &lt;SPAN class=kwrd&gt;private&lt;/SPAN&gt; &lt;SPAN class=kwrd&gt;void&lt;/SPAN&gt; SendRow() {
        &lt;SPAN class=rem&gt;// Thread.Sleep(100);&lt;/SPAN&gt;
        Output0Buffer.AddRow();
        Output0Buffer.LineID = VeryComplexFunction();
    }

    &lt;SPAN class=rem&gt;/// &amp;lt;summary&amp;gt;&lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// &lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// Sends a row in async mode&lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// &lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// &amp;lt;/summary&amp;gt;&lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// &amp;lt;param name="O"&amp;gt;&amp;lt;/param&amp;gt;&lt;/SPAN&gt;
    &lt;SPAN class=kwrd&gt;public&lt;/SPAN&gt; &lt;SPAN class=kwrd&gt;void&lt;/SPAN&gt; SendAsyncRow(Object O) {
        Thread.Sleep(100);
        Output0Buffer.AddRow();
        Output0Buffer.LineID = VeryComplexFunction();
        &lt;SPAN class=kwrd&gt;if&lt;/SPAN&gt; ((&lt;SPAN class=kwrd&gt;bool&lt;/SPAN&gt;)O == &lt;SPAN class=kwrd&gt;true&lt;/SPAN&gt;) {
            Output0Buffer.SetEndOfRowset();
        }
    }

    &lt;SPAN class=rem&gt;/// &amp;lt;summary&amp;gt;&lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// &lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// Not very complex... but it's a demo!&lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// &lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// &amp;lt;/summary&amp;gt;&lt;/SPAN&gt;
    &lt;SPAN class=rem&gt;/// &amp;lt;returns&amp;gt;&amp;lt;/returns&amp;gt;&lt;/SPAN&gt;
    &lt;SPAN class=kwrd&gt;private&lt;/SPAN&gt; &lt;SPAN class=kwrd&gt;int&lt;/SPAN&gt; VeryComplexFunction() {
        &lt;SPAN class=kwrd&gt;return&lt;/SPAN&gt; 10;
    }
}&lt;/PRE&gt;
&lt;P&gt;using Serial mode (useThreads se to false) everything works fine. Using parallel mode (useThreads set to true) SSIS will crash and produce no output at all. The bug in TableDifference was very similar to that, I needed to handle the wait for termination of the consumer process during the ProcessInput function.&lt;/P&gt;
&lt;P&gt;It is not the first time I discover inconsistencies in the way SSIS handles parallelism, in order to try to make it easier to write code (using 2008 mode there is no need to call SetEndOfRowset and your program will run fine if you forget it) the architects of SSIS introduce a behavior that will make better and correct code crash. If you believe this is a problem, vote &lt;A href="https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=367692" target=_blank&gt;here&lt;/A&gt;, anyway... be aware of it, it mighe be useful in the future. :)&lt;/P&gt;&lt;img src="http://sqlblog.com/aggbug.aspx?PostID=8899" width="1" height="1"&gt;</description><category domain="http://sqlblog.com/blogs/alberto_ferrari/archive/tags/Components/default.aspx">Components</category><category domain="http://sqlblog.com/blogs/alberto_ferrari/archive/tags/SQL/default.aspx">SQL</category><category domain="http://sqlblog.com/blogs/alberto_ferrari/archive/tags/SQL+Server/default.aspx">SQL Server</category><category domain="http://sqlblog.com/blogs/alberto_ferrari/archive/tags/SSIS/default.aspx">SSIS</category><category domain="http://sqlblog.com/blogs/alberto_ferrari/archive/tags/TableDifference/default.aspx">TableDifference</category></item><item><title>Table Difference 2.0 - Call for beta</title><link>http://sqlblog.com/blogs/alberto_ferrari/archive/2007/12/17/table-difference-2-0-call-for-beta.aspx</link><pubDate>Mon, 17 Dec 2007 11:53:37 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:4030</guid><dc:creator>AlbertoFerrari</dc:creator><slash:comments>0</slash:comments><comments>http://sqlblog.com/blogs/alberto_ferrari/comments/4030.aspx</comments><wfw:commentRss>http://sqlblog.com/blogs/alberto_ferrari/commentrss.aspx?PostID=4030</wfw:commentRss><description>&lt;p&gt;The new version 2.0 of SSIS SCD handling component &lt;a href="http://www.sqlbi.eu/Projects/TableDifference.aspx" target="_blank"&gt;TableDifference&lt;/a&gt; is in beta, I am searching for beta testers and some help in writing an installer for it. You will find all the informations at &lt;a href="http://www.sqlbi.eu"&gt;www.sqlbi.eu&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;The final release of the component will be as usually freely available with sources as soon as it has been tested enough.&lt;/p&gt;&lt;img src="http://sqlblog.com/aggbug.aspx?PostID=4030" width="1" height="1"&gt;</description><category domain="http://sqlblog.com/blogs/alberto_ferrari/archive/tags/Components/default.aspx">Components</category><category domain="http://sqlblog.com/blogs/alberto_ferrari/archive/tags/SSIS/default.aspx">SSIS</category><category domain="http://sqlblog.com/blogs/alberto_ferrari/archive/tags/TableDifference/default.aspx">TableDifference</category></item><item><title>Table Difference 1.2</title><link>http://sqlblog.com/blogs/alberto_ferrari/archive/2007/07/23/table-difference-1-2.aspx</link><pubDate>Mon, 23 Jul 2007 08:16:29 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:1900</guid><dc:creator>AlbertoFerrari</dc:creator><slash:comments>0</slash:comments><comments>http://sqlblog.com/blogs/alberto_ferrari/comments/1900.aspx</comments><wfw:commentRss>http://sqlblog.com/blogs/alberto_ferrari/commentrss.aspx?PostID=1900</wfw:commentRss><description>&lt;p&gt;&lt;a href="http://www.sqlbi.eu/Projects/TableDifference/tabid/74/Default.aspx"&gt;TableDifference&lt;/a&gt; has been updated to version 1.2. For those who want to read the main description of the component can be found &lt;a href="http://www.sqlbi.eu/Home/tabid/36/ctl/Details/mid/374/ItemID/0/Default.aspx"&gt;here&lt;/a&gt;&amp;nbsp;but take a look at the full story, this component and the documentaton has been updated several times.&lt;/p&gt; &lt;p&gt;You will find complete documentation about the changes at &lt;a href="http://www.sqlbi.eu"&gt;www.sqlbi.eu&lt;/a&gt;&amp;nbsp;and all the sources and binaries in the download section.&lt;/p&gt; &lt;p&gt;Have fun with your SCD handling, any comment&amp;nbsp;will&amp;nbsp;be&amp;nbsp;very welcome.&lt;/p&gt;&lt;img src="http://sqlblog.com/aggbug.aspx?PostID=1900" width="1" height="1"&gt;</description><category domain="http://sqlblog.com/blogs/alberto_ferrari/archive/tags/Components/default.aspx">Components</category><category domain="http://sqlblog.com/blogs/alberto_ferrari/archive/tags/SSIS/default.aspx">SSIS</category><category domain="http://sqlblog.com/blogs/alberto_ferrari/archive/tags/TableDifference/default.aspx">TableDifference</category></item><item><title>Table Difference 1.1</title><link>http://sqlblog.com/blogs/alberto_ferrari/archive/2006/07/06/table-difference-1-1.aspx</link><pubDate>Thu, 06 Jul 2006 06:48:26 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:1890</guid><dc:creator>AlbertoFerrari</dc:creator><slash:comments>0</slash:comments><comments>http://sqlblog.com/blogs/alberto_ferrari/comments/1890.aspx</comments><wfw:commentRss>http://sqlblog.com/blogs/alberto_ferrari/commentrss.aspx?PostID=1890</wfw:commentRss><description>&lt;p&gt;&lt;a href="http://www.sqlbi.eu/Home/tabid/36/ctl/Details/mid/374/ItemID/0/Default.aspx"&gt;TableDifference &lt;/a&gt;has been updated to&amp;nbsp;&lt;a href="http://www.sqlbi.eu/Home/tabid/36/ctl/Details/mid/374/ItemID/3/Default.aspx"&gt;version 1.1.&lt;/a&gt; &lt;p&gt;The major improvement is the fact that now TableDifference can handle a field as  &lt;ul&gt; &lt;li&gt;&lt;strong&gt;Key&lt;/strong&gt;: a field that represent a key in the table, you must supply KeyOrder  &lt;li&gt;&lt;strong&gt;Compare&lt;/strong&gt;: a field that is to be compared between OLD and NEW, you must supply UpdateID  &lt;li&gt;&lt;strong&gt;PreferNEW&lt;/strong&gt;: the field appear in both flows but you want the value from the NEW flow  &lt;li&gt;&lt;strong&gt;PreferOLD&lt;/strong&gt;: the field appear in both flows but you want the value from the OLD flow.&lt;/li&gt;&lt;/ul&gt; &lt;p&gt;Mostly valuable implementation is PreferOLD, in this way you can use the ID of the OLD table in the subsequent flows to make any kind of update you want in an easier way. &lt;p&gt;You can find the whole article and source code on &lt;a href="http://www.sqlbi.eu"&gt;www.sqlbi.eu&lt;/a&gt;.&lt;/p&gt;&lt;img src="http://sqlblog.com/aggbug.aspx?PostID=1890" width="1" height="1"&gt;</description><category domain="http://sqlblog.com/blogs/alberto_ferrari/archive/tags/SQL+Server/default.aspx">SQL Server</category><category domain="http://sqlblog.com/blogs/alberto_ferrari/archive/tags/TableDifference/default.aspx">TableDifference</category></item><item><title>TableDifference: a solution to SCD handling</title><link>http://sqlblog.com/blogs/alberto_ferrari/archive/2006/06/08/tabledifference-a-solution-to-scd-handling.aspx</link><pubDate>Thu, 08 Jun 2006 06:48:26 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:1888</guid><dc:creator>AlbertoFerrari</dc:creator><slash:comments>0</slash:comments><comments>http://sqlblog.com/blogs/alberto_ferrari/comments/1888.aspx</comments><wfw:commentRss>http://sqlblog.com/blogs/alberto_ferrari/commentrss.aspx?PostID=1888</wfw:commentRss><description>&lt;p&gt;In&amp;nbsp;SSIS the SCD component&amp;nbsp;does a lot of work for you but it works so slowly that it is quite always convenient to create an “ad hoc” solution to handle the slowly changing dimensions. &lt;p&gt;As the main problem is that of determining the difference between the last snapshot of the SCD and the current data we decided to write an SSIS component that has two inputs, one for the “old” and one for the “new” flow of data. It compares all the rows and sends each one to a different output for new, deleted and user configurable&amp;nbsp;updated data. You can then decide how to handle the variations executing the correct UPDATE sequences to the dimension table or whatever you need to handle variations. &lt;p&gt;The following picture illustrates the component functionality better than thousand words. &lt;p&gt;&lt;a href="http://sqlblog.com/blogs/alberto_ferrari/WindowsLiveWriter/TableDifferenceasolutiontoSCDhandling_8A7D/TableDifference.png"&gt;&lt;img style="border-right:0px;border-top:0px;border-left:0px;border-bottom:0px;" height="285" alt="TableDifference" src="http://sqlblog.com/blogs/alberto_ferrari/WindowsLiveWriter/TableDifferenceasolutiontoSCDhandling_8A7D/TableDifference_thumb.png" width="619" border="0"&gt;&lt;/a&gt;  &lt;p&gt;Using TableDifference&amp;nbsp;two tables with 5.5 millions of records and 25 columns each were compared, using an AMD dual core processor with 4Gb of RAM and standard SATA disks running both SSIS and SQL Server in 7 minutes. The complexity of the algorithm is linea r, so you can expect to take more or less 15 minutes for a 10 millions record tables with the same hardware. &lt;p&gt;You can find the full article&amp;nbsp;that describes TableDifference &lt;a href="http://www.sqlbi.eu/Home/tabid/36/ctl/Details/mid/374/ItemID/0/Default.aspx"&gt;here&lt;/a&gt;. Full source code is available at &lt;a href="http://www.sqlbi.eu"&gt;www.sqlbi.eu&lt;/a&gt;.&lt;/p&gt;&lt;img src="http://sqlblog.com/aggbug.aspx?PostID=1888" width="1" height="1"&gt;</description><category domain="http://sqlblog.com/blogs/alberto_ferrari/archive/tags/SQL+Server/default.aspx">SQL Server</category><category domain="http://sqlblog.com/blogs/alberto_ferrari/archive/tags/TableDifference/default.aspx">TableDifference</category></item></channel></rss>