THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alberto Ferrari

SSIS: living in a parallel world? Not yet...

Today I come into a nice bug (misfeature?) of SSIS 2008 and I'd like to share it with you. Let's start from the beginning:

One of the keys to make something speedy is to parallelize it. In TableDifference I read from two inputs, compare them and produce several outputs for new, deleted, updated rows and so on. It is the classical "produce/consumer" paradigm that works fine in parallel, creating a consumer thread that compares rows and different threads to produce them.

During the porting of the component in SQL 2008 I discovered that my code does not work as expected. It reads all the lines, then terminates abnormally without producing a single row of output. It took me a couple of hours and several coffes to discover the change in the way SSIS handles asyncronous components causing my bug.

In order to make the problem easier to understand (for microsoft connect too), I produced the same behaviour with a much simpler script component. Imagine to have a script transformation that needs to make some very complex work on each row. You will end up with something like this:

    public override void Input0_ProcessInputRow(Input0Buffer Row) {
        Output0Buffer.AddRow();
        Output0Buffer.LineID = VeryComplexFunction(Row);
        if (Row.EndOfRowset ()) {
           OutputBuffer.SetEndOfRowset ();
        }
    }

But, if you have a 64 CPU computer. You can imagine to detach many threads, each one handling just one very complex function, in this way you will produce a massive parallel computation. Wow! The solution to get maximum speed would be this:

    public override void Input0_ProcessInputRow(Input0Buffer Row) {
        Thread T = new Thread(SendAsyncRow);
        T.Start(Row.EndOfRowset());
    }

    public void SendAsyncRow(Object O) {
        Output0Buffer.AddRow();
        Output0Buffer.LineID = VeryComplexFunction();
        if ((bool)O == true) {
            Output0Buffer.SetEndOfRowset();
        }
    }

You detach a separate thread for each row, the thread will compute the very complex function and send it to the output buffer as soon as it is ready. In order to make SSIS understand when everything is finished, you signal SetEndOfRowset in the last thread.

Don't start crying, this code will never work, the SetEndOfRowset is called on the last read line and not on the last produced one, but this is just pseudocode to understand the topic, don't blame me for inconsistencies. :) Let's go on with the main topic.

Under SSIS 2005 this code (in VB, of course) would work fine as the engine detects the end of processing when it receives the EndOfRowset on the outputs. In SSIS 2008 the engine detects the end of the processing as soon as the last call to ProcessInputRow has finished. So, in our case, after the last thread has been detached SSIS believes that the processing is finished and calls the cleaning up of the component, completely ignoring both the presence of separate threads running and our williness to not terminate the component because we know that we need some more time.

The result? As with any other parallel bug you will sometime get a crash, on the same inputs sometime finish corectly, sometime get an error... the nightmare of any programmer.

If you want to try it by yourself, cut and paste this code to a transformation script in a data flow that process some lines ( rows 10/20 lines are enough):

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

using System.Threading;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent {

    bool useThreads = false;

    /// <summary>
    /// 
    /// Processes input rows
    /// 
    /// </summary>
    /// <param name="Row"></param>
    public override void Input0_ProcessInputRow(Input0Buffer Row) {
        Thread T = new Thread(SendAsyncRow);

        if (useThreads) {
            T.Start(Row.EndOfRowset());
        } else {
            SendRow();
        }
    }

    /// <summary>
    /// 
    /// Sends a row in sync mode
    /// 
    /// </summary>
    private void SendRow() {
        // Thread.Sleep(100);
        Output0Buffer.AddRow();
        Output0Buffer.LineID = VeryComplexFunction();
    }

    /// <summary>
    /// 
    /// Sends a row in async mode
    /// 
    /// </summary>
    /// <param name="O"></param>
    public void SendAsyncRow(Object O) {
        Thread.Sleep(100);
        Output0Buffer.AddRow();
        Output0Buffer.LineID = VeryComplexFunction();
        if ((bool)O == true) {
            Output0Buffer.SetEndOfRowset();
        }
    }

    /// <summary>
    /// 
    /// Not very complex... but it's a demo!
    /// 
    /// </summary>
    /// <returns></returns>
    private int VeryComplexFunction() {
        return 10;
    }
}

using Serial mode (useThreads se to false) everything works fine. Using parallel mode (useThreads set to true) SSIS will crash and produce no output at all. The bug in TableDifference was very similar to that, I needed to handle the wait for termination of the consumer process during the ProcessInput function.

It is not the first time I discover inconsistencies in the way SSIS handles parallelism, in order to try to make it easier to write code (using 2008 mode there is no need to call SetEndOfRowset and your program will run fine if you forget it) the architects of SSIS introduce a behavior that will make better and correct code crash. If you believe this is a problem, vote here, anyway... be aware of it, it mighe be useful in the future. :)

Published Sunday, September 14, 2008 10:57 PM by AlbertoFerrari

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Ke Yang -MSFT said:

can you mail me to kyang@microsoft.com ? i'm an ssis dev, thanks!

October 31, 2008 2:21 AM
 

Yeow Seng Yap said:

Hi,

I faced the same problem. But i manage to use a hack for this.

in my main thread, i use Thread.Sleep(Timeout.Infinite).

then on my sub threads, i use the Thread.Interrupt() to wake up the main thread once the job is finish.

Hope this help.

Thanks.

September 2, 2009 2:03 PM
 

Mark White said:

Surely if you're writting an aynchronous (threaded) script component, then you should set the output buffer to asynchronous as well? I realize it's more work to copy all of the columns, but it does actually improve overall parallelism and it overcomes this bug.

August 3, 2010 4:08 AM
 

AlbertoFerrari said:

Mark,

Yes, you might create 64 asyncronous outputs for your 64 CPU box but, as you might imagine, this is not a very clean solution, since you are tying the package to the box architecture. Nevertheless, the issue can be easily solved using a technique like Yeow described (which is very similar to what I did in TableDifference, I only used a semaphore instead of Sleep(Infinite)). What I complain is that SSIS already had a correct interface to handle multithreaded components and, in order to make things easier for lazy programmers, they broke the architecture and introduced a buggy behavior.

August 3, 2010 4:22 AM

Leave a Comment

(required) 
(required) 
Submit

About AlbertoFerrari

Alberto Ferrari is a Business Intelligence consultant. He his interests lie in two main areas: BI development lifecycle methodologies and performance tuning of ETL and SQL code. His main activities are with SSIS and SSAS for the banking, manufacturing and statistical sectors. He is also a speaker in international conferences like European PASS Conference and PASS Summit.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement