THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alberto Ferrari

TableDifference version 2.0 has been released

Table difference is an SSIS custom component designed to simplify the management of slowly changing dimensions and – in general – to check the differences between two tables or data flow with SSIS.

Its main advantage over Microsoft standard SCD component is its speed and ease of use, the component receives input from two sorted sources and generates different outputs for unchanged, new, deleted or updated rows.

The project is freeware, full source code is available at www.sqlbi.com.

Note to version 1.x users

The internal structure of the metadata of TableDifference is very different from the previous one. The component is able to read metadata from the previous version and write them with the new format as soon as it is called from inside BIDS so you will be able to reload all the previous metadata without loss of functionality.

However, in writing the new metadata, the component changes the lineage IDs of all the output columns so, when you open a package that contains a previous version of TableDifference, you will need to open its designer to check that everything worked fine and the correct the subsequent flows by double clicking on the components that use TableDifference outputs. I have converted a lot of packages without any problems but, if you encounter something strange, do not hesitate to contact me.

Introduction

One question that arises very often in Data Warehouse programming is “what are the differences between these two tables/flows of data?” It is often the case in SCD management; you receive several millions of customers and should decide what changes need to be done after your last successful load.

Using SSIS we have the SCD component, it works but it does it slow that in the production environment it is quite always convenient to create an “ad hoc” solution to handle the SCD.

We decided to write an SSIS component that has two inputs, one for the “old” and one for the “new” data, compares all the rows that come in from the old and the new flow and sends them to different outputs, namely:

  • Unchanged rows (are the same in both input)
  • Deleted rows (appear in old but not in new)
  • New rows (appear in new but not in old)
  • Updated rows (appear in both flows but something is changed)

The following picture illustrates the component functionality better than thousand words.

clip_image002

As not every update is to be handled the same way, it is possible to tag each column with an integer; the number of updated outputs is determined by the number of tags, one for each, in this way you can tag with “10” the historical attributes, with “20” other columns and decide what kind of operation to carry on with the different updates.

The inputs MUST be sorted and have a collection of fields (keys) that let the component decide when two rows from the inputs represent the same row, but this is easily accomplished by SQL Server with a simple “order by” and a convenient index; moreover the SCD do normally maintain an index by the business key, so the sorting requirement is easily accomplished and do not represent a problem.

The main structure

The structure of the component is pretty simple:

  • Old Flow Input: it should be attached to a convenient query that returns all the current rows from the SCD
  • New Flow Input: it can be connected to the flow where the new structure of the SCD has been computed.

Each input has a buffer where all the incoming rows are directed and maintained.

A separate thread starts as soon as data from both buffers is available and checks for the differences between them, sending the rows to the appropriate output and then removing the buffered rows from the buffers.

The buffer does not need to maintain all the data from the SCD, data is removed as soon as it can be compared with a corresponding row: the buffer, even for a several million record table, should be pretty small (but read the document to the end… some tricks may be useful to avoid memory consumption).

The outputs

The component has three standard outputs for new, deleted and unchanged rows. The number of updated rows output is determined by the user via the component editor, one output is provided for each different updateID that is inserted by the user. The outputs are named “UpdateID” followed by the updateID defined by the user.

Of course, you can easily change the name of the output to something more interesting like “Updated historical attribute” and we encourage you to do so!

The collection of output fields is computed by the component via the intersection of the two inputs: if one column appears in only one input it is not managed, if it appears in both then it will be compared and outputted.

Installing TableDifference

Installation is very simple:

· copy TableDifference.DLL into

“<Program Files>\Microsoft SQL Server\90\DTS\PipelineComponents”

· add the DLL to the GAC using “GACUTIL –I TableDifference.DLL”

No installer for the moment… sorry.

Note for Vista users: you need to run these command on a command line opened with administrator privileges to perform the tasks.

Using TableDifference

After the component is installed you should add it to the toolbar as for every SSIS component and then you are ready to use it.

After both inputs has been attached, by double clicking on the component, you reach the component editor that shows a single window like this:

clip_image004

A few words about what is happening just before the component editor pops up:

  1. TableDifference analyzes all the columns in both inputs and compares their names. If the name of two columns and their corresponding types are identical, TableDifference adds them to the available columns to manage.
  2. If the flows are sorted, their sort columns will be marked as key fields, using the same order in which they appear in the sort.
  3. All other columns are assigned a standard Update ID of 10 and are managed as comparable columns.

Using the component editor you need to provide these information for the columns:

  • Check Option: you can choose the column type between:
    • Key field: these column will be used to detect when two rows from the inputs represent the same row. Beware that the inputs must be sorted by those columns
    • Compare: these columns will be compared one by one to detect differences
    • Prefer NEW: these columns will be copied from the NEW input directly into the output, no check
    • Prefer OLD: these columns will be copied from the OLD input directly into the output, no check
  • KeyOrder: If a column is of type “Key Field” it is the order under which the field appear under the “order by” clause of your query. Beware that the component do not check for the correct sorting sequence, it is up to you to provide this information.
  • Update ID: each different UpdateID creates a different output flow. If you need to detect when a change appears in some column you can use different update ID. Beware that the lowest update ID wins, i.e. if AccountNumber has update id of 10 and AddressLine1 has update id of 20, then Accountnumber will be checked first and if a change is detected, the row will go to update output 10, no matter if AddressLine has a difference.

Clicking “OK” is enough for the component to generate the outputs and to define the metadata for all the outputs.

In version 2.0 there are a three new panels:

Outputs panel

In this panel you can choose which output to enable. If you are not interested, for example, in the unchanged output, then you can deselect it from this panel to avoid warnings for unused columns.

clip_image006

You can also rename outputs and provide a concise description of them. Renaming is very useful as it gives the data flow clearness.

Output Details

This panel let you select the columns for each output. You cannot add any column but you can disable columns for outputs that do not use them.

clip_image008

The upper combo box let you choose an output, in the grid you can select or unselect any column. This feature is useful as it avoids warnings for unused columns. In the picture, the deleted output will receive only the customer key as it will make no use of any other information.

Always remember that removing useless output columns increases the performances of the component.

Misc Options

Under this pane there are miscellaneous options.

clip_image010

Here you can define, for string comparisons:

  1. The culture ID to use to perform the comparison. If not specified TableDifference will use the culture ID of the running task. The default is “empty”.
  2. If you want it to ignore casing during string comparisons. The default is unchecked so TableDifference will perform comparison considering case.

Warnings

This panel will list any unused column from the input. As you might recall, if two columns are not identical regarding name and type, TableDifference will ignore them. This might be an error but the decision is up to you. By checking the warnings panel you can see if TableDifference is working with all the columns you need it to compare.

FlowSync

The component works fine and do not consume memory if and only if the input from both flows come in at a synchronized speed: if one input is much faster than the other then the component will start buffering data and consume memory that will be freed only when the buffer starts to shrink.

Using the component to compare two tables with 5 millions of records we had several problems with memory, because data came in from one buffer much faster than the other and, after the difference (and hence the buffer on one input) reached 1.5 millions of records, the whole memory of the DtExec process (in a 32 bit virtual space) was filled in.

As both inputs were from a simple “Select * From” and the speed from both lines was the same, we discovered that – for some obscure reason – SSIS prefers one input to the other and do not leave enough time to both tasks, resulting in memory consumption by the component.

FlowSync is a component that will make two or more flows of data in an SSIS data flow package run at the same speed, by stopping one flow if the others run too slow. It has been created as a convenient companion to TableDifference to resolve the problems with memory occupation, it can be used by its own in the case where you want flows to run at the same speed.

It makes use of semaphores to handle synchronization, so no CPU is ever wasted, when the faster flows is stopped all the CPU is free for others (more useful) processes.

You can find source code, executable and description of FlowSync at www.sqlbi.com. We normally use flowsync when we need to compare more than half a million rows while we use TableDifference without flow sync for all the smaller tables in a project.

Published Friday, February 22, 2008 2:45 PM by AlbertoFerrari

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Carol Dennis said:

I am new to SQL Server and found this very interesting

April 6, 2009 7:21 PM
 

Fredrik Gunnarsson said:

I like this component bigtime. When does it comming for SQL Server 2008? (SSIS 2008)

May 24, 2010 7:35 AM
 

AlbertoFerrari said:

Fredrik,

Since January 1st of 2010, TableDifference is now owned and supported by CozyRoc (www.cozyroc.com) as part of their SSIS+ set of tools.

I am currently using their library and all the SQL engines are supported, up to SQL2008R2 without any problem.

Please send any question about TableDifference directly to them, since I can no longer provide any information about this component.

May 24, 2010 7:43 AM
 

Rajesh said:

Hi ,

I facing an issue with TableDifference Component in our PROD machine.

Unable to get the object tabledifference component during runtime.As per the design of the interface this component handles 25 columns out of which 23 columns are used for comparison and 2 columns are used for getting old data. This kind of error donot occur regurlarly but twice or thrice in a week. Can you please help me on this issue. Below is the error message we get when the job fails.

Message

Executed as user: EUROPE\ssi...035.00 for 64-bit  Copyright (C) Microsoft Corp 1984-2005. All rights reserved.    Started:  7:00:01 PM  Error: 2010-07-21 19:22:23.25     Code: 0x00000000     Source: DFT_STG_SLO TableDifference     Description: Object reference not set to an instance of an object.  End Error  Error: 2010-07-21 19:22:23.47     Code: 0xC0047062     Source: DFT_STG_SLO TableDifference [14846]     Description: System.Exception: Consumer Exception ---> System.NullReferenceException: Object reference not set to an instance of an object.     at Microsoft.SqlServer.Dts.Pipeline.Wrapper.IDTSManagedComponentWrapper90.AddBLOBData(IDTSBuffer90 pIDTSBuffer, Int32 hRow, Int32 hCol, Byte[]& ppsaData)     at Microsoft.SqlServer.Dts.Pipeline.PipelineBuffer.AddBlobData(Int32 columnIndex, Byte[] data)     at SqlBi.SSIS.Components.TableDifference.OutputInformation.AddRowToOutputBuffer(Row row)     at SqlBi.SSIS.Components.TableDifference.processB...  The package execution fa...  The step failed.

July 23, 2010 12:46 AM
 

AlbertoFerrari said:

Raiesh,

The issue you are facing is a pretty old bug in TableDifference, that has already been solved. That said, I can no longer support it, since TableDifference is now a CozyRoc product. so I think your best option is to contact them and receive their latest release, which should solve your problem.

Alberto

July 23, 2010 3:20 AM
 

Rajesh said:

Thank you very much Alberto. Noted your suggestion.

Thanks again

August 20, 2010 1:15 AM
 

Rajesh said:

Hi Alberto,

Thank you very much for your suggestion and help. One more doubt is once if the new version is installed do we need to make any changes in our existing package?

Thank you

Rajesh

August 20, 2010 1:53 AM
 

Michael said:

Hello Alberto,

nice work. :)

So as the Component moved to CozyRoc is it posible to get the source of the last available free version?

I am just interested in it for learning.

Cheers

Michael

September 8, 2010 11:08 AM
 

AlbertoFerrari said:

Sorry michael, from january 1st I can no longer distribute the sources.

Please refer to cozyroc for any kind of assistance.

Alberto.

September 8, 2010 6:21 PM

Leave a Comment

(required) 
(required) 
Submit

About AlbertoFerrari

Alberto Ferrari is a Business Intelligence consultant. He his interests lie in two main areas: BI development lifecycle methodologies and performance tuning of ETL and SQL code. His main activities are with SSIS and SSAS for the banking, manufacturing and statistical sectors. He is also a speaker in international conferences like European PASS Conference and PASS Summit.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement