THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

SQLBI - Marco Russo

SQLBI is a blog dedicated to building Business Intelligence solutions with SQL Server.
You can follow me on Twitter: @marcorus

How to get Distinct Count in SSIS

Almost one month ago I made a post about a brand new Distinct component for SSIS made by Alberto Ferrari and freely available on SQLBI.EU.

The most used way to get distinct rows in a Data Flow  that I've seen is to use the Sort component, asking only for distinct rows (there is a check box for that in the standard edit dialog box). The Sort component is a fully blocking one, because it requires to stop the flow until all the rows are read from the source. The Distinct component is not blocking, because it does not stop the flow and internally use hash functions and other structures to maintain a list of the alreay emitted rows. A previous post from Jamie Thomson describes pretty well pro and cons of the Sort component and raise the need for a better solution.

A couple of weeks ago, I realized that there is another option available. Using the Aggregate component, you can select all the fields you want in output (even all the columns) and select "Group By" as operations on them. Simply that. Not so intuitive to get a Distinct in such a way, but if you want performance... this seems the way to go.

From tests we made, using Aggregate component is the faster way to get a Distinct and it also doesn't consume as much memory as the Sort component. Unfortunately, also the Aggregate is a fully blocking component. The following picture shows the debug of a sample package with the three components: the Distinct is not a blocking one, while the Aggregate is blocking but, once it received all the rows, it produces output very fast, while the Distinct is still flushing the queue. The Sort start the sorting operation after he received all the rows and is still sorting at the moment of picture snapshot.

 

Now we have three options to get a Distinct:

  • Aggregate component: it is standard, it is fast, it is fully blocking
  • Distinct component: it needs a separate install, it is fast but sometimes not as fast as the Aggregate, it is partially blocking
  • Sort component: it is standard, it is the slowest, it is fully blocking, it also sort the data

I think that in general we can use the Aggregate component, just because it does not require a separate deployment of the Distinct component. However, when the fully blocking behavior is not desirable, the Distinct component may worth the time to deploy/install another data flow component.

Published Friday, March 09, 2007 9:06 PM by Marco Russo (SQLBI)
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

No Comments

Leave a Comment

(required) 
(required) 
Submit

About Marco Russo (SQLBI)

Marco Russo is a consultant, writer and trainer specialized in Business Intelligence with Microsoft technologies. He runs the SQLBI.COM website, which is dedicated to distribute resources useful for BI developers, like Integration Services components, Analysis Services models, tools, technical information and so on. Marco is certified as MCT, MCDBA, MCSD.NET, MCSA, MCSE+I.

This Blog

Syndication

Archives

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement