THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a freelance data mangler in London

Sort transform arbitration [SSIS]

This post was prompted by a thread on the MSDN SSIS forum today where the poster was asking how he could replicate the behaviour of SSIS’s Sort transform using T-SQL, specifically he wanted to know how the Sort transform chooses what data to pass through when the “Remove Duplicates” option is checked.

Let me give a bit more background. Take the following dataset:

image

Note how we have repeated values (‘Simon’ & ‘Louis’) in the [Mentor] column.

Now let’s pass that data through SSIS’s Sort transform which we will sort on [Mentor] and choose to “Remove rows with duplicate sort values”:

image

Notice how [Mentee] is a Pass Through column.

When we execute this then we get the following output from the Sort Transform:

image

Notice anything? Two of our [Mentee] values have disappeared, namely ‘Jedward’ and ‘Danyl’. The Sort Transform has removed duplicates from [Mentor] and arbitrarily picked which value it should pass through for [Mentee].

 

Speaking personally I see no value at all in this arbitrary behaviour. Let me clarify that statement, the ability to remove duplicates is very very valuable but the practise of passing through an arbitrary value along with the de-duplicated values is pointless. The poster on the forum was asking how he could replicate this behaviour in T-SQL and I had to tell him (a) You can’t, because we don’t know what algorithm the Sort transform is using to choose its arbitrary value and (b) why would you even want to? I don’t know of any business rule that would value being able to say “Just give me any value back, I don’t care which”.

In my opinion the Sort transform should remove Pass Through columns when the “Remove rows with duplicate sort values” option is checked. They are pointless and thus may be confusing for new users of the product.

What do you think? Agree or disagree? Let me know in the comments!

@Jamiet

P.S. Apologies to the none-UK residents that are oblivious to the pop-culture references in this blog post! ;)

Published Thursday, November 12, 2009 8:40 PM by jamiet

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Todd McDermid said:

I completely agree, Jamie - you can't do it in T-SQL's DISTINCT, because it just doesn't make sense.  In fact, I'd go further and say that the "distinct" option on the Sort should be removed entirely!  The better way to get that information would be to use the Aggregate component - but that's just not possible now as the Aggregate doesn't support the kind of functions you'd need (min/max on strings), or some of the functions you'd want (first/last/first non-NULL/...)  I have a Connect issue open on that - #311852.

November 12, 2009 3:18 PM
 

Rich said:

Well at least it got rid of Jedward :)

So far i've avoided sorting in SSIS, doing it in Stored Procedure sources, but I'm sure it won'r be long.

November 12, 2009 3:44 PM
 

Brandon Reno said:

Just the other day I came across a situation where this would have been useful (although we decided to perform the reduction in SQL)

basically we had a dataset along the lines of:

US United States

AN Netherlands Antilles

AN Antilles

...

...

there was a surrogate key on the data, and for a Data Warehouse load we were bringing over the country codes as used in this source system.

Now, obviously the source data needs to be cleaned in this case, and it will be fixed (outside the scope of this project), but at the end of the day, the system allows bad data to be entered, and we need to account for this possibility on the way into the warehouse in order for the package to be robust.

Since it really didn't matter to us what the "Formatted" name came out as, we selected the one that came first alphebetically.

The ability to clean up bad data is important for SSIS and should not be gimped. that being said, i never would have considered looking in the "sort" transformation for this feature, it makes more sense to me to use the Aggregate.

for those wondering, the query looked something like this:

WITH countries AS

(

SELECT abbr, name, ROW_NUMBER() OVER(PARTITION BY abbr ORDER BY name)  as rownum FROM src_country_table

)

SELECT abbr, name FROM countris WHERE rownum = 1

possibly a little overkill, but I *hate* putting subqueries in the select clause

November 12, 2009 3:45 PM
 

Brandon Reno said:

Forgot to mention: that query requires SQL Server 2005 or later

November 12, 2009 3:49 PM
 

Gil said:

I think passing thru the other columns is a useful feature, but the user should have to specify which row to select (perhaps by marking a (set of) column(s) to order by and whether to take the first or last one).  That way it would be predictable (unless there were ties :-).

November 12, 2009 7:16 PM
 

Duane Douglas said:

Well, it makes sense when sorting on all of the columns.

November 13, 2009 1:32 AM
 

jamiet said:

Gil,

Yes, and that is my whole point. If the user wants to see some values for those "passed through" columns then they *should* have to specify which row to select - and they can do perfectly well today using the existing components. I think you have just proved my point :)

Duane,

If you're sorting on all of the columns then their aren't any pass through columns so I don't really understand your point there, sorry.

-Jamie

November 13, 2009 1:45 AM
 

Kristian Wedberg said:

* Your nice example shows SSIS picking the first duplicated record to pass through in both cases. Have you ever seen it passing through anything _but_ the first duplicated record? If not, then maybe which record is passed through isn't random (which would good)?

* Either way, I think it's correct to pass through all the non-sorted columns, since in some cases I will know that the all duplicates will have identical pass through columns, and I want to use those columns down stream.

November 13, 2009 3:02 AM
 

Michael Riedmüller said:

Hi,

even if it seems that it doesn't make sense, we can simply group our data by the sorted column and return the maximum or minimum of the pass-through column(s). As we require an arbitrary value any one will do.

This should work in any SQL.

- Michael

November 13, 2009 3:03 AM
 

jamiet said:

Hi Willshak,

TO be honest I've never taken the time to observe which row gets passed through - mainly because before writing this blog post I'd never used Pass-Through columns (for the reasons I've outlined in the blog post).

I kinda go by the school of thought that if there is no documentation telling us what we *will* get through 100% of the time then we shouldn't make any assumptions about it.

If I *do* want some of those columns coming through then I'll make sure I know exactly from which row I'm going to get them - and you can't do that using the Sort transform.

"all duplicates will have identical pass through columns"

If that's the case then shouldn't they be part of the sort key?

Hi Michael,

"even if it seems that it doesn't make sense, we can simply group our data by the sorted column and return the maximum or minimum of the pass-through column(s). As we require an arbitrary value any one will do."

Yes very true - and you would use the Aggregate transform to do that. That is my point. When would we ever require an arbitrary value? As yet no-one has answered that question because I am convinced that the answer is "never" (I'm happy to be proved wrong).

If you use an Aggregate then it wouldn't be an arbitrary value - it would be one that you pick!!

Thanks for the comments everyone, I remain unconvinced :)

-Jamie

November 13, 2009 5:05 AM
 

Siddharth Mehta said:

I do not feel that the pass-through columns should be removed from the stream, it should be completely at the discretion of the developer or user. These pass-thru columns can be of value for debugging purposes, as they extend the detail of the record, though less digestable from a business rule implementation perspective.

November 13, 2009 6:21 AM
 

Gil said:

Jamie,

Which existing component lets you remove duplicates and determine which rows to pass through and keep additional columns (for example, columns from the row with the lowest InsertedDateTime for each group)?

November 13, 2009 10:27 PM
 

jamiet said:

Hi Gil,

You could use the Aggregate component to find out the lowest InsertedDateTime for each combination of key columns. Then, join the output from the Aggregate back to the original data, joining on the key columns and InsertedDateTime.

-Jamie

November 14, 2009 3:41 AM
 

Gil said:

Jamie,

Wouldn't it be nice if that functionality were packaged into a single component?

November 14, 2009 11:32 AM
 

jamiet said:

Gil,

Yes, that would be a great candidate for a custom component.

-Jamie

November 14, 2009 11:52 AM
 

Brian said:

I came across a similar requirement recently. I've loaded dim_product from a flat file. When the product is entered, it's attributes will be different throughout the day due to various factors. But, the point is, at the end of the day we want the last record to be kept and no other. So, after testing, I found the following Data Flow configuration to work satisfactorily: Flat File Source > Sort [Desc by Product Code, Timestamp, Remove Dups] (This places newest record at bottom of set by time entered) >  Sort [Asc/Desc (Either) by Product Code, Remove Dups] > ETC > ETC --- Long story short, I found that the Sort transform keeps the bottom most record of a dataset. And in my case, the "newest" record.

December 3, 2009 3:34 PM
 

Michael said:

Out of interest, can anyone tell me how the SSIS Sort component actually works?  I can't find any information on the inner workings of the component, i.e. the algorithm used or the way it buffers the data.  The answer to this question may actually provide insight to the way it selects the duplicates to keep/drop...

I'm interested as the sort component is REALLY bad at handling wide datasets, yet it appears to be fairly decent with narrow ones.  Would be great to know what it's actually doing underneath.

January 14, 2010 8:05 AM
 

Sid Mohapatra said:

If you use the SORT transform to sort and deduplicate on multiple columns and one of the columns is say a numeric column. Suppose you want to get the record that has the highest (or lowest) value in that numeric colum. You also need the values in the pass through columns.

Now using the sort order and sorting first on the numeric column can get you the appropriate record you are lookin for including all vales in the pass through columns.

SORT transform should return the record with the highest (or lowest) value depending on how results are sorted...which you would have to try out to find out.

October 17, 2012 11:25 AM
 

II ARROWS said:

QUOTE:

I don’t know of any business rule that would value being able to say “Just give me any value back, I don’t care which”.

I do... and they keep telling that...

July 18, 2013 10:52 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement