This post was prompted by a thread on the MSDN SSIS forum today where the poster was asking how he could replicate the behaviour of SSIS’s Sort transform using T-SQL, specifically he wanted to know how the Sort transform chooses what data to pass through when the “Remove Duplicates” option is checked.
Let me give a bit more background. Take the following dataset:
Note how we have repeated values (‘Simon’ & ‘Louis’) in the [Mentor] column.
Now let’s pass that data through SSIS’s Sort transform which we will sort on [Mentor] and choose to “Remove rows with duplicate sort values”:
Notice how [Mentee] is a Pass Through column.
When we execute this then we get the following output from the Sort Transform:
Notice anything? Two of our [Mentee] values have disappeared, namely ‘Jedward’ and ‘Danyl’. The Sort Transform has removed duplicates from [Mentor] and arbitrarily picked which value it should pass through for [Mentee].
Speaking personally I see no value at all in this arbitrary behaviour. Let me clarify that statement, the ability to remove duplicates is very very valuable but the practise of passing through an arbitrary value along with the de-duplicated values is pointless. The poster on the forum was asking how he could replicate this behaviour in T-SQL and I had to tell him (a) You can’t, because we don’t know what algorithm the Sort transform is using to choose its arbitrary value and (b) why would you even want to? I don’t know of any business rule that would value being able to say “Just give me any value back, I don’t care which”.
In my opinion the Sort transform should remove Pass Through columns when the “Remove rows with duplicate sort values” option is checked. They are pointless and thus may be confusing for new users of the product.
What do you think? Agree or disagree? Let me know in the comments!
P.S. Apologies to the none-UK residents that are oblivious to the pop-culture references in this blog post! ;)