THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alberto Ferrari

Does SSIS knows how to sort a flow of data? Well… no

Today I ran into a nice misfeature of SSIS that worths sharing.

If I want to use SSIS to sort a flow of data, containing both lowecase and uppercase strings, I would expect a standard behaviour or, at least, something with some “common sense”. As I am going to show, the final result is completely nonsense and will lead to incorrect computations.

Let us start with some information that I remember from the old time of ASCII code. Uppercase letters appears before the lowercase ones in the standard ASCII sequence and hence are sorted before them. Is it still true in the 2009 world with Unicode and all our new and fancy ways of storing strings? It seems so, if you run this simple VB script:

If "A" >= "a" Then
    MsgBox("A is greater or equal than a")
Else
    MsgBox("A is less than a")
End If

You will get the correct result: uppercase comes BEFORE lowercase characters. So, since SSIS sort component does not provide any “consider/ignore case” checkbox, it has two options for sorting strings:

  • It ignores casing and sorts data like SQL does.
  • It adhere to the standard sorting sequence and sorts uppercase BEFORE lowercase.

Well, armed with all this powerful knowledge, we can build a simpe package that receives three rows from this query:

SELECT Name = 'ALBERTO', Surname = 'FERRARI'
UNION ALL
SELECT Name = 'alberto', Surname = 'ferrari'
UNION ALL
SELECT Name = 'alberto', Surname = 'FERRARI'

Then it sends them into a Sort by Name, Surname and add a data viewer after the SORT, just to check what the component does:

image

I have been amazed to look at the data viewer:

image

SSIS believes that lowercase strings should be LESS than uppercase ones and produces a reverse sorted list of rows. This is pretty important because, if after the flow you trust data to be sorted, your system will fail, as it has been the case with some code I received to test just today.

The final lesson is pretty simple: do not trust common sense, for some (very) obscure reason the SSIS team decided to provide “Yet Another Sorting Method” to the IT World, generating some more confusion in the head of the poor SSIS programmer.

Published Monday, April 20, 2009 1:52 PM by AlbertoFerrari

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Uri Dimant said:

Hi Alberto

I think it depends on COLLATION..

SELECT * FROM

(

SELECT Name = 'ALBERTO'COLLATE Latin1_General_CI_AS, Surname = 'FERRARI' COLLATE Latin1_General_CI_AS

UNION ALL

SELECT Name = 'alberto'COLLATE Latin1_General_CI_AS, Surname = 'ferrari' COLLATE Latin1_General_CI_AS

UNION ALL

SELECT Name = 'alberto'COLLATE Latin1_General_CI_AS, Surname = 'FERRARI' COLLATE Latin1_General_CI_AS

) AS D

ORDER BY Name

Also , I see the following test

CREATE TABLE T1

(

 c1 VARCHAR(1) COLLATE Latin1_General_CI_AS,

 c2 VARCHAR(1) COLLATE Latin1_General_CS_AS

)

INSERT T1 VALUES('b', 'b')

INSERT T1 VALUES('B', 'B')

INSERT T1 VALUES('b', 'b')

INSERT T1 VALUES('B', 'B')

INSERT T1 VALUES('a', 'a')

INSERT T1 VALUES('A', 'A')

INSERT T1 VALUES('a', 'a')

INSERT T1 VALUES('A', 'A')

SELECT * FROM T1 ORDER BY c2

c1   c2

---- ----

a    a

a    a

A    A

A    A

b    b

b    b

B    B

B    B

SELECT * FROM T1 ORDER BY c1

April 20, 2009 7:16 AM
 

AlbertoFerrari said:

Uri,

I do agree that, from SQL Server point of view, sorting depends on collation. Nevertheless, when data is in the SSIS flow, it should not depend on SQL Server collating sequence (what if the flow comes from a flat file or another server without a valid collating sequence? I used a query to make it fast, but you can get the same result using a script as a source) but on the standard unicode sorting sequence. This is the reason for I tested "A" and "a" with VB before posting the code.

I think this behaviour is errating from the SSIS point of view but... clearly, I might be wrong. :)

April 20, 2009 7:29 AM
 

Uri Dimant said:

Alberto

Well,I do not use SSIS often and have not seen that before. Perhaps someone from MS can comment on...

April 20, 2009 8:42 AM
 

Alejandro Mesa` said:

Alberto,

I played a little bit setting the "comparison flags" for each column in the sort task to "ignore case", and I got the expected result or aat least different sorting.

Comparing String Data

http://msdn.microsoft.com/en-us/library/ms141038(SQL.90).aspx

AMB

April 20, 2009 10:07 AM
 

AlbertoFerrari said:

Alejandro,

What is great is that I have never seen those flags... blame to me. :) Nevertheless, using "ignore case" on both I get a results that looks more funny, now the lowercase version "alberto ferrari" comes in the middle.

Nevertheless, now SSIS behaves correctly, adding a third field (Age) it correctly ignores the case of the second column and sorts accordingly to the Age, which is definitely the expected result.

Thanks a lot for pointing me in the right direction.

Alberto

April 20, 2009 10:20 AM
 

AlbertoFerrari said:

Alejandro,

I was about to edit my post in order to tell that a solution exists and the post is somehow wrong... but I still think that, if "ignore case" is not set, then SSIS sorts rows in the wrong order, for what concern casing.

I'm very open to any comment about it.

April 20, 2009 10:27 AM
 

Alejandro Mesa said:

Alberto,

I agree with you that it is not intuitive and some how confusing. The comparison rules are driven by the settings of LocaleID and comparison rules.

AMB

April 20, 2009 11:08 AM
 

Alejandro Mesa said:

Correction:

The comparison rules are driven by the settings of LocaleID and comparison flags.

AMB

April 20, 2009 11:39 AM
 

sri said:

hi alberto,

i was puzzled by this same behavior while working on a data import job last week. i am using sql server 2008 with sp1. i submitted this to connect.

sri

April 20, 2009 3:45 PM
 

AlbertoFerrari said:

Sri, if you provide us the URL of the connect file, we can vote for it. :)

Thanks.

Alberto

April 20, 2009 3:49 PM
 

Josh Robinson said:

We actually ran into this issue this afternoon, ironically we found it while using it as an import to your TableDiff component, Alberto.  One of the guys on the team spent all afternoon trying to figure out why Table Diff wasn't working correctly and finally realized that his T-SQL sort on one side and SSIS sort on the other were producing a different sort order.  No matter what we chose from comparison flags, we still couldn't get them to sort the same.  Very strange. . .

April 22, 2009 5:30 PM
 

Dan said:

Same issue here, cannot get the SQL sorting to match the SSIS sort shape (Windows Vista OS):

SELECT N'AAA-111-AA' COLLATE Latin1_General_CI_AS AS Txt

UNION ALL

SELECT N'AAA111-AA' COLLATE Latin1_General_CI_AS

ORDER BY Txt

SQL 2005 sorts 'AAA-111-AA' first. The SSIS sort shape will do the reverse.

May 19, 2009 4:50 PM

Leave a Comment

(required) 
(required) 
Submit

About AlbertoFerrari

Alberto Ferrari is a Business Intelligence consultant. He his interests lie in two main areas: BI development lifecycle methodologies and performance tuning of ETL and SQL code. His main activities are with SSIS and SSAS for the banking, manufacturing and statistical sectors. He is also a speaker in international conferences like European PASS Conference and PASS Summit.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement