THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Louis Davidson

Sorting and case sensitive collations

This is an addition to my book, on page 255, to the end of the section on collation (before the statement that I won't delve any deeper into collations, of course :)), I want to add the following bit of information about collations and how they affect sorting and searching. 

You can download the entire addendum with source code from the:
Pro SQL Server 2005 Database Design and Optimization Homepage

Sort and Searching

One of the more confusing parts of collation is how they affect the ordering of data when sorted and how that can affect other operations like the LIKE operator.  When it comes to case insensitive collations, it is pretty clear, data is sorted as if 'A' and 'a', are the same character.  The other common situation is accent insensitivity where 'a' and 'ấ' appear as the same character.  When you are using a case and accent insensitive collation there will be no guarantee when sorting data that either of these characters would come first in the list.  The great part about this is that when you search for where like '%A%', you know that you will get back: aardvark, Abel, Penấ with no worries. 

In some situations, this is not desirable to have this be the case, and you set up a column, table, or database to be case sensitive and accent sensitive.  This is where sometimes you can get confused when using between, like, or other range type queries on characters.

As an example, let me create the following table and seed it with some characters:

create table dbo.TestSorting
(
     value nvarchar(1) collate Latin1_General_CI_AI
)
insert into dbo.TestSorting
select 'A'
union all
select 'a'
union all
select nchar(256) --Ā
union all
select 'b'
union all
select 'B'

Doing a normal case insensitive search using the base collation:

select value
from dbo.TestSorting
where value like '[A-Z]%'

All rows are returned, as expected:

value
-----
A
a
Ā
b
B

The problem comes in when you want to do a case sensitive search.  You choose a case sensitive collation, and either use it in the WHERE clause or in your table declaration, and:

select value
from dbo.TestSorting
where value like '[A-Z]%' collate Latin1_General_CS_AI

This returns what seems a confusing set:

value
-----
A
Ā
b
B

Everyone who does this the first time thinks "Why?" And I did too (in fact it was why I wrote this section of the book!)  The key is to kook at how the values sort in this collation:

select value
from dbo.TestSorting
order by value collate Latin1_General_CS_AI

value
-----
a
Ā
A
b
B

Case sensitive sorts things aAbBcC, not, as most of us programmers expect intuitively: ABCabc. That is actually how a binary collation works, since it sorts based on the ASCII or Unicode value of the character.  Using a binary collation would work, but then you lose the ability to include the special characters in your search which is given to you by using accent insensitivity.

Instead, you can simply use a LIKE such as this:

select value
from dbo.TestSorting
where value like '[ABCDEFGHIJKLMNOPQRSTUVWXYZ]%' collate Latin1_General_CS_AI

This returns only the capital letters including the accented capital letter I included.

value
-----
A
Ā
B

It is a good idea to take a look at the sorting of data when you choose a collation to make sure it is clear to you how data will be sorted.

First in the data you have in your table, like this, if you want to see how it will sort in a binary collation:

select value
from dbo.TestSorting
order by value collate Latin1_General_bin

value
-----
A
B
a
b
Ā


And then I have a query I use for this purpose that I can use to generate the characters I am interested and see how they sort

;with digits (i) as( select 1 as i union all select 2 as i union all select 3 union all
                           select 4 union all select 5 union all select 6 union all select 7 union all
                           select 8 union all select 9 union all select 0)
,sequence (i) as (
                           SELECT D1.i + (10*D2.i) + (100*D3.i) + (1000*D4.i) + (10000*D5.i)
                           FROM digits AS D1 CROSS JOIN digits AS D2 CROSS JOIN digits AS D3 
                                        CROSS JOIN digits AS D4 CROSS JOIN digits as D5
)
select i, nchar(i) as character
from sequence
where i between 48 and 122 --vary to include any characters in the character set of choice
order by nchar(i) collate Latin1_General_bin

Crossposted to http://drsql.spaces.live.com

Published Sunday, May 20, 2007 6:16 PM by drsql

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

duncan said:

you're awesome.  thanks mate

June 12, 2008 12:42 PM
 

greg said:

Lou, Do you have a link to machine locales that default in SQL as case-sensitive? I mean has anyone bothered to run them all to see what defaults as CS? Here is the current mappings link offered by Msft => http://msdn.microsoft.com/en-us/library/ms180175.aspx

January 29, 2010 1:07 PM
 

Senthilprabhu said:

Nice one.

but i need to sort lower case first then upper case.

May 14, 2013 6:57 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Links to my other sites

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement