THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

String comparison: binary vs. dictionary

It is well known that you get better performance if you compare two strings in a binary collation than in a dictionary collation. But is the performance difference significant enough to warrant the use of an explicit COLLATE clause in the string comparison expression? That was a question came up in a conversation I had with a colleague recently.

 

To get a feel for the extent of the performance difference, I ran several tests comparing two commonly-used collations:

 

·      Latin1_General_BIN, and

·      SQL_Latin1_General_CP1_CI_AS

 

More specifically, I ran the following T-SQL script in a loop for 100 times:

 

DECLARE @s1 varchar(max), @s2 varchar(max), @i int

DECLARE @dummy int, @dt datetime

DECLARE @length int

 

SET @length = 1000000  -- or 100, or 1000000

 

SELECT @s1 = REPLICATE(CAST('a' as varchar(max)), @length)

SELECT @s2 = REPLICATE(CAST('a' as varchar(max)), @length)

 

SELECT @i = 1, @dt = GETDATE()

    WHILE @i < 1000

    BEGIN

       IF @s1 = @s2 COLLATE SQL_Latin1_General_CP1_CI_AS

       --IF @s1 = @s2 COLLATE Latin1_General_BIN

            SET @dummy =0

 

       SET @i = @i + 1

    END

SELECT DATEDIFF(ms, @dt, GETDATE())

 

The script was run in a number of scenarios:

 

·      The @length variable was set to 100, 10,000, and 1,000,000 to see the impact of the string length on the comparison method,

·      The string comparison was done in the IF clause with either the Latin1_General_BIN collation or the SQL_Latin1_General_CP1_CI_AS collation

 

The following table summarizes the test results. The value in each cell is the average elapsed time of the WHILE loop in the above script running 100 times for the corresponding test scenario. All values are in milliseconds.

 

 

String length (bytes)

 

100

10,000

1,000,000

Latin1_General_BIN

~86

~92

~3,104

SQL_Latin1_General_CP1_CI_AS

~89

~185

~20,516

 

The most salient point to note in the table is that as the string length increases, the performance difference between the two comparison methods increases. While there was no performance difference when the strings were 100 bytes long, the binary comparison was about twice as fast as the dictionary comparison when the string length was 10,000 bytes. When the string length increased to 1,000,000, the binary comparison was about seven times faster.

 

So, if you are comparing two short strings, you probably shouldn’t bother to explicitly cast the comparison into a binary collation. But when the strings can be very long, it can result in a huge performance improvement to explicitly specify a binary collation for string comparison.

 

Note that string comparison in COLLATE Latin1_General_BIN and string comparison in COLLATE SQL_Latin1_General_CP1_CI_AS have different semantics. But there are many cases where you have control over your data, and the difference in comparison semantics does not matter (e.g. when you know your data is all in lower case).

 

Nothing is particularly new here. Just want to contribute some data points to the community.

 

Published Saturday, January 31, 2009 11:46 PM by Linchi Shea
Filed under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

No Comments

Leave a Comment

(required) 
(required) 
Submit

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement