THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Rob Farley

- Owner/Principal with LobsterPot Solutions (a MS Gold Partner consulting firm), Microsoft Certified Master, Microsoft MVP (SQL Server), APS/PDW trainer and leader of the SQL User Group in Adelaide, Australia. Rob is a former director of PASS, and provides consulting and training courses around the world in SQL Server and BI topics.

How to make text searching go faster

...but first, let’s look at one oft-forgotten reason why finding a particular piece of text can be slow, if not impossible: collation. This will then provide a useful platform for making it go faster.

I say ‘impossible’, but of course it’s never impossible to find something in a database (assuming it’s there). It might take longer, but you can always scan the column for it, starting on the first page and going until you’ve found it. Various things like Full Text Search can help make things easier, but all-too-frequently we see code that searches for SomeText%, or worse: %SomeText%. This is the thing I want to look at – finding non-indexed strings patterns.

First let’s remember that if we are hoping to use an index, we need to know what language we’re in. I have spoken before about how I picked up a map in Sweden to find Västerås, but couldn’t find it listed in the map’s index. I didn’t realise that in Swedish, ‘ä’ and ‘å’ are not the same as ‘a’, and found at a different part of the alphabet. When I searched using an English alphabet, I couldn’t find the entry. I might think that ‘Västerås’ and ‘Vasteras’ are the same, but a Swedish person would tell me otherwise. It’s like if I refer to a game as ‘football’, you would need to understand my personal collation setting to know what I was talking about. When Michael Palin sung (as a lumberjack) about wearing high-heels, suspenders and a bra, he wasn’t referring to anything that held his trousers up, despite what people using an American collation setting might think.

But this is about making searches for text go faster. If we’re comparing two strings in a different collation we get an error, but let’s think about speed.

Consider that I’m looking for rows in a table WHERE CommentText LIKE '%Farl%'. Right away, I’m sure you appreciate that no amount of regular indexing on CommentText would let me perform an ordinary b-tree index search to find that row. I could improve it by using other technologies that will allow the individual words in my text to be found, but I’m just looking for a particular piece of text. It’s not even a whole word.

For my experiment, I’m using a table on SQL 2014 on my Surface Pro 2. It’s a larger version of AdventureWorks2012’s Person.Address with 19 million rows. There is a column called AddressLine1, which has collation SQL_Latin1_General_CP1_CI_AS and has type nvarchar(120). You can create it using code like this:

CREATE TABLE [Person].[Address_Big](
    [BigAddressID] [int] IDENTITY(1,1) NOT NULL,
    [AddressID] [int] NOT NULL,
    [AddressLine1] [nvarchar](60) NOT NULL,
    [AddressLine2] [nvarchar](60) NULL,
    [City] [nvarchar](30) NOT NULL,
    [StateProvinceID] [int] NOT NULL,
    [PostalCode] [nvarchar](15) NOT NULL,
    [SpatialLocation] [geography] NULL,
    [rowguid] [uniqueidentifier] NOT NULL,
    [ModifiedDate] [datetime] NOT NULL
);
go

insert Person.Address_Big
select * from Person.Address;
go 100

I ran this query quite a few times, and it took about 40 seconds to tell me there were no rows returned.

select *
from Person.Address_Big
where AddressLine1 like N'%Farl%'
option (maxdop 1);

Obviously no one lives on 203 Farley Avenue, or 1 Farlabulous Drive. But nor do they live at 711 Gofarles Street. You see, despite the fact that I had specified ‘Farl’ with a capital F and lower-case ‘arl’, it didn’t care about that at all. My collation setting told it explicitly to ignore case. That’s what the CI was for in SQL_Latin1_General_CP1_CI_AS. In fact, if we query select * from fn_helpcollations() where name = 'SQL_Latin1_General_CP1_CI_AS'; we see it says “Latin1-General, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 52 on Code Page 1252 for non-Unicode Data”. So it’s not only case-insensitive, it’s also kanatype-insensitive and width-insensitive too.

Clearly there is a lot more work for it to do, when scanning large amounts of text looking for a group of consecutive characters that could match inexactly.

Now, without changing my table at all, I changed my query like so, telling it to use a binary collation for the search. To search exactly rather than inexactly.

select *
from Person.Address_Big
where AddressLine1 like N'%Farl%' collate Latin1_General_BIN
option (maxdop 1);

I could’ve used the SQL collation SQL_Latin1_General_CP437_BIN, but I find it easier to remember the Windows collation, and in Australia, the default settings for SQL are to use Windows collations. They’re a little better than the SQL collations [citation needed].

But anyway – this query returned in just 7 seconds. I re-ran the original one – 40 seconds. I re-ran this one – 7 seconds. It really was significantly faster. The plan is the same. There is no sneakiness going on. The search for the binary text was simply faster.

This makes sense. If I’m looking for a particular string, it’s going to be quicker if I can just look for the exact bits, and not have to consider what the text might be in a different case, or if width needs to play a part, and so on.

Now you might think “Great – I’m going to add that to all my string searches”, but you should understand that there is potential for the results to be different. If there were someone in my table who lived in FARLEY BOULEVARD (in all caps, in the way that French people often write their surnames, for example), then that would have been found in my case-insensitive-collation search, but not in my binary-collation search for the lower-case letters ‘arl’. It’s useful if the data in your system is only stored in capitals, in which case (ha!) you could actually change the collation of your column, but it’s definitely worth considering the benefits of asking for a collation-specific search.

And what about grouping, you ask? Ok, maybe I didn’t hear you ask that, but let’s pretend you did.

If there’s an index on the column you’re grouping, then changing the collation is going to hurt a bit. Grouping could take advantage of a Stream Aggregate under our indexed collation, but changing the column is like throwing it away the index order (ORDER BY doesn’t get handled well by changing the collation) means a Hash is required. But comparing two query plans that both use Hash Match (Aggregate), one on a case-insensitive collation and one on a binary collation, then I found the latter was slightly faster. Not as drastic a change as searching, but still 10-30% better. One would run in about 12 seconds, and one in about 10.

select City, count(*)
from Person.Address_Big
group by City
option (maxdop 1);

select City collate Latin1_General_BIN, count(*)
from Person.Address_Big
group by City collate Latin1_General_BIN
option (maxdop 1);

Considering what’s going on with a hash function and non-exact strings is actually pretty interesting. HASH(Value) must produce the same value for any two values that are considered equal – such as ‘FARLEY’ and ‘Farley’ in my CI collation. For this to happen, it obviously can’t hash the actual values, it must have to convert the values into a common form that will hash the same way regardless of case, kana, and width. But this is work that is hidden from the query plan. We can see the impact of it through the query speed, but not anywhere in the plan. This will become yet another thing for me to investigate – but not this week before T-SQL Tuesday comes around and I need to publish this post. New father Bob Pusateri (@sqlbob) is hosting this month, about text searching, in case you hadn’t guessed.

TSQL2sDay150x150

@rob_farley

Published Tuesday, March 08, 2016 2:39 PM by Rob Farley
Filed under: , ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Uri Dimant said:

Hi Rob

It is great post, but question is not answered :-))) Waiting for your investigation.

March 7, 2016 11:56 PM
 

Rob Farley said:

You mean the question about why a Hash is 10% (ish) faster in a binary collation?

It's because with a non-binary collation, it needs to make a call to GetSortKey(), so that it can hash a value which is equal for all the things that sort to the same point in that collation. Like how in a CI collation, H and h need to end up in the same hash bucket. In a binary collation, it doesn't have to call GetSortKey(), because things are only equal when they have exactly the same binary value.

I worked this out by asking someone who had access to a debugger which they regularly hook SQL Server into. He's a good friend, and happens to live only a few time zones away from me. (Thanks Paul!)

March 8, 2016 12:19 AM
 

Uri Dimant said:

I am aware of this Rob.

Using binary collation as you pointed out may return incorrect data (not all data found)

March 9, 2016 1:56 AM
 

Rob Farley said:

Yes. If you need to find "Uri" and "URI" and "uri", then you can't use a binary collation. But if it's always stored as "URI", then you can.

March 9, 2016 2:06 AM
 

Saeid Hasani said:

There are many situation that we can use this method such as codes, barcodes, national security codes etc.

http://social.technet.microsoft.com/wiki/contents/articles/31238.t-sql-improve-the-performance-for-like-wildcard-by-changing-the-collation.aspx

March 25, 2016 7:54 PM
 

Rob Farley said:

Yes, Saeid. Lots of situations. Nice article - I wasn't aware of it. I presented this as one of my T-SQL Tips at TechEd AU back in 2008, but it wasn't new then either. Interesting that you refer to Brent's article on Sargability - he refers to one of mine in his, and was in the audience for my SQLBits session on it in 2010, which you can see at http://bit.ly/Sargability

March 25, 2016 8:16 PM
 

Saeid Hasani said:

Thanks Rob for your fast response. By referring to that article , I wanted to confirm your solution. I'm not against Uri's idea. He is a bright person in SQL world. But, this solution has its opponents (like few comments on my article). So, I was become happy when I read your great post. I didn't consider Brent's article references. I downloaded your SQLBits session and I saw it . It is an awesome session. Wish I saw it few years ago! :-) Thank you so much!

March 25, 2016 9:03 PM
 

Rob Farley said:

No worries, and thanks for supporting it.

March 25, 2016 9:12 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

News

News? Haven't you read my blog?

My Company


Can't find something?

Contact Me

IM: rob_farley@hotmail.com
Twitter: @rob_farley
Skype: rob_farley
E: rob_farley@hotmail.com

MVP (SQL Server)




Certifications








Adelaide SQL UG

Privacy Statement