THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Adam Machanic

Adam Machanic, Boston-based SQL Server developer, shares his experiences with programming, monitoring, and performance tuning SQL Server. And the occasional battle with the query optimizer.

SQL Server Query Processing Puzzle: LIKE vs ?

How creative are you with manipulating your queries to produce more efficient plans? Try the following puzzle and e-mail your solution to me at [<do_not_mail> @ do_not_mail.com]. Make sure to include an explanation of why it works, as well as your mailing address. The best two solutions/explanations win a free copy of Expert SQL Server 2005 Development, a wonderful feeling of accomplishment, plus eternal fame and glory when I reveal your solutions here on the blog.

Run the following T-SQL to create two tables in TempDB:

USE TempDB
GO

CREATE TABLE b1 (blat1 CHAR(5) NOT NULL)
CREATE TABLE b2 (blat2 VARCHAR(200) NOT NULL)
GO

INSERT b1
SELECT LEFT(AddressLine1, 5) AS blat1
FROM AdventureWorks.Person.Address

INSERT b2
SELECT AddressLine1 AS blat2
FROM AdventureWorks.Person.Address
GO
Now consider the following query:
SELECT *
FROM b1
JOIN b2 ON
b2.blat2 LIKE b1.blat1 + '%'

This query takes around three minutes to run on my notebook, and does over 1.8 million logical reads. Can you figure out a way to re-write it so that it performs better? No modification of the base tables or addition of any other objects is allowed (sorry, no indexed views!) -- the challenge is to tune this by doing nothing more than re-writing the query.

Good luck! I'll leave the contest open for submissions until May 1.

Published Tuesday, April 22, 2008 11:18 AM by Adam Machanic

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Adam Machanic said:

To people responding: Don't just send me a solution.  Explain to me WHY it works.  Why does your change allow the query processor to work more efficiently?  The goal of this challenge is not just to modify a query and happen upon a faster version, but rather to think about the reasons behind it and gain a deeper understanding of the QP.  Enjoy!

April 22, 2008 1:05 PM
 

Alexander Kuznetsov said:

Hi Paul,

Why did you post your solution here instead of e-mailing it to Adam? Don't you think it might have skewed the contest just a little bit?

April 23, 2008 10:55 AM
 

Adam Machanic said:

Paul D., next time think before you post.

April 23, 2008 11:37 AM
 

david wei said:

Adam, do you think change blat1 to VARCHAR(5) will make this more challenging?  People may take the advantage of CHAR(5). :-)

April 23, 2008 2:17 PM
 

Adam Machanic said:

David:

If you want more of a challenge, try changing all of the 5s to 7s ... I'm not sure if this will be more or less challenging than the VARCHAR(5) change; I'll let you tell me :-)

April 23, 2008 3:05 PM
 

david wei said:

Adam, not the length issue, is the trailing space issue.

Trailing space makes the difference.

for example:

if 'abcde' like 'a%'

print 'Y' -- returns Y

if 'abcde' like 'a  %'

print 'Yes' -- does not return Yes

So if we know the blat1 is fixed 5 characters, we can simply use the left(balt2,5) = balt1 for the join, and a simple HASH join hint will only sacn two tables once, avoid nested loop joins on two 19K rows tables(which caused 1.8M reads).

But if we change blat1 to VARCHAR(5), and truncate the trailing space to make the length vary (from 1 to 5), this will add the chalenge, hence you can not just compare the first 5 characters :-)

INSERT b1

SELECT RTRIM(LEFT(AddressLine1, 5)) AS blat1

FROM AdventureWorks.Person.Address

-- Note. you can comment out some sensitive part in my post if you think this will skew the contest.

April 23, 2008 3:51 PM
 

Adam Machanic said:

David,

Length is an issue here as well, due to some of the data.  Try creating everything with length 7, and play with some of the solutions (assuming you've come up with one -- I don't see a submission from you in my inbox).  If your solution happens to be the most common one people have been sending, you'll find that it will no longer return the same results as the example query.

April 23, 2008 4:07 PM
 

Denis Gobo said:

David, that didn't make a difference, I am getting the same number of rows back

SELECT blat1,len(blat1),datalength(blat1),b1.blat1 ,left(b2.blat2,5),

len(blat2),datalength(blat2)

FROM b1

JOIN b2 ON  <very secret indeed>

len and datalength reveal that there are some rows with 5 and 4 chars

April 23, 2008 4:21 PM
 

david wei said:

Adam, still the trailing space issue, because the minimal length in b2 is 5, not 7.

select min(len((blat2))) from b2

So when you change to 7, blat1 automatic added 2 trailing space to make it CHAR(7); that cause result difference.

I have sent my solution.

April 23, 2008 4:24 PM
 

Michelle said:

So Adam,

Do we need to resend you a new solution using CHAR(7), or do we still have a chance to win...?   :-)

Thanks,

Michelle.

April 23, 2008 4:36 PM
 

david wei said:

Denis, it does make the difference. The len() does not count the trailing space, but they affect the like operator.

Try create the b3 table use trimmed varchar(5) as below:

(CREATE TABLE b3 (blat3 VARCHAR(5) NOT NULL)

INSERT b3

SELECT RTRIM(LEFT(AddressLine1, 5)) AS blat1

FROM AdventureWorks.Person.Address

SELECT count(*) FROM b3 inner JOIN b2 ON    b2.blat2 LIKE b3.blat3 + '%'

--Trimmed VARCHAR(5) returns 111181 rows

SELECT count(*) FROM b1 inner JOIN b2 ON    b2.blat2 LIKE b1.blat1 + '%'

-- original CAHR(5) returns 109984 rows

April 23, 2008 4:44 PM
 

Denis Gobo said:

David,

I see I was talking about LEFT not LIKE, with LEFT it doesn't matter if it is char or varchar

April 23, 2008 4:51 PM
 

Fabiano Neves Amorim said:

Umm, Denis, I think with the answer you "spilled your beans" about your secret :-(

April 25, 2008 5:51 AM
 

Gordon said:

It's all about iteration.

May 2, 2008 1:23 AM
 

Igor Mikhalyev said:

So where are three best queries?  Who are the winners?

May 2, 2008 12:35 PM
 

Adam Machanic said:

Igor: Patience, please.  I got a -lot- of responses... Trying to figure out who won now.

May 5, 2008 12:14 PM
 

too embarrassed to say said:

Criminy. I didn't even NOTICE the char vs. varchar part. Now my answer could be wrong. Blah.

May 9, 2008 5:23 PM
 

Igor Mikhalyev said:

So where are three best queries?  Who are the winners?

June 3, 2008 5:55 PM
 

Magneto said:

We are waiting .......

June 5, 2008 7:24 AM
 

Jonathan Kehayias said:

If you follow this blog, you should understand that Adam has been both busy, and experienced a major setback like the drive failure.  If you don't speak publicly, I would recommend giving it a shot, and then you will understand just how much time is dedicated to doing a meaningful presentation.  I too was curious about the answers and emailed Adam a few weeks back, and he responded that he is trying to work this out, but he had to start over after the drive crash.

June 12, 2008 10:23 PM
 

BPhilip said:

set  statistics io on

SELECT *FROM b1 inner JOIN b2 ON    left(b2.blat2,5) = b1.blat1

I used this query and got results in 2 seconds. Following are my scan results.

(109984 row(s) affected)

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'b2'. Scan count 1, logical reads 74, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'b1'. Scan count 1, logical reads 34, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

June 17, 2008 11:49 AM
 

Andrew said:

It would be very interesting to see the top performing query...

June 19, 2008 6:43 AM
 

Adam Machanic said:

In late April, I posted a puzzle to test readers' knowledge of SQL Server query processing internals

October 1, 2008 3:43 PM
 

Denis Gobo said:

Wow, it has been already a year since I wrote A year in review, The 21 + 1 best blog posts on SQLBlog

December 31, 2008 10:37 AM
 

gyanendra said:

how we can make like query and in like condition i want to give a hindi text which are exist in the database collumn. and the collumn type is nvarchar.

February 3, 2009 1:19 PM
 

NotNowJohn said:

Since the competition is already finished I would post here my explanation.

The execution plan of the query shows that nested loop join has been used. The reason why is simple - only this join operator is possible without an equijoin predicat. The solution for this problem would be to force usage of hash join operator and it requires at least one equijoin predicat. Simple transformation LIKE => LEFT solves this problem and hash join is efficient in this scenario.

BTW. the estimated plan for the query

SELECT * FROM b1

JOIN b2 ON b2.blat2 > b1.blat1

says that this estimated number of rows for the nested loops would be more than 85M!

February 17, 2009 7:12 PM
 

Adam Machanic said:

It's been quite a while since the LIKE vs ? Puzzle , and I feel like it's time for another one. Response

February 27, 2009 1:24 PM
 

SqlGuru said:

Please try this its 3sec

select * from b1

inner join b2

on b1.blat1=substring(b2.blat2,1,5)

March 2, 2009 6:06 PM
 

SqlGuru said:

typo mistake sorry its 1sec

March 2, 2009 6:07 PM
 

Vimvq1987 said:

SELECT DISTINCT *

FROM b1

JOIN b2 ON

   --b2.blat2 LIKE b1.blat1 + '%'

   b1.blat1 = LEFT(b2.blat2,5)

Sorry, I'm late. I just know this blog today. After 10 minutes trying to figure out what really happened inside the query, I found something:

JOIN (or Inner Join) return matching common values between two tables. So, in the original query, SQL Server has to read a row in the b1 table, and then reads all rows in the b2 table to find matching strings, and then, repeats until end of table b1. LIKE Operator is relatively slow, so It'll take a long time to execute the query (About 1m41s in my computer :P.)

One way to improve performance is replace LIKE operator with operator =. We already knew that blat1 column contains only CHAR(5) string, so that, Join condition can be rewrite like this:

b1.blat1 = LEFT(b2.blat2,5)

Even with LEFT function, this new query outperform the old one, it costs about 2s to run in my computer, but still affects 109984 rows, equals to the old one.

The fact is, when we insert values into blat1 column, it has been trimmed into CHAR(5). In example, the following addresses make sense in blat2 column, but they don't make any in blat1 column.

1, place Beaubernard

1, place de Brazaville

1, place de la République

They'll be inserted into blat1 with the same value as '1, pl',and when executing the query, SQL server has to repeat Join operator many times to same values that are exactly same. So I added DISTINCT keyword to prevent it from wasting time. Now my query returns only 13567 rows and takes only near 0s to run.

Hope this right!

March 7, 2009 3:48 AM
 

Brian Tkatch said:

Just saw this now. So i'll post:

SELECT *

FROM b1

WHERE b1.blat1 IN (SELECT LEFT(b2.blat2, 5) FROM b2);

It is faster because it can do 1 read on b2. The LIKE doesn't know that b1 is five chars and therefor, it has to re-evaluate the expression in b2 for each record in b1. When IN is used, it generates a sub-table first.

That my guess at least. :)

July 22, 2009 9:57 AM
 

Divya said:

Try this

SELECT *

FROM b1

JOIN b2 ON

LEFT(b2.blat2,5) =b1.blat1

June 21, 2010 2:25 AM
 

Olivier Comte said:

These solutions aren't equivalent to (SELECT *FROM b1JOIN b2 ON    b2.blat2 LIKE b1.blat1 + '%') if b1.blat1 contain a "special" character like '%'.

For example, if there is one record in b1 with blat1='%a',and a record in b2 with blat2='a',

SELECT * FROM b1 JOIN b2 ON    b2.blat2 LIKE b1.blat1 + '%'

will return this record but

SELECT * FROM b1 JOIN b2 ON LEFT(b2.blat2,5) =b1.blat1

won't.

May 2, 2011 8:50 AM

Leave a Comment

(required) 
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based SQL Server developer, writer, and speaker. He focuses on large-scale data warehouse performance and development, and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "SQL Server 2008 Internals" (Microsoft Press, 2009) and "Expert SQL Server 2005 Development" (Apress, 2007). Adam regularly speaks at conferences and training events on a variety of SQL Server topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server, a Microsoft Certified IT Professional (MCITP), and an alumnus of the INETA North American Speakers Bureau.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement