THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Adam Machanic

Adam Machanic, Boston-based SQL Server developer, shares his experiences with programming, monitoring, and performance tuning SQL Server. And the occasional battle with the query optimizer.

Solution for the "LIKE vs. ?" Puzzle

In late April, I posted a puzzle to test readers' knowledge of SQL Server query processing internals. The goal of the puzzle was to take a simple yet incredibly inefficient query and rewrite it, without changing the base tables or adding any additional database objects, to improve performance.

The deadline for entries was May 1. I expected to receive a few responses, sort through them, and get a solution post published within the week. But alas, that's not what happened. I received hundreds of submissions, many of them really, really good. I started sorting them, and during that process I spoke at a couple of conferences, my hard drive crashed and I lost my work (but none of the submissions, luckily), and I was generally too busy to sort through the hundreds of submissions.  Then I finally got some time and started working on this post on June 24, and then... Well, I have no further excuses.  I just didn't finish.

So I would like to start with an apology for the extensive delay. I know how impatient DBAs are, and several readers were kind enough to e-mail me and call me out for my laziness prod me along all this time.

I am happy to announce that today, finally, I finished sorting and judging the submissions, and I have to say that I am extremely impressed with everyone who submitted! After eliminating all but the very best submissions, I still had 20 on my list.  Choosing the top two was extremely difficult; I had to decide between excellence and excellence. After thinking about it for a while, I decided to give one award for the best code and a second for the best explanation (as both of these were required per the rules of the challenge).  And I still ended up awarding three people rather than just two.

Before I get to the solutions, let's take a second (or Nth) look at the original query:

SELECT *
FROM b1
JOIN b2 ON
b2.blat2 LIKE b1.blat1 + '%'

The difficulty here, from a query processing point of view, is the nature of the JOIN predicate. The task for the query processor is to figure out how, given a 'blat2' value, all of the associated 'blat1' values can be determined. The query processor has only three possible options for satisfying this (or any other) join:

  1. A merge join. In this case, it's not possible, as SQL Server's merge algorithm only supports equijoins. In this case, we're joining based on a range: all 'blat2' strings that start with the same characters asthose in 'blat1'.
  2. A hash join. Again, not possible, as we need an equijoin. SQL Server's hash algorithm is not ordered (which is not to say that it couldn't be in the future), so hash keys must be compared based on equality.
  3. A loop join. This join type can be used, for better or for worse, in all cases.  And in this case it does get used, for worse.  For every row of table b1, table b2 gets scanned and all of the matching rows are found. That's a lot of scans, and a lot of work.

There must be a better way. Clearly, the key is to try one of those other join types, but we need an equijoin!  Virtually everyone who submitted responses figured this out, whether by accident or on purpose, and almost everyone came up with a query approximating the following:

SELECT *
FROM b1
JOIN b2 ON
    LEFT(b2.blat2,5) = b1.blat1

This query solves the problem, and solves it well.  It turns the query from a two-minute nightmare to a half-second dream, all via the magic that is the hash join.  But don't take my word for it.  Instead, why don't we let contest winner #1, Chamindu Munasinghe explain?  Following is a slightly edited version of his winning submission.

The optimized query uses a hash match to perform the join operation. When a hash match is performed the table b1 is scanned and a hash table is built in memory. The hash key is computed using the b1.blat1 column and the rows from b1 are added to it. Then another table scan is performed on the b2 table and a hash key is computed using LEFT(b2.blat2, 5) value and then the previously built hash table is searched [ed: "probed"] on the key and matches are made.

But if you consider the query that uses the LIKE operator, the database engine cannot perform a hash join because we don't have a value from b2 to compute a hash. Instead it has to do a nested loops join where for each row in b1 it loops over the b2 table and tries to match the blat1 and blat2 columns using the wild card pattern.

If you have N.b1 rows in b1 table and N.b2 rows in the b2 table the number of operations required to do the join can be computed like this

Optimized query = N.b1 + N.b2  (Scan both tables)
LIKE query = N.b1 * N.b2 (Nested Loops)

Due to that the optimized query is much faster. I have assumed that the number of rows in b1 can be accommodated in memory and a the hash match is done in memory. If this is not the case then the computations will be different because it can use a grace hash match or a recursive hash match. But the hash match is still faster than the nested loop.

Chimandu pointed out in his submission that he is a .NET developer -- not a SQL Server professional -- and it's clear in his answer that it was his knowledge of the inner workings of data structures that helped him answer the question so well.

Now that we've looked at the winning explanation, which explains why fastest possible way to solve the puzzle works, I will reveal the second set of winners -- a tie, actually, between two creative coders.

Andrew den Hertog came up with the same solution as Chimandu and everyone else, but took it one step further. He was bothered by the fact that this solution was hardcoded and would only work with an input of five characters, so he smartly took a stab at making it dynamic:

SELECT
    b1.*,
    b2.*
FROM b1
CROSS JOIN
(
    SELECT
        MAX(LEN(blat1)) AS fieldSize
    FROM b1
) AS fs
JOIN b2 ON
    LEFT(b2.blat2, fs.fieldsize) = b1.blat1

This solution is a bit slower than the non-dynamic version, and almost works... But alas, it fails to return the same results if you modify the script that builds the two input tables, and change the number of characters to, say, 7 rather than 5. The reason this fails is due to the fact that some of the input values are only 5 characters long -- which is the reason I chose that as the initial length.

Around the same time that Andrew came up with his solution, Kevan Riley sent me a non-dynamic way of solving the problem generally for any input character length. The trick is to delimit both of the input strings:

SELECT *
FROM b1
JOIN b2 ON
    '#'+b1.blat1+'#' = '#'+LEFT(b2.blat2,7)+'#'

Great job solving that one, Kevan!  Combining this solution with Andrew's, we get the best possible solution, totally dynamic and viable for all possible input cases:

SELECT
    b1.*,
    b2.*
FROM b1
CROSS JOIN
(
    SELECT
        MAX(LEN(blat1)) AS fieldSize
    FROM b1
) AS fs
JOIN b2 ON
    '#'+LEFT(b2.blat2, fs.fieldsize)+'#' = '#'+b1.blat1+'#'

Great job, Chamindu, Andrew, and Kevan!!!  I would also like to name four runners-up, each of whom also provided excellent submissions:

Gordon Klundt
Creighton Kagey
Erik Eckhardt
Greg Bodden

Again, this was an extremely difficult choice and I'm sorry I was not able to award everyone, but if you're on this list please give yourself several pats on the back for your great work.

Thanks again to everyone who participated.  I was truly impressed by all of the submissions.  This contest was a lot of fun, and I have a few ideas for some other challenges for the near future. Stay tuned.

Published Wednesday, October 01, 2008 4:37 PM by Adam Machanic

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Jason Massie said:

Gordon is a tsql madman. We tried but failed to get him to name is son Harry Klundt.

October 1, 2008 5:18 PM
 

Adam Machanic said:

I should probably delete your post to keep a sense of decorum about this blog, but ... that was a much-needed bit of humor after a tremendously boring day, so it stays.  Thanks for the comment!

October 1, 2008 5:33 PM
 

Jonathan Kehayias said:

Can you recheck this:

SELECT *

FROM b1

JOIN b2 ON

   '#'+b1.blat1+'#' = '#'+LEFT(b2.blat2,7)+'#'

That doesn't pass the test when I run it.  Is there a typo in it perhaps?  I only get 2 rows back from running that.

October 1, 2008 7:52 PM
 

Adam Machanic said:

Hi Jonathan,

Sorry, I left it coded for 7 characters which was the example Kevan sent me.  Change the 7 to a 5 and it should work.  Or, rebuild b1 with blat1 typed as CHAR(7) and change the insert to do LEFT(AddressLine1, 7), and then leave the 7 and you'll see the same result as if you'd run b2.blat2 LIKE b1.blat1 + '%'.  Making the same change to the base table and using the LEFT(b2.blat2,7) = b1.blat1 predicate will in fact NOT return the same exact results.

I hope that clarifies things?

October 1, 2008 9:34 PM
 

Bruce M said:

Thanks for the answers! Two questions:

1. What is the joke? I didn't get it :(

2. Can you please explain the trick of using delimiters "#" for the input strings?

October 2, 2008 1:49 PM
 

Adam Machanic said:

Hi Bruce,

The fact that you didn't get the joke is probably a good thing.  Hint: if you want to get it, just put your mind firmly in the gutter.  It's really quite sophomoric (but after my day yesterday, it was just what I needed).

The delimiters work because the data is CHAR(n), and if the input data length is < N, it will end up with trailing spaces.  Those trailing spaces will end up being ignored with an equality predicate, but with LIKE, when you append the '%' they are preserved.  So you end up getting different results for the equality (more rows returned) vs the LIKE predicate.  Delimiting the data makes those rows not matched, because just as when adding the '%' those trailing spaces are preserved.  I guess, in retrospect, you don't need to delimit the front of the string too, but arguably it's nice for consistency.

October 2, 2008 4:40 PM
 

Bruce M said:

Oooh! Now I get it...initially, I thought it had something to do with t-sql.

Thanks for the explanation of the delimiters. Isn't that the same as using LTRIM? Or is using a function in a join slow down performance?

October 4, 2008 3:30 PM
 

Brian said:

Hi Adam,

I don't imagine such an elegant solution is possible for the Like VS ? puzzle if the original question would have been:

SELECT *

FROM b1JOIN b2

ON    b2.blat2

LIKE '%' + b1.blat1 + '%'

Nonetheless, I found this blog post very interesting, but you have an idea let us know.

Thanks.

November 18, 2008 7:38 PM
 

Adam Machanic said:

Hi Brian,

Sure there is--it's called Full Text Search :-)

November 24, 2008 5:02 PM

Leave a Comment

(required) 
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based SQL Server developer, writer, and speaker. He focuses on large-scale data warehouse performance and development, and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "SQL Server 2008 Internals" (Microsoft Press, 2009) and "Expert SQL Server 2005 Development" (Apress, 2007). Adam regularly speaks at conferences and training events on a variety of SQL Server topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server, a Microsoft Certified IT Professional (MCITP), and an alumnus of the INETA North American Speakers Bureau.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement