THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alexander Kuznetsov

When correlation between columns fools the optimizer

When two columns are correlated, it may fool the optimizer and cause it to choose a wrong plan. Here is a simple script that demonstrates it.
The following script creates a table and populates some sample data:

 

CREATE TABLE dbo.Vehicles(ID INT NOT NULL CONSTRAINT PK_Vehicles PRIMARY KEY,
  
Make VARCHAR(10),
  
Model VARCHAR(10),
  
SomeOtherData CHAR(50)
);
GO
DECLARE @i INT;
SET NOCOUNT ON;
SET @i=0;
WHILE @i<100000 BEGIN
  INSERT 
dbo.Vehicles(IDMakeModelSomeOtherData)
  
SELECT @i
    
CASE WHEN @i%100 THEN 'Toyota' ELSE 'M'+CAST(@i AS VARCHAR(6)) END,
    
CASE WHEN @i%100 THEN 'Camry' 
       
WHEN @i%100 THEN 'Corolla' 
       
ELSE 'M'+CAST(@i AS VARCHAR(6)) END,
    
'Some Data'
  
SET @i @i 1;
END
GO
CREATE INDEX Vehicles_Make ON dbo.Vehicles(Make);
GO 

 

The following query chooses to scan the whole clustered index, because it considers the predicate Make='Toyota' AND Model='Camry' not selective enough to justify bookmark lookups.

 

SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT IDMakeModelSomeOtherData
  
FROM dbo.Vehicles 
  
WHERE Make='Toyota'
  
AND Model='Camry';


 

Note: while this is true on my server, nobody can guarantee that the optimizer will behave in the same way for all editions and versions.

Clearly make and model of a car are strongly correlated - when you know car's model, you almost always know its make too. 2% of vehicles in the table are Toyotas, 1% are Camrys, and 1% are Toyota Camrys. If make and model were independent, there would be 0.02*0.01=0.0002=0.02% Toyota Camrys. Of course this is not the case, but let us see what happens when you create another index and rerun the same select:

 

CREATE INDEX Vehicles_Model ON dbo.Vehicles(Model);
 

 

The select now runs slower, and makes three times more reads that the clustered index scan.
The reason is simple - creating another index comes with statistics on Model column.
When I looked at the execution plan, I saw that the optimizer expected 20 rows, which is precisely 0.02%.In fact, there are 1000 Toyota Camry - the optimizer's estimate is off the mark. What happened?


In this case, for this version and edition, the optimizer assumed that make and model are independent, decided that the search criteria (Make='Toyota' AND Model='Camry') is very selective, and chosen nested loops.
Let me repeat myself - that might or might not be the case if you rerun my scripts on your system.

 

Over time the optimizer can and does change. in my opinion the most robust way to optimize such queries is to create covering indexes, so that the plan stays the same regardless of selectivity estimates. Of course there are other approaches, which might be better in some other cases.


Note: If you are still reading this, you might also want to read about DATE_CORRELATION_OPTIMIZATION setting in BOL.

Published Friday, June 05, 2009 5:28 PM by Alexander Kuznetsov
Filed under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Marc Brooks said:

Of course, a saner index would have had both Make and Model columns (in that order), which would be an indexed seek. You would then have to decide if the predominate access pattern is Make + Model AND/OR Model or if the predominate access pattern is Model + Make AND/OR Make to decide how to order the columns and if you need that second index.

Short version, know your data / query and build multi-segment indexes before you complain about "surprising" behavior from the optimizer.

I've learned that the MORE information you give SQL (in the form of indexes and hints and statistics) and the more explicit you are in the query, the faster it gets.

June 6, 2009 2:52 AM
 

jchang said:

start by reading the Statistics Used by the Query Optimizer paper by Eric Hanson, Lubor contributing

read both the 2000 and 2005 versions, as the 2000 one talks more about the actual data structures used to store data distributions.

If you understand what statistics data is kept (and how it is collected), then you can understand what it can predict and what it cannot.

June 6, 2009 3:32 PM
 

Alexander Kuznetsov said:

Marc,

Let me quote directly from documentation: "Histograms in SQL Server 2005 are only built for a single column—the first column in the set of key columns of the statistics object." This means that a composite index wont help.

Joe,

I agree the following is a must read:

http://technet.microsoft.com/en-us/library/cc966419.aspx

June 6, 2009 4:27 PM
 

Alexander Kuznetsov said:

In my previous post I described how correlation between columns may confuse the optimizer and cause it

June 9, 2009 10:41 PM

Leave a Comment

(required) 
(required) 
Submit

About Alexander Kuznetsov

Alex Kuznetsov has been working with object oriented languages, mostly C# and C++, as well as with databases for more than a decade. He has worked with Sybase, SQL Server, Oracle and DB2. He regularly blogs on sqlblog.com, mostly about database unit testing, defensive programming, and query optimization. Alex has written a book entitled "Defensive Database Programming with Transact-SQL" and several articles on simple-talk.com and devx.com. Currently he works at DRW Trading Group in Chicago, where he leads a team of developers, practicing agile development, defensive programming, TDD, and database unit testing.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement