THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Alexander Kuznetsov

When correlation between columns fools the optimizer

When two columns are correlated, it may fool the optimizer and cause it to choose a wrong plan. Here is a simple script that demonstrates it.
The following script creates a table and populates some sample data:


Make VARCHAR(10),
Model VARCHAR(10),
SomeOtherData CHAR(50)
SET @i=0;
WHILE @i<100000 BEGIN
CASE WHEN @i%100 THEN 'Camry' 
WHEN @i%100 THEN 'Corolla' 
'Some Data'
SET @i @i 1;
CREATE INDEX Vehicles_Make ON dbo.Vehicles(Make);


The following query chooses to scan the whole clustered index, because it considers the predicate Make='Toyota' AND Model='Camry' not selective enough to justify bookmark lookups.


SELECT IDMakeModelSomeOtherData
FROM dbo.Vehicles 
WHERE Make='Toyota'
AND Model='Camry';


Note: while this is true on my server, nobody can guarantee that the optimizer will behave in the same way for all editions and versions.

Clearly make and model of a car are strongly correlated - when you know car's model, you almost always know its make too. 2% of vehicles in the table are Toyotas, 1% are Camrys, and 1% are Toyota Camrys. If make and model were independent, there would be 0.02*0.01=0.0002=0.02% Toyota Camrys. Of course this is not the case, but let us see what happens when you create another index and rerun the same select:


CREATE INDEX Vehicles_Model ON dbo.Vehicles(Model);


The select now runs slower, and makes three times more reads that the clustered index scan.
The reason is simple - creating another index comes with statistics on Model column.
When I looked at the execution plan, I saw that the optimizer expected 20 rows, which is precisely 0.02%.In fact, there are 1000 Toyota Camry - the optimizer's estimate is off the mark. What happened?

In this case, for this version and edition, the optimizer assumed that make and model are independent, decided that the search criteria (Make='Toyota' AND Model='Camry') is very selective, and chosen nested loops.
Let me repeat myself - that might or might not be the case if you rerun my scripts on your system.


Over time the optimizer can and does change. in my opinion the most robust way to optimize such queries is to create covering indexes, so that the plan stays the same regardless of selectivity estimates. Of course there are other approaches, which might be better in some other cases.

Note: If you are still reading this, you might also want to read about DATE_CORRELATION_OPTIMIZATION setting in BOL.

Published Friday, June 5, 2009 5:28 PM by Alexander Kuznetsov
Filed under:



Marc Brooks said:

Of course, a saner index would have had both Make and Model columns (in that order), which would be an indexed seek. You would then have to decide if the predominate access pattern is Make + Model AND/OR Model or if the predominate access pattern is Model + Make AND/OR Make to decide how to order the columns and if you need that second index.

Short version, know your data / query and build multi-segment indexes before you complain about "surprising" behavior from the optimizer.

I've learned that the MORE information you give SQL (in the form of indexes and hints and statistics) and the more explicit you are in the query, the faster it gets.

June 6, 2009 2:52 AM

jchang said:

start by reading the Statistics Used by the Query Optimizer paper by Eric Hanson, Lubor contributing

read both the 2000 and 2005 versions, as the 2000 one talks more about the actual data structures used to store data distributions.

If you understand what statistics data is kept (and how it is collected), then you can understand what it can predict and what it cannot.

June 6, 2009 3:32 PM

Alexander Kuznetsov said:


Let me quote directly from documentation: "Histograms in SQL Server 2005 are only built for a single column—the first column in the set of key columns of the statistics object." This means that a composite index wont help.


I agree the following is a must read:

June 6, 2009 4:27 PM

Alexander Kuznetsov said:

In my previous post I described how correlation between columns may confuse the optimizer and cause it

June 9, 2009 10:41 PM
New Comments to this post are disabled

About Alexander Kuznetsov

Alex Kuznetsov has been working with object oriented languages, mostly C# and C++, as well as with databases for more than a decade. He has worked with Sybase, SQL Server, Oracle and DB2. He regularly blogs on, mostly about database unit testing, defensive programming, and query optimization. Alex has written a book entitled "Defensive Database Programming with Transact-SQL" and several articles on and Currently he works as an agile developer.

This Blog


Privacy Statement