When two columns are correlated, it may fool the optimizer and cause it to choose a wrong plan. Here is a simple script that demonstrates it.
The following script creates a table and populates some sample data:
CREATE TABLE dbo.Vehicles(ID INT NOT NULL CONSTRAINT PK_Vehicles PRIMARY KEY,
DECLARE @i INT;
SET NOCOUNT ON;
WHILE @i<100000 BEGIN
INSERT dbo.Vehicles(ID, Make, Model, SomeOtherData)
CASE WHEN @i%100 < 2 THEN 'Toyota' ELSE 'M'+CAST(@i AS VARCHAR(6)) END,
CASE WHEN @i%100 = 0 THEN 'Camry'
WHEN @i%100 = 1 THEN 'Corolla'
ELSE 'M'+CAST(@i AS VARCHAR(6)) END,
SET @i = @i + 1;
CREATE INDEX Vehicles_Make ON dbo.Vehicles(Make);
The following query chooses to scan the whole clustered index, because it considers the predicate Make='Toyota' AND Model='Camry' not selective enough to justify bookmark lookups.
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT ID, Make, Model, SomeOtherData
Note: while this is true on my server, nobody can guarantee that the optimizer will behave in the same way for all editions and versions.
Clearly make and model of a car are strongly correlated - when you know car's model, you almost always know its make too. 2% of vehicles in the table are Toyotas, 1% are Camrys, and 1% are Toyota Camrys. If make and model were independent, there would be 0.02*0.01=0.0002=0.02% Toyota Camrys. Of course this is not the case, but let us see what happens when you create another index and rerun the same select:
CREATE INDEX Vehicles_Model ON dbo.Vehicles(Model);
The select now runs slower, and makes three times more reads that the clustered index scan.
The reason is simple - creating another index comes with statistics on Model column.
When I looked at the execution plan, I saw that the optimizer expected 20 rows, which is precisely 0.02%.In fact, there are 1000 Toyota Camry - the optimizer's estimate is off the mark. What happened?
In this case, for this version and edition, the optimizer assumed that make and model are independent, decided that the search criteria (Make='Toyota' AND Model='Camry') is very selective, and chosen nested loops.
Let me repeat myself - that might or might not be the case if you rerun my scripts on your system.
Over time the optimizer can and does change. in my opinion the most robust way to optimize such queries is to create covering indexes, so that the plan stays the same regardless of selectivity estimates. Of course there are other approaches, which might be better in some other cases.
Note: If you are still reading this, you might also want to read about DATE_CORRELATION_OPTIMIZATION setting in BOL.