In some cases some of the columns involved in a query are highly correlated. If you manage to communicate to the optimizer that valuable information, it may come up with a more efficient plan. For example, consider the following table (the script that populates it is at the end of this post):
CREATE TABLE dbo.Events(EventID INT NOT NULL PRIMARY KEY,
EventTime DATETIME NOT NULL,
SomeMoreData CHAR(10)
);
Suppose that your system inserts events one by one, that EventID keeps increasing as your system keeps inserting, and that EventTime also keeps increasing most of the time, with possible minor fluctuations caused by delays between the client and the server. Clearly EventID and EventDate are highly correlated in this scenario. Consider the following queries:
DECLARE @dfrom DATETIME, @dto DATETIME;
SELECT @dfrom = '20050102 12:34:56', @dto = '20050103 12:34:56';
SELECT COUNT(*) FROM(
SELECT EventID, EventTime, SomeMoreData FROM dbo.Events
WHERE EventTime BETWEEN @dfrom AND @dto
AND SomeMoreData > '1'
)AS t;
SELECT COUNT(*) FROM(
SELECT EventID, EventTime, SomeMoreData FROM dbo.Events
WHERE EventTime BETWEEN '20050102 12:34:56' AND '20050103 12:34:56'
AND SomeMoreData > '1'
)AS t;
Table 'Events'. Scan count 9, logical reads 4088, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 107 ms, elapsed time = 21 ms.
In this case more than 8% or rows meet the criteria (the script that populates test data is at the end of this post, you can run it and see for yourself), so both queries are executed as clustered index scans. However, because of the typical way the rows are inserted into the table, the rows that actually meet the criteria are very much clustered together - most pages do not contain any rows meeting the criteria, while on some pages most or all rows are selected. Communicating this information to the optimizer is easy and the query runs much faster, because it scans just a range of the clustered index:
DECLARE @dfrom DATETIME, @dto DATETIME;
SELECT @dfrom = '20050102 12:34:56', @dto = '20050103 12:34:56';
SELECT COUNT(*) FROM(
SELECT EventID, EventTime, SomeMoreData FROM dbo.Events
WHERE EventTime BETWEEN @dfrom AND @dto
AND EventID BETWEEN (SELECT MIN(EventID) FROM dbo.Events AS e1 WHERE e1.EventTime = @dfrom)
AND (SELECT MAX(EventID) FROM dbo.Events AS e1 WHERE e1.EventTime = @dto)
AND SomeMoreData > '1'
)AS t;
Table 'Events'. Scan count 3, logical reads 848, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 31 ms, elapsed time = 36 ms.
Of course, if the assumption that EventID and EventDate are highly correlated is no longer true, this query will no longer be a better choice.
Setting up test data
CREATE TABLE dbo.Numbers(n INT NOT NULL PRIMARY KEY)
GO
DECLARE @i INT;
SET @i = 1;
INSERT INTO dbo.Numbers(n) SELECT 1;
WHILE @i<1024000 BEGIN
INSERT INTO dbo.Numbers(n)
SELECT n + @i FROM dbo.Numbers;
SET @i = @i * 2;
END;
GO
DROP TABLE dbo.Events
GO
CREATE TABLE dbo.Events(EventID INT NOT NULL PRIMARY KEY,
EventTime DATETIME NOT NULL,
SomeMoreData CHAR(10)
);
GO
INSERT INTO dbo.Events(EventID, EventTime, SomeMoreData)
SELECT n, DATEADD(second, n, '20050101'), 'some data'
FROM dbo.Numbers;
GO
-- reshuffle EventTime just a little bit
UPDATE dbo.Events SET EventTime = DATEADD(second, EventID%20, EventTime);
GO
CREATE INDEX Events_EventTime ON dbo.Events(EventTime);