THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alexander Kuznetsov

Optimizing yet another query that involves highly correlated columns

In some cases some of the columns involved in a query are highly correlated. If you manage to communicate to the optimizer that valuable information, it may come up with a more efficient plan. For example, consider the following table (the script that populates it is at the end of this post):

 

CREATE TABLE dbo.Events(EventID INT NOT NULL PRIMARY KEY,
  
EventTime DATETIME NOT NULL,
  
SomeMoreData CHAR(10)
); 

 

Suppose that your system inserts events one by one, that EventID keeps increasing as your system keeps inserting, and that EventTime also keeps increasing most of the time, with possible minor fluctuations caused by delays between the client and the server. Clearly EventID and EventDate are highly correlated in this scenario. Consider the following queries:

 

DECLARE @dfrom DATETIME@dto DATETIME;
SELECT @dfrom '20050102 12:34:56'@dto '20050103 12:34:56';

SELECT COUNT(*) FROM(
SELECT EventIDEventTimeSomeMoreData FROM dbo.Events 
WHERE EventTime BETWEEN @dfrom AND @dto 
AND SomeMoreData '1'
)AS t;


SELECT COUNT(*) FROM(
SELECT EventIDEventTimeSomeMoreData FROM dbo.Events 
WHERE EventTime BETWEEN 
'20050102 12:34:56' AND '20050103 12:34:56'
AND SomeMoreData '1'
)AS t;
 

 

Table 'Events'. Scan count 9, logical reads 4088, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 107 ms,  elapsed time = 21 ms.
 

 In this case more than 8% or rows meet the criteria (the script that populates test data is at the end of this post, you can run it and see for yourself), so both queries are executed as clustered index scans. However, because of the typical way the rows are inserted into the table, the rows that actually meet the criteria are very much clustered together - most pages do not contain any rows meeting the criteria, while on some pages most or all rows are selected. Communicating this information to the optimizer is easy and the query runs much faster, because it scans just a range of the clustered index:

 

DECLARE @dfrom DATETIME@dto DATETIME;
SELECT @dfrom '20050102 12:34:56'@dto '20050103 12:34:56';

SELECT COUNT(*) FROM(
SELECT EventIDEventTimeSomeMoreData FROM dbo.Events 
WHERE EventTime BETWEEN @dfrom AND @dto
AND EventID BETWEEN (SELECT MIN(EventIDFROM dbo.Events AS e1 WHERE e1.EventTime @dfrom)
  AND (
SELECT MAX(EventIDFROM dbo.Events AS e1 WHERE e1.EventTime @dto)
AND 
SomeMoreData '1'
)AS t;
  

 

 Table 'Events'. Scan count 3, logical reads 848, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 36 ms.

 

Of course, if the assumption that EventID and EventDate are highly correlated is no longer true, this query will no longer be a better choice.

 

 Setting up test data

 

CREATE TABLE dbo.Numbers(INT NOT NULL PRIMARY KEY)
GO
DECLARE @i INT;
SET @i 1;
INSERT INTO dbo.Numbers(nSELECT 1;
WHILE @i<1024000 BEGIN
  INSERT INTO 
dbo.Numbers(n)
    
SELECT @i FROM dbo.Numbers;
  
SET @i @i 2;
END;
GO 

DROP TABLE dbo.Events
GO
CREATE TABLE dbo.Events(EventID INT NOT NULL PRIMARY KEY,
  
EventTime DATETIME NOT NULL,
  
SomeMoreData CHAR(10)
);
GO

INSERT INTO dbo.Events(EventIDEventTimeSomeMoreData)
SELECT nDATEADD(secondn'20050101'), 'some data'
FROM dbo.Numbers;
GO
-- reshuffle EventTime just a little bit
UPDATE dbo.Events SET EventTime DATEADD(secondEventID%20EventTime);
GO
CREATE INDEX Events_EventTime ON dbo.Events(EventTime);
  

 

Published Monday, July 13, 2009 6:15 PM by Alexander Kuznetsov

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Uri Dimant said:

Alex

I think if we change to put the CI on EventTime it will dramatically increase performance

Thanks

July 14, 2009 1:34 AM
 

Alexander Kuznetsov said:

Uri,

Yes of course, having clustered index on (EventTime,<maybe something to make it unique like EventID>) would speed up this query. So would a covering index starting with EventTime. There are plenty of cases when we want to speed up queries without paying too much price in additional indexes.

Thanks.

July 14, 2009 9:11 AM

Leave a Comment

(required) 
(required) 
Submit

About Alexander Kuznetsov

Alex Kuznetsov has been working with object oriented languages, mostly C# and C++, as well as with databases for more than a decade. He has worked with Sybase, SQL Server, Oracle and DB2. He regularly blogs on sqlblog.com, mostly about database unit testing, defensive programming, and query optimization. Alex has written a book entitled "Defensive Database Programming with Transact-SQL" and several articles on simple-talk.com and devx.com. Currently he works as an agile developer.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement