THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Alexander Kuznetsov

Using CROSS APPLY to optimize joins on BETWEEN conditions

Recently I encountered a case when I knew much more about the data than the optimizer. Originally the performance was horrible, this is why I had to have a look at the query in the first place. When I was able to share my knowledge with the optimizer, it produced a better plan, and the query ran dramatically faster.

 

The slow query

 

The following tables store one-munite commercials for every minut for one year, and customer calls, one call per minute, for the same year. The scripts that populate tables with test data are provided at the end of this post. Here are the tables:

 

CREATE TABLE dbo.Commercials(
  
StartedAt DATETIME NOT NULL 
   
CONSTRAINT PK_Commercials PRIMARY KEY,
  
EndedAt DATETIME NOT NULL,
  
CommercialName VARCHAR(30) NOT NULL);
GO
CREATE TABLE dbo.Calls(CallID INT 
  CONSTRAINT 
PK_Calls NOT NULL PRIMARY KEY,
  
AirTime DATETIME NOT NULL,
  
SomeInfo CHAR(300));
GO
CREATE UNIQUE INDEX Calls_AirTime
  
ON dbo.Calls(AirTimeINCLUDE(SomeInfo);
GO

Every commercial in my table lasts for at most one minute, and they do not overlap. I can easily enforce both conditions with constraints (

Storing intervals of time with no overlaps"

), which are omitted in this post just to keep it simple.

The following query retrieves only 181 rows, and it runs very slowly:

 SELECT s.StartedAts.EndedAtc.AirTime
FROM dbo.Commercials s JOIN dbo.Calls c 
  
ON c.AirTime >= s.StartedAt AND c.AirTime s.EndedAt
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
  

 Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 3338264, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Commercials'. Scan count 2, logical reads 7166, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 71704 ms,  elapsed time = 36316 ms.

Why is it so slow? I haven't mastered the fine art of adding images to my posts yet, so I have to explain verbally. For every call the DB engine scans all the commercials which begin before the time of the call, which is expensive. The reason is simple: the optimizer does not know that the commercials are short, and that the commercials do not overlap, so it must scan all the potential matches, which are all the commercials which begin before the time of the call.

 

Using CROSS APPLY  to tell the optimizer that commercials do not overlap.

 

 Because commercials do not overlap, we need at most one match. Translating this information into plain SQL is easy, and the query runs dramatically faster:

 

SELECT s.StartedAts.EndedAtc.AirTime
FROM dbo.Calls c CROSS APPLY(
  
SELECT TOP 1 s.StartedAts.EndedAt FROM dbo.Commercials s 
  
WHERE c.AirTime >= s.StartedAt AND c.AirTime s.EndedAt
  
ORDER BY s.StartedAt DESCAS s
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'

 

Table 'Commercials'. Scan count 181, logical reads 1327, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 31 ms.

 

Note: if you needed only one column from Commercials table, you could easily use just a subquery. Because more than one column is needed, CROSS APPLY is a better, a more performant choice choice without redundant code.

Note: If you are using the assumption that the commercials do not overlap, you have to enforce that business rule in the database. Also to make sure that you don't forget that your query relies on that assumption, use a unit test do document it.

Also let me put it differently: If you are using the assumption that the commercials do not overlap, use a unit test do document it, so that that you don't forget that your query relies on that assumption. Also you have to enforce that business rule in the database.

 

Using another range condition to tell the optimizer that commercials are short.

 

 Because commercials are short, there is no need to scan the commercials that start more than maximum commercial's length before the call. Again, translating this information into plain SQL is quite easy too, and again the query runs much faster, even faster than the previous one:

 

SELECT s.StartedAts.EndedAtc.AirTime
FROM dbo.Commercials s JOIN dbo.Calls c 
  
ON c.AirTime >= s.StartedAt AND c.AirTime s.EndedAt
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
AND s.StartedAt BETWEEN '20080630 23:45' AND '20080701 03:00'
 

Table 'Worktable'. Scan count 1, logical reads 753, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Commercials'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 24 ms.
 

Note: If you are using the assumption that the commercials are short, you have to enforce that business rule in the database. Also to make sure that you don't forget that your query relies on that assumption, use a unit test do document it.

Also let me put it differently: If you are using the assumption that the commercials are short, use a unit test do document that. Also you have to enforce that business rule in the database. 

 

 

Setting up tables and test data

 

CREATE TABLE dbo.Numbers(INT NOT NULL PRIMARY KEY)
GO
DECLARE @i INT;
SET @i 1;
INSERT INTO dbo.Numbers(nSELECT 1;
WHILE @i<1024000 BEGIN
  INSERT INTO 
dbo.Numbers(n)
    
SELECT @i FROM dbo.Numbers;
  
SET @i @i 2;
END;
GO
INSERT INTO dbo.Commercials(StartedAtEndedAtCommercialName)
SELECT DATEADD(minute1'20080101')
   ,
DATEADD(minuten'20080101')
   ,
'Show #'+CAST(AS VARCHAR(6))
  
FROM dbo.Numbers
  
WHERE n<=24*365*60;
GO
INSERT INTO dbo.Calls(CallID,
  
AirTime,
  
SomeInfo)
SELECT 
   
,DATEADD(minute1'20080101')
   ,
'Call during Commercial #'+CAST(AS VARCHAR(6))
  
FROM dbo.Numbers
  
WHERE n<=24*365*60;
GO
  

 


Published Tuesday, July 07, 2009 8:47 PM by Alexander Kuznetsov

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Rob Farley said:

Nice post. I'm definitely a big fan of making sure that the database understands as much of the business logic as possible, and that queries reflect that where feasible too.

Rob

July 7, 2009 10:03 PM
 

Michael Zilberstein said:

Recently had very similar case with IP ranges - table with StartRange,  EndRange and Country columns and the task is to efficiently find which Country the given IP belongs to. DB programmer built an index on (StartRange, EndRange), performed BETWEEN query and was surprised to see many reads. Had to re-write it to the query very similar to yours - for each IP performed CROSS APPLY to

SELECT TOP 1 ...

FROM IPRanges

WHERE IP >= StartRange

...

ORDER BY StartRange DESC

July 8, 2009 2:10 AM
 

Robert L Davis said:

I've had similar results by converting a query to use a CTE to filter one half of the join before performing the join. This is another good example of a way to "trick" SQL into finding a better plan.

July 8, 2009 11:08 AM
 

SnipStorm said:

Thank you for submitting this cool story - Trackback from SnipStorm

July 8, 2009 1:55 PM
 

Alexander Kuznetsov said:

Michael,

How do you enforce that your intervals of IP addresses do not overlap?

July 11, 2009 8:16 PM
 

Peter said:

Alexander,

Thx for the article as it pointed me to the problem of SQL not optimizing my IPCountry queries as I intended. Given that my situation is virtually identical to that of Michael, I will show you how my solution for that problem now works.

create table dbo.IPcountry_fact

(

 IPFrom       bigint                                              not null

, IPTo         bigint                                              not null

, IdCountry    char(2)       collate SQL_Latin1_General_CP1_CI_AI  not null      

, IdAuthority  int

, constraint pk_IPcountry_fact           primary key clustered( IPFrom )

, constraint fk_IPcountry_fact_country   foreign key          ( IdCountry )   references dbo.IPcountry_country  ( Id )

, constraint ck_IPcountry_fact_IPTo      check ( IPTo >= IPFrom )

)

go

The Country table is kind of obvious, so i left it out here. The query I found even faster then a cross apply is:

select

 dr.*

from

 ( select top 1 * from ipcountry_fact where IPFrom <= 3663389509 order by IPFrom desc ) as dr

where

 dr.IPTo >= 3663389509

;

By using a derived table there is only one clustered index seek. No secondary indexes are needed. There is no overlap, but there can be gaps...all this is enforced by the process that fills the table. Before the old content is replaced a new one is set up and validated before it replaces the old in a single transaction.

I am kind of surprised the optimiser could not work as well with this original:

select top 1

 *

from

 ipcountry_fact

where

 3663389509 between IPFrom and IPTo

order by

 IPFrom asc

;

It is kind of obvious that the IPTo is in no way usable as a predicate, only IPfrom is available in a meaningful way. Given the top 1 and the order by, I was thinking the optimizer had no choice but to do as intended. Find the first record then test to see if the found record also matched tge IPTo condition.

In the optimized version of the query using the derived table I leave the optimizer no choice. It is sad tho that I have to do this, for a functional language one ought not to bother with many rewrites of the query to check which ones fits the optimizer best.

July 13, 2009 9:15 AM
 

Peter said:

I am not sure HOW precise your test matches the real world characteristics. But if I understand the code correctly and the test dat ais corect, this would be the fastest overall:

select

 s.StartedAt, s.EndedAt, c.AirTime

from

 dbo.Calls as c

   inner join dbo.Commercials as s on s.StartedAt = c.AirTime  -- s.StartedAt < c.AirTime and s.EndedAt >= c.AirTime

where

 c.AirTime BETWEEN {d '2008-07-01'} AND {ts '2008-07-01 03:00:00'}

;

Since every minute is represented and a call once it starts suring a commercial is always valid no matter when it ends, a direct merge join on is possible.

181 row(s) affected)

Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Commercials'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

  CPU time = 0 ms,  elapsed time = 1 ms.

July 13, 2009 10:35 AM
 

Peter said:

Ofcourse I was assuming calls and commercials are always at whole minutes :), then again that is the test data.

July 13, 2009 10:57 AM
 

Alexander Kuznetsov said:

Peter,

Commercials can start and end at any time, like from 20070101 02:35:17 to 20070101 02:35:59. Calls also can start at any time, like 20070101 02:35:37. So unfortunately joining on equality as you are suggesting, s.StartedAt = c.AirTime, is not an option.

July 13, 2009 10:58 AM
 

Alexander Kuznetsov said:

Peter,

I agree that test data should have seconds, to avoid confusion.

July 13, 2009 11:34 AM
 

Quassnoi said:

An excellent article.

> When I was able to share my knowledge with the optimizer, it produced a better plan.

This is what I call "SQL optimization in one short sentence" :)

July 21, 2009 2:52 AM
 

dd said:

"It is sad tho that I have to do this, for a functional language one ought not to bother with many rewrites of the query to check which ones fits the optimizer best."

Exactly, it should be up to the language to rewrite queries to make them fast.  Having to tweak queries like this is ridiculous.

September 8, 2009 6:12 AM

Leave a Comment

(required) 
(optional)
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement