THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alexander Kuznetsov

Using CROSS APPLY to optimize joins on BETWEEN conditions

Recently I encountered a case when I knew much more about the data than the optimizer. Originally the performance was horrible, this is why I had to have a look at the query in the first place. When I was able to share my knowledge with the optimizer, it produced a better plan, and the query ran dramatically faster.

 

The slow query

 

The following tables store one-munite commercials for every minut for one year, and customer calls, one call per minute, for the same year. The scripts that populate tables with test data are provided at the end of this post. Here are the tables:

 

CREATE TABLE dbo.Commercials(
  
StartedAt DATETIME NOT NULL 
   
CONSTRAINT PK_Commercials PRIMARY KEY,
  
EndedAt DATETIME NOT NULL,
  
CommercialName VARCHAR(30) NOT NULL);
GO
CREATE TABLE dbo.Calls(CallID INT 
  CONSTRAINT 
PK_Calls NOT NULL PRIMARY KEY,
  
AirTime DATETIME NOT NULL,
  
SomeInfo CHAR(300));
GO
CREATE UNIQUE INDEX Calls_AirTime
  
ON dbo.Calls(AirTimeINCLUDE(SomeInfo);
GO

Every commercial in my table lasts for at most one minute, and they do not overlap. I can easily enforce both conditions with constraints (

Storing intervals of time with no overlaps"

), which are omitted in this post just to keep it simple.

The following query retrieves only 181 rows, and it runs very slowly:

 SELECT s.StartedAts.EndedAtc.AirTime
FROM dbo.Commercials s JOIN dbo.Calls c 
  
ON c.AirTime >= s.StartedAt AND c.AirTime s.EndedAt
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
  

 Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 3338264, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Commercials'. Scan count 2, logical reads 7166, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 71704 ms,  elapsed time = 36316 ms.

Why is it so slow? I haven't mastered the fine art of adding images to my posts yet, so I have to explain verbally. For every call the DB engine scans all the commercials which begin before the time of the call, which is expensive. The reason is simple: the optimizer does not know that the commercials are short, and that the commercials do not overlap, so it must scan all the potential matches, which are all the commercials which begin before the time of the call.

 

Using CROSS APPLY  to tell the optimizer that commercials do not overlap.

 

 Because commercials do not overlap, we need at most one match. Translating this information into plain SQL is easy, and the query runs dramatically faster:

 

SELECT s.StartedAts.EndedAtc.AirTime
FROM dbo.Calls c CROSS APPLY(
  
SELECT TOP 1 s.StartedAts.EndedAt FROM dbo.Commercials s 
  
WHERE c.AirTime >= s.StartedAt AND c.AirTime s.EndedAt
  
ORDER BY s.StartedAt DESCAS s
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'

 

Table 'Commercials'. Scan count 181, logical reads 1327, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 31 ms.

 

Note: if you needed only one column from Commercials table, you could easily use just a subquery. Because more than one column is needed, CROSS APPLY is a better, a more performant choice choice without redundant code.

Note: If you are using the assumption that the commercials do not overlap, you have to enforce that business rule in the database. Also to make sure that you don't forget that your query relies on that assumption, use a unit test do document it.

Also let me put it differently: If you are using the assumption that the commercials do not overlap, use a unit test do document it, so that that you don't forget that your query relies on that assumption. Also you have to enforce that business rule in the database.

 

Using another range condition to tell the optimizer that commercials are short.

 

 Because commercials are short, there is no need to scan the commercials that start more than maximum commercial's length before the call. Again, translating this information into plain SQL is quite easy too, and again the query runs much faster, even faster than the previous one:

 

SELECT s.StartedAts.EndedAtc.AirTime
FROM dbo.Commercials s JOIN dbo.Calls c 
  
ON c.AirTime >= s.StartedAt AND c.AirTime s.EndedAt
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
AND s.StartedAt BETWEEN '20080630 23:45' AND '20080701 03:00'
 

Table 'Worktable'. Scan count 1, logical reads 753, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Commercials'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 24 ms.
 

Note: If you are using the assumption that the commercials are short, you have to enforce that business rule in the database. Also to make sure that you don't forget that your query relies on that assumption, use a unit test do document it.

Also let me put it differently: If you are using the assumption that the commercials are short, use a unit test do document that. Also you have to enforce that business rule in the database. 

 

 

Setting up tables and test data

 

CREATE TABLE dbo.Numbers(INT NOT NULL PRIMARY KEY)
GO
DECLARE @i INT;
SET @i 1;
INSERT INTO dbo.Numbers(nSELECT 1;
WHILE @i<1024000 BEGIN
  INSERT INTO 
dbo.Numbers(n)
    
SELECT @i FROM dbo.Numbers;
  
SET @i @i 2;
END;
GO
INSERT INTO dbo.Commercials(StartedAtEndedAtCommercialName)
SELECT DATEADD(minute1'20080101')
   ,
DATEADD(minuten'20080101')
   ,
'Show #'+CAST(AS VARCHAR(6))
  
FROM dbo.Numbers
  
WHERE n<=24*365*60;
GO
INSERT INTO dbo.Calls(CallID,
  
AirTime,
  
SomeInfo)
SELECT 
   
,DATEADD(minute1'20080101')
   ,
'Call during Commercial #'+CAST(AS VARCHAR(6))
  
FROM dbo.Numbers
  
WHERE n<=24*365*60;
GO
  

 

Published Tuesday, July 07, 2009 8:47 PM by Alexander Kuznetsov

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Rob Farley said:

Nice post. I'm definitely a big fan of making sure that the database understands as much of the business logic as possible, and that queries reflect that where feasible too.

Rob

July 7, 2009 10:03 PM
 

Michael Zilberstein said:

Recently had very similar case with IP ranges - table with StartRange,  EndRange and Country columns and the task is to efficiently find which Country the given IP belongs to. DB programmer built an index on (StartRange, EndRange), performed BETWEEN query and was surprised to see many reads. Had to re-write it to the query very similar to yours - for each IP performed CROSS APPLY to

SELECT TOP 1 ...

FROM IPRanges

WHERE IP >= StartRange

...

ORDER BY StartRange DESC

July 8, 2009 2:10 AM
 

Robert L Davis said:

I've had similar results by converting a query to use a CTE to filter one half of the join before performing the join. This is another good example of a way to "trick" SQL into finding a better plan.

July 8, 2009 11:08 AM
 

Alexander Kuznetsov said:

Michael,

How do you enforce that your intervals of IP addresses do not overlap?

July 11, 2009 8:16 PM
 

Peter said:

Alexander,

Thx for the article as it pointed me to the problem of SQL not optimizing my IPCountry queries as I intended. Given that my situation is virtually identical to that of Michael, I will show you how my solution for that problem now works.

create table dbo.IPcountry_fact

(

 IPFrom       bigint                                              not null

, IPTo         bigint                                              not null

, IdCountry    char(2)       collate SQL_Latin1_General_CP1_CI_AI  not null      

, IdAuthority  int

, constraint pk_IPcountry_fact           primary key clustered( IPFrom )

, constraint fk_IPcountry_fact_country   foreign key          ( IdCountry )   references dbo.IPcountry_country  ( Id )

, constraint ck_IPcountry_fact_IPTo      check ( IPTo >= IPFrom )

)

go

The Country table is kind of obvious, so i left it out here. The query I found even faster then a cross apply is:

select

 dr.*

from

 ( select top 1 * from ipcountry_fact where IPFrom <= 3663389509 order by IPFrom desc ) as dr

where

 dr.IPTo >= 3663389509

;

By using a derived table there is only one clustered index seek. No secondary indexes are needed. There is no overlap, but there can be gaps...all this is enforced by the process that fills the table. Before the old content is replaced a new one is set up and validated before it replaces the old in a single transaction.

I am kind of surprised the optimiser could not work as well with this original:

select top 1

 *

from

 ipcountry_fact

where

 3663389509 between IPFrom and IPTo

order by

 IPFrom asc

;

It is kind of obvious that the IPTo is in no way usable as a predicate, only IPfrom is available in a meaningful way. Given the top 1 and the order by, I was thinking the optimizer had no choice but to do as intended. Find the first record then test to see if the found record also matched tge IPTo condition.

In the optimized version of the query using the derived table I leave the optimizer no choice. It is sad tho that I have to do this, for a functional language one ought not to bother with many rewrites of the query to check which ones fits the optimizer best.

July 13, 2009 9:15 AM
 

Peter said:

I am not sure HOW precise your test matches the real world characteristics. But if I understand the code correctly and the test dat ais corect, this would be the fastest overall:

select

 s.StartedAt, s.EndedAt, c.AirTime

from

 dbo.Calls as c

   inner join dbo.Commercials as s on s.StartedAt = c.AirTime  -- s.StartedAt < c.AirTime and s.EndedAt >= c.AirTime

where

 c.AirTime BETWEEN {d '2008-07-01'} AND {ts '2008-07-01 03:00:00'}

;

Since every minute is represented and a call once it starts suring a commercial is always valid no matter when it ends, a direct merge join on is possible.

181 row(s) affected)

Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Commercials'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

  CPU time = 0 ms,  elapsed time = 1 ms.

July 13, 2009 10:35 AM
 

Peter said:

Ofcourse I was assuming calls and commercials are always at whole minutes :), then again that is the test data.

July 13, 2009 10:57 AM
 

Alexander Kuznetsov said:

Peter,

Commercials can start and end at any time, like from 20070101 02:35:17 to 20070101 02:35:59. Calls also can start at any time, like 20070101 02:35:37. So unfortunately joining on equality as you are suggesting, s.StartedAt = c.AirTime, is not an option.

July 13, 2009 10:58 AM
 

Alexander Kuznetsov said:

Peter,

I agree that test data should have seconds, to avoid confusion.

July 13, 2009 11:34 AM
 

Quassnoi said:

An excellent article.

> When I was able to share my knowledge with the optimizer, it produced a better plan.

This is what I call "SQL optimization in one short sentence" :)

July 21, 2009 2:52 AM
 

dd said:

"It is sad tho that I have to do this, for a functional language one ought not to bother with many rewrites of the query to check which ones fits the optimizer best."

Exactly, it should be up to the language to rewrite queries to make them fast.  Having to tweak queries like this is ridiculous.

September 8, 2009 6:12 AM
 

MGeek said:

Thanks very much this article helped alot and we are utilizing Cross Apply in our stored procedures and gained alot of performance benefit.

June 6, 2010 10:42 PM
 

Alberto said:

Given your data, I think you should have one table that just has start and end times and an ID, and your table of commercials should just have an ID.  This will make it clear than commercials don't overlap, and make the table easier to query.  Just my 2c.  Thanks for the info, I didn't know this.

January 27, 2011 9:51 AM
 

AlexK said:

Alberto,

I don't really see how your suggestion could work. Can you give a more detailed example? Thanks.

February 4, 2011 5:54 PM
 

Bala said:

good one.

Thanks

February 22, 2011 4:35 PM
 

John said:

Thanks for the nice article Alexander. Also, appreciate the nice gesture of providing the test data for running the queries.

June 1, 2012 6:10 AM
 

Erik Eckhardt said:

I've used CROSS APPLY to cause join to a linked server to perform hundreds of times better--it basically stopped the server from bringing entire tables across the network and instead made it query for only the exact data needed.

December 26, 2012 10:18 PM
 

Chris Williamson said:

I like your article, but your two queries cannot be compared equally. Your performance improvement is because of the TOP/ORDER BY, not the APPLY. With different data, these queries will return different results.

Add this record (an overlapped commercial):

INSERT  Commercials

VALUES  ('2008-07-01 00:00:59' ,'2008-07-01 00:01:59' ,'Show #262082b ')

The JOIN query will now return 182 records and the APPLY will return only 181. Removing the TOP/ORDER BY portion of the APPLY statement will provide the same result set in about the same amount of time as the JOIN. The execution plans are very similar. The APPLYs CPU, reads and duration were negligably higher (we're talking 0.2% at the worst).

APPLY is a very handy join type and I have run into several times it had better performance. It has been used best when needing to return a TOP X of another table. For example, needing to join 'Users' to TOP 1 of 'Contracts'. If a user happen to have more than one contract, a JOIN would duplicate the user record. Or, join 'Users' to TOP 5 'Invoices' to show a purchase history, etc.

January 9, 2013 5:27 PM
 

Chris Williamson said:

Let me restate myself a little bit... The APPLY doesn't provide the performance improvement, but it provides the ability to use TOP/ORDER BY, which provides the performance improvement. As you state in the description, the commercials "do not overlap". This assumption allows you to do the TOP for the performance increase.

Another method to implement TOP, though about twice as expensive as APPLY because there are 2 SELECT TOP statements. If only 1 field from Commercials was necessary, this method would have the same performance as APPLY.

SELECT  StartedAt = (SELECT TOP 1

                           StartedAt

                    FROM   dbo.Commercials

                    WHERE  c.airtime >= StartedAt

                           AND c.airtime < EndedAt

                    ORDER BY StartedAt DESC

                   )

      ,EndedAt = (SELECT TOP 1

                           EndedAt

                  FROM     dbo.Commercials

                  WHERE    c.airtime >= StartedAt

                           AND c.airtime < EndedAt

                  ORDER BY StartedAt DESC

                 )

      ,c.AirTime

FROM    dbo.Calls c

WHERE   c.AirTime BETWEEN '20080701' AND '20080701 03:00'

Another method, though still twice as expensive as APPLY, but won't get significantly more expensive by retreiving more fields from commercials:

SELECT s.StartedAt, s.EndedAt, t.AirTime, s.CommercialName

FROM    (SELECT StartedAt = (SELECT TOP 1

                                   StartedAt

                            FROM   dbo.Commercials

                            WHERE  c.airtime >= StartedAt

                                   AND c.airtime < EndedAt

                            ORDER BY StartedAt DESC

                           )

              ,c.AirTime

        FROM   dbo.Calls c

        WHERE  c.AirTime BETWEEN '20080701' AND '20080701 03:00'

       ) AS t

INNER JOIN dbo.Commercials AS s

       ON s.StartedAt = t.StartedAt

Great article and great exercise. Thank you.

January 9, 2013 6:29 PM

Leave a Comment

(required) 
(required) 
Submit

About Alexander Kuznetsov

Alex Kuznetsov has been working with object oriented languages, mostly C# and C++, as well as with databases for more than a decade. He has worked with Sybase, SQL Server, Oracle and DB2. He regularly blogs on sqlblog.com, mostly about database unit testing, defensive programming, and query optimization. Alex has written a book entitled "Defensive Database Programming with Transact-SQL" and several articles on simple-talk.com and devx.com. Currently he works at DRW Trading Group in Chicago, where he leads a team of developers, practicing agile development, defensive programming, TDD, and database unit testing.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement