THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alexander Kuznetsov

Correlated subqueries do not execute once per row.

Correlated subqueries do not have to execute once per row - on the contrary, they are equivalent to outer joins, and they may have the same execution plans and the same real execution costs (if we retrieve only one column via a correlated subquery, of course)

Let me provide a script which compares the performance of a query with one correlated subquery vs. an equivalent out join. First, we need two tables for our example:

CREATE TABLE dbo.Parent
    
(
      
ID INT NOT NULL
             
PRIMARY KEY ,
      
ParentNumber INT NOT NULL
    ) ;
GO
CREATE TABLE dbo.Child
    
(
      
ID INT NOT NULL
             
PRIMARY KEY ,
      
ParentID INT NOT NULL ,
      
ChildNumber INT NOT NULL
    ) ;
  

Next, let us populate them with 512K rows each:

DECLARE @adder INT ;
SET @adder ;
INSERT  INTO dbo.Parent
        
IDParentNumber )
VALUES  1) ;

WHILE @adder 500000 
    
BEGIN ;
        
INSERT  INTO dbo.Parent
                
ID ,
                  
ParentNumber
                
)
                
SELECT  ID @adder ,
                        
ParentNumber
                
FROM    dbo.Parent ;
        
SET @adder @adder ;
    
END ;
GO

INSERT INTO dbo.Child
        
IDParentIDChildNumber )
SELECT ID-- ID - int
          
ID-- ParentID - int
          
0  -- ChildNumber - int
FROM dbo.Parent 

We want to compare the performance of two queries without the overhead of retrieveing and or materializing result sets, so I will make sure the result sets are empty. The correlated subquery approach is as follows:

SET STATISTICS IO ON ;
SET STATISTICS TIME ON ;
GO
               
SELECT  ChildNumber ,
        
ParentNumber
INTO    #t1
FROM    SELECT    ChildNumber ,
                    ( 
SELECT    ParentNumber
                      
FROM      dbo.Parent AS p
                      
WHERE     p.ID c.ParentID
                    
AS ParentNumber
          
FROM      dbo.Child AS c
        
AS t
-- no rows meet this criteria
WHERE   ChildNumber ParentNumber -123 

The execution costs are:

 Table 'Child'. Scan count 9, logical reads 2502, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Parent'. Scan count 9, logical reads 2215, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The equivalent outer join looks like this (run it in another tab):

SET STATISTICS IO ON ;
SET STATISTICS TIME ON ;
GO

SELECT  ChildNumber ,
        
ParentNumber
INTO    #t2
FROM    SELECT    ChildNumber ,
                    
ParentNumber
          
FROM      dbo.Child AS c
                    
LEFT OUTER JOIN dbo.Parent AS ON p.ID c.ParentID
        
AS t
-- no rows meet this criteria
WHERE   ChildNumber ParentNumber -123 ;

Its real execution costs are the same.

More to the point, when I hit Ctrl+L in both tabs, I see the same execution plan.

Conclusion 

As we have seen, correlated subqueries do not have to execute once per each row. In fact, they are just macros - the optimizer can flatten them out and rewrite the queries involving them as outer joins.

 

 

 

 

 

 

Published Wednesday, January 20, 2010 6:08 PM by Alexander Kuznetsov
Filed under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Remus Rusanu said:

OUTER/CROSS APPLY are also flattened out:

SELECT  ChildNumber ,

       ParentNumber

INTO    #t3

FROM    ( SELECT    ChildNumber ,

                   ParentNumber

         FROM      dbo.Child AS c

                   OUTER APPLY

                   ( SELECT    ParentNumber

                     FROM      dbo.Parent AS p

                     WHERE     p.ID = c.ParentID

                   ) AS ParentNumber

       ) AS t

-- no rows meet this criteria

WHERE   ChildNumber + ParentNumber = -123 ;

January 20, 2010 8:30 PM
 

Alex K said:

Remus,

I completely agree. Thanks for pointing this out.

January 20, 2010 10:07 PM
 

Alex K said:

Remus,

I was disagreeing with the following quote from books online: "In queries that include a correlated subquery (also known as a repeating subquery), the subquery depends on the outer query for its values. This means that the subquery is executed repeatedly, once for each row that might be selected by the outer query."

January 20, 2010 10:44 PM
 

Adam Machanic said:

Remus: It depends on what you're doing. I often use APPLY to force a nested loop when I want one and the QO won't do it for me with joins... It usually works, although sometimes I do see a merge or hash instead--but when I'm doing it the inner query is generally much more complex than what you've shown, and I align everything with an outer key. It's too bad that it's not easier to force some of these things when needed, but also good that the QO can use a number of options for people who aren't trying to be control freaks.

January 21, 2010 10:11 AM
 

Alex K said:

Hi Adam,

Are you also using multi statement table valued UDFs to force nested loops? Or is it too slow in your situation?

January 21, 2010 11:29 AM
 

Dylan Finney said:

I don't believe that this is entirely correct.  Here is an example that follows Microsofts example here: http://msdn.microsoft.com/en-us/library/ms187638.aspx

SELECT [ParentNumber]

FROM [dbo].[Parent]

WHERE [Parent].[ID] = (

SELECT [ParentID]

FROM [Child]

WHERE [Child].[ParentID] = [Parent].[ID]

)

January 21, 2010 1:42 PM
 

Dylan Finney said:

Sorry, I forgot to include the statistics IO data:

Table 'Worktable'. Scan count 524288, logical reads 2837515, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

-----------------------------------------------

Table 'Child'. Scan count 1, logical reads 1368, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

-----------------------------------------------

Table 'Parent'. Scan count 5, logical reads 1213, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

January 21, 2010 1:47 PM
 

Alex K said:

Hi Dylan,

What do you mean by "I don't believe that this is entirely correct." Which "this" are you referring to?

January 21, 2010 2:08 PM
 

Dylan Finney said:

Oh sorry, I don't think that it's correct to say that correlated subqueries don't execute once per row.  Some do not, however some do.  In the example that I posted you can see that it's executing the subquery in the where clause (the same patter as the Microsoft example I posted) once per row.

January 21, 2010 2:27 PM
 

Adam Machanic said:

Alex: No, not using multistatement TVFs. Usually when I'm trying to force nested loops it's because the QO wants to use a hash match iterator, I'm seeing excessive tempdb utilization, and there's an index that can be used to do efficient seeks instead of building a hash table. In that case I think the TVF's use of tempdb would defeat the goal, but I can't say that I've tested. Are you seeing good results doing something like that?

January 21, 2010 2:32 PM
 

Alex K said:

Dylan,

I agree that the title is somewhat misleading. The first line of the post reads "Correlated subqueries do not _have_to_ execute once per row", and that "have to" should be included in the title too. I'll fix it on Monday.

BTW, after tweaking your example just a little bit we get merge join again:

SELECT  [ParentNumber]

FROM    [dbo].[Parent]

WHERE   [Parent].[ID] = ( SELECT    max([ParentID])

                         FROM      [Child]

                         WHERE     [Child].[ParentID] = [Parent].[ID]

                       ) ;

January 21, 2010 2:58 PM
 

Alex K said:

Adam,

I cannot say I have good results, that depends on the criteria, but multistatement TVFs do execute once per row, which is equivalent to nested loops. There is considerable overhead of calling them though.

January 22, 2010 10:05 AM
 

amiya said:

banda

May 6, 2010 7:33 AM

Leave a Comment

(required) 
(required) 
Submit

About Alexander Kuznetsov

Alex Kuznetsov has been working with object oriented languages, mostly C# and C++, as well as with databases for more than a decade. He has worked with Sybase, SQL Server, Oracle and DB2. He regularly blogs on sqlblog.com, mostly about database unit testing, defensive programming, and query optimization. Alex has written a book entitled "Defensive Database Programming with Transact-SQL" and several articles on simple-talk.com and devx.com. Currently he works at DRW Trading Group in Chicago, where he leads a team of developers, practicing agile development, defensive programming, TDD, and database unit testing.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement