THE SQL Server Blog Spot on the Web
Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Andrew Kelly

Exists Vs. Count(*) - The battle never ends...

    I am still amazed at how many of the database applications written today still disregard some basic rules of thumb when it comes to accessing the data. One in particular is the use of COUNT(*) to check to see if there are any rows that match some criteria. The technique of using EXISTS over COUNT(*) has been widely publicized and is in pretty much every best practices document I have come across. So why are database developers still using COUNT(*) instead of EXISTS all over the place?  If it is because people just don't believe or recognize the benefits of it? Or is the concept of EXISTS difficult for new programmers to grasp?  I am not really sure since I have heard both views. So to cover both bases I will show a little demo of why EXISTS is almost always a better way to code when you simply need to see if there is at least 1 row that matches some condition in the WHERE clause. Of course if you really need to know exactly how many match that condition then COUNT(*) is appropriate so hopefully this won't confuse anyone in that regard.

Lets use the Adventureworks database and turn statistics IO on so we can see the number of reads associated with each query. We will then compare COUNT(*) with EXISTS so there is no mistake on how much work is being done in relation to each other.

USE Adventureworks
GO

SET STATISTICS IO ON

GO

 

First lets look at a situation in which there is an index to satisfy the WHERE clause and there are only 2 matching rows:

IF (SELECT COUNT(*) FROM sales.salesorderdetail
WHERE ProductID = 870 ) > 0

    Print 'Yes'

IF EXISTS (SELECT * FROM sales.salesorderdetail WHERE ProductID = 870)
    Print 'Yes'

Table 'SalesOrderDetail'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderDetail'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 

As we can see from the Logical reads there was only a difference of 1 between the two queries since the amount of matching rows was so small. So in this case there was not a significant difference although it was still 1/3 more expensive to use COUNT(*) instead of EXISTS.

 

Now lets do the same but with 4688 matching rows:

IF (SELECT COUNT(*) FROM sales.salesorderdetail
WHERE ProductID = 897 ) > 0

    Print 'Yes'

IF EXISTS (SELECT * FROM sales.salesorderdetail WHERE ProductID = 897)
    Print 'Yes'

 

Table 'SalesOrderDetail'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderDetail'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 

Now the cost for the COUNT(*) is over 5 times the EXISTS. This still may not seem like a lot to you. But if you were calling this queries thousands of times a second this would be a big deal. And remember this index is pretty small overall still.

 

OK now lets try this on a column with no index in that same table. There are 357 rows that match but there are a total of 121,317 in the table.

IF (SELECT COUNT(*) FROM sales.salesorderdetail
WHERE ModifiedDate = '20010701 00:00:00.000' ) > 0

    Print 'Yes'

IF EXISTS (SELECT * FROM sales.salesorderdetail WHERE ModifiedDate = '20010701 00:00:00.000')
    Print 'Yes'

 

Table 'SalesOrderDetail'. Scan count 1, logical reads 1241, physical reads 0, read-ahead reads 331, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderDetail'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 

This is now almost 250 times more expensive to do a COUNT(*) vs. an EXISTS. Both queries scanned the table but the EXISTS was able to at least do a partial scan do to the fact it can stop after it finds the very first matching row. Where as the COUNT(*) must read each and every row in the entire table to determine if they match the criteria and how many there are. That is the key folks. The ability to stop working after the first row that meets the criteria of the WHERE clause is what makes EXISTS so efficient.  The optimizer knows of this behavior and can factor that in as well. Now keep in mind that these tables are relatively small compared to most databases in the real world. So the figures of the COUNT(*) queries would be multiplied many times on larger tables. You could easily get hundred's of thousands of reads or more on tables with millions of rows but the EXISTS will still only have just a few reads on any queries that can use an index to satisfy the WHERE clause.

Hopefully this will help to persuade those last holdouts who insist on using COUNT(*) everywhere even when EXISTS is the clear choice.  One last note, make sure to turn off the statistics IO when done.

SET STATISTICS IO OFF


Published Saturday, December 15, 2007 9:27 PM by Andrew Kelly

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Adam Machanic said:

Hi Andy,

It's interesting to note that in SQL Server 2005 if there is a index available to allow a seek, the COUNT(*) > 0 test will be optimized and behave the same as EXISTS.  That said, I totally agree -- never use COUNT(*)!  But at least the optimizer can help sometimes...

Following is a demo of this in case anyone is interested:

---

create index ix_modified

on sales.salesorderdetail (modifieddate)

include (CarrierTrackingNumber,OrderQty,ProductID,SpecialOfferID,

UnitPrice,UnitPriceDiscount,LineTotal,rowguid)

go

set statistics io on

go

raiserror ('full count', 10, 1) with nowait

SELECT COUNT(*) FROM sales.salesorderdetail

WHERE ModifiedDate = '20010701 00:00:00.000'

go

raiserror ('if count(*)', 10, 1) with nowait

IF (SELECT COUNT(*) FROM sales.salesorderdetail

WHERE ModifiedDate = '20010701 00:00:00.000' ) > 0

   Print 'Yes'

go

raiserror ('if exists', 10, 1) with nowait

IF EXISTS (SELECT * FROM sales.salesorderdetail WHERE ModifiedDate = '20010701 00:00:00.000')

   Print 'Yes'

go

drop index ix_modified on sales.salesorderdetail

go

---

December 15, 2007 8:53 PM
 

Andrew Kelly said:

Yes I debated if I should add that tidbit or not :).  It does not work that way in 2000 and to me EXISTS is much clearer as to the intent to me and makes it easier to read the code.

December 15, 2007 9:45 PM
 

Linchi Shea said:

DB2 has an interesting query rewrite feature that is built into the database engine. "IF (SELECT COUNT(*) FROM sales.salesorderdetail WHERE ModifiedDate = '20010701 00:00:00.000' ) > 0" would be a perfect candidate for such a rewrite. I wonder whether SQL2005 has started the work to add a query rewrite layer, or this improvement is built into the query optimizer. It seems to me that it would make more sense to rewrite it behind the scene before sending it to the optimizer than to tinker with the optimizer for these specific cases.

December 16, 2007 1:35 AM
 

Denis Gobo said:

December 16, 2007 6:55 AM
 

DotNetKicks.com said:

You've been kicked (a good thing) - Trackback from DotNetKicks.com

December 16, 2007 7:49 AM
 

Jason Haley said:

December 16, 2007 6:29 PM
 

gOODiDEA said:

SQLServer: ExistsVs.Count(*)

Resharper: RGreatEx-TheReSharperGreatExtension[阅读全文]文章来...

December 17, 2007 10:29 PM
 

Russ Bell said:

I'm using SQL Server 9.0.3175 and the last select yields the same with either method:

Table 'SalesOrderDetail'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Yes

That said, I do like the EXISTS syntax.

Another not too well publicized fact is that selecting a constant instead of * in the COUNT function can dramatically reduce IO.  For example:  SELECT COUNT(0) FROM sales.salesorderdetail.  Any constant will do.  This is particularly helpful when querying a wide table with many columns.

December 24, 2007 9:27 AM
 

dbaduck said:

Also, to mention that it is also helpful when using EXISTS to use a constant instead of a *, especially for wide tables as was indicated above.

EXISTS (SELECCT

January 4, 2008 12:53 PM
 

dbaduck said:

Let me try that again.

Also, to mention that it is also helpful when using EXISTS to use a constant instead of a *, especially for wide tables as was indicated above.

EXISTS (SELECT 1 FROM sales.salesorderdetail WHERE conditionexists)

January 4, 2008 12:54 PM
 

Andrew Kelly said:

Actually a * is preferred in this case.  It does not mean the same as in a select list where it returns all columns. It gives the optimizer it's choice of how to best run the query. A constant like 1 should work fine as well but never use a column name.

January 4, 2008 3:33 PM
 

ML said:

Actually, the SELECT clause of the query used in the EXISTS predicate is irrelevant.

Compare the execution plans. The optimizer is smart enough.

January 8, 2008 4:54 AM
 

Andrew Kelly said:

ML is correct.  Inside the EXISTS sub query it has little meaning.  I was still thinking of the outer query and count(x).

January 8, 2008 9:05 AM
 

TheSQLGuru said:

As you say, this is indeed a common practice by the unknowing.  Good examples, and I agree with your choice to not mention a few of the deeper issues pointed out by posters.

One very minor point is that 2 reads to 3 reads is a 50% increase as opposed to a 33% increase - which makes your point even more strongly. :-))

January 12, 2008 1:46 PM

Leave a Comment

(required) 
(optional)
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement