THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alexander Kuznetsov

Defensive database programming: rewriting queries with NOT IN().

The behavior of NOT IN clause may be confusing and as such it needs some explanations. Consider the following query:

SELECT LastName, FirstName FROM Person.Contact WHERE LastName NOT IN('Hedlund', 'Holloway', NULL)

 

Although there are more than a thousand distinct last names in AdventureWorks.Person.Contact, the query returns nothing. This may look counterintuitive to a beginner database programmer, but it actually makes perfect sense. The explanation consist of several simple steps. First of all, consider the following two queries, which are clearly equivalent:

 

SELECT LastName, FirstName FROM Person.Contact

WHERE LastName IN('Hedlund', 'Holloway', NULL)

 

SELECT LastName, FirstName FROM Person.Contact

WHERE LastName='Hedlund' OR LastName='Holloway' OR LastName=NULL

 

Note that both queries return expected results. Now, let us recall DeMorgan's theorem, which states that:

 

not (P and Q) = (not P) or (not Q)

not (P or Q) = (not P) and (not Q)

 

I am cutting and pasting from Wikipedia (http://en.wikipedia.org/wiki/De_Morgan_duality). Applying DeMorgan's theorem to this queries, it follows that these two queries are also equivalent:

SELECT LastName, FirstName FROM Person.Contact WHERE LastName NOT IN('Hedlund', 'Holloway', NULL)

 

SELECT LastName, FirstName FROM Person.Contact

WHERE LastName<>'Hedlund' AND LastName<>'Holloway' AND LastName<>NULL

 

Note that when I run both queries against AdventureWorks database, they both return empty result sets. The reason is simple - the condition LastName<>NULL can never be true (unless you play with ANSI_NULLS setting, which you should not be doing anyway).  Now that you understand why queries with NOT IN can return empty result sets, consider this query:

 

SELECT <some columns> FROM SomeTable s

WHERE s.LastName NOT IN(SELECT LastName FROM Person.Contact WHERE <some conditions>)

At the time of this writing Person.Contact.LastName is not nullable, and in the short term it is perfectly safe to write such queries. In the long term, however, database schemas can and do change. At some time later on the column Person.Contact.LastName might become nullable. Come to think of it, Person.Contact might become a view instead of a table. To be on the safe side, you can either rewrite your query using NOT EXISTS clause instead of NOT IN, or you can explicitly ensure that  Person.Contact.LastName is not nullable, as follows:

SELECT <some columns> FROM SomeTable s

WHERE s.LastName NOT IN(SELECT LastName FROM Person.Contact

  WHERE LastName IS NOT NULL AND <some conditions>)

 

Either way, your modified query will not break when the underlying database schema changes.

 Next post in this series:

Defensive database programming: qualifying column names
Published Tuesday, October 21, 2008 10:36 PM by Alexander Kuznetsov

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Cliff said:

NULLS are the bane of my existence. =)

For the most part, I have found that using NOT EXISTS has better performance than NOT IN unless you are working with a smaller dataset (less than 1000 rows). Mileage may very.

October 23, 2008 1:54 PM
 

Matt said:

Cliff,

Good point.  NOT EXISTS stops query execution (releases server resources and returns the result) upon the first encounter.  While there may be instances to use IN and NOT IN, they are better replaced with EXISTS or NOT EXISTS whenever possible for performance reasons aside from this articles point of clarity of function.

Nice article Alexander but the performance difference would be a good supporting argument and I think would "win over" more developers to avoid using IN and NOT IN.

October 27, 2008 9:45 AM
 

Alexander Kuznetsov said:

Cliff and Matt,

I think that in almost all the cases there is no performance difference between NOT IN() and NOT EXISTS() approaches, at least with SQL Server 2000, 2005 and 2008. To double check myself, I played with a few pairs of queries right now - I would replace NOT IN with NOT EXISTS and see exactly the same plan and the same real execution costs.

The optimizer is not stupid, it usually recognizes that both approaches are equivalent.

Can you post a repro when NOT EXISTS() performs better?

October 27, 2008 10:14 AM
 

Alexander Kuznetsov said:

The following pattern is quite common in database programming: IF EXISTS(some query) BEGIN DO SOMETHING;

November 27, 2008 10:29 PM
 

Alexander Kuznetsov said:

It is well known that UPDATE ... FROM command does not detect ambiguities. Also it well known that ANSI

December 8, 2008 11:04 AM
 

Alexander Kuznetsov said:

Suppose that you need to implement the following logic: IF(row exists) Update the row ELSE Insert a new

December 14, 2008 8:37 PM
 

Denis Gobo said:

Wow, it has been already a year since I wrote A year in review, The 21 + 1 best blog posts on SQLBlog

December 31, 2008 10:38 AM
 

Alexander Kuznetsov said:

Comparing SET vs. SELECT is a very popular topic, and much of what I have to say has been said before.

January 25, 2009 5:57 PM
 

Alexander Kuznetsov said:

I have been posting examples of defensive database programming for some time now. I am by no means done

March 8, 2009 9:49 PM
 

Alexander Kuznetsov said:

I have written up two examples when a SET ROWCOUNT command breaks a seemingly working stored procedure

March 21, 2009 11:05 PM
 

Niloct said:

Thanks man, I pinpointed a weakness in my procedure with took me 3 days of debugging, and I was thinking about server configuration or exploits, but this was the real cause.

Thanks indeed!

May 4, 2009 10:42 AM
 

Niloct said:

Alternatively, you could issue an IsNull(column, 0) as the main column selected by the subquery, so it will also never breaks outer code.

Of course, those things only come to surface after a lot of pain.... :-/

May 4, 2009 10:51 AM
 

Alexander Kuznetsov said:

There are three kinds of triggers: those which blow up and those which fail silently ;). Seriously, there

May 11, 2009 9:19 PM
 

Alexander Kuznetsov said:

My query used to work, but it blows up after I have added an index? The following query is not safe:

July 11, 2009 11:14 PM
 

Alexander Kuznetsov said:

You cannot assume that the conditions in your WHERE clause will evaluate in the left-to-write order -

July 16, 2009 5:41 PM

Leave a Comment

(required) 
(required) 
Submit

About Alexander Kuznetsov

Alex Kuznetsov has been working with object oriented languages, mostly C# and C++, as well as with databases for more than a decade. He has worked with Sybase, SQL Server, Oracle and DB2. He regularly blogs on sqlblog.com, mostly about database unit testing, defensive programming, and query optimization. Alex has written a book entitled "Defensive Database Programming with Transact-SQL" and several articles on simple-talk.com and devx.com. Currently he works at DRW Trading Group in Chicago, where he leads a team of developers, practicing agile development, defensive programming, TDD, and database unit testing.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement