Two months ago, I posted the first two parts of a series about NULL. After that, I went quiet. Much to do and little time was one excuse. But to be honest, I also lost interest. However, I felt I owe my readers to conclude the series, so I have now forced myself to write and publish the last parts before moving on to more interesting subjects J.
Before reading on, I suggest that you first read (or reread) the first and second part of this series, so that we’re all on the same level again.
Finished reading? Okay, good to see you back. J
How to turn logical into illogical
In his comment to the second part of this series, David Portas pointed out the danger of equating the behaviour of Unknown in SQL’s three-valued logic to the behaviour of humans when confronted with unknown (or rather: missing) data. The problem is that the behaviour of NULL and Unknown is defined such that it will mostly produce the logical result – but that defined behaviour is then in true computer-fashion applied to all situations. A computer lacks the brain power to assess when the rules stop to make sense.
Here are two examples where the logical treatment of NULL and Unknown, when applied to a specific situation, renders completely illogical results:
DECLARE @Var int;
SET @Var = NULL;
IF @Var = 8 OR @Var <> 8
PRINT 'It either is or isn''t 8';
ELSE PRINT 'Huh? It''s not 8, but it''s not not 8 either!';
IF @Var = @Var
PRINT 'Duh, of course it''s equal to itself';
ELSE PRINT 'What the hey, it''s not even equal to itself!!';
Both conditions in the code above are immediately recognisable as tautologies for us humans. But SQL Server doesn’t attempt to identify tautologies, it just does as instructed. So in the case of the first IF statement, it first evaluated each of the individual parts: “@Var = 8” is Unknown, since the value for @Var is missing and might or might not be 8. Likewise, the evaluation of “@Var <> 8” results in Unknown as well. And then the logic table that I explained in my last post is consulted to get a result of Unknown for “Unknown OR Unknown” – and hence the ELSE is executed (and the PRINTed output should actually have read “I’m not really sure whether it’s either equal or unequal to 8!” – which of course makes no sense either). Note also that changing “@Var <> 8” to “NOT(@Var <> 8)” will not change the outcome.
The second example takes the oddity yet a step further. Surely, anything is always to itself? Well, yeah, sure it is – but the specifications of the ANSI standard disallow SQL Server from knowing that. What SQL Server does here, is that first the variable reference is replaced by its value, so that the equation to evaluate becomes “NULL = NULL” – and there is of course no way to know if two missing values are equal to each other or not.
A different, slightly better hidden form of tautology is presented in this example:
CREATE TABLE Test
(Col1 int NOT NULL PRIMARY KEY,
Col2 int NULL,
CHECK(Col2 < 10));
INSERT INTO Test (Col1, Col2)
SELECT 1, 3
UNION ALL
SELECT 2, NULL;
SELECT Col1, Col2
FROM Test
WHERE Col2 < 10;
go
For the row with Col1 = 2, the value of Col2 is missing – but because of the CHECK constraint, we know that it can never be 10 or more. And yet, SQL Server had to reject this row, based on the handling of NULL that I have been labelling as “logical” in my previous posts!
An idea that pops up in newsgroups from time to time, is to have the RDBMS recognise tautologies and react accordingly, regardless of whether the input is NULL or not. I must say that I am glad that this has, so far (touches wood), not been incorporated in either the ANSI standard or SQL Server. Sure, I do appreciate that the current situation leads to weird effects, but it is at least 100% logical. But extending the database with a list of recognised tautologies would introduce another weirdness, for I consider it to be highly unlikely that a database will ever be able to recognise all possible tautologies. Would you instantly recognise the WHERE clause below as a tautology?
WHERE Foo BETWEEN 0 AND Bar
OR SQUARE(Foo) > Foo * Bar
OR Bar <= 10;
And remember – even if you could, and if you could make a query parser do it too, I can always come up with a more complex way of saying “True”.
Unequal, but not distinct
Database developers often have to deal with testing if values are unequal – but they also have to deal with testing if values are distinct. Now you may think that “distinct” is the same as “unequal”, and you may even find that Wikipedia agrees with you – but in database land, “distinct” is in fact both distinct from and unequal to “unequal”.
The difference between equality and distinction, in database terms, is related to NULL values. For an equality comparison, the result is Unknown if either or both the arguments is NULL, as explained previously. For a distinction comparison, though, a NULL value is considered to be distinct from any non-NULL value, but the same as another NULL. It is easy to see this in practice – just run a query with a GROUP BY or a DISTINCT clause, and you will see that all NULL values are lumped together in a single group. (Unfortunately, Microsoft has also seen fit to use distinction rather than equality in its implementation of UNIQUE constraints, a clear violation of the ANSI standard that often forces us to implement awkward workarounds – vote here if you agree with me that this should be changed).
There are various occasion where a database developer will want to know if two values are distinct. A typical example would be a trigger to track changes and store them in a separate auditing table – if only two columns in a 40-column table are actually changed, you prefer not to waste log space on the other 38 columns. But a straight comparison between the old and the new value would only catch changes from a non-NULL to a different non-NULL value; changes from NULL to non-NULL and vice versa would not be audited since this comparison would result in Unknown. As would a non-change of a NULL value.
Unfortunately, Microsoft has not yet implemented the IS [NOT] DISTINCT FROM operator that was added in the SQL:1999 version of the ANSI standard. That means that instead of being able to write
WHERE OldValue IS DISTINCT FROM NewValue;
we are stuck with writing this more clumsy (and less efficient) variant:
WHERE OldValue <> NewValue
OR (OldValue IS NULL AND NewValue IS NOT NULL)
OR (OldValue IS NOT NULL AND NewValue IS NULL);
Implementing the IS [NOT] DISTINCT FROM clause has been suggested on Microsoft Connect two months ago by Steve Kass. There are already 49 votes in favour of this suggestion and Microsoft is considering it for SQL Server 2008 – you too can increase the chance that this will happen by adding your vote here.
WHERE WeAre [NOT] IN (trouble)
Another common cause of NULL-related woes is related to the use of [NOT] IN with a subquery. To understand the root cause of this troubles, it’s useful to first have a look at the exact meaning of [NOT] IN with a hardcoded value list.
In the ANSI standard, the IN clause is defined such that
WHERE Something IN (3, 5, 7);
should be considered as shorthand for
WHERE Something = 3
OR Something = 5
OR Something = 7;
and should therefore return the exact same result. Likewise,
WHERE Something NOT IN (3, 5, 7);
should return the same result as
WHERE NOT (Something = 3
OR Something = 5
OR Something = 7);
or
WHERE Something <> 3
AND Something <> 5
AND Something <> 7;
This definition is to be taken very literally. Even when a NULL is inserted in the value list, the definition still holds. So
WHERE Something IN (3, NULL, 7);
is considered as shorthand for
WHERE Something = 3
OR Something = NULL
OR Something = 7;
and returns the same results. If Something = 3, the result will be True (True OR Unknown OR False) and the row is included. If Something = 2, the result is Unknown (False OR Unknown OR False), and the row is not included. No surprises so far. And if Something is NULL, the result ends up as Unknown (Unknown OR Unknown OR Unknown) as well, so this row is excluded as well – this may be slightly surprising, but not really a big problem for most.
The nasty surprise sits in the NOT IN variant:
WHERE Something NOT IN (3, NULL, 7);
is shorthand for
WHERE Something <> 3
AND Something <> NULL
AND Something <> 7;
and this will never return any row at all! See, if Something is 3, the result is of course False (False AND Unknown AND True) and the row is omitted, as expected. For a NULL value of Unknown, the result is Unknown (Unknown AND Unknown AND Unknown), so this row is omitted as well. But the kicker is what happens if Something = 2 … in that case, the result is not True (as most would expect), but Unknown (True AND Unknown AND True) and this row is also omitted from the result. After all, the database can’t be sure that the missing value is accidentally equal to 2, and doesn’t want to risk returning rows that don’t qualify!
Mind you, I have never ever seen anyone actually use a NULL in the hardcoded value list of a NOT IN clause, so what’s the deal anyway? Well, it’s simple – the defined behaviour of a [NOT] IN (subquery) clause is to evaluate the subquery, replace it with the list of values returned from the subquery, and then evaluate the [NOT] IN according to the rules discussed above. (Note that the query optimizer is free to, and will often, choose another method of actually executing the query as long as it doesn’t change the results). So if Foo.Bar can be NULL, the result of
WHERE Something NOT IN
(SELECT Bar
FROM Foo
WHERE Yadda yadda);
will be just as empty as the result of
WHERE Something NOT IN (3, NULL, 7);
The easy workaround is of course to explicitly exclude the NULLs from the subquery:
WHERE Something NOT IN
(SELECT Bar
FROM Foo
WHERE Yadda yadda
AND Bar IS NOT NULL);
A different solution is to rewrite the NOT IN to a subquery with NOT EXISTS:
WHERE NOT EXISTS
(SELECT *
FROM Foo
WHERE Yadda yadda
AND Bar = Something);
This also provides a nice workaround for the current non-existing support for row value constructors in SQL Server (also on Connect, so vote here if you deem this important). You can not write
WHERE (Something, Splunge) NOT IN
(SELECT Bar, Splat
FROM Foo
WHERE Yadda yadda);
but you can use this:
WHERE NOT EXISTS
(SELECT *
FROM Foo
WHERE Yadda yadda
AND Bar = Something
AND Splat = Splunge);
My advice is to strictly avoid [NOT] IN (subquery) and always use [NOT] EXISTS (subquery) instead. Not only because the latter avoids the NULL issues, but also because [NOT] EXISTS can be used in cases where [NOT] IN isn’t possible. Plus, I have never seen a case where [NOT] IS performs better than [NOT] EXISTS (but I have seen the reverse).
There is of course nothing wrong with using [NOT] IN (value, value, …)
What if null if null is null null null is null?
I promised to answer this question – and now I see that this is already my longest blog post ever, and I’m not even near the answer yet. So I’ll just go ahead and tell you that, if null if null is null null null is null, everything is okay and the product is working as it should be, but for one caveat. Why, and what caveat, that’ll have to wait until I find the time to wrap up and post the final part of this series on NULL. (And no, I won’t keep you waiting for another two months – promise!)