I once read a scientific article on black holes that started of on the observation that, since a hole is the absence of anything and black is invisible in space, a black hole is in fact an invisible nothing – so what the heck are we talking about? Well, almost the same can be said about NULL in databases.
Let’s first take a look at what NULL is supposed to be. Here is the definition of NULL from the SQL-2003 standard:
null value: A special value that is used to indicate the absence of any data value.
This definition differs significantly from some common misconceptions about NULL:
· NULL is not the same as the numeric value 0, even though they are pronounced the same in some languages (like Dutch or German).
· NULL is also not the same as the string value ‘’ (also known as the empty string), even though many Oracle developers would like to believe so. However, they can’t be blamed for this as this is completely Larry Ellison’s fault.
· NULL is definitely not the same as either of the date values January 1st 1753, January 1st 1900, or December 31st 9999, even though there might be valid (performance related) reasons to use either of those values as a magic value instead of NULL in a specific situation.
· Moving to the more controversial stuff, NULL does not mean “not applicable”. Of course, a NULL in a table is often a result of the attribute not being applicable for a specific occurrence of the entity stored in the table (e.g., a column “birthday” in a customer table that stores details of businesses as well as humans) – but in other columns and other tables, or even in another row of the same table, there might be a completely different reason for the data being missing (absent)!
· And saving the best (read: most controversial) for the last, NULL is also definitely not meant to signify “unknown”. Again, a NULL in a table might result from the value being unknown at data entry time (e.g., when we forget to ask a customer for his or her birthday), but there might be other reasons as well. Unfortunately, many text books insist on explaining the behaviour of NULL in expressions by describing NULL as unknown, rather than missing or absent, causing this misconception about NULL to be the most widespread and the hardest to combat.
Allow me to dwell some more on this whole unknown/not applicable thing, as it’s able to confuse even highly-appraised scientists – as is aptly demonstrated in “Much Ado About Nothing”, an exchange between Dr. E.F. Codd and C.J. Date about NULL, in which Date gets Codd to agree that there should be more than one kind of NULL. The fact that both Codd and Date apparently missed, is that the examples used in their debate failed to meet the basic rules of normalization! Moving back to the birthday example to illustrate this, it is true that NULL in this column can have different causes – but that doesn’t change the meaning of this NULL, which is limited to “the birthday for this particular customer is not in the database”. If the business doesn’t care why a birthday is missing, then the fact that there might be different causes doesn’t have any consequences on the database. If, on the other hand, the business does case about the reason that a birthday is missing, then this reason should of course be modeled and stored in the database – but not in the same column as the birthday! “Birthday” is one attribute and “Reason birthday is missing” is a different attribute – heck, they even have completely disjunctive domains! The fact that these attributes are mutually exclusive doesn’t warrant violating first normal form by stuffing these two attributes in a single column! In this case, a proper design for a SQL Server table would look like this:
CREATE TABLE Customers
(CustomerID int NOT NULL,
Birthday datetime NULL,
ReasonNoBirthday int NULL,
-- Other columns,
CONSTRAINT PK_Customers PRIMARY KEY (CustomerID),
-- Time part for birthday has to be midnight
CONSTRAINT CK_Birthday CHECK
(CONVERT(char(8),Birthday,108)='00:00:00'),
-- Birthday mutually exclusive with ReasonNoBirthday
CONSTRAINT CK_ReasonNoBirthday CHECK
((Birthday IS NULL AND ReasonNoBirthday IS NOT NULL)
OR (Birthday IS NOT NULL AND ReasonNoBirthday IS NULL))
);
Things get more complex when nullable columns are used in expressions and predicates. In a procedural language, this wouldn’t have been a problem – if a procedural program fails to find the information it needs, it enters a conditional branch to handle this situation, as defined by the programmer. In a declarative, set-based language such as SQL, this was not possible. The alternatives were either to have the SQL developer add conditional expressions for each nullable column in a query to handle missing data, or to define a decent default behavior in SQL for missing data so that developers only have to write explicit conditional expressions if they need to override the default behavior.
The default NULL handling in expressions is very much based on how humans would handle similar situations. For instance, stop and think a moment what you would reply if I asked you to calculate my age, and to mimic database behavior, I’d also constrain your answer to be within the domain of integer values. You’d probably first ask me for my birthday. But if I refuse to specify my birthday, you would be unable to answer the question – so you wouldn’t answer it at all. And this is exactly what a database does – if any value to be used in an expression is missing (NULL) , there won’t be any result of the expression; in other words, the result is missing (NULL) as well. This is known as the rule of NULL propagation: any expression in SQL returns NULL if any of its input arguments is NULL (with the notable exception of a few functions that are specifically conceived for NULL handling). And that is a second similarity between NULL and a black hole: the gravity of a black hole pulls in everything that gets too close, causing it to “disappear” in the hole, and the rule of NULL propagation causes any expression that has a NULL in it to have no result, as if the other input values disappear as well.
Just to confuse matters more, default handling of NULL in a predicate is different. There is a valid reason for this, though. First of all, expressions are supposed to return a value that adheres to the rule of a datatype, but “Hey dude, I can’t answer that” is obviously not a valid value in any numeric, date, time, or datetime domain. It would of course be valid in a domain for character strings, but you can’t just overload what might already be a valid “regular” value with a special meaning. NULL however is valid, in any domain. So there really was no other choice but to return NULL if input data for an expression is missing. The reason that predicates can’t be handled the same way, is that a predicate is typically used in a WHERE expression, so there has to be a result for each row – even if some input data is missing, the DBMS still has to decide whether or not to include the row in the output. This problem was also solved by mimicking what humans would do. Suppose I’d ask you to tell me whether I am older than thirty-five. This time, I won’t restrict your answer to any domain, but I do force you to give an answer. And of course, I still refuse to disclose my date of birth. So, since you obviously can’t say “yes” or “no” without having a 50% chance of being wrong, what will your answer be? Probably some more or less polite variation of “bugger off, dude, how am I supposed to know that if you don’t disclose your birthday?” Translated back to database terms, you answer would not be true or false, but unknown. Not unknown because some input data is unknown (as some text books write), but unknown because some input data is missing.
This choice solves the problem, since we are now able to evaluate each predicate in a WHERE clause for each row, with the result being either true (row is included), false (row is omitted), or unknown (in which case the row is omitted as well). In a future post, I will cover how this affects logical expressions that involve AND, OR, or NOT to modify predicates, and how unknown is sometimes treated the same as false, yet other times treated the same as true, depending on the context of the predicate.