All data warehouses have the notion of an Unknown member somewhere in their make-up. That is, a dimension member to which we can map fact records when the member that we need to map to is, well, unknown. The canonical example of where the Unknown member becomes useful is in a retail environment where the retailer uses a customer loyalty card that enables them to track who purchased what items; if the customer does not have a loyalty card then they will get mapped to the unknown member in the Customer dimension.
Surrogate keys (SKs) are also particularly prevalent in data warehouses. SKs are meaningless values that get generated as part of the warehouse population process to uniquely identify a record in a dimension table and are usually integers (aside: Mladen Prajdić has a terrific discussion of the merits of SKs in a recent blog post at Why I prefer surrogate keys instead of natural keys in database design – don’t miss the comments).
It is common practice to pre-populate a dimension table with its Unknown member prior to the first real data population and moreover it is almost equally as common to artificially set that member’s surrogate key value to be –1. Indeed, Ralph Kimball even talks about doing this in his book “The Data Warehouse Toolkit”:
Map all the bad fact rows to the Unknown member for that dimension by supplying a key of –1 (or whatever your Unknown member key is)
Ralph Kimball, The Data Warehouse Toolkit
Now, that’s not quite a recommendation but it is very much in line with what is considered to be a well-known, consistent, convention.
Consistency is good. Convention is good. We like both of those things and indeed I have employed this practice of using –1 for the Unknown member surrogate key on many many projects in the last few years. Of late though I’ve started to question whether it really is a good idea and here I’ll explain where my doubt emanates from.
The main issue I have with using –1 is quite simply this: in doing so you are giving meaning to something that is inherently supposed to be meaningless. The whole point of generating surrogate key values is that they don’t mean anything and therefore have no relevance to the real-world thing they are representing. By stipulating that -1=Unknown we are violating one of the justifications for using surrogate keys in the first place. That doesn’t sit easily with me.
The common method used to discover SK values is to lookup against a dimension table using the natural key. Its not uncommon however for developers to say to themselves “I know that this is an unknown member so I won’t bother with a lookup, I’ll just hardcode –1 instead”. In other words the unknown members get treated differently to everything else; all of a sudden some inconsistency has crept into our system.
“What’s that?” I hear you say, “Inconsistency? You just told me that using –1 was supposed to be consistent?”
Hmmmm… Not only has this practice introduced some inconsistency but we’re hardcoding values into our code as well, another practice that makes me shudder slightly. Now, I will concede that this exhibits a lack of developer discipline rather than an inate problem with –1 values but still, removing temptation is no bad thing.
Ignoring half your range of values
When –1 is being used for the Unknown member you can be as sure as night following day that the data population process will start generating SK values from either 0 or 1 and count upwards. One question, WHY? What is so special about –1, 0 & 1? If you’re using 4-byte integer SKs then your range of values is –2147483648 to 2147483647 each of which is equally as meaningless as all the others so what’s the point in instantly disregarding half of them?
Again I’ll concede that this will create a problem in only a miniscule number of cases but still, if you have taken the time to choose a data type with a suitably large value range I don’t really see the point in excluding half of that range before you have even begun.
All that being said its hard to rail against common convention and for that reason I suspect I’ll continue to use –1 for Unknown members in the future because my colleagues will pressure me into it, I’ll just think twice about it before I do.
What say you dear reader? Have I managed to persuade you that using –1 is a bad idea?
UPDATE 2010-09-13: Almost a year on and I have a counter argument to the main argument that I put forward above. That is, a reason why using -1 to indicate Unknown might actually be a good thing.
On my current project we have to add a new dimension to an existing fact table. That fact table does of course have data in it and the new dimension column needs to be NOT NULLable - the typical way of handling this is to default the column to the Unknown member using a DEFAULT constraint. However, DEFAULT constraints cannot be defined with a SELECT statement to get the Unknown member surrogate key (an attempt to do so will return an error: "Subqueries are not allowed in this context. Only scalar expressions are allowed.") thus that Unknown member surrogate key needs to be a known value e.g. -1.
This limitation can be worked around using a
scalar function, but its something else to think about when considering Unknown members and their surrogate keys.
Any other issues one should be considering?