THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a freelance data mangler in London

Considering surrogate keys for Unknown members

All data warehouses have the notion of an Unknown member somewhere in their make-up. That is, a dimension member to which we can map fact records when the member that we need to map to is, well, unknown. The canonical example of where the Unknown member becomes useful is in a retail environment where the retailer uses a customer loyalty card that enables them to track who purchased what items; if the customer does not have a loyalty card then they will get mapped to the unknown member in the Customer dimension.

Surrogate keys (SKs) are also particularly prevalent in data warehouses. SKs are meaningless values that get generated as part of the warehouse population process to uniquely identify a record in a dimension table and are usually integers (aside: Mladen Prajdić has a terrific discussion of the merits of SKs in a recent blog post at Why I prefer surrogate keys instead of natural keys in database design – don’t miss the comments).

It is common practice to pre-populate a dimension table with its Unknown member prior to the first real data population and moreover it is almost equally as common to artificially set that member’s surrogate key value to be –1. Indeed, Ralph Kimball even talks about doing this in his book “The Data Warehouse Toolkit”:

Map all the bad fact rows to the Unknown member for that dimension by supplying a key of –1 (or whatever your Unknown member key is)
Ralph Kimball, The Data Warehouse Toolkit

Now, that’s not quite a recommendation but it is very much in line with what is considered to be a well-known, consistent, convention.

 

Consistency is good. Convention is good. We like both of those things and indeed I have employed this practice of using –1 for the Unknown member surrogate key on many many projects in the last few years. Of late though I’ve started to question whether it really is a good idea and here I’ll explain where my doubt emanates from.

Meaningless meanings

The main issue I have with using –1 is quite simply this: in doing so you are giving meaning to something that is inherently supposed to be meaningless. The whole point of generating surrogate key values is that they don’t mean anything and therefore have no relevance to the real-world thing they are representing. By stipulating that -1=Unknown we are violating one of the justifications for using surrogate keys in the first place. That doesn’t sit easily with me.

Lax development

The common method used to discover SK values is to lookup against a dimension table using the natural key. Its not uncommon however for developers to say to themselves “I know that this is an unknown member so I won’t bother with a lookup, I’ll just hardcode –1 instead”. In other words the unknown members get treated differently to everything else; all of a sudden some inconsistency has crept into our system.

“What’s that?” I hear you say, “Inconsistency? You just told me that using –1 was supposed to be consistent?”

Hmmmm… Not only has this practice introduced some inconsistency but we’re hardcoding values into our code as well, another practice that makes me shudder slightly. Now, I will concede that this exhibits a lack of developer discipline rather than an inate problem with –1 values but still, removing temptation is no bad thing.

Ignoring half your range of values

When –1 is being used for the Unknown member you can be as sure as night following day that the data population process will start generating SK values from either 0 or 1 and count upwards. One question, WHY? What is so special about –1, 0 & 1? If you’re using 4-byte integer SKs then your range of values is –2147483648 to 2147483647 each of which is equally as meaningless as all the others so what’s the point in instantly disregarding half of them?

Again I’ll concede that this will create a problem in only a miniscule number of cases but still, if you have taken the time to choose a data type with a suitably large value range I don’t really see the point in excluding half of that range before you have even begun.

 

All that being said its hard to rail against common convention and for that reason I suspect I’ll continue to use –1 for Unknown members in the future because my colleagues will pressure me into it, I’ll just think twice about it before I do.

What say you dear reader? Have I managed to persuade you that using –1 is a bad idea?

@JamieT

UPDATE 2010-09-13: Almost a year on and I have a counter argument to the main argument that I put forward above. That is, a reason why using -1 to indicate Unknown might actually be a good thing.

On my current project we have to add a new dimension to an existing fact table. That fact table does of course have data in it and the new dimension column needs to be NOT NULLable - the typical way of handling this is to default the column to the Unknown member using a DEFAULT constraint. However, DEFAULT constraints cannot be defined with a SELECT statement to get the Unknown member surrogate key (an attempt to do so will return an error: "Subqueries are not allowed in this context. Only scalar expressions are allowed.") thus that Unknown member surrogate key needs to be a known value e.g. -1.

This limitation can be worked around using a scalar function, but its something else to think about when considering Unknown members and their surrogate keys.

Any other issues one should be considering? 

Published Wednesday, October 14, 2009 11:35 PM by jamiet

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Mladen said:

the only really great reason i was given for not using int min value for the unknown is that the data compresion in 2008 is useless on it.

if you start at -1, 0 you get at least great compression for first "few" rows.

October 14, 2009 5:45 PM
 

SkullKiller said:

Hi.

You posted a very valid issue. With the "unknown member SK" we are really giving a meaning to something meaningless. But is it really true? It can mean so many things. It's like the null value: null != null, null = null, null = "nothing else", null != "nothing else". It means so many things and means nothing at the same time. We are always trying to give it a meaning (absence of value, no value exists, ...). The key to this is definition: it means what we define.

So I see nothing wrong with the "unknown member SK". On the other hand, when we use the "unknown member SK" in hierarquies sometimes we are identifying it as the "unknown member" of something, let's say, "unknown member of laptop category". This way we sometimes use "-[0,9]*" member.

To ease things a bit when I use this kind of SK I rather call it "unindentified @atributemember" instead of "unknown @atributemember", this way I don't even think of the meaning issue given that it has a meaning.

October 14, 2009 6:48 PM
 

Ben E said:

Hear!Hear!  There is a reasonable rationale for this in OLAP frameworks that don't natively support the concept of an unknown member, but SSAS doesn't have this limitation.  I accept that in 10+ years of working with relational databases I have never once seen a negative integer surrogate key; but then neither have I seen SQL injection attempted against any websites I've developed, cross-site scripting attacks, et al.

It is about inferring meaning, which in modern era of not having business logic enforced in the database is even more important.

October 15, 2009 5:39 AM
 

Eric Wisdahl said:

Great question!  I have thought about the negative SKs before and must say that the main reason why I believe that people don't use them is that they don't expect to ever need a SK larger than the max int value.  

I suspect as more an more data is generated and examined we will begin to see a trend towards using "the other half" of the int region for SKs as well.

As to the actual -1 for an unknown member?  Well, I would say that there is little harm in the consistent use of this value.  The main risk is from lazy programming...

October 15, 2009 9:02 AM
 

Dave said:

I'll just be happy if I can get my team to stop hardcoding SK values generally in our production code.  For example in a report, rather than joining to the lookup table and filtering on the business descriptive fields all too often the code will filter on an SK.

Where

Store = 5

Apparently it's fine, the SK will never change.  Well we better hope it doesn't as if it did we'd have to recode half our system.  

I agree that there is no need to use -1, but there is no great need to not use it either.  Making use of -ve numbers seems to make sense though, especially as we recently hit the upper limit on a table with a few hundred million rows!  Big Int ftw \o/

October 16, 2009 9:52 AM
 

jamiet said:

Dave,

Ouch, hardcoding SK values in REPORTS? Jeez...that's a recipe for disaster. Does the report work in different environments? Surely the SKs aren't guaranteed?

-Jamie

October 16, 2009 10:29 AM
 

Everest said:

I not only used negative SKs recently, but purposefully let the mainframe creating + surrogates so SQL Server could assign negative surrogates whenever the key was generated from the internet. In either case it's as easy to go forwards and backwards and avoided collision.

Good stuff, Jamie! Thanks.

Lee

October 19, 2009 9:48 PM
 

Ralph Wilson said:

I haven't thought about using the -1 (or other negative values) for surrogate keys before reading this post.  Now, though, I am wondering about using zero (0) instead of the -1.  It seems to me that there could be a psychological and "reasonable" basis for using zero . . . since you have zero information about the Unkown. ;-)

October 23, 2009 11:03 AM
 

Kristian Wedberg said:

Hey Jamie, on the meaninglessness:

IMHO a so called key decision in a data warehouse with SKs is whether Surrogate Keys are static, or if Business Keys are static, i.e. the warehouse should guarantee that one or the other won't change over time (without also changing its meaning.) This way the warehouse users (e.g. reports) know whether to hard code Surrogate Keys or Business Keys in filters etc. The decision doesn't have to be the same across the whole warehouse, but consistency is obviously preferable.

Examples:

* If an organisation (or country, or person, or...) changes its name (assuming this is the Business Key), and I still want to treat it as the same organisation, I should make the Surrogate Keys static and hard code them in reports, but allow Business Keys to vary. This way the reports will still filter on the correct entry, even though the displayed name has changed.

* If I want the Business Key change to imply this is a separate organisation, then reports should hard code business keys instead.

In the first example we're adding even more meaning to SKs (beyond the -1 meaning), and as long as it's a conscious and communicated decision I believe it's a very good one.

Tossing in even more meaning into the SK mix, instead of just using -1 = (Unknown), it's often good to use -1 = (Not Available) and -2 = (Not Applicable) to better show _why_ there is missing data.

Thoughts?

November 7, 2009 10:53 AM
 

jamiet said:

Hi willshak,

Thanks for the comments.

Regarding static SKs or business keys...yes, one of them should definitely stay static/stable/be immutable. Generally I use business keys in my filter predicates as it reinforces the point that SKs are supposed to be meaningless. I would expect this to be the norm though readily accept that there may be other circumstances (possible unforeseen) where the business key changes.

Regarding breaking down the reason for something being Unknown - yes, I think this is a good idea. On my most recent engagement most dimensions had an 'Unknown' & a 'Not applicable' member.

Of course, if you're using the "Not Applicable" member then it suggests that the data you're storing against it is being stored at the wrong fact grain - however there may be good justifications for doing this.

Thanks again for the cvomments, good thought provoking stuff.

-Jamie

November 8, 2009 5:19 AM
 

JamesH said:

I have been using 'Unknown' -1 : and 'Uncategorized' -2 for many years but only when the project owner (or profiling) states that it is necessary to balance or keep track of these types of data.  Unknown has primarily been used in the past (even before SSAS) to facilitate auditing and correcting the master data, if it's ignored in the onset of a DW/DM project, you can bet that over time a manager that is concerned about his bonus will eventually want to know what the heck 'Unknown' means and how it's impacting his performance rating (sales and margins).  Uncategorized is a bit different and is usually provided by the source system but can have the same effect (positive/negative) on a system.  The navigational

aspect of these comes into play with balancing and without them can create the 'blackhole of death' for a BI project.  

Concerning SK's, it is and always has been necessary for a warehouse to 'shield' itself from the inconsistencies and changes from an underlying system(s); moreover, carrying natural keys might be useful with a single source but when you start integrating multiple systems you will inevitably run into the same problems as a relational database unless you use a completely generic key.  I could care less if it is a string or numeric as long as it can be controlled.  A good ETL process is, IMHO, the proper place for assignment and control of SK's.

November 10, 2009 1:18 PM
 

JustinS707 said:

Is it too much to ask SSIS to create an inferred member surrogate key in the dimensions when the fact arrives too early?  Use of the "-1" value for the Unknown member is inefficient:

1. You must keep your natural keys in the fact table to be able to update the surrogate key at some point.  Effectively, you defeat the purpose of using the surrogate key because you must keep the natural key column for every dimension that has an unknown member.

2. You must have a process to check the dimension for all Unknown member natural keys in the fact table and update the fact table with the correct surrogate key.

I believe this convention is antiquated and a lazy coding practice.

September 5, 2010 3:12 PM
 

SSIS Junkie said:

Earlier today I posted the following question on Twitter: Foreign keys in a data warehouse. Yes or no?

April 16, 2013 5:36 PM
 

James Snape said:

Hi Jamie,

I read this blog post when you originally posted it and I used to be with the -1 crowd but in my last two projects I've decided to try "letting null be null"... We do apply some rules but overall it was most liberating.

1. If Null or <empty string> have meaning in the source system then replace the null or empty string with the meaning (e.g. NULL > N/A, '' > Empty) BEFORE the surrogate key mapping stage. Then these values will be mapped using the same mapper.

2. We consider null to be unknown once it reaches the data warehouse. We configure SSAS Unknown member handling properly.

I was worried about doing this but in reality we have far fewer issues than we ever did with -1s and -2s. It even helps us maintain a ROLAP realtime partition because if rows get added but the dimension hasn't caught up then the cube still works - we just get some unknown data for a while which magically fixes itself when the dimension is processed.

I do start keys at 1 because of a) data compression and b) until sequences came about it was really useful to be able to create an identity in the opposite direction when your modelling needs change without worrying about clashes - particularly around inheritance hierarchies.

Oh yeah - FK's are a must in the DW too. Otherwise none of the tools work and new starters have no idea how your model fits together. I consider it unprofessional to not have them although if there is a proven performance hit I might keep them as metadata only FK's.

April 17, 2013 6:01 AM
 

jamiet said:

Great comment, thank you James. Definitely some food for thought there.

The "data compression" argument for starting keys at 1 is something I've heard before. I'm wrestling with whether that's a good enough justification or not. Everything's a trade-off.

cheers

Jamie

April 17, 2013 6:06 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement