Those who are familiar with Ralph Kimball’s theories on data warehousing modelling may be familiar with his assertion that a type 2 dimension member record should have a StartDate and an EndDate that defines the effective period of that record. He outlines this approach in his paper Slowly Changing Dimensions, Types 2 & 3 from the October 2008 edition of DMReview:
I recommend adding five additional fields to a dimension that is undergoing Type 2 processing… begin-effective-datetime & end-effective-datetime are full time stamps that that represent the span of time between when the change became effective and when the next change becomes effective. The end-effective-datetime of a Type 2 dimension record must be exactly equal to the begin-effective-datetime of the next change for that dimension member .The most current dimension record must have an end-effective-datetime equal to a fictitious datetime far in the future.
My opinion, which is based on experience of implementing this technique, is that it is a bad idea and in this blog post I’ll explain why. Let me clarify that I do not think the practise of managing dimension members as Type 2 slowly changing dimensions is a bad idea, only that maintaining an effective period using a start date and an end date is.
Framing the problem
Let’s look at an example where I am modelling customers as a type 2 dimension. Following Kimball’s advice we would implement a table something like this:
Note how we have two records for “Henry” because at some point he got married and had a child (attributes that we treat as type 2 changes) and how the [SCDEndDate] of the first record equals the [SCDStartDate] of the second record.
Can you spot the obvious problem here? We’ve got the same piece of information in more than one place; namely the date ‘27/11/2009’. What’s the point? We know that the [SCDEndDate] of a record equals the [SCDStartDate] of the record that supersedes it so using our old friend SQL we can easily work out the [SCDEndDate] of any record. I’ve seen too many examples of Kimball’s approach being used and I’ve hardly ever seen it not be a problem; just recently I worked on a data migration project and we had untold problems created by the use of [SCDStartDate] and [SCDEndDate] because the database could not constrain this relationship and neither, seemingly, did the code that had been written to manage the data.
At the root of this problem is the fact that Kimball’s method cannot be constrained in a relational database management system (RDBMS) without a complicated and debilitating set-based check constraint, it ignores the goodness of relational theory and normalisation that data professionals have had drummed into them for years.
A better way
Here follows how I would solve this problem; note that the syntax here is for SQL Server but this method would work perfectly well for any RDBMS.
Firstly, let’s create a table and put some data in it:
CREATE TABLE [Customer] (
[Id] INT PRIMARY KEY
, [NId] NVARCHAR(50)
, [Name] NVARCHAR(50)
, [HomeTown] NVARCHAR(50)
, [MaritalStatus] CHAR(1)
, [NumberOfChildren] INT
, [SCDStartDate] DATETIME --Note only [SCDStartDate], no [SCDEndDate]
, CONSTRAINT Customer_UK UNIQUE
INSERT [Customer] ([Id],[NId],[Name],[HomeTown],[MaritalStatus],[NumberOfChildren],[SCDStartDate])
I’ve created three rows here for Henry for demo purposes. Over time we notice that Henry got married and then later had a child; note that there is no [SCDEndDate]:
Now let’s write a view that gets us the [SCDEndDate] for each record:
CREATE VIEW vCustomer
, COALESCE(MIN(c2.[ScdStartDate]),CONVERT(DATETIME,'99991231')) AS [SCDEndDate]
, CONVERT(BIT,CASE WHEN MIN(c2.[ScdStartDate]) IS NULL THEN 1 ELSE 0 END) AS [IsLatest]
FROM [Customer] c
LEFT OUTER JOIN [Customer] c2
ON c.[Nid] = c2.[Nid]
AND c.[SCDStartDate] < c2.[SCDStartDate]
GROUP BY c.[Id],c.[NId],c.[Name],c.[HomeTown],c.[MaritalStatus]
Here is what we get if we select from that view:
Note that each row for Henry has an [SCDEndDate] that is guaranteed to be the same as the [SCDStartDate] of the record that supersedes it. Note also that we are guaranteed there will be no time period gaps since when Henry first became known to us. (Edit: Another advantage of this approach was pointed out to me by Phil Nolan in the comments below: we do not have to update a record when it gets superseded which we would have to do were we explicitly storing [SCDEndDate])
We have achieved this by joining the [Customer] table to itself and, for each row, finding the row that supersedes it and use the superseding row’s [SCDStartDate] as the current row’s [SCDEndDate]; if there is no superseding row we simply use a fictitious future date exactly as Kimball suggests. Moreover we can also derive an [IsLatest] field which is another field that Kimball recommend we use and also another field whose value cannot be constrained by the relational model. We have achieved the same as Kimball’s bunkum [SCDStartDate] & [SCDEndDate] table columns and still maintained the integrity of our data.
The counter argument to doing this is that Kimball’s method will be quicker because we don’t need to write an INNER JOIN to achieve the data that we want. Well that is true but ask yourself what is more important, querying speed or the integrity of your data? If your answer is '’querying speed’ then you should probably ask yourself why you’re bothering to use an RDBMS at all; data integrity is the reason that we build a relational data model. [That’s a pretty controversial opinion so I’ll look forward to debating it in the comments of this blog post :)]
If you absolutely need the [SCDEndDate] persisted somewhere then I’ll concede that the view resultset can be materialised into a table but really I don’t think there is that much need to do so; the main use of an [SCDEndDate] is to make your ETL easier and I’m sure its extremely rare that an ETL process cannot cope with the slightly increased querying time created by using the method I outline herein.
In this blog post I have outlined the problems that I believe are inherent in Kimball’s method of defining a time period for a slowly-changing-dimension record and also an alternative method that exhibits a purer method of achieving the same information. I welcome your thoughts in the comments below.
I have uploaded the script that I used herein so that you can try this technique for yourself. Download it from my Skydrive: http://cid-550f681dad532637.skydrive.live.com/self.aspx/Public/BlogShare/20091128/DebunkingKimballMethodEffectiveDates.sql
P.S. There are some great debates going on in the comments section of this blog post so if you’re reading this using a blog aggregator/feedreader click through to http://sqlblog.com/blogs/jamie_thomson/archive/2009/11/28/debunking-kimball-effective-dates.aspx to read more opinions on this.
UPDATE: The debate continued over on KimballGroup.com (where all respondents disagreed with me, unsurprisingly).