THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a freelance data mangler in London

Debunking Kimball Effective Dates

Those who are familiar with Ralph Kimball’s theories on data warehousing modelling may be familiar with his assertion that a type 2 dimension member record should have a StartDate and an EndDate that defines the effective period of that record. He outlines this approach in his paper Slowly Changing Dimensions, Types 2 & 3 from the October 2008 edition of DMReview:

I recommend adding five additional fields to a dimension that is undergoing Type 2 processing… begin-effective-datetime & end-effective-datetime are full time stamps that that represent the span of time between when the change became effective and when the next change becomes effective. The end-effective-datetime of a Type 2 dimension record must be exactly equal to the begin-effective-datetime of the next change for that dimension member .The most current dimension record must have an end-effective-datetime equal to a fictitious datetime far in the future.

My opinion, which is based on experience of implementing this technique, is that it is a bad idea and in this blog post I’ll explain why. Let me clarify that I do not think the practise of managing dimension members as Type 2 slowly changing dimensions is a bad idea, only that maintaining an effective period using a start date and an end date is.

Framing the problem

Let’s look at an example where I am modelling customers as a type 2 dimension. Following Kimball’s advice we would implement a table something like this:

image

Note how we have two records for “Henry” because at some point he got married and had a child (attributes that we treat as type 2 changes) and how the [SCDEndDate] of the first record equals the [SCDStartDate] of the second record.

Can you spot the obvious problem here? We’ve got the same piece of information in more than one place; namely the date ‘27/11/2009’. What’s the point? We know that the [SCDEndDate] of a record equals the [SCDStartDate] of the record that supersedes it so using our old friend SQL we can easily work out the [SCDEndDate] of any record. I’ve seen too many examples of Kimball’s approach being used and I’ve hardly ever seen it not be a problem; just recently I worked on a data migration project and we had untold problems created by the use of [SCDStartDate] and [SCDEndDate] because the database could not constrain this relationship and neither, seemingly, did the code that had been written to manage the data.

At the root of this problem is the fact that Kimball’s method cannot be constrained in a relational database management system (RDBMS) without a complicated and debilitating set-based check constraint, it ignores the goodness of relational theory and normalisation that data professionals have had drummed into them for years.

A better way

Here follows how I would solve this problem; note that the syntax here is for SQL Server but this method would work perfectly well for any RDBMS.

Firstly, let’s create a table and put some data in it:

CREATE TABLE [Customer] (

  
[Id]                INT             PRIMARY KEY
[NId]               NVARCHAR(50)
[Name]              NVARCHAR(50)
[HomeTown]          NVARCHAR(50)
[MaritalStatus]     CHAR(1)
[NumberOfChildren]  INT
[SCDStartDate]      DATETIME   --Note only [SCDStartDate], no [SCDEndDate]
CONSTRAINT  Customer_UK UNIQUE
  
(
      
[NId]
  
,   [SCDStartDate]
  
)
);
INSERT [Customer] ([Id],[NId],[Name],[HomeTown],[MaritalStatus],[NumberOfChildren],[SCDStartDate])
VALUES (1,'Cust001','Henry','London','S',0,CONVERT(DATETIME,'20050324'))
,      (
2,'Cust001','Henry','London','M',0,CONVERT(DATETIME,'20070726'))
,      (
3,'Cust002','Sarah','Birmingham','M',2,CONVERT(DATETIME,'20060213'))
,      (
4,'Cust001','Henry','London','M',1,CONVERT(DATETIME,'20091127'));,'M',2,CONVERT(DATETIME,'20060213'));

I’ve created three rows here for Henry for demo purposes. Over time we notice that Henry got married and then later had a child; note that there is no [SCDEndDate]:

image

Now let’s write a view that gets us the [SCDEndDate] for each record:

CREATE VIEW vCustomer  
AS
SELECT
c.[Id],c.[NId],c.[Name],c.[HomeTown],c.[MaritalStatus]
,      c.[NumberOfChildren],c.[SCDStartDate]
,      COALESCE(MIN(c2.[ScdStartDate]),CONVERT(DATETIME,'99991231')) AS [SCDEndDate]
,      CONVERT(BIT,CASE    WHEN MIN(c2.[ScdStartDate]) IS NULL THEN 1 ELSE 0 END) AS [IsLatest]
FROM   [Customer] c
LEFT   OUTER JOIN [Customer] c2
ON     c.[Nid] = c2.[Nid]
AND        c.[SCDStartDate] < c2.[SCDStartDate]
GROUP  BY c.[Id],c.[NId],c.[Name],c.[HomeTown],c.[MaritalStatus]
,      c.[NumberOfChildren],c.[SCDStartDate];

Here is what we get if we select from that view:

image

Note that each row for Henry has an [SCDEndDate] that is guaranteed to be the same as the [SCDStartDate] of the record that supersedes it. Note also that we are guaranteed there will be no time period gaps since when Henry first became known to us. (Edit: Another advantage of this approach was pointed out to me by Phil Nolan in the comments below: we do not have to update a record when it gets superseded which we would have to do were we explicitly storing [SCDEndDate])

We have achieved this by joining the [Customer] table to itself and, for each row, finding the row that supersedes it and use the superseding row’s [SCDStartDate] as the current row’s [SCDEndDate]; if there is no superseding row we simply use a fictitious future date exactly as Kimball suggests. Moreover we can also derive an [IsLatest] field which is another field that Kimball recommend we use and also another field whose value cannot be constrained by the relational model. We have achieved the same as Kimball’s bunkum [SCDStartDate] & [SCDEndDate] table columns and still maintained the integrity of our data.

The counter argument to doing this is that Kimball’s method will be quicker because we don’t need to write an INNER JOIN to achieve the data that we want. Well that is true but ask yourself what is more important, querying speed or the integrity of your data? If your answer is '’querying speed’ then you should probably ask yourself why you’re bothering to use an RDBMS at all; data integrity is the reason that we build a relational data model. [That’s a pretty controversial opinion so I’ll look forward to debating it in the comments of this blog post :)]

If you absolutely need the [SCDEndDate] persisted somewhere then I’ll concede that the view resultset can be materialised into a table but really I don’t think there is that much need to do so; the main use of an [SCDEndDate] is to make your ETL easier and I’m sure its extremely rare that an ETL process cannot cope with the slightly increased querying time created by using the method I outline herein.

Conclusion

In this blog post I have outlined the problems that I believe are inherent in Kimball’s method of defining a time period for a slowly-changing-dimension record and also an alternative method that exhibits a purer method of achieving the same information. I welcome your thoughts in the comments below.

I have uploaded the script that I used herein so that you can try this technique for yourself. Download it from my Skydrive: http://cid-550f681dad532637.skydrive.live.com/self.aspx/Public/BlogShare/20091128/DebunkingKimballMethodEffectiveDates.sql

@Jamiet

P.S. There are some great debates going on in the comments section of this blog post so if you’re reading this using a blog aggregator/feedreader click through to http://sqlblog.com/blogs/jamie_thomson/archive/2009/11/28/debunking-kimball-effective-dates.aspx to read more opinions on this.

UPDATE: The debate continued over on KimballGroup.com (where all respondents disagreed with me, unsurprisingly).

Published Saturday, November 28, 2009 4:59 PM by jamiet

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Alex K said:

Hi Jamie,

I am not following this: "Kimball’s method cannot be constrained in a relational database management system (RDBMS) without a complicated and debilitating set-based check constraint".

The following approach does just that via a FK constraint:

http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/03/08/storing-intervals-of-time-with-no-overlaps.aspx

Am I missing something?

November 28, 2009 11:13 AM
 

jamiet said:

Hi Alex,

Fair point, but it doesn't constrain that the value that it is constrained by is the *correct* one.

Take the example of Henry above. I could add the FK that you describe but it wouldn't stop me from specifying the [SCDEndDate] of the record where Id=1 to be '2009-11-27'. It satisfies the contraint but its not the correct value.

I notice that your blog post shows a method with which this can be achieved using a series of check constraints, but I still prefer the method that I outline above because my personal choice is not to put too much logic into check constraints - the method I outline exhibits that.

Thanks for pointing that out - it gave me the chance to clarify.

-Jamie

November 28, 2009 12:04 PM
 

merrillaldrich said:

Jamie - I've faced the same issue, and I like your solution. Why use space and memory, and introduce the opportunity for gaps in the date ranges, when the logic indicates that gaps should not be possible.

One additional challenge with this is that if you have an automated BI tool that auto-gens SQL select queries, it might be challenging to join to the table that has only one date. For example, a tool we have at my workplace has a "join between" GUI widget that can handle the scenario "table1 join table2 where somedate between table2.start and table2.end" but perhaps not the more complex query required on the single-date version. That could mean having to fall back on the view, and then performance might be a question.

November 28, 2009 2:09 PM
 

Alex K said:

Jamie,

I still have a few questions. Regarding this: "what is more important, querying speed or the integrity of your data?" - these are not the only two choices, we can have both speed and integrity. Also the businesses which pay us money to design and implement data may have quite explicit requirements about speed. We may come up with the most elegant and brilliant solution in the world, and it will be a failure if it does not meet performance requirements.

Also can you elaborate why you made your "personal choice is not to put too much logic into check constraints".

Also there are different requirements - of course your approach makes it easy to handle this case: "over time we notice that Henry got married and then later had a child", and this is an advantage of your approach. However, consider another case:

Starting 1/1/2003, ABC company reimburses up its employees on business trips for meals to $100 per day. On 11/20/2009 they up the limit to $120 per day. However, instead of 11/20/2009, a wrong date 11/20/2001 is entered, and there is nothing, no constraints in the database schema to prevent that. The integrity of your data is compromised and you don't know about it. What do you think?

November 28, 2009 2:57 PM
 

Ben E said:

I think you've made a good point here - one that works all the way back into the application layer.  Your point about not (over)using constraints to enforce integrity is also a valid one - one of the key principles of data management in n-tier design (and thus implicitly also SOA) is that business logic should not be handled by the database (despite the fact that I see this principle regularly violated by people who should know better).

As for Alex's comments, quality checking data being entered into the system is a separate issue from ensuring the integrity of the data once it's there - a subtle distinction for sure, but as someone who wears both BI and application developer hats, I can assure you it makes a big difference to maintainability and reliability.  There is also the small matter of performance and scalability - the more work being done by the database that could be done elsewhere (i.e. in the business logic layers), the less scalable the database is.

The only flaw with not including an end date would happen if the man from Gallifrey became a DBA.

November 28, 2009 5:20 PM
 

jamiet said:

Hi Alex,

Happy to elaborate - can't beat a good technical debate :)

"We may come up with the most elegant and brilliant solution in the world, and it will be a failure if it does not meet performance requirements."

Very true. And I will refer to the get-out-of-jail-card that we always use, "it depends".

My personal choice is that I would sacrifice a smidgeon of performance for a solution that is perhaps purer and/or more maintainable. That's not the right approach, nor is it the wrong approach - it always depends on the scenario.

As always its a juggling act, right? You need to quantify the perceived benfits against the requirements.

I will definitely say though that all things being equal, I would invariably sacrifice that tiny bit of performance for what I see as a more maintainable and, yes, purer solution. Of course, we have to balance purity with pragmatism - and that's what makes these debates so interesting!

"personal choice is not to put too much logic into check constraints".

I prefer to use keys wherever possible. Personal preference is sometimes a hard thing to qualify in words. I guess because:

a) I think keys are more "front-and-centre". By that I mean they're quite often the first things people look for when they want to understand a data model. First thing I do when I encounter a new database? Open up a database diagram and throw all the tables into it and see what the FKs tell me.

b) More intuitive, they explain the structure of a data model better than a check constraint can because the scope of what a check constraint can do is pretty large. Keys (both UKs & FKs) have a very specific job to do and the information they provide is unambiguous and easily understood.

"this is an advantage of your approach"

Yes, I won't disagree with that.

Regarding the scenario you outlined - user error is a hard thing to protect against in any data model would you not agree? I don't think that negates my approach tho unless there is something I have misunderstood (which is eminently possible).

I'll look forward to debating it over a beer at the summit in February :) , but in the meantime I await your reply here with interest. Thanks for being such a willing commenter - its much appreciated.

-Jamie

November 28, 2009 5:24 PM
 

Jim said:

Why not end the first row with an SCDEndDate of 26/11/2009?

November 28, 2009 7:11 PM
 

Alex K said:

Jamie,

I was thinking about performance and scalability and such, and I distinguish several different patterns in how the data is going to be used. Two of the common patterns relevant for this discussion are as follows:

1. Security video in a cinema: the data must save real time with no delays. Storage should not be wasted, we need an efficient way to store as much data per gigabyte of storage as possible. Data integrity is important. However, the data is read very infrequently, and optimizing the speed of selects is not a high priority either. This is where your approach shines.

2. Movie showtimes: the data is infrequently inserted, rarely updated, and very frequently viewed. It is common to have a requirement that selects must return in specified time, otherwise the clients would turn away. On the other hand, it is typically OK if the modifications are slowish. This is where it might make sense to denormalize.

Most systems are somewhere in between these two extreme cases, and IMO we cannot choose one approach that best fits all the cases. Makes sense?

Also I am not no longer an MVP, so let us have the discussion here.

Ben,

Regarding your comments on scalability, Jamie and I are choosing whether to have slower modifications (the overhead of more constraints) vs. slower selects (the overhead of joins). In both cases the work if performed by the database engine. Also I cannot agree with the following: "the more work being done by the database that could be done elsewhere (i.e. in the business logic layers), the less scalable the database is." Filtering, joining, and aggregating can all be done outside the database, but is usually very inefficient to do so.

November 28, 2009 8:49 PM
 

Todd McDermid said:

Great food for thought Jamie.  I was looking forward to how far you could stuff your foot into your mouth.  :)  I'm kidding - you have some good points - yet I'd still like to debate them.  I've had much, much less practical practice at this than just about anybody.  Scares the willies out of me when people assume I know what I'm talking about.

First, "we’ve got the same piece of information in more than one place."  Of course we do - it's a Kimball DW Dimension table.  We've got his hometown stored _three_ times FFS - and it's a longer char column to boot.  Duplication of data to simplify user querying is the point of the Kimball model, isn't it?

Second, using redundant columns/not using RDBMS relational integrity features "ignores the goodness of relational theory and normalisation that data professionals have had drummed into them for years."  Of course it does - it's a Kimball DW Dimension table.  (Wait - I used that argument already, didn't I?)  Dimension tables are denormalized, have immense column widths, and store insanely redundant information - completely at odds with normalization.  This practice is almost at the core of Kimball data warehousing.  They are even OK with NOT maintaining PK/FK relationships in the DW to make ETL more performant.  This is based on the knowledge that ONE and only ONE process updates the dimension table - designed by a data professional, not multiple operational systems designed by completely unqualified devs.  (IMO, the argument worked again.)

Finally, your solution has one problem with it that simply can't be overcome.  What happens when Henry is removed from the source operational system?  There's no way to "close" the last "version" of Henry.  Fundamentally, this is part of the issue I have with the design of the SCD Wizard in SSIS - there is no way to represent "deletions" in an SCD using the Wizard or your method.  Sure, you can add another date column that contains some kind of date as to when the information is considered "deleted", but that is essentially attempting to duplicate the SCD 2 End Date's mission - while not accomplishing it.  Without an end date, it is simply not possible to count the number of "active" members in a dimension without jumping through other hoops.

Regardless of the pros and cons presented, I agree with Kimball's recommendation to store and maintain both dates.  Because query speed is important.  But moreso is data understandability.  Sure, we could store the entire DW completely normalized and throw views on top to denormalize it.  But that's missing the point of why Kimball is suggesting a denormalized structure.  With a smidgen of thoughtfulness in the ETL, and/or constraints, and/or other integrity checks, gaps in the dates are simply an unacceptable lapse in what is supposed to be a data professional's job.  Shouldn't happen.  Ever.

Buy you a beer in Feb!

November 28, 2009 10:46 PM
 

Peter Schmitz said:

The end/start date scenario you described above sounds more as if it was a problem that's caused by the approach taken in programming, rather than a problem existing in the technique.

The problem in the initial post could easiily be solved by having an enddate be 1 second less than midnight on the next startdate. I.e. The enddate for Henry with ID 1 would be 26/11/2009 23:59:59.

As pointed out before, speed and performance is why the two dates exist in Kimball's model. By enforcing an additional join on a dimensional table to come up with the end date, you sacrifice speed.

November 29, 2009 5:39 AM
 

jamiet said:

Peter,

Solving the problem (i.e. that a piece of data exists twice) by simply changing one of them to something else is not solving the problem.

-Jamie

November 29, 2009 6:00 AM
 

jamiet said:

Jim,

"Why not end the first row with an SCDEndDate of 26/11/2009?"

Not sure I understand what you mean, sorry. 26/11/2009 isn't the a valid value for the record WHERE Id=1. Could you elaborate?

Todd,

You raised exactly the two points that I was expecting someone to raise. Thank you for being that guy!. :)

I'll take the easy one first. Henry being removed from the system is an attribute of Henry, not of an individual dimension member record. In other words, we need to have an [EndDate], which is different from an [SCDEndDate]. [EndDate] is an attribute of a Customer in the same way that [NumberOfChildren] and [HomeTown] are attributes of Henry.

Correct about [HomeTown] being stored 3 times, that is a different situation than the one for [SCDEndDate] though - its not comparing apples with apples (I have a strong suspicion you will disagree with that assertion :). I accept that the Kimball model is all about speed and that is why I caveated what I said above with "the view resultset can be materialised into a table".

On the recent project that I spoke of we were doing exactly  that -materialising data into a denormalised form- however the [SCDEndDate] & [SCDStartDate] were still being used *prior* to that (possibly called the ODS in Kimball parlance - not sure) and is was *that* practise that I had issues with.

I am generally OK with Kimball's type 1 & 2 dimensional theory though I evaluate each aspect of it on its merits and in the case of [SCDEndDate] & [SCDEndDate] I'm of the opinion that they are not warranted. That does not mean that I disagree with the pracxtice of managing type 1 and type 2 changes per se.

Alex,

"Most systems are somewhere in between these two extreme cases, and IMO we cannot choose one approach that best fits all the cases. Makes sense?"

Yep, it makes sense. I completely agree that "we cannot choose one approach that best fits all the cases"; I'll fall back, again, to my favourite answer of "it depends".

This blog post is largely dedicated to situations where Kimball dimensional modelling is used, which is typically data warehousing. In such situations I believe that the method I've outlined here is the right way to go in the majority of cases.

Again, thanks all for the comments! This is turning into a really good debate  and I'm loving it. Its not often that the really good stuff is in the comments section though it certainly is the case here.

-Jamie

November 29, 2009 6:07 AM
 

Jim said:

Jamie,

I used your first example and was stating what Pete said about ending the previous row one day before.  That is exactly the approach we take in our DW.

November 29, 2009 9:09 AM
 

jamiet said:

Jim,

Fair enough. But the problem I'm describing here is not specifically that the two values are the same, its that you're storing information that doesn't need to be stored. That is true irrespective of whether the logic is either (a) [SCDEndDate] is the same as [SCDStartDate] of the next record or (b) [SCDEndDate] is the same as [SCDStartDate]-1 of the next record.

-Jamie

November 29, 2009 9:21 AM
 

SSIS Junkie : Debunking Kimball Effective Dates said:

November 29, 2009 10:41 AM
 

Cade Roux said:

I'm not understanding why we're optimizing this column out?  After all, in the dimensional model there are all kinds of things which are duplicated and not normally constrained, but which we know are true from the metabehavior of dimensional change.  The choice of dimensions can be arbitrary and duplicative.  Dimensional modelling is all about trading off storage for performance, whether it's current dimension flags and effective dates or a date table with precalculated days of the week.  For small dimensions, this might be acceptable, but, particularly in the given case, you are likely to have millions of active customer dimensions - when you account for all the changes - that's overhead I'd rather tradeoff for storage.

November 29, 2009 11:19 AM
 

Cade Roux said:

I'd also add that, in practice, we do not use such columns in our statistical and accounting analyses (except for dimensional behavior analyses) anyway.  Nor do we use the current indicator or any of the other metadata stored in the dimensions (including the load identifier which caused the record to be altered/created).

So in those cases, your technique saves space and the performance hit never occurs because no one queries the dimensions except as attributes of facts, and metadata like date of change is not typically relevant - all that matters is which dimensional attributes are attached to the facts.

November 29, 2009 12:08 PM
 

jamiet said:

Cade roux,

"all that matters is which dimensional attributes are attached to the facts."

I completely agree. If a column in the data model is not relevant to the user then there is no sense in denormalising it out. Defining SCD effective periods (regardless of the method used to define them) is done for the purposes of ETL, not end-user querying.

-Jamie

November 29, 2009 1:25 PM
 

John Welch said:

Interesting post, and even more interesting comments. I have to admit, on my first read, my reaction was similar to Todd's. However, on a little reflection, I think you are making some good points. I have never actually used the Start and End dates in a SCD 2 dimension for anything but determing which version of a dimension member to match to a fact row in the ETL process.

However, it seems you are still relying on some outside process to enforce that a new row's SCD Start Date is newer than the previous row's SCD Start Date. If you have a process managing the integrity of the Start Date, why can't it manage the integrity of the End Date as well?

November 29, 2009 2:18 PM
 

Najm Hashmi said:

Hi Jamie,

It is an interesting article. However, article does not provide any facts showing that using this approach will improve a data load. I can see for a large dimension having 800000000 rows I will save about 6 GB of hard disk(datetime data type is 8 bytes).

select 8*800000000.0/(1024*1024) SaveMB -->6103.52

Do you agree that for large dimensions this can have an adverse impact on processing time?

So it is about speed vs disk space???

I do see your approach will work for smaller dimensions having few thousands rows.

I would like to see some facts how this approach performs in very large data warehouses.

Thanks,

Najm

November 29, 2009 2:24 PM
 

jamiet said:

John/Najm,

Thanks for the comments guys!

--------

John,

I'll answer your last question first. I have seen (on my recent project that I mentioned above) where the SCDStartDate was managed correctly but the SCDEndDate was not - I could fill pages and pages talking about the problems that this has caused us (and have provided some ifo in my reply to Najm below). I'm *generally* of the opinion that reducing what I call "moving parts" in the system is a good way to go - the less you implement then the less that can go wrong. Some people may call it K.I.S.S :)

--------

Najm,

Yeah, I don't have any comparitive stats to share which is a shame but then, you can prove anythng with stats, right? The notion of "improving a dataload" is another nebulous concept that is subject to debate. What does it mean to improve a dataload? Making it quicker certainly improves it but then reducing the time and effort to impement improves it also, as does reducing the amount of disk space required - and quite often these things are not mutually compatible.

"Do you agree that for large dimensions this can have an adverse impact on processing time?"

Possibly, yes. But then it could also have a positive affect. Again we have to question what we mean by "processing". Do we mean time taken to load the table or time taken to extract data from it? Which of these facets drives your data model?

All great questions, and all subject to conjecture and debate.

What I WILL state categorically is this. My personal experience is that the management (woah, there's another nebulous, subjective word :) of both [SCDStartDate] & [SCDEndDate] is more difficult than just [SCDStartDate] alone. On my recent project that I keep allusing to I inherited a system that (in come cases) had multiple records per dimension member which all had [SCDEndDate]='9999-12-31'. We had to migrate those to another system that also used [SCDStartDate] & [SCDEndDate] and we had to fix these broken dimension records in-flight - not a pleasant experience I can tell you.

As I said above, the less moving parts the better (IMO)!

"So it is about speed vs disk space???"

I can truthfully say that disk space considerations never factored into my thoughts when I write this blog post. My main motivation for the method I outlined is simplicity and purity of the data model.

---------

Thanks again everyone for the comments. I'm beginning to feel like a lynch mob is going to come and hunt me down at the next SQL gathering but regardless, keep them coming. :)   Debating this stuff is a good thing.

-Jamie

November 29, 2009 3:40 PM
 

philnolan said:

great post jamie and enjoying reading the ensuing debate.

thought i'd point out a performance gain that jamie's solution provides and which hasn't been highlighted.

jamie's solution allows a gain can be achieved within the etl as it removes the need to write back to the existing row (to update the end date and current flag) when the scd is identified.

that said, pretty sure the lynch mob are sharpening their forks!

November 29, 2009 4:33 PM
 

jamiet said:

Phil,

never thought about that. One more tick in the box :)

I won't even mention the extra code that needs to be written and maintained in order to do the update to the previous row. Oh whoops...I just did! :)

-Jamie

November 29, 2009 4:54 PM
 

Michelette Kimballe' said:

Jamie,

Start and end dates are not just to make the ETL easier, it also helps to answer business questions related to the state of a dimension at a particular time (how many children did Henry have on Jan 06)? Imagine trying to answer this question on a dimension with millions of members without an end date field that you can just add to your query.

Throughout the years I have seen that database guys just don't get dimensional modelling. They start asking questions on misused space, redundancy of data, etc, etc. The whole point of a dimensional modelling is to answer questions as quickly as possible in the simplest way. it doesn't matter if a 6GB model shouldn't be "bigger than 500MB". Its all about the users.

Michelette Kimballe'

November 29, 2009 5:47 PM
 

Craig Wilson said:

Jamie,

 Good article.  I'm not a DBA, but rather a programmer and have never read Kimball nor heard of Kimball. However, one thought occurs to me as we were attempting to model something like this.  Your theory is correct from a programmers point of view, DRY -> don't repeat yourself.  But what happens when the two pieces of data are not contiguous.

In other words, using your example, what if your "customer", or "employee" Henry left for a while and then came back.  He is the same person, so you would naturally use his same id again, but he definitely wasn't effective while not being at the company.  Anyways, just a thought.

November 29, 2009 6:27 PM
 

Greg Linwood said:

Michelette,

Throughout the years I have seen that so many database designers have head-in-the-coulds notions that they can forget all aspects of physical modelling. It seems obvious to me this is mainly because they rarely have to support the designs they fling around.

It is equally as important that a database system be operational as it is that it is "correct".

There are MANY customers out there who suffer having to keep their systems running under severe pain from designers who thought they didn't have to consider the performance implications of their designs.

November 30, 2009 12:02 AM
 

Nick said:

Jamie-

Thanks for the post (and the ensuing debate).  I think there may be some situations where this technique is better overall, but I'd have to guess that most situations would be best tackled with Kimball's approach.

However, one suggestion that I haven't seen mentioned is the use of the new Date data type (if business rules allow).  From what I've seen, most dimensions are processed daily, so the time data could be dropped and thus saving 2 x 5 bytes per record while maintaining the SCDEndDate.

Just my .02.

Enjoying your blog,

Nick.

November 30, 2009 10:55 AM
 

CVarela said:

Good points Jamiet but (there's always one isn't it ;) )

Why do you LEFT JOIN using the C.Name instead of CNId?

Joining by C.Name would only work if you replicate the changes to all records when it changes and IF it was a unique value ...

I take it was just to make the point !

Regards,

Cris.

November 30, 2009 11:13 AM
 

Pete w said:

This is good stuff man.

I've done this kind of optimization before, but I've never taken the time to document it such as you have in this well-written article.

Who cares about the space savings of a date column in a dimension column? I could care less, space is cheap. Your solution is attractive because it is one less column to maintain and it guarantees consistency, and that is your big payoff.

November 30, 2009 11:44 AM
 

AaronBertrand said:

Pete, space is cheap but I/O is not... even on high performance EMC2 Clariions we are constantly hitting bottlenecks in I/O.  When I can save 4 or 8 bytes per row, I do it.

November 30, 2009 12:24 PM
 

jamiet said:

CVarela,

No, it wasn't done to make a point - it was done cos I'm an idiot! :)

In other words, it didn't deliberately join on [Name], that was a mistake. Indeed the join *should* be done on [NId].

I've updated this blog post and the downloadable script accordingly. THANK YOU for pointing this out and well done for being the first person to spot it!!

-Jamie

November 30, 2009 2:42 PM
 

Peter Schmitz said:

Jamie,

In your reply to my post you referred to the point that you want to avoid storing data twice. Does that mean you are snowflaking your dimensional model in order to avoid doing that?

The whole purpose of using a star schema is to enable business users with hardly any knowledge of the SQL language to pull out the data they need, and as fast as possible.

Normalizing data for transactional systems is meant to ensure that transactional data can get committed to the system as fast as possible. A dimansional model has a completely different purpose, and therefore requires a completely different approach. Usually whenever I talk to clients who have experience using transactional databases, I'll use the comparison of a transactional system being a racewagon and a dimensional database being a big truck loaded with data. Sure, both are cars, but they serve completely different purposes, and therefore are intrinsically different.

As pointed out, it is far easier to allow end users to figure out relevant questions by storing data redundantly (though by substracting a second, the meaning of the date actually changes from a business perspective, which, hey, is exactly what is intended), than to insist joining a potentially huge table with itself. Sure we store a piece of data "twice", but we do it knowing fully well that the purpose of this "bad behaviour" is in order to allow ease of use, and performance in pulling out the data.

Just my 2 cents.

December 1, 2009 3:11 AM
 

jamiet said:

Hi Peter,

Your 2 cents are very welcome :)

"In your reply to my post you referred to the point that you want to avoid storing data twice. Does that mean you are snowflaking your dimensional model in order to avoid doing that?"

No. When I say I don't want to store data twice I am referring to these metadata fields (i.e. [SCDStartDate]&[SCDEndDate]) only, not to the attributes of the dimension member.

"The whole purpose of using a star schema is to enable business users with hardly any knowledge of the SQL language to pull out the data they need, and as fast as possible."

I'd mostly agree with that. Yes, a star schema is to pull the data out as fast as possible, I disagree that the SQL-authoring abilities of the end-user has any influence over our data model because the end-user will be (in the majority of cases) be abstracted away from the actual SQL anyway by a reporting/analysis tool. Therefore, if it were me I would rephrase your sentance as:

"The whole purpose of using a star schema is pull out the data as fast as possible."

"it is far easier to allow end users to figure out relevant questions by storing data redundantly"

Sorry Peter, I disagree with that! It makes it quicker, that is not the same as easier.

I should clarify something. I am not disregarding Kimball's dimensional model, nor denormalisation, as a means of enabling end-user querying and I believe I clarified that in the above blog post when I said: "I do not think the practise of managing dimension members as Type 2 slowly changing dimensions is a bad idea". Apologies if that wasn't clear. Furthermore I advocated denormalision of dimension record effective dates if absolutely necassary when I said: "If you absolutely need the [SCDEndDate] persisted somewhere then the view resultset can be materialised into a table"

What I do object to is the practice of unnecessarily storing data (i.e. [SCDEndDate]), especially when (IMO) a better method exists. I suspect you will say that storing that data is NOT unnecassary and IS indeed the best method - and that's OK. There's no correct answer here and that's why this comment section is currently 34-comments-strong.

Thanks for the comment Peter. As I keep alluding to I'm happy for the debate to continue.

-Jamie

December 1, 2009 7:22 AM
 

Cade Roux said:

I will second the suggestion that this technique is not suitable for cases where dimensions are expired when no facts arrive (monthly snapshot paradigm, for example).  In this case, there is no place to put the end date, and no expired flag.  There can be a debate about what the dimensional behavior should be of course, and why one would want all the dimensions which did not arrive to be expired.

When the technique can be used, although there is less information to update during a dimension processing step (INSERT only), it would seem that identifying the current dimension during the ETL to do the comparison to decide to insert or re-use the lookup would be significantly slower, since this view would need to be used (although I assert it would rarely be used in reporting and analysis).  So are we optimizing for an uncommon problem and hurting the performance of our regular ETL?

Whilst I have used similar design techniques for OLTP systems which had complete temporal consistency where new versioned entries are always created and old ones simply hang around, I have not used it in a reporting-user facing traditional data warehouse.

In addition, the Kimball methodology will still work with loads which are loaded in incorrect orders (reloads of a day which turned out to be bad, etc.) - although the dates will get screwy, whereas I'm not so sure what the behavior would be here since there's no independence of the current flag and the dates.  While most DW are loaded back to front, I have not seen a case in our environment where the DW was rolled back and then loaded forward to preserve the dimensional behavior at all costs - only the bad fact data was usually purged in the middle and reloaded with the regular package - dimensional behavior was not accounted for, nor was any attempt made to ensure that the dimension chosen was one which would have been active.  Instead, a regular dimension cycle was allowed to occur with whatever the active dimensions were.

I'm not 100% in agreement with the design of the DW I have to use every day, but I think I would continue to use both date columns and a current flag.  Now we don't use datetime columns - we use foreign keys into date dimension (with the YYYYMMDD natural representation of the dimension IDs which Kimball discusses), and this technique generally enhances the performance of most date-only columns.

What this whole debate highlights to me is that right now there still is no one single best methodology in data warehousing and all the best practices we have are not all in agreement.

December 1, 2009 11:55 PM
 

Stuart said:

I don't like the view option.  Firstly, this is larger overhead than just getting the data directly from the table.  Secondly, the ETL process should not govern the design of the warehouse based on updates.  I always define the alternate key by way of a constraint or unique index for the table.  Updating one column or two columns is of little consequence in the bigger picture.  Alternatively, you could use computed column to manage the IsLatest column; by checking the latest effective date indicates if the field is the latest.  Data Warehouses are all about performance for Cubes, Data Mining, and Queries. The use of views in this manor to me undermines this matter.  The use of indiexes to enhance performance becomes a very much more complicated issue with views.  For me this is a developer solution without looking at DBA issues on this matter. Placement of indexes on views has its own implications and starts to complicate the whole matter.

December 3, 2009 9:33 AM
 

jamiet said:

Stuart,

All good points. I'd like to ask you something about the following:

"the ETL process should not govern the design of the warehouse based on updates"

Is that still the case even if the columns in question are nothing to do with end-user reporting? After all, the purpose of Kimball's dimensional modelling is to make end-user reporting quicker; if a column is not to be used for end-user reporting then does this still apply?

"For me this is a developer solution"

If you mean, "this is a solution that makes for easier development" then I would definitely agree. I don't see that as a bad thing if it is not to the detriment of the goals of the warehouse and I'm of the opinion that that is invariably the case here. I expect you will disagree :)

Thanks for the comment,.

-Jamie

December 3, 2009 9:52 AM
 

JaggedEdge said:

Wow, that enormous group by does not look very appealing.  Not only is there another layer of abstraction, but when columns get added I have to update a view, and the group by is gonna make it crawl.

January 6, 2010 6:31 PM
 

SSIS Junkie said:

I’ve been using Analysis Services a lot lately in my current day job and recently encountered an issue

February 15, 2010 2:29 PM
 

Mike C said:

Using Jamie's method of storing just a start date you can *accurately* infer the end dates.  As has been pointed out already here, accurately enforcing end dates when you store the start and end dates is a lot tougher.  Even if you want to store the start and end dates in the Kimball-style datamart, you would do well to store just the start date in the DW or ODS from which you're populating the datamart.  Then you can *accurately* calculate the end date in a view or in the ETL.

How fast you answer a business question doesn't really matter if the result is incorrect.  Random number generators are relatively easy to create, but not usually what you want to run the business on.

February 21, 2010 4:58 PM
 

Mike C said:

@JaggedEdge - Don't be skurred.

If GROUP BY skurrs you, then try a CTE with a windowing function (ROW_NUMBER) instead.

Not sure why you'd find GROUP BY skurry.  If something were to keep me up at night it would be the possibility of incorrect results, not set-based SQL syntax.

February 21, 2010 5:02 PM
 

jamiet said:

Hey Mike,

Thanks for the replies.

I guess it comes down to making the trade-off between perf and accuracy - I seem to spend my whole work life wondering exactly where to make that trade-off, y'know. In the ODS layer its definitely an easy decision to make (for me anyway)!

-J

February 21, 2010 5:11 PM
 

Mike C said:

No problem.  The thing people miss is that even in a denormalized DM, speed is a *secondary* consideration.  It's easy to get fast results if you don't care how wrong they are.  In every DM (at least the ones I've built) correct results are the *primary* consideration.

February 21, 2010 5:29 PM
 

Bruce Kirk said:

Hi AaronBertrand,

In reference to I/O being expensive. Using the proposed logic of having to join back to the table to find the successor record, ir going to cost you allot more than the storage of the end date on the record.

With the premise of rejoining we would need to reparse the table index (which on a large table is still large) to find the appropriate record and then read the table to get the record. Under the best of circumstances this would be at least 3 additional I/O's that otherwise would not be required.

This is a big reason why compression of tables can have a big impact on the DW performance. It reduces the amount of I/O, but at the cost of CPU cycles to compress and uncompress data.

March 25, 2010 4:17 PM
 

Nick McDermaid said:

Interesting thread. In response to some of the posts I'd like to point out that the view does not impact the majority of the end users query's.

During actual usage of the data warehouse, the majority of business questions are answered by a joins on the surrogate key, not the start/end date. So the view is never used.

The start/end date are just 'helpers' to find the appropriate fact record to assign the surrogate key. The view would only be executed 'in batch'to assist with the ETL process, not to answer business questions.

It would be nice to assume that the ETL procedures were ironclad enough to keep these dates in line (as they also have to be ironclad enough not to mess up surrogate keys).

May 2, 2010 9:30 PM
 

Kevin said:

We use a similar schema in a production database storing only the start date.  Recently, we ran into an issue where the data was changing in the table between the first read of the table and the second read of the table to join on itself to get the end date.  Are you using a different transaction level or are you not concerned with this happening?  It seems like it would be rare but we have seen it happen and it's a tough issue to track down!

June 25, 2010 7:12 PM
 

jamiet said:

Kevin,,

Interesting. I'm not really concerned because typically I would use this technique in BI systems which only get written to during an overnight batch. Definitely would be an issue for a highly trafficked OLTP site tho.

-Jamie

June 28, 2010 2:56 PM
 

SSIS Junkie said:

Reigniting the [SCDEndDate] debate In November 2009 I wrote a blog post entitled Debunking Kimball Effective

July 13, 2011 1:34 PM
 

Baskar said:

Hi,I am new to DW,This really good.But I would like to know ,What will be the Fact table structure,If I use SCD 2 with Start date and End date and Which date(start date or end date) will be pointing to Fact table date attribute.

December 5, 2011 8:03 AM
 

Emil Glownia said:

Jamie,

I agree with your post completely. Even that we have a view on top of the table with the logic it is just much simplier and less error prone comparing to trying to handle it in ETL.

I'm not worried about performance at all. If you have huge dimension that you have a problem anyway.

Also I will trial this approach with more complicated SCD Type 2 load. I need to perform inserts (Type 2 change) of the same natural key multiple times during one load (valid reason to have more than one SK for the same NK in the same day) and my problem was "SCDEndDate" which I couldn't easily update in SSIS data flow. This approach will nicely resolve it for me without losing data intergrity or spending hours trying to do all of this and have it 100% accurate in every single scenario (now + future).

Regards

Emil

February 14, 2012 6:56 AM
 

jamiet said:

Hi Emil,

Thanks for the reply. Your situation sounds exactly like the sort of scenario that can be prone to error.

I find it interesting that you are prioritising the validity of your data ahead of ETL performance - that's usually my viewpoint too. Others' mileage will of course vary as I think this discussion has borne out :)

Jamie

February 14, 2012 7:06 AM
 

Emil Glownia said:

Update.... I have just used and tested this method and it was easy to implement and worked with my unsual scenario (BusinessKey > 1 in one load) perfectly.

It also has extra benefit in my case of filling in gaps by default without extra work which can happen in my case when there is an expected issue with load or I redirect one row with BusinesKey but not the remaining rows with the same business key (I can have more than one in one load).

For extra information. Initial load with my (for loop) will also be straight forward. Kimball tip #9 "Processing Slowly Changing Dimensions during Initial Load" suggest weekly load and accepts small % of error which won't work in my case.

February 15, 2012 5:13 AM
 

jamiet said:

Great to hear Emil, thanks.

February 15, 2012 5:23 AM
 

Vishy said:

I followed this but it updates all the previous rows with the startdate of most recent row.

March 9, 2012 3:46 AM
 

jamiet said:

Hello Vishy,

I'm not sure I understand, the blog post above does not talk about issuing UPDATEs. Can you explain a bit more perhaps?

JT

March 9, 2012 4:15 AM
 

Hang Liu said:

Thank you Vishy, for pointing me to the site, as I really like Jamiet's scripts and the comments. The topic will definitely win a medal in Kimball forum.

I think I remember seeing similar structure (one date attribute) in some OLTP system to store the temporal data, but never thought about leveraging the cross join (popular for PTD) logic to derive the end date, and IsLatest flag. What a perfect example of set based thinking.

However, I am still skeptical about using the approach to get rid of the end date in all SCD dimensions. I guess it will work for most of them. However I have two issues about the approach.

1. How do you handle some odd situation that the end date is future date. With a physical end date field, you could always wrap a few DML passes in ETL to update the end date field to achieve customised SCD logic.

2. How do you handle the logical deletes. In my view, even an end SCD date is not sufficient for deletes just by expiring a dimension member, although we could include a flag as part of dimension attributes as commented earlier. I am always thinking about adding additional couple of attributes to indicate deletion as a part of SCD housekeeping fields. Hang on, I might have resolve the logical delete with initially proposed approach, extending the script by another select max ... group by.

I guess it comes down to the principle of dimensional modeling. What is the proper balance between physical and logical implementation. It's true that SCD attributes are supposed to be only used by ETL. However at the end of the day, you still would prefer to look at physical table directly, instead of a view, for data profiling, testing and debugging etc, although I am not encouraging end users to venture around table fields meant to be used by system. I agree that implementing the proposed script involves a good relational thinking by minimising redundant data. But remember, the primary goals of dimensional modeling are ease of use and performance, irrespective of the sanity on data redundancy which is the core of relational modeling. Kimball would only think hard about the data redundancy when the performance becomes critical because of the increasingly massive data volume caused by redundancy. Typical examples are monster dimensions and fact tables.

Having said all that, I would definitely borrow the idea to validate my ETL or even incorporate it into the ETL framework to ensure its robustness.

March 9, 2012 6:25 PM
 

SSIS Junkie said:

I recently inherited a data warehousing SQL Server solution from my good buddy James Rowland-Jones (you

July 13, 2012 10:18 AM
 

Eric said:

That is profanely ignorant reasoning against Kimball's Type 2 SCD handling. Kimball calls for a data time stamp that provides a distinct time range for each dimensional change for a related record set. There are many prgramatic means to manage this cascading date logic which work successfully.

August 30, 2013 6:26 PM
 

jamiet said:

"profanely ignorant". Wow.

Is it "profanely ignorant" to have an opinion that differs from that of ... ummm... someone else?

August 31, 2013 5:34 PM
 

donoseven said:

Jamie,

Your concept overlays a bit with the Data Vault methodology.   I think what is missing in much of the comments is "EDW" versus "MRT".   In Kimball's world the marts are most often copies or subsets of your dimension model.  In the data vault and your model (view) the MRT resolves the end date and current records, either creating them for joining as type II or for flattening into type I.   Direct user access of the DIM is not the intent.

Adding to the confusion, the data vault model links facts to a master record key and not the changing type II key.   The resolution of type II or Type I is deferred from ETL load to MRT creation, either by view or physical table.   This last decision is based on performance, storage and business need.

October 2, 2013 10:58 AM
 

Kyle Hale said:

Can we get an update to this blog using a windowing function to get the next row's start date to avoid the join?

Just to render that whole "INNER JOIN" argument moot.

March 13, 2014 4:45 PM
 

Pradeep Saraf said:

Kyle Hale,

I am not sure why you are referring to get "next row's Start date" with Windowing function but to get the "next row's End date":

SELECT [Id]

,[NId]

,[Name]

,[HomeTown]

,[MaritalStatus]

,[NumberOfChildren]

,[SCDStartDate]

,COALESCE(LEAD([SCDStartDate]) OVER (PARTITION BY [NId],[Name],[HomeTown] ORDER BY [NId]), '99991231') "SCDEndDate"

,CASE WHEN LEAD([SCDStartDate]) OVER (PARTITION BY [NId],[Name],[HomeTown] ORDER BY [NId]) IS NULL THEN 1 ELSE 0 END "IsLatest"

FROM dbo.Customer C

April 8, 2014 1:44 PM
 

Mark Manville said:

While I agree that it's hard to enforce the end effective date reliably -- that's why they are never recorded in ERPs like PeopleSoft -- on the other hand, my job as a DW designer is to give semi-technical end users data that is as easy to use as possible. Let's not forget, in the quest for technically purity, that our goal is customer service.  Other warehouses at our site do not stored computed end dates for them, and they hate that; that my schemas do is one of the few design aspects of our schema that gets universal gratitude.

In that spirit, let me make this observation,reflecting how I have approached this problem:  Your argument is that stored end dates may get corrupted, so they should be computed on the fly. Let me flip that and say: if you believe the on the fly code is rock solid, then ... why can't that same code be trusted enough to put that in your ETL?  In other words, the last step of any update to that dimension, put this code to recalculate end dates, as part of an atomic commit. My "flip" of your point is that, if the ultimate goal to be measured by is that user's get the right data answers, then that goal is FAR more likely to be achieved by one data expert (me or you) writing that code, than forcing every user to have to write it themselves. I have also even gone further than deriving end date for my users, I also derive a "current flag" on dimension rows, since that is the typical access pattern for end users that want to access dimensions individually & not part of a star join (for which the effective dates are irrelevant).

And then of course there is query performance ...

July 19, 2014 11:50 AM
 

Mark Manville said:

p.s. I should also add that in Peoplesoft, in addition to effective date, there is of "effective sequence."  In other words multiple rows with the same effective date (no time component), ordered by that seq.  For purposes of dimension data range queries ("what did Joe Blow look like this day"), you want to return the data in the highest sequence number record for that effective date.  But you also need to keep the intermediate seq nbrs because some transactional events (facts) may have happened that could be tied explicitly to the intermediate state (aka not the highest seq nbr for the date).  In my approach of precomputing and storing, I can create a variation of the code you show to make the end effective date of those intermediate rows the day BEFORE the begin date, so they automatically disappear from range queries, and my users don't have to deal with the extra headaches of sequence number sub-ordering. This is not something I feel I, as a data professional, should leave for my end users to deal with. Like I said, I've seen others do that, and I've heard the frustration from the users.

July 19, 2014 12:01 PM
 

Jeremy Huppatz said:

Hi folks,

Just thought I'd post in an alternative approach which guarantees the correct dates, and also provides a row-versioning metric.  For the record - this approach is optimized for processing the data during OLTP processes, rather than ETL.  In doing so, it makes the ETL process much simpler (just load anything that hasn't already been loaded).  

I am not a fan of including an end-date in a type-II SCD table unless it has a definitive value - I generally make Date-Effective-To (or Active-To in my example below) a NULL until a value has been applied.  Filtering for NULL in this column effectively replaces the search for an "Is Current" flag.

Anyway - without any majorly complex group-by operations or CTEs, I present my solution for implementing SCD-type II dimension processing using a simple trigger.  This trigger structure will work cleanly against any number of columns, but it also has some logic which prevents ID collisions and allows accurate processing of a number of rows simultaneously.  This logic helps to minimize the overhead of processing lots of atomic transactions against a

The RowVer field allows you link fact table entries (or the transactional prototype tables from which fact tables are defined) to not just the ID, but the version of the record associated with that ID, which is really the point of Type II SCDs - linking facts to the record versions at a particular point in time.  I'm not saying this solution is perfect, but it enforces all of the constraints required and has been tested over multiple concurrent queries to verify the ID collision prevention.  Feedback on further fine-tuning is welcome.  Without further ado - here it is:

=========================================================================

IF (SELECT OBJECT_ID('Test')) IS NOT NULL

DROP TABLE test

GO

CREATE TABLE test

(

[Id] BIGINT NULL,

[RowVer] INT NOT NULL DEFAULT @@DBTS,

[ActiveFrom] DATETIME2 NOT NULL DEFAULT getdate(),

[ActiveTo] DATETIME2 NULL,

[Descriptor] nvarchar(200) NOT NULL UNIQUE CLUSTERED

)

GO

CREATE TRIGGER test_trigger ON test

INSTEAD OF INSERT, UPDATE, DELETE

AS

SET NOCOUNT ON

SELECT

       ROW_NUMBER() OVER (ORDER BY i.[Id], d.[Id]) as RowNumber,

       CASE

           WHEN i.[Id] IS NULL AND d.[Id] IS NULL THEN 'I'

           WHEN i.[Id] IS NOT NULL THEN 'U'

           WHEN i.[Id] IS NULL AND d.[Id] IS NOT NULL THEN 'D'

       END as Operation,

       i.[Id] as [InsertedID],

       d.[Id] as [DeletedID],

       i.[RowVer] as [InsertedRowVer],

       d.[RowVer] as [DeletedRowVer],

       i.[Descriptor] as [InsertedDescriptor],

       d.[Descriptor] as [DeletedDescriptor]

   INTO #RowsToProcess

   FROM

       inserted i FULL JOIN deleted d

           ON i.[ID] = d.[Id]

   DECLARE @RowToProcess int = 1;

   DECLARE @RowCount int = (SELECT COUNT(*) FROM #RowsToProcess)

   DECLARE @CurrentVersion INT

   DECLARE @CurrentID BIGINT

DECLARE @Operation char(1)

DECLARE @maxid INT

DECLARE @CurrentDescriptor nvarchar(200)

SELECT * INTO #INS FROM test

WHERE 1 = 0;

   WHILE (@RowToProcess <= @RowCount)

BEGIN

SELECT

@Operation = Operation,

@CurrentID = COALESCE (InsertedID, DeletedID, NULL),

@CurrentVersion = COALESCE([DeletedRowVer], 0),

@CurrentDescriptor = COALESCE([InsertedDescriptor], [DeletedDescriptor])

FROM #RowsToProcess

WHERE RowNumber = @RowToProcess;

IF @Operation = 'I'

BEGIN

SELECT @maxid = MAX(Id) + 1 FROM test WITH (UPDLOCK, HOLDLOCK);

PRINT CAST(@maxid as nvarchar(10));

INSERT INTO #ins SELECT * FROM inserted WHERE Descriptor = @CurrentDescriptor;

UPDATE #ins SET Id = @MaxId, RowVer = 1

WHERE Descriptor = @CurrentDescriptor

INSERT INTO test

SELECT * FROM #ins;

DELETE FROM #ins;

END

ELSE IF @Operation = 'U'

BEGIN

UPDATE test

SET ActiveTo = getdate()

WHERE Id = @CurrentID

AND ActiveTo IS NULL;

INSERT INTO #ins SELECT * FROM inserted WHERE Id = @CurrentID;

UPDATE #Ins

SET

ActiveFrom = getdate(),

RowVer = @CurrentVersion + 1

WHERE Descriptor = @CurrentDescriptor;

INSERT INTO test

SELECT * FROM #ins

WHERE Id = @CurrentID;

DELETE FROM #Ins;

END

ELSE If @Operation = 'D'

BEGIN

UPDATE test

SET

ActiveTo = GetDate()

WHERE Id = @CurrentID

AND ActiveTo IS NULL

END

SET @RowToProcess = @RowToProcess + 1;

END

GO

=========================================================================

Some demo data to show off the capabilities of the approach.

=========================================================================

INSERT INTO test (Id, Descriptor)

VALUES (1, 'Bob'), (2, 'Mary'), (3, 'Marianne'), (4, 'Peter'), (5, 'Alison'), (6, 'Jane');

GO

SELECT * FROM test;

INSERT INTO test (Descriptor) VALUES ('Brian'), ('Arthur'), ('Albert');

UPDATE Test

SET Descriptor = REPLACE (Descriptor, 'Mar', 'Cal') WHERE Descriptor Like 'Mar%';

DELETE FROM test

WHERE Descriptor IN ('Mary', 'Bob', 'Brian');

SELECT * FROM test;

GO

October 28, 2014 9:05 PM
 

Jeremy Huppatz said:

Oops - just realized - no DEFAULT required on the RowVer field as it gets handled by the insert logic. :)

October 28, 2014 9:07 PM
 

ivan jefton said:

I think what has been lost in all this discussion is that the purpose of the start/end dates is to make QUERYING the data easier. a one-time hit during ELT (if there is a performance hit) is significantly preferable to an every-time hit during querying.

that, plus the ability to keep discontinuous date ranges (e.g. start date of row 3 is greater than end date of row 2, or end date of the final row for a key's value is before current_timestamp), supports Kimball's design.

That storing an additional date column "takes up too much space" is a flawed argument: disk space is cheap.

November 24, 2014 11:06 AM
 

jamiet said:

Hi Ivan,

Thanks for the comment.

"I think what has been lost in all this discussion is that the purpose of the start/end dates is to make QUERYING the data easier."

I disagree. The primary purpose of SCD start/end dates is not for end user querying, it is for the purposes of doing lookups during the ETL process. I am of the opinion that there is no value to the end user of exposing lineage columns such as these.

If you *do* require your users to be able to query such columns for whatever reason then that implies that they have value to your end users other than being SCD start/and dates and in such circumstances, yes, store them explicitly as you do all other attributes that are available for end users.

"that, plus the ability to keep discontinuous date ranges (e.g. start date of row 3 is greater than end date of row 2, or end date of the final row for a key's value is before current_timestamp), supports Kimball's design."

Interesting perspective. Could you perhaps outline such a scenario where this would be a requirement?

"That storing an additional date column "takes up too much space" is a flawed argument: disk space is cheap."

Disk space is cheap. Memory is not cheap though. Neither is I/O. For those reasons, if I can save a few bytes on each record then I'll do it.

Thanks again for sharing your thoughts Ivan.

November 24, 2014 11:20 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement