THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a freelance data mangler in London

Debunking Kimball Effective Dates part 2 – Windowing Functions

Reigniting the [SCDEndDate] debate

In November 2009 I wrote a blog post entitled Debunking Kimball Effective Dates where I stated that I was against the Kimball-recommended practice of maintaining a StartDate & EndDate value for every record in a type 2 slowly changing dimension (SCD) table (i.e. the values highlighted in blue in the following image):

image

The post was deliberately inflammatory so as to provoke a debate and it certainly got the desired response as it received 47 comments with a roughly 50-50 split between folks agreeing/disagreeing with me (which was especially interesting given I discussed the same technique back in 2005 but without the inflammatory tone and then received only 6 comments). I really don’t mind if people agree or not, it certainly made for a good debate in those 47 comments.

The crux of the post was thus: the column [SCDEndDate] is unnecessary because it can be derived from [SCDStartDate] of the preceding record. Eliminating that column is therefore beneficial because it can ease the pain of managing those values in an ETL process. Here’s some code that creates the table (without [SCDEndDate]), populates it, then derives [SCDEndDate] in a query:

           DROP TABLE [Customer]--drop if exists
          
CREATE TABLE [Customer] (
             
[Id]                INT             PRIMARY KEY   IDENTITY(1,1)
           , 
[NId]               NVARCHAR(50)
           , 
[Name]              NVARCHAR(50)
           , 
[Home]              NVARCHAR(50)
           , 
[MaritalStatus]     CHAR(1)
           , 
[Descendants]       INT
          
[SCDStartDate]      DATE   --Note only [SCDStartDate], no [SCDEndDate]
          
);
          
CREATE UNIQUE INDEX Cust_idx ON [Customer]([NId],[SCDStartDate]) INCLUDE ([Id],[Name],[Home],[MaritalStatus],[Descendants]);
          
INSERT [Customer] ([NId],[Name],[Home],[MaritalStatus],[Descendants],[SCDStartDate])
          
VALUES ('Cust001','Henry','London','S',0,CONVERT(DATETIME,'20050324'))
           ,      (
'Cust001','Henry','London','M',0,CONVERT(DATETIME,'20070726'))
           ,      (
'Cust002','Sarah','Birmingham','M',2,CONVERT(DATETIME,'20060213'))
           ,      (
'Cust001','Henry','London','M',1,CONVERT(DATETIME,'20091127'));
          
SELECT c.[Id],c.[NId],c.[Name],c.[Home],c.[MaritalStatus],c.[Descendants],c.[SCDStartDate]
          
,      COALESCE(MIN(c2.[ScdStartDate]),CONVERT(DATE,'99991231')) AS [SCDEndDate]
          
FROM   [Customer] c
          
LEFT   OUTER JOIN [Customer] c2       ON     c.[Nid] = c2.[Nid]
          
AND        c.[SCDStartDate] < c2.[SCDStartDate]
          
GROUP  BY c.[Id],c.[NId],c.[Name],c.[Home],c.[MaritalStatus],c.[Descendants],c.[SCDStartDate];

Go and run it now if you’re not au fait with what we’re talking about here! Here is the resultset:

image

Notice how the [SCDEndDate] for all records pertaining to “Henry” equal the [SCDStartDate] of the following record.


A new way in Denali

Amongst those 47 comments was the following:

Wow, that enormous group by does not look very appealing.  Not only is there another layer of abstraction, but when columns get added I have to update a view, and the group by is gonna make it crawl.
-JaggedEdge

That’s a fair point. Not only is there a GROUP BY but if we look at the execution plan we can see that there are two index scans also:

image

We’re only accessing one table but still get two scans? Yeah, that doesn’t feel good does it? Happily a new feature is coming to SQL Server Denali that makes this easier and is included in yesterday’s CTP3 release; that feature is windowing functions. Below is the same query rewritten to use the new LEAD windowing function:

           SELECT  c.[Id],c.[NId],c.[Name],c.[Home],c.[MaritalStatus],c.[Descendants],c.[SCDStartDate],
                  
ISNULL( LEAD([SCDStartDate]) OVER(PARTITION BY [NId] ORDER BY [SCDStartDate] ASC),
                          
CAST('20501231' AS DATETIME)
                       )
AS [SCDEndDate]
          
FROM    [Customer] c

and here is the resultset and execution plan:

image

image

We have the same results (just in a different order) with a different execution plan. There are more physical operations going on here but at least we have eliminated one of those scans. Having said that we have introduced a key lookup operator instead which is easily eliminated by including some columns in our index:

           CREATE UNIQUE INDEX Cust_idx ON [Customer]([NId],[SCDStartDate]) INCLUDE ([Id],[Name],[Home],[MaritalStatus],[Descendants]);

Run it again and our new execution plan is:

image

The key lookup operator has been eliminated and we have a nice method of deriving [SCDEndDate] without resorting to JOINs and multiple index scans.

 


Perf testing

Ok, we have eliminated that nasty looking additional Index Scan and Nested Loop Join but the real question is whether this has proved beneficial or not - for that we need a suitably large dataset with which to test. Here’s some code that inserts 140000 rows into our [Customer] table , each time modifying the [NId] column slightly. Its a crude dataset but for our purposes, it will suffice:

       DECLARE @i INT = 0
      
WHILE @i < 35000
      
BEGIN
                   INSERT
[Customer] ([NId],[Name],[Home],[MaritalStatus],[Descendants],[SCDStartDate])
                  
VALUES ('Cust001' + CAST(@i AS VARCHAR(6)),'Henry','London','S',0,CONVERT(DATETIME,'20050324'))
                   ,      (
'Cust001' + CAST(@i AS VARCHAR(6)),'Henry','London','M',0,CONVERT(DATETIME,'20070726'))
                   ,      (
'Cust002' + CAST(@i AS VARCHAR(6)),'Sarah','Birmingham','M',2,CONVERT(DATETIME,'20060213'))
                   ,      (
'Cust001' + CAST(@i AS VARCHAR(6)),'Henry','London','M',1,CONVERT(DATETIME,'20091127'));
                  
SET @i += 1;
      
END

I ran our two queries with STATISTICS IO ON & STATISTICS TIME ON, here are the results:

Method Scan Count Logical Reads CPU time Elapsed time
JOIN with GROUP BY 18 2902 1637 8232
LEAD 1 2129 718 1853

There are a few stats there but really the most important one is Elapsed time; as you can see the new LEAD method was orders of magnitude faster, it took less than a quarter as long as the old method that I outlined back in the previous blog post.


Conclusion

I still maintain that even pre-Denali you are better off not storing an [SCDEndDate] but now in Denali we have even less reason to do so because the new windowing functions are such a massive boost to performance in these sorts of scenarios. Get to know the new windowing functions well, something tells me you’ll be glad that you did!

@jamiet

 

Update

After reading Jeremiah Peschka’s post Leaving the Windows Open earlier today I realised that I could make the query containing the LEAD() function a little bit neater. Previously I was using ISNULL() to compensate for there being no match from LEAD like so:

       SELECT  c.[Id],c.[NId],c.[Name],c.[Home],c.[MaritalStatus],c.[Descendants],c.[SCDStartDate],
              
ISNULL( LEAD([SCDStartDate]) OVER(PARTITION BY [NId] ORDER BY [SCDStartDate] ASC),
                      
CAST('20501231' AS DATETIME)
                   )
AS [SCDEndDate]
      
FROM    [Customer] c

Here’s the execution plan again for easy comparison (its different to the one above because this was for a much larger dataset):

image

Jeremiah’s post taught me that you don’t need ISNULL because LEAD has a similar ability built into it hence I can write the query a little more succinctly like so:

       SELECT  c.[Id],c.[NId],c.[Name],c.[Home],c.[MaritalStatus],c.[Descendants],c.[SCDStartDate],
              
LEAD([SCDStartDate], 1, CAST('20501231' AS DATETIME))
                          
OVER(PARTITION BY [NId] ORDER BY [SCDStartDate] ASC) AS [SCDEndDate]
      
FROM    [Customer] c

and the query plan:

image

OK, the query plan isn’t any different but I think we’ve got a much neater SQL query. And that makes me happy Smile

Published Wednesday, July 13, 2011 7:31 PM by jamiet

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Gonçalo Ferreira said:

It wasn't orders of magnitude faster -- it actually was in the same order of magnitude.

July 13, 2011 1:40 PM
 

Richard Back said:

I think that is a really neat feature to draw our attention to! I was on a Kimball course recently taught by the man himself, and they really drummed into us that their methods were for ease of use of the model, and performance of queries. They fully admit that sometimes this causes a bit more work 'in the kitchen', i.e. ETL development, but this is what works for them.

They also stressed that traditionally, SQL is easy to write (hence easy for navigation/self service tools to automatically write too) when you're pulling stuff out of the same row, with all the stuff you need in different columns, but harder when you have to do work across more than one row.

From a denormalisation perspective its totally fine to have the end date in the same row as the start date, although it causes a bit of ETL pain and I'll probably carry on with that to make the model easy to use.

But still, its a really neat feature nonetheless!

July 13, 2011 3:45 PM
 

Cliff said:

To really show the benefit of your view that the EndDate is uncessesary, you would need to compare use of the new windowing function vs having the EndDate column. I'd rather add very little additional ETL to manage an EndDate than to have to self join or use a windowing function to get a value that you know at the time the type2 change occurs.

July 13, 2011 4:25 PM
 

jamiet said:

Cliff,

Yes, that's right. I won't try and pretend this will be as quick as simply selecting SCDEndDate from the table - but hopefully I've proved it is a viable option, 140k rows in under two seconds is not to be disregarded. I fully expect tha many people would rather just materialise the value in the table - and I'm totally OK with that.

If nothing else this blog post introduces a very handy new tool in Denali.

JT

July 13, 2011 4:34 PM
 

Peter Schott said:

I was wondering the same thing Cliff was wondering about, especially since any plans we may have to upgrade to Denali are definitely for the far future at this point. We're still on mostly 2005 boxes here....  An interesting comparison and a great example of using the Windowing functions, but we're not even close to attempting those except in a lab / proof of concept environment (aka local machine).

July 13, 2011 5:52 PM
 

Rob Farley said:

You can rewrite your first query to make it way nicer. It'll run about the same, but doesn't need the GROUP BY.

Try:

SELECT *, ISNULL((SELECT TOP (1) cNext.SCDStartDate FROM Customer AS cNext WHERE cNext.Nid = c.Nid AND cNext.SCDStartDate > c.SCDStartDate ORDER BY cNext.SCDStartDate),'99991231') AS SCDEndDate

FROM Customer AS c;

--No GROUP BY needed here, because our FROM clause only has one table, and we want a row outputted for every row in said table. Can even use SELECT * (but with care), because we want every column plus our extra one (even if new columns are added). If your underlying table changes, you may want those changes to go straight through to the view...

But LAG/LEAD are designed for this kind of situation, and definitely encouraged here. Nice post!

But for users living in the now (not the future), a correlated subquery in place of a GROUP BY is best.

July 14, 2011 2:31 AM
 

Craig said:

So glad they finally added this functionality. This was a real pain with DW queries previously.

July 15, 2011 2:36 PM
 

Tamera said:

OMG I LOVE YOU!!!!!  I have been struggling for about a week (I have had it in my mind to work with for over a month) with a query that I need to calculate time in status.  But some jackass (our BI team, who should know better) built the SCD without and end date.  Everyone kept telling me this was going to be hard and no where could I find good samples to help me.  This morning I was reading my newsletter and your post title caught my eye.  Since I had all the "logistics" worked out already it took me < 15 min to create what I need.  Oh don't get me wrong I will ask the director to move my database to denali so I can try out the NEW solution.

THANK YOU!!!!!!!

July 16, 2011 10:43 AM
 

jamiet said:

Thanks Tamera - I love getting comments like that :)

July 16, 2011 10:46 AM
 

Kristian Wedberg said:

Great stuff on LEAD! I ran a quick test with your example, with and without adding a physical SCDEndDate, just to make a point:

GROUP BY: 2390ms

Physical SCDEndDate: 141ms

This makes the physical SCDEndDate query consume about 17 times less CPU vs. GROUP BY, and about 8 times less vs. the LEAD version. So _if_ query performance and resource usage are important, _and_ the queries use non-current records, then physical SCDEndDate is definitely the way to go.

Cheers/Kristian

August 9, 2011 6:15 AM
 

Nicki D said:

Hi Jamie,

I think this design misses one of the fundamental points of Kimball's Star Schema design i.e denormalisation simplifies queries in reports.  Granted I'm not yet familiar with the windowing functions in Denali, and the efficiencies they introduce, but the neatest query is going to be one that doesn't need LEAD or GROUP BY or anything but SELECT c.SCDStartDate, c.SCDEndDate ... because they're both in the same row.  Surely the overhead of denormalising the dates into one row in the ETL process is significantly outweighed by the overhead of joining two rows together every time you run a query that requires both Start & End Date ?

Another option, in this example, could be to avoid the Type 2 SCD altogether, by moving the Demographic attributes into a separate "mini-dimension" and handling the time variance through the fact(s).

October 20, 2011 8:47 AM
 

jamiet said:

Hi Nicki,

That was pretty much the issue that prompted the debate in the comments to the original post. My position is this - SCDStartDate & SCDEndDate are there for the purposes of ETL - they are not there to be reported upon. if they're not going to be reported on, why bother denormalising them?

Others are free to disagree of course - that's what makes it a good debate.

Thanks for the comment

JT

October 20, 2011 9:24 AM
 

Nicki D said:

Thanks Jamie,

That's a very good point. Those dates are seldom if ever required for reporting.  I'll have to rethink my position !

You're posts are always thoughtful, helpful and well written.

Keep up the good work.

Nicki

October 30, 2011 7:06 AM
 

Emil Glownia said:

I doubt Kimball would be so strong about not using this method as some people in here .... From my point of view the goal is to get start/end date the method you use should be the simpliest of all (for you) so it doesn't cause unexpected results in short & long run.

Problem with SCD Types that I found is you update EndDate multiple times for one dimension load and sometimes they chance from type 1 to 2 or introduce new SCD types or you get unexpected issues or just bug and you end up with wrong date..... Jamie's post gives you a very neat solution that is maintained in one centralized place and most of us love centralized place of logic :) for obvious reasons....

February 14, 2012 11:13 AM
 

Cecil Dkal said:

This won't scale nearly as well as actually storing an enddate, highly complicates the actual retrieval of data (now you use extra joins or window functions, rather than a between) and forces you to read mininally twice as many rows as necessary on retrieval (need to read both target and previous row in order to evaluate effective record), decreases readability/maintainability of the reading code.

Adopting patterns that actually result in higher overall maintenance, more complicated queries and slower performance in the most common aspect (read) of data access so that you can save a little in maintenance in the sigificantly less frequent operation (writing and or updating of ETL procedures) is a terrible practice to adopt...

This is a cool pattern and would be excellent to use in an ETL process when pulling from a system that did not originally have an enddate, but not storing one is a poor design decision.  not from the Logical perspective(there it makes sense), but from the physical perspective of how SQL can leverage indexes, how much IO it has to perform, How many CPU cycles it would need to performm, etc..

April 22, 2013 2:34 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement