|
|
|
|
This is the blog of Jamie Thomson, a freelance SQL Server developer in London
Reigniting the [SCDEndDate] debate In November 2009 I wrote a blog post entitled Debunking Kimball Effective Dates where I stated that I was against the Kimball-recommended practice of maintaining a StartDate & EndDate value for every record in a type 2 slowly changing dimension (SCD) table (i.e. the values highlighted in blue in the following image):  The post was deliberately inflammatory so as to provoke a debate and it certainly got the desired response as it received 47 comments with a roughly 50-50 split between folks agreeing/disagreeing with me (which was especially interesting given I discussed the same technique back in 2005 but without the inflammatory tone and then received only 6 comments). I really don’t mind if people agree or not, it certainly made for a good debate in those 47 comments. The crux of the post was thus: the column [SCDEndDate] is unnecessary because it can be derived from [SCDStartDate] of the preceding record. Eliminating that column is therefore beneficial because it can ease the pain of managing those values in an ETL process. Here’s some code that creates the table (without [SCDEndDate]), populates it, then derives [SCDEndDate] in a query: DROP TABLE [Customer]; --drop if exists CREATE TABLE [Customer] ( [Id] INT PRIMARY KEY IDENTITY(1,1) , [NId] NVARCHAR(50) , [Name] NVARCHAR(50) , [Home] NVARCHAR(50) , [MaritalStatus] CHAR(1) , [Descendants] INT , [SCDStartDate] DATE --Note only [SCDStartDate], no [SCDEndDate] ); CREATE UNIQUE INDEX Cust_idx ON [Customer]([NId],[SCDStartDate]) INCLUDE ([Id],[Name],[Home],[MaritalStatus],[Descendants]); INSERT [Customer] ([NId],[Name],[Home],[MaritalStatus],[Descendants],[SCDStartDate]) VALUES ('Cust001','Henry','London','S',0,CONVERT(DATETIME,'20050324')) , ('Cust001','Henry','London','M',0,CONVERT(DATETIME,'20070726')) , ('Cust002','Sarah','Birmingham','M',2,CONVERT(DATETIME,'20060213')) , ('Cust001','Henry','London','M',1,CONVERT(DATETIME,'20091127')); SELECT c.[Id],c.[NId],c.[Name],c.[Home],c.[MaritalStatus],c.[Descendants],c.[SCDStartDate] , COALESCE(MIN(c2.[ScdStartDate]),CONVERT(DATE,'99991231')) AS [SCDEndDate] FROM [Customer] c LEFT OUTER JOIN [Customer] c2 ON c.[Nid] = c2.[Nid] AND c.[SCDStartDate] < c2.[SCDStartDate] GROUP BY c.[Id],c.[NId],c.[Name],c.[Home],c.[MaritalStatus],c.[Descendants],c.[SCDStartDate]; Go and run it now if you’re not au fait with what we’re talking about here! Here is the resultset:  Notice how the [SCDEndDate] for all records pertaining to “Henry” equal the [SCDStartDate] of the following record. A new way in Denali Amongst those 47 comments was the following: Wow, that enormous group by does not look very appealing. Not only is there another layer of abstraction, but when columns get added I have to update a view, and the group by is gonna make it crawl. -JaggedEdge That’s a fair point. Not only is there a GROUP BY but if we look at the execution plan we can see that there are two index scans also:  We’re only accessing one table but still get two scans? Yeah, that doesn’t feel good does it? Happily a new feature is coming to SQL Server Denali that makes this easier and is included in yesterday’s CTP3 release; that feature is windowing functions. Below is the same query rewritten to use the new LEAD windowing function: SELECT c.[Id],c.[NId],c.[Name],c.[Home],c.[MaritalStatus],c.[Descendants],c.[SCDStartDate], ISNULL( LEAD([SCDStartDate]) OVER(PARTITION BY [NId] ORDER BY [SCDStartDate] ASC), CAST('20501231' AS DATETIME) ) AS [SCDEndDate] FROM [Customer] c and here is the resultset and execution plan:   We have the same results (just in a different order) with a different execution plan. There are more physical operations going on here but at least we have eliminated one of those scans. Having said that we have introduced a key lookup operator instead which is easily eliminated by including some columns in our index: CREATE UNIQUE INDEX Cust_idx ON [Customer]([NId],[SCDStartDate]) INCLUDE ([Id],[Name],[Home],[MaritalStatus],[Descendants]); Run it again and our new execution plan is:  The key lookup operator has been eliminated and we have a nice method of deriving [SCDEndDate] without resorting to JOINs and multiple index scans. Perf testing Ok, we have eliminated that nasty looking additional Index Scan and Nested Loop Join but the real question is whether this has proved beneficial or not - for that we need a suitably large dataset with which to test. Here’s some code that inserts 140000 rows into our [Customer] table , each time modifying the [NId] column slightly. Its a crude dataset but for our purposes, it will suffice: DECLARE @i INT = 0 WHILE @i < 35000 BEGIN INSERT [Customer] ([NId],[Name],[Home],[MaritalStatus],[Descendants],[SCDStartDate]) VALUES ('Cust001' + CAST(@i AS VARCHAR(6)),'Henry','London','S',0,CONVERT(DATETIME,'20050324')) , ('Cust001' + CAST(@i AS VARCHAR(6)),'Henry','London','M',0,CONVERT(DATETIME,'20070726')) , ('Cust002' + CAST(@i AS VARCHAR(6)),'Sarah','Birmingham','M',2,CONVERT(DATETIME,'20060213')) , ('Cust001' + CAST(@i AS VARCHAR(6)),'Henry','London','M',1,CONVERT(DATETIME,'20091127')); SET @i += 1; END I ran our two queries with STATISTICS IO ON & STATISTICS TIME ON, here are the results: | Method | Scan Count | Logical Reads | CPU time | Elapsed time | | JOIN with GROUP BY | 18 | 2902 | 1637 | 8232 | | LEAD | 1 | 2129 | 718 | 1853 | There are a few stats there but really the most important one is Elapsed time; as you can see the new LEAD method was orders of magnitude faster, it took less than a quarter as long as the old method that I outlined back in the previous blog post. Conclusion I still maintain that even pre-Denali you are better off not storing an [SCDEndDate] but now in Denali we have even less reason to do so because the new windowing functions are such a massive boost to performance in these sorts of scenarios. Get to know the new windowing functions well, something tells me you’ll be glad that you did! @jamiet Update After reading Jeremiah Peschka’s post Leaving the Windows Open earlier today I realised that I could make the query containing the LEAD() function a little bit neater. Previously I was using ISNULL() to compensate for there being no match from LEAD like so: SELECT c.[Id],c.[NId],c.[Name],c.[Home],c.[MaritalStatus],c.[Descendants],c.[SCDStartDate], ISNULL( LEAD([SCDStartDate]) OVER(PARTITION BY [NId] ORDER BY [SCDStartDate] ASC), CAST('20501231' AS DATETIME) ) AS [SCDEndDate] FROM [Customer] c Here’s the execution plan again for easy comparison (its different to the one above because this was for a much larger dataset):  Jeremiah’s post taught me that you don’t need ISNULL because LEAD has a similar ability built into it hence I can write the query a little more succinctly like so: SELECT c.[Id],c.[NId],c.[Name],c.[Home],c.[MaritalStatus],c.[Descendants],c.[SCDStartDate], LEAD([SCDStartDate], 1, CAST('20501231' AS DATETIME)) OVER(PARTITION BY [NId] ORDER BY [SCDStartDate] ASC) AS [SCDEndDate] FROM [Customer] c and the query plan:  OK, the query plan isn’t any different but I think we’ve got a much neater SQL query. And that makes me happy 
Comment Notification
If you would like to receive an email when updates are made to this post, please register here
Subscribe to this post's comments using
|
|
|
|
|