THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alberto Ferrari

SQL: Merge two "history tracking" tables

If you have two columns in a table that "remember" their old values in two different historical tables, as it is the case in AdventureWorks with EmployeePayHistory and EmployeeDepartmentHistory, you may need (or just wonder how) to merge these two tables into only one historical table that will keep track of both columns.

This has been my "problem of the day" and, after some thoughts, I came up with a pattern of solution that I'd like to share with you both to see if you have a better solution and to have the code at hand when I'll need later. :)

Before using the real table, I made up a simple test case with a Product table with only two fields a couple of variation tables (VariationA and VariationB):

--
--  Sample table Products, only two columns A and B. The current value of the row is (D, 4)
--
CREATE TABLE Products (
    A CHAR (1),
    B INT
)
INSERT INTO products (A, B) VALUES ('D', 4)

--
--  Variations for column A
--
--      ------------------
--      Year    Old Value
--      ------------------
--      2001        A
--      2002        B
--      2003        C
--      2005        D
--
CREATE TABLE VariationA (   
    Position INT,
    OldValue VARCHAR (5),
    ChangeDate DATETIME )

INSERT VariationA (Position, OldValue, ChangeDate) VALUES (1, 'A', '01/01/2001')
INSERT VariationA (Position, OldValue, ChangeDate) VALUES (2, 'B', '01/01/2002')
INSERT VariationA (Position, OldValue, ChangeDate) VALUES (3, 'C', '01/01/2003')
INSERT VariationA (Position, OldValue, ChangeDate) VALUES (4, 'D', '01/01/2005')

--
--  Variations for column B
--
--      ------------------
--      Year    Old Value
--      ------------------
--      2000        1
--      2003        2
--      2004        3
--
CREATE TABLE VariationB (   
    Position INT,
    OldValue INT,
    ChangeDate DATETIME )

INSERT VariationB (Position, OldValue, ChangeDate) VALUES (1, 1, '01/01/2000')
INSERT VariationB (Position, OldValue, ChangeDate) VALUES (2, 2, '01/01/2003')
INSERT VariationB (Position, OldValue, ChangeDate) VALUES (3, 3, '01/01/2004')

 

The query that mixes all these variations into a single table is pretty easy but it took me some time to discover. My solution is to detect what was the value of "B" when a variation in "A" happened and vice versa. As the values stored are "old" values, we know that the value of A at a certain date is the value stored in the first variation for A AFTER that date. The only special case is the current time: if no variation record is found then we know that the value to use is the current value of the record in the product table. The same is obviously true for B. 

I ended up with this pattern query:

--
--  AllEvents contains all the events from both VariationA and VariationB,
--  and will detect, for each variation that happened, what was
--  the value of the other column at that time, building in this way the
--  merged variation list.
--  The final SELECT will return the sorted and DISTINCTed result. 
--  Then final COALESCE is needed because if a value is NULL it 
--  means that it should contain the "current" value of the column
--
WITH AllEvents AS (
    SELECT 
        OldValueOfA  = Events.OldValueOfA,    -- Old value of A
        OldValueOfB  = Events.OldValueOfB,    -- Old value of B
        ChangeDate   = Events.ChangeDate      -- Date of change
    FROM 
            (SELECT 
                OldValueOfA = OldValue, 
                OldValueOfB = (SELECT TOP 1 OldValue 
                                 FROM VariationB V
                                WHERE V.ChangeDate >= VariationA.ChangeDate
                                ORDER BY ChangeDate), 
                ChangeDate 
            FROM VariationA 
        UNION ALL 
            SELECT 
                OldValueOfA = (SELECT TOP 1 OldValue 
                                 FROM VariationA V
                                WHERE V.ChangeDate >= VariationB.ChangeDate
                                ORDER BY ChangeDate), 
                OldValueOfB = OldValue, 
                ChangeDate 
            FROM VariationB
        ) Events
    )
SELECT DISTINCT
    ChangeDate   = ChangeDate,
    OldValueOfA  = COALESCE (OldValueOfA, (SELECT A FROM Products)),
    OldValueOfB  = COALESCE (OldValueOfB, (SELECT B FROM Products))
FROM
    AllEvents
ORDER By ChangeDate

 

Clearly, i could have written some VB code in SSIS to solve the same problem. Nevertheless, having a pattern query at hand is useful because the pattern can be easily adapted to any real world situation with only SQL Management Studio at hand. The following code is the implementation of the pattern for AdventureWorks to merge Department and Payment history for Employees:

WITH AllEvents AS (
    SELECT 
        EmployeeID      = Events.EmployeeID,
        OldValueOfRate  = Events.OldValueOfRate,  
        OldValueOfDep   = Events.OldValueOfDep,   
        ChangeDate      = Events.ChangeDate     
    FROM 
            (SELECT 
                EmployeeID      = EmployeeID,
                OldValueOfRate  = Rate, 
                OldValueOfDep   = (SELECT TOP 1 DepartmentID
                                     FROM HumanResources.EmployeeDepartmentHistory V
                                    WHERE V.StartDate >= PayHistory.RateChangeDate
                                      AND V.EmployeeID = PayHistory.EmployeeID
                                    ORDER BY V.StartDate), 
                ChangeDate      = RateChangeDate 
            FROM HumanResources.EmployeePayHistory PayHistory
        UNION ALL 
            SELECT 
                EmployeeID      = EmployeeID,
                OldValueOfRate  = (SELECT TOP 1 Rate
                                     FROM HumanResources.EmployeePayHistory V
                                    WHERE V.RateChangeDate >= DepHistory.StartDate
                                      AND V.EmployeeID = DepHistory.EmployeeID
                                    ORDER BY V.RateChangeDate), 
                OldValueOfDep   = DepHistory.DepartmentID, 
                ChangeDate      = StartDate
            FROM HumanResources.EmployeeDepartmentHistory DepHistory
        ) Events
    )
SELECT DISTINCT
    EmployeeID     = EmployeeID,
    ChangeDate     = ChangeDate,
    OldValueOfDep  = COALESCE (
                         OldValueOfDep,
                         (SELECT DepartmentID 
                           FROM HumanResources.EmployeeDepartmentHistory DepHistory
                          WHERE EndDate IS NULL 
                            AND DepHistory.EmployeeID = AllEvents.EmployeeID)),
    OldValueOfRate = COALESCE (
                         OldValueOfRate,
                         (SELECT TOP 1 Rate
                           FROM HumanResources.EmployeePayHistory PayHistory
                          ORDER BY RateChangeDate))
                    
FROM
    AllEvents
ORDER By EmployeeID, ChangeDate

 

The code can be easily checked filtering for a single EmployeeID (4 is a good candidate for this check). Please note that - in this case - the "current value" is kept in the history table and so the final COALESCE is a bit more intricated but still pretty easy both to write and to understand.

If you have a better pattern for this kind of situation or any comments on it... I'll be glad to read your comments.

Published Monday, October 15, 2007 9:36 PM by AlbertoFerrari
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Davide Mauri said:

Hi Alberto

shouldn't the table VariationB contain the "4" value (along with the year of change)?

Otherwise the two table (VariantionA & VariationB) are not coherent with each other, in fact VariationA contains the current value, where VariationB doesn't

October 16, 2007 5:50 AM
 

Davide Mauri said:

Btw, this is a classic "Time-Dependent" problem.

You can solve it switching to a more appropriate view where you have the column "ValidFrom" and "ValidTo". After doing this is just a matter of choosing the rows where times does overlaps:

with cte_v1 as

(

select

Position = va2.Position,

OldValue = va2.OldValue,

ValidFrom = coalesce(va1.ChangeDate, '19000101'),

ValidTo = va2.ChangeDate

from

[VariationA] as va1 right outer join [VariationA] as va2 on va1.Position = va2.Position - 1

),

cte_v2 as

(

select

Position = vb2.Position,

OldValue = vb2.OldValue,

ValidFrom = coalesce(vb1.ChangeDate, '19000101'),

ValidTo = vb2.ChangeDate

from

[VariationB] as vb1 right outer join [VariationB] as vb2 on vb1.Position = vb2.Position - 1

)

select

ChangeDate = case when c1.ValidTo < c2.ValidTo then c1.ValidTo else c2.ValidTo end,

OldValueOfA = c1.OldValue,

OldValueOfB = c2.OldValue

from

[cte_v1] c1, [cte_v2] c2

where

(c1.ValidFrom < c2.ValidTo and c1.ValidTo >= c2.ValidFrom)

and

(c2.ValidFrom < c1.ValidTo and c2.ValidTo >= c1.ValidFrom)

October 16, 2007 8:17 AM
 

AlbertoFerrari said:

Nice solution Davide, it was my first try but I messed myself up writing the WHERE condition, so I choose the other tecnique.

Thanks for the comments.

October 16, 2007 9:00 AM
 

Davide Mauri said:

Hi Alberto, yes the problem with this kind of temporal queries are the qualifying clauses. I wish we could use a nice "OVERLAPS" operator :-)

October 16, 2007 9:38 AM
 

LudwikG said:

What about this approach:

WITH allDates AS

(SELECT ChangeDate FROM VariationA

UNION

SELECT ChangeDate FROM VariationB)

SELECT

allDates.ChangeDate,

OldValueOfA = COALESCE(

(SELECT TOP 1 OldValue

FROM VariationA V

       WHERE V.ChangeDate >= allDates.ChangeDate

       ORDER BY ChangeDate),

(SELECT A FROM products)),

OldValueOfB = COALESCE(

(SELECT TOP 1 OldValue

FROM VariationB V

       WHERE V.ChangeDate >= allDates.ChangeDate

       ORDER BY ChangeDate),

(SELECT B FROM products))

FROM allDates

October 18, 2007 6:30 AM

Leave a Comment

(required) 
(required) 
Submit

About AlbertoFerrari

Alberto Ferrari is a Business Intelligence consultant. He his interests lie in two main areas: BI development lifecycle methodologies and performance tuning of ETL and SQL code. His main activities are with SSIS and SSAS for the banking, manufacturing and statistical sectors. He is also a speaker in international conferences like European PASS Conference and PASS Summit.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement