THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alberto Ferrari

PowerPivot and the Slowly Changing Dimensions

Slowly changing dimensions are very common in the data warehouses and, basically, they store many versions of the same entity whenever a change happens in the columns for which history needs to be maintained. For example, the AdventureWorks data warehouse has a type 2 SCD in the DimProduct table. It can be easily checked for the product code “FR-M94S-38” which shows three different versions of itself, with changing product cost and list price.

image

This is exactly what we can expect to find in any data warehouse: each row contains the historical value of the attribute, so that we can perform analysis on how the attribute has changed over time. Moreover, a column (status, in the example) holds information about which is the current version of the product. Thus, looking at the table, it is very easy to detect that the current price of the product is 1,346.50. Nevertheless reporting on the current list price is not very easy because there is no way to use the current value to slice data in a PivotTable: no columns hold it; it has to be computed in some way.

The issue might not be evident with the list price but, if you think at a slowly changing dimension with the customer address as an historical attribute, it might be interesting to know where a customer lived but it is much more interesting to know where he lives now. Moreover, the best would be to be able to get both information, which might be interesting for different purposes. Unfortunately, in the AdventureWorks company they seem to be interested only in price changes, so I had to use this as an example. Sorriso

Now, if you live in a corporate BI environment and want to perform current vs historical price comparisons, then somebody will create a view for you which, through some complex JOIN, will be able to expose both the historical and the current version of some columns. If you dare to ask him to consolidate the values in the dimension table, he will complain about ETL time, UPDATES needed on a dimension, log file usage and, at the end, he will probably refuse to do that.

Luckily, in the Self-Service BI world, things are much easier. If we want to add a calculated column to the table that computes, for each version of the product, which is the current list price (which happens to be the same for all the instances of the same product), it will be enough to use this formula:

=CALCULATE (
    VALUES (DimProduct[ListPrice]),
    ALL (DimProduct),
    DimProduct[ProductAlternateKey] = EARLIER (DimProduct[ProductAlternateKey]),
    DimProduct[Status] = "Current"
)

Its behavior is very easy to understand: compute the value of the listPrice searching in all products which have the same ProductAlternateKey as the current product and the Status equal to" “Current”. The only part worth noting is the usage of ALL (DimProduct) inside the CALCULATE filters. This is needed because the initial CALCULATE will consolidate the row context of the calculated column into a filter one, which we will need to remove in order to replace it with our specific filter. The result is straightforward:

image

Now CurrentListPrice is a calculated column and can be used to slice data in a PivotTable, as any other column:

image

Some words need to be spent if you ever face a data warehouse which does not hold a “Current” column like AdventureWorks. There is really no standard technique to identify the current version of a product: sometimes we have an “historical/current” column, as in this example, sometimes we need to infer it from the EndDate containing NULL, sometimes we need to find the last version of the record and use that. I am writing here the various flavors of the formula, just to keep them at hand in case they are needed.

If we need to rely on the emptiness of the EndDate column, the formula is easily adapted:

=CALCULATE (
    VALUES (DimProduct[ListPrice]),
    ALL (DimProduct),
    DimProduct[ProductAlternateKey] = EARLIER (DimProduct[ProductAlternateKey]),
    ISBLANK (DimProduct[EndDate])
)

While, if we need to search for the last occurrence of the record with the same code, we need some more DAX acrobatics to express the formula:

=CALCULATE (
VALUES (DimProduct[ListPrice]),
FILTER (
ALL (DimProduct),
DimProduct[ProductAlternateKey] = EARLIER (DimProduct[ProductAlternateKey]) &&
DimProduct[StartDate] = CALCULATE (
MAX (DimProduct[StartDate]),
ALL (DimProduct),
DimProduct[ProductAlternateKey] = EARLIER (DimProduct[ProductAlternateKey])
)
)
)

This last formula is the most interesting one, since it relies on a minimum of information (just the start date) and performs the whole logic of current value detection, searching, for each row, the one containing the same code but the last date of validity. Needless to say the three formulas return the very same value.

As a side note, I think that in BISM this will be a very easy and convenient method to handle slowly changing dimensions even for the corporate BI environment since the current and historical values of any attribute can be easily computed with a minimal CPU effort and without the need to perform any update on the historical dimension on the database. I wish I had a similar functionality long time ago, when I needed to write ETL code to handle this scenario and spent some time in optimizations and locking issues… Sorriso

This is just another example of what PowerPivot can do for your reporting needs. If you want to touch the real Karma of PowerPivot, don’t miss the opportunity to follow one of the workshops I and Marco Russo are bringing all over Europe in the next few months.

You can find more info on www.powerpivotworkshop.com. Hope to see you  there!

Published Wednesday, February 09, 2011 9:00 AM by AlbertoFerrari
Filed under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Lee Hawthorn said:

Thanks Marco, this article is very helpful.

February 9, 2011 2:04 PM
 

Javier Guillen said:

Hi Alberto

Interesting solution to SCD Type 2 in PowerPivot.  I am wondering, though, as the product FR-M94S-38 now has three rows with the same (current) price, when aggregating the data in a pivot table the value will overstate the product price at is being added three times.

If the report in thiss case is meant to display only current products, wouldn't it be best to filter the in-memory table to only load the rows where [Status] = Current?

February 22, 2011 4:42 PM
 

AlbertoFerrari said:

Javier,

Yes, there are three rows for the same product, but this is how the AdventureWorks DWH (and any DWH AFAIK, which uses SCD2) stores the SCD. The computations are right, since aggregations are carried on using column values, not physical rows. Thus, the existence of three different rows does not change the values.

Clearly, as it is always the case with SCD, if you create a DISTINCT COUNT measure you need to pay attention to what you count, the product code is fine, the product ID not.

Moreover, if the report is designed to show current products only... then you are not facing and SCD of type 2, so the whole technique is useless. :) You end ups using SCD when you are interested in historical changes...

February 22, 2011 4:54 PM

Leave a Comment

(required) 
(required) 
Submit

About AlbertoFerrari

Alberto Ferrari is a Business Intelligence consultant. He his interests lie in two main areas: BI development lifecycle methodologies and performance tuning of ETL and SQL code. His main activities are with SSIS and SSAS for the banking, manufacturing and statistical sectors. He is also a speaker in international conferences like European PASS Conference and PASS Summit.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement