THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Microsoft OLAP by Mosha Pasumansky

Using Linear Regression MDX functions for forecasting

MDX provides several functions for computing linear regression, however those functions are anything but intuitive or easy to use. This is definitely an area in MDX, that given a chance I would've done differently now. The problem is that when we designed those functions, we unnecessarily generalized them so it would be possible to use them on any dimension with any expression for both "x" and "y" coordinates, but in the process of generalization, we made the most natural use cases - forecasting by Time too difficult and unnatural. Microsoft even published KB article 307276 which explains how to use MDX Linear Regressions family of functions. Unfortunately, while syntactically and mathematically the examples in this KB article are correct, they don't make sense from the business problem point of view, because instead of time, they use "Units Sales" as an "x" coordinate. So let's try to understand how those functions work and what they expect as parameters. I will be focusing on the LinRegPoint function, since it is one of the more useful functions, but same principals apply to others as well. Let's start with what Books Online tell us about it. I slightly modified the content of BOL, because it makes reference to other LinRegxxx functions.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/olapdmad/agmdxfunctions_8n3o.asp

LinRegPoint

Calculates the linear regression of a set and returns the value of y in the regression line y = ax + b.

Syntax

LinRegPoint(«Numeric Expression», «Set», «Y-Numeric Expression»[, «X-Numeric Expression»])

Remarks

Linear regression that uses the least-squares method calculates the equation of the best-fit line for a series of points. Let the regression line be given by the following equation, where a is called the slope and b is called the intercept:

y = ax+b

LinRegPoint uses its last three arguments like the other LinRegxxx functions use them: to calculate the regression line, i.e. it evaluates «Set» against the «Y-Numeric Expression» to get the set of values for the y-axis. It then evaluates «Set» against the «X-Numeric Expression», if present, to get the set of values for the x-axis. If the «X-Numeric Expression» is not present, the function uses the members of «Set» as values for the x-axis. Then, the function evaluates the first argument and uses the resulting number as the x value in the regression equation (y = ax + b) to calculate the y value.

Let's try to understand what this means. The idea is simple, really. In order to compute the coefficients a and b (also known as "slope" and "intercept" of the regression), the algorithm needs set of points - { (x1 y1), (x2 y2), ... (xk yk) }. LinRegxxx functions produce this set of points by saying that applying «Y-Numeric Expression» to the «Set» produces values for y1, y2 ... yk and that applying «Y-Numeric Expression» to the «Set» produces values for x1, x2 ... xk. Let's take concrete example: In the Foodmart sample database, in the cube sales, we have data for Store Sales measure in 1997, and we would like to forecast Store Sales in 1998 using linear regression. The graph for Store Sales over time in 1997 looks like following:

So let's see how to model those (x,y) coordinates in MDX. The expression for y1, y2 ... yk is trivial - we will just use [Measures].[Store Sales]. But what expression to use for the x1, x2 ... xk ? Basically, we want to put months on the axis X using following translation: 1997.Q1.January - 1, 1997.Q1.February - 2 etc. This can be achieved through various ways. One way, is to define a member property on the Month level which will do the mapping. Then the expression will become (assuming that XPosition is the name of that member property)

Val(Time.CurrentMember.Properties("XPosition"))

There is also a generic way to convert Month into its position on X axis without introducing new member properties. The MDX function Rank does exactly that - returns position of an element inside the set. So the set would be the all the months, i.e. [Time].[Month].Members, or making it a little bit more general - [Time].CurrentMember.Level.Members - this will give all the members from the current level in Time. So the «X-Numeric Expression» will look now

Rank([Time].CurrentMember, [Time].CurrentMember.Level.Members)

This is actually the most difficult part about LinReg functions - how to construct expression for the X axis. Everything else now becomes simple. Next thing that needs to be defined is «Set», i.e. over which domain the points ( xi yi ) need to computed, i.e. what would be the range of months for which we have actual Store Sales and which will be used to build linear regression's slope and intercept. It is possible to programmatically with MDX to find out the dates which had data, but this really goes outside of the scope of this article, so we will just hardcode the "training set" to be months in 1997, since this is where we have data in Foodmart's Sales cube anyway. Therefore the expression for «Set» will be

Descendants([Time].&[1997],[Time].CurrentMember.Level)
And to finish the parameters to the LinRegPoint function, we now only need to come up with the expression for the first argument. This expression is supposed to calculated X coordinates of the points which are to be forecasted - and the forecasted value y will be computed by the equation of linear regression, i.e. y = ax+b. Now, this is easy, since we already know how to compute the X coordinates, so the expression will be the same as we used for «X-Numeric Expression», i.e.
Rank([Time].CurrentMember, [Time].CurrentMember.Level.Members)

Now we simply need to put all of it together. We will define new calculated measure [Store Sales Forecast] using the following formula:

LinRegPoint(
  Rank(Time.CurrentMember, Time.CurrentMember.Level.MEMBERS),
  Descendants([Time].&[1997],[Time].CurrentMember.Level), 
  [Measures].[Store Sales], 
  Rank(Time.CurrentMember, Time.CurrentMember.Level.MEMBERS))

If we now will browse the cube and put [Store Sales] together with [Store Sales Forecasted] - we will see that while [Store Sales] doesn't have any data in 1998, [Store Sales Forecast] does ! However, there is a problem. The cells for ([Store Sales Forecast], [1997]) and ([Store Sales Forecast], [1998]) contain -1.#IND instead of data. The reason for that is simple - the set over which we compute linear regression uses [Time].CurrentMember.Level, so when we are looking at 1997 or 1998 - it will resolve to the Year level. And Descendants([Time].[1997], [Time].[Year]) will return a single member 1997 - indeed we only have one year with data. But, it is impossible to compute linear regression using a single point, at least two points are needed, and when only a single point is used, formula of least-square method will generate division by zero, which gets formatted as -1.#IND. Of course, it is possible to workaround this problem by testing whether we are at the year level, or whether set has a single element etc. But let's look at this problem differently. Why are we calculating forecasted sales at all levels of the Time dimension ? Shouldn't we forecast it at the lowest level only, and then let it aggregate to the higher levels ? Additional issue with the calculated measure that we created is that it hardcoded forecasting on Store Sales. What if we want to forecast on other measures - like Unit Sales, or even on calculated measures like Profit. The right solution to all those problems is to define forecasting calculation not as calculated member, but as cell calculation. Cell calculations are defined on the cells without creating new coordinate and they aggregate up by using measure aggregation function. So let's add to the Sales cube the Category dimension which conveniently have members such as "ACTUAL" and "FORECAST". Category dimension doesn't have join with fact table "sales_fact_1997", so we can create a view on top of it, adding constant field category_id which always joins to the "ACTUAL" member in the Category dimension, i.e.

SELECT 'ACTUAL' AS category_id, * FROM sales_fact_1997;

Now we will define cell calculation with the following properties:

Scope:

[Time].[Month].MEMBERS, {[Category].&[FORECAST]}
- i.e. it applies to the Month level and to FORECAST category only

Expression:

LinRegPoint(
  Rank(Time.CurrentMember, Time.CurrentMember.Level.MEMBERS),
  Descendants([Time].&[1997],[Time].CurrentMember.Level), 
  [Measures].[Store Sales], 
  Rank(Time.CurrentMember, Time.CurrentMember.Level.MEMBERS))

So what happens now when we position ourselves on cell ([Store Sales], [FORECAST], [1998]) - it gets aggregated from the months in 1998, each one of which is predicted using linear regression forecasting. The result graph will look like following:

Published Tuesday, December 21, 2004 7:50 PM by mosha
Filed under:
Anonymous comments are disabled
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement