THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Microsoft OLAP by Mosha Pasumansky

Budget Variance - A study of MDX optimizations: evaluation modes and NON_EMPTY_BEHAVIOR

According to OLAP Survey, the number one objective in building OLAP systems is performance.  Certainly postings in newsgroups and forums confirm that finding. It is safe to say, that the most frequent question people are asking is "How do I optimize the following MDX...". Optimizing MDX is both science and art. It requires a mix of knowledge of MDX constructs, familiarity with UDM concepts and some basic understanding of how query optimizer works. But what I find the most difficult, is to understand the real problem behind the question. Most of those require deep insight into the concrete model and intimate knowledge of the business logic driving the requirements. Often these details obscure the technical question enough that either nobody is brave enough to take a shot at it (certainly, myself, if I didn't understand the issue after reading it - I am unlikely to engage) or a long thread starts which sinks in additional details and clarifications, and at the end even if the answer is both correct and improves performance, it is not easy to extract the core of the idea behind the optimization and generalize it enough to apply to other scenarios.

This is why, I decided to demonstrate some of the MDX optimization techniques using the simplest example possible. We will be talking about Budget Variance today. Budget Variance is simply difference between the Actual and, well, the Budget. The MDX expression for it is nothing more sophisticated then Actual - Budget. Yet, after we will consider all the related things, we will see how to achieve several thousand fold performance improvements. The techniques used below are fairly simple, and hopefully are applicable to the wide range of scenarios. I will also try to explain what is happening "behind the scenes" as we progress.

Budget Variance problem setting

Now, it turns out that in Accounting, Budget Variance is not always computed as Actual-Budget. For the Expense accounts it actually should be computed as Budget-Actual. This rule is so common, that in Essbase, there is a special built-in function - @VAR - which does exactly that, i.e. @VAR(Actual, Budget) will return Actual-Budget for "No Expense" accounts, and Budget-Actual for "Expense" accounts. More information about @VAR can be found here. So we would want to do the same in MDX. Additional requirement for computing Budget Variance is to take into account the fact that Budget is not entered at every cell in the cube. There are areas where Budget is not defined, even thought that Actuals do exist everywhere. At these cells we need to leave Budget Variance empty as well. If we look inside Adventure Works cube, for example, we will see that the Budget is entered only for the first day of each month. This is actually an unusual approach (although a valid one). Usually, Bugdets are defined in separate measure group, which have different granularity from the Actuals measure group, i.e. for Date dimension Budget would have granularity on Months rather then on Dates. Anyway, even with such designs missing Budgets are normal.

Now, if we open Adventure Works cube, we will discover that it already have a calculation for Budget Variance, which satisfy all of our requirements. Below is the relevant fragment of MDX Script:

Create Member CurrentCube.[Scenario].[Scenario].[Budget Variance]
 As Case
        When IsEmpty
             (
                ( 
                  [Measures].[Amount],
                  [Scenario].[Scenario].[Budget] 
                )
             ) 

        Then "Not Budgeted" 

        When [Account].[Account Type].CurrentMember Is 
             [Account].[Account Type].[Expenditures]
             Or
             [Account].[Account Type].CurrentMember Is 
             [Account].[Account Type].[Liabilities]

        Then ( [Measures].[Amount],[Scenario].[Scenario].[Budget] ) 
             -
             ( [Measures].[Amount],[Scenario].[Scenario].[Actual] )

        Else ( [Measures].[Amount],[Scenario].[Scenario].[Actual] ) 
             -
             ( [Measures].[Amount],[Scenario].[Scenario].[Budget] )
    End,

Format_String = "Currency",
Non_Empty_Behavior = { [Measures].[Amount] }

Before we can proceed, we will have to fix few minor things about this expression

1. We will replace "Not Budgeted" with NULL. Assigning strings to the values is usually not a good idea, both from performance and correctness point of view. The correctness piece is basically because if we will ever build calculation which relies explicitly or implicitly on [Budget Variance] - the chances are it won't work well when it will get fed strings. If it is desired to show string "Not Budgeted" to users, it can be relegated to FORMAT_STRING property instead.

2. The checks on [Account Type] attribute rely on the attribute overwrite rules to decode Account Type when current coordinate is set to Account. Unfortunately, this decoding doesn't happen in parent-child dimensions between the key or parent and other attributes. Therefore the check for Account Type will not work correctly. We will have to replace it to the explicit checks on .Properties("Account Type"). This is not an ideal scenario (just think what would happen if Accounts didn't have grain, but Account Type did), but unfortunately this is the only way to overcome the above mentioned limitation of parent-child

3. Finally, Non_Empty_Behavior is defined completely wrong, since, obviously, [Budget Variance] does not behave equivalent to Measures.Amount with respect to NULLs. Luckily, since it was enclosed in { }'s, AS will ignore it (at least until SP2 is out) on everything but calculated measures - and this is not a calculated measure. So no harm is done, but since it is defined incorrectly, we will remove it.

After these changes, the modified version will look like following:

Create Member CurrentCube.[Scenario].[Scenario].[Budget Variance]
AS Case
        When IsEmpty
             (
                ( 
                  [Measures].[Amount],
                  [Scenario].[Scenario].[Budget] 
                )
             ) 

        Then NULL 

        When [Account].[Accounts].CurrentMember.Properties("Account Type") = "Expenditures"
             Or
             [Account].[Accounts].CurrentMember.Properties("Account Type") = "Liabilities"

        Then ( [Measures].[Amount],[Scenario].[Scenario].[Budget] ) 
             -
             ( [Measures].[Amount],[Scenario].[Scenario].[Actual] )

        Else ( [Measures].[Amount],[Scenario].[Scenario].[Actual] ) 
             -
             ( [Measures].[Amount],[Scenario].[Scenario].[Budget] )
    End
,Format_String = "Currency";

The rule here is that Expenditures and Liabilities account types are considered expenses. The expression here uses Case operator instead of more traditional two nested IIFs, which probably makes it look a little nicer. Case operator has always been part of OLEDB for OLAP specification since the very first version, but it was first implemented only in Analysis Services 2005 version.

Now, let's measure the performance of this expression. Since it will be working very fast on a single cell, we will force computation of it on a lot of cells. But we also don't want to make the resultset huge, because then there will be a lot of time spent serializing it, sending back to the client and receiving on the client. So we will make query to return a single cell, but this cell will add up lots of others. The query to do this is following:

WITH MEMBER Measures.PerfTest AS 
  SUM(
      [Date].[Date].[Date]
    * [Department].[Department].[Department]
    * [Organization].[Organization].[Organization]
    * [Destination Currency].[Destination Currency].[Destination Currency]
    , Measures.Amount)
SELECT 
  [Scenario].[Scenario].[Budget Variance] ON 0
 ,[Account].[Accounts].[Conferences] ON 1
FROM [Adventure Works]
WHERE Measures.PerfTest

There are 1158 dates, 14 organizations and destination currencies and 7 departments - so in the crossjoin we are summing up 1,588,776 cells. On my laptop, Profiler reveals that this query takes 27.172 seconds, which corresponds to the rate of about 58471 cells per second. I think everybody would agree that for such a simple calculation this is somewhat slow. Now, just for the interest, let's rewrite the calculation to use IIF function:

Create Member CurrentCube.[Scenario].[Scenario].[Budget Variance Iif] AS
 IIF(IsEmpty(
               ( 
                  [Measures].[Amount],
                  [Scenario].[Scenario].[Budget] 
                )
               )
     , NULL
     , IIF([Account].[Accounts].CurrentMember.Properties("Account Type") = "Expenditures" 
           Or 
           [Account].[Accounts].CurrentMember.Properties("Account Type") = "Liabilities"
         ,( [Measures].[Amount],[Scenario].[Scenario].[Budget] ) 
           -
          ( [Measures].[Amount],[Scenario].[Scenario].[Actual] )

         ,( [Measures].[Amount],[Scenario].[Scenario].[Actual] ) 
           -
          ( [Measures].[Amount],[Scenario].[Scenario].[Budget] )
        )
      )
, FORMAT_STRING = "Currency";

Executing the same query, we observe the running time of 1.484 seconds. This is about 20 times faster then with Case operator, so obviously query optimizer uses very different execution plan. In order to understand what's going on we need to talk a little about how calculation engine works and about different evaluation modes that it can use to build execution plans.

Evaluation Modes

After FE analyzes the query and the calculations in the cube which can apply to it, it ends up with data structure which can be described as list of cube subspaces and MDX expressions trees which should be computed for each subspace. These subspaces are simply subcubes with single granularity - i.e. exactly same subcubes which SE uses to read data from partitions - i.e. the same ones reported in the "Query Subcube" event in Profiler (see more details about it here). Now, FE needs to apply the expression to the entire subcube. There are different algorithms used for that - and they are called evaluation modes. While there are several evaluation modes inside AS, the most important taxonomy divides them into two groups:

  • Cell by cell evaluation mode (aka naive mode)
  • Bulk evaluation mode

The cell by cell evaluation mode is straightforward one - the code iterates over all the cells in the subcube, updates current coordinate in the context and applies MDX expression to the current cell from scratch, i.e. as if the expression never saw all the preceding cells. This means, that the expression needs to prepare its data structures needed for computations etc for every single cell in the subcube. Obviously, this is the least efficient calculation mode, but it is also the simplest one from the code point of view, and it serves as a fallback when no other evaluation mode can be applied.

Bulk evaluation modes are trying to be smarter. They are carefully inspecting both the subcube and the expression tree and try to exploit the fact that the same expression must be evaluated over all cells in the subcube. For example, it can detect that the expression really depends only on some attributes, and these attributes change slowly, or even remain constant over the given subcube. One such example could be member functions such as PrevMember or semiadditive aggregation LastChild. Both of them only depend on attributes in the Time dimension, and if query had Time attributes only in slicer and not in the axes, then PrevMember and/or LastChild can be executed only once for the entire subcube ! There are many other examples, but it is clear that different MDX functions can optimize differently, therefore choice of evaluation mode depends heavily on the MDX functions inside the execution tree.

So, obviously, we will prefer the execution plans which use bulk evaluation modes over cell-by-cell evaluation modes. The best way to detect which one was used is to watch the PerfMon counter MSAS 2005:MDX\Total cells calculated. It pretty much shows number of cells which were computed using cell-by-cell evaluation mode. So if we run query using [Budget Variance] calculated member, we will see that the value of this counter is incremented by 1,589,209. On the other hand if we run query using [Budget Variance Iif], the counter is incremented by 1. Let's explain also where this 1 is coming from. Basically, this is for the single cell defined by the query before SUM unrolls the crossjoin set. Query optimizer uses cost vs. benefit argument which say that if we compute small amount of cells (like single cell), it shouldn't bother going with bulk evaluation mode, because there are some upfront costs for building bulk evaluator, and it isn't worth doing - computing this cell in cell-by-cell mode will be faster.

So now we can explain why IIF was so much faster then CASE. Simply because IIF has code written for it which supports bulk evaluation mode, and CASE doesn't. That was simply a matter of priorities. Writing good bulk evaluators is not simple, so we had to prioritize which ones to work on first. IIF is a hugely popular MDX function (in this study, that I've done, it came as the second most popular MDX function, behind only CurrentMember). Also, IIF is simpler then CASE, because it only deals with two branches, while CASE can have arbitrary number of branches. So hopefully, as CASE gains more popularity, it will get its own bulk evaluator implemented in the next version of AS. (Don't black-cross CASE completely, because while currently it is less efficient then IIF, it doesn't matter much if query is touching only a small number of cells, and CASE has a benefit of MDX code looking more readable).

Well, we are still not done by any means. 1.5 seconds sounds OK for 1.5 million of cells, but we still can do much better then that - let's see how !

Avoid run-time checks

Readers of my blog know that I always try to write my calculations to avoid IIF and try to use SCOPE instead (for example see discussions here and here). There is a very good reason for it - IIF implies doing condition checks done during run-time, i.e. dynamic checks, as opposed to SCOPE, which is resolved only once, when MDX script is evaluated, i.e. static checks. And there is also another, even more important reason for that, we will discuss it later in this article. So let's try to rewrite our expression without IIFs. The part about checking the Account Type for Expenses vs. Non Expenses is fairly straightforward:

Create [Scenario].[Scenario].[Budget Variance Ex];
SCOPE ([Measures].[Amount], [Scenario].[Scenario].[Budget Variance Ex]);
  this = [Scenario].[Scenario].[Actual] - [Scenario].[Scenario].[Budget];
  {[Account].[Account Type].[Expenditures], [Account].[Account Type].[Liabilities]} = [Scenario].[Scenario].[Budget] - [Scenario].[Scenario].[Actual];
  Format_String(this) = "Currency";
END SCOPE;

Let's note couple of things here. First we create the calculated member [Budget Variance Ex] without expression, which means it is going to evaluate to NULL everywhere. Then we SCOPE on it, and on Amount measure. Scoping on Amount allows us to drop it from the tuples in the Actual-Budget expressions, which simplifies the formula. It also enhances the meaning. The original [Budget Variance] redirected to hardcoded Measures.Amount everywhere, independent of the current measure. This means that if the current measure was [Internet Sales Amount], [Budget Variance] would work with Amount. The script above, on the other hand, makes [Budget Variance Ex] to return NULL if the current measure is not Amount - which makes it more clear to the user (although since this calculated member is only included in Finance perspective, it makes the point above a little moot). Anyway, the logic first set the value to always be Actual-Budget, and on the next line overwrites this to be Budget-Actual.

Unfortunately, we need to be reminded again, that attribute overwrite rules do not work in parent-child dimensions for decoding regular attributes from key or parent attributes. Had Account not been parent-child, then the solution above would've worked fine, but since it is parent-child, it won't work correctly. Any attempt to fix it by trying to do something like

Filter(Account.Account.Account.MEMBERS, Account.Account.CurrentMember.Properties("Account Type") = "Expenditures") =
  [Scenario].[Scenario].[Budget] - [Scenario].[Scenario].[Actual];

Won't work either - because the filter set on key or parent attribute of parent-child cannot be used in the SCOPE (it causes "arbitrary shape" error). So we are left out of options, and we have to use IIF to check for the Account Type. But there is another IIF - the one which checks whether or not Budget is empty. First it seems that it will be impossible to get rid of that IIF either - after all it is data dependent check - how can it be replaced by SCOPE ? It's true that it cannot be replaced by SCOPE, but there is a little trick that can be played here. We notice that the IIF is in the following form: IIF(IsEmpty(Budget), NULL, expr) - i.e. when Budget is NULL, we want the expression to evaluate to NULL. This can be achieved by multiplying the Budget with expression, i.e. to write instead the above IIF simply Budget*expr. Well, this has a little problem, of course, that our result is now incorrect, because it is multiplied by Budget. So to make it correct, we will divide back by Budget, i.e. Budget*expr/Budget. And since in MDX NULL/NULL is still NULL, this should work correctly.

So the final script will look the following

Create [Scenario].[Scenario].[Budget Variance Trick];
SCOPE ([Measures].[Amount], [Scenario].[Scenario].[Budget Variance Trick]);
  this = [Scenario].[Scenario].[Budget]*
     IIF(
             [Account].[Account].CurrentMember.Properties("Account Type") = "Expenditures"
               Or
             [Account].[Account].CurrentMember.Properties("Account Type") = "Liabilities"
         ,[Scenario].[Scenario].[Budget] - [Scenario].[Scenario].[Actual]
         ,[Scenario].[Scenario].[Actual] - [Scenario].[Scenario].[Budget])/[Scenario].[Scenario].[Budget];
  Format_String(this) = "Currency";
END SCOPE;

It sounds like we are doing more work then before - after all we have this extra floating point multiplication and division operations, which are somewhat expensive. So let's execute our query using [Budget Variance Trick] calculated member and measure the performance. Checking the profiler, it finishes in... WOW !!! It finishes in whooping 31 milliseconds. That is 0.031 seconds !!! In fact, I suspect that it actually takes even less then that, but we are now running on the edge of Profiler Trace sensitivity - it cannot accurately detect duration down to milliseconds. But even if it is exactly 31 milliseconds - it is a great result. It is so great, that it raises the question - how was it possible to compute 1.5 million cells so fast ? The explanation is in the next chapter.

Bulk evaluation mode and NON_EMPTY_BEHAVIOR

So, how was it possible to compute 1.5 million cells under 31 milliseconds ? The answer is - the engine didn't actually compute 1.5 million cells - it computed much less then that ! I have written in the past about execution plans built for multiplication operator in this article. Basically, the bulk evaluator of multiplication operator realizes, that if the first operand of the multiplication operator is NULL, then the result of multiplication is going to be NULL as well. Therefore it only asks for the cells where first operand is going to be non-NULL. First operand is Budget, and we already mentioned before, that Budgets are only entered for the first day of month, therefore we know that at least 97% of the data we are iterating over is going to be empty, and in the worst case we will iterate only 1/30 of the data, but probably even less then that. 1.5 seconds/30 gives us 50 milliseconds right there - that's the trick.

Now, after we recovered from all the joy, let's look again at the solution. It works fast, but there is something inelegant about it. We had to do this redundant multiplication and division just to trick the query optimizer. Not cool. And what about cases when Budget happens to be not NULL, but 0 ? Well, then we are going to get division by zero which will result in these ugly #1.INF values. Of course, it is possible to protect against this as well - but this is going too far. There must be a better way to handle this. And, indeed, there is a better way. It is possible to give query optimizer a hint, to describe exactly which cells aren't worth to even look at, because the result of expression is going to be NULL anyway. This hint is called NON_EMPTY_BEHAVIOR calculation property. Now, people who worked enough with Analysis Services 2000 heard about NON_EMPTY_BEHAVIOR. It was a property of calculated measures only, and it could be only a single physical measure, and it was used only as optimization hint for NON EMPTY clause. Looks like we have nothing of that here - no calculated measure, no real measure to point to, and, above all, there is no NON EMPTY anywhere in our example. Welcome to Analysis Services 2005, which completely redefines and extends to great lengths what NON_EMPTY_BEHAVIOR is. It is a calculation property, (much like FORMAT_STRING or BACK_COLOR) which can be set on any subset of cells in the cube, and the value is another MDX expression. I.e. the most generic syntax for it is:

NON_EMPTY_BEHAVIOR(<subcube>) = <expression>;

The meaning is, that for all cells covered by the given <subcube>, the values at these cells are guaranteed to be empty if <expression> evaluates to NULL. Note that there is no "if and only if" requirement - the cells are allowed to be NULLs even if the <expression> is not NULL, but if the <expression> is NULL, then the cell will have to be NULL as well. This is very powerful hint, and it is very important to use it correctly. Because query optimizer trusts the author of MDX script, that expressions for NON_EMPTY_BEHAVIOR are defined correctly - if they are not correct, then the cell values could get unpredictable results, depending on the execution plan chosen by the query optimizer.

Now, let's see how can we apply this to our case. We can prove mathematically, that the following is true. If we have MDX expression of the form

<subcube> = IIF( IsEmpty(<empty-expression>), NULL, <else-expression> );

then it is always correct to define

NON_EMPTY_BEHAVOIR(<subcube>) = <empty-expression>;

And in order to guarantee that FE will pick it, it's better write it this way:

SCOPE (<subcube>);
 this = IIF( IsEmpty(<empty-expression>), NULL, <else-expression> );
 NON_EMPTY_BEHAVOIR(this) = <empty-expression>;
END SCOPE;

Definitely, the pattern of IIF(IsEmpty(...), NULL, ...) is very common. And, of course, what is true for IIF function is also true for similar CASE operator. So, let's go back to our original [Budget Variance] formula, and add the appropriate NON_EMPTY_BEHAVIOR to it. The original one will become

Create Member CurrentCube.[Scenario].[Scenario].[Budget Variance]
AS Case
        When IsEmpty
             (
                ( 
                  [Measures].[Amount],
                  [Scenario].[Scenario].[Budget] 
                )
             ) 

        Then NULL 

        When [Account].[Accounts].CurrentMember.Properties("Account Type") = "Expenditures"
             Or
             [Account].[Accounts].CurrentMember.Properties("Account Type") = "Liabilities"

        Then ( [Measures].[Amount],[Scenario].[Scenario].[Budget] ) 
             -
             ( [Measures].[Amount],[Scenario].[Scenario].[Actual] )

        Else ( [Measures].[Amount],[Scenario].[Scenario].[Actual] ) 
             -
             ( [Measures].[Amount],[Scenario].[Scenario].[Budget] )
    End
,Format_String = "Currency"
,Non_Empty_Behavior = ([Measures].[Amount],[Scenario].[Scenario].[Budget]);

Or the way I actually prefer to write it

Create [Scenario].[Scenario].[Budget Variance];
SCOPE ([Measures].[Amount], [Scenario].[Scenario].[Budget Variance]);
  this =
    Case
        When IsEmpty([Scenario].[Scenario].[Budget])
        Then NULL 

        When [Account].[Accounts].CurrentMember.Properties("Account Type") = "Expenditures"
             Or
             [Account].[Accounts].CurrentMember.Properties("Account Type") = "Liabilities"

        Then [Scenario].[Scenario].[Budget] - [Scenario].[Scenario].[Actual]
        Else [Scenario].[Scenario].[Actual] - [Scenario].[Scenario].[Budget]
    End;
  Format_String(this) = "Currency";
  Non_Empty_Behavior(this) = [Scenario].[Scenario].[Budget];
END SCOPE;

Now, if we will execute our test query using either one of these definitions, the Profiler shows running time of 47 milliseconds (0.047 second). We will consider this a good result and won't optimize further.

In conclusion, in this article we used as example a simple problem of computing Budget Variance, but on the way we discussed different evaluation modes that query optimizer can choose from, and importance of directing it to the best one. The most important tuning feature is NON_EMPTY_BEHAVIOR hint, but it should be used very carefully in order to preserve correctness of results.

Published Sunday, November 05, 2006 5:05 PM by mosha
Anonymous comments are disabled
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement