THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
 in Microsoft OLAP by Mosha Pasumansky (Entire Site) Search

Performance of aggregating data from lower levels in MDX

In the "MDX Solutions" book, at pages 69-70, George Spofford discusses the task of computing expression [Units Sold] * [Last Update of Price] at the leaf-level cells and then aggregating the result up to the higher levels. George gives the following formula to do the aggregation in MDX:

```WITH MEMBER [Measures].[Dollars Sold] AS
'Sum (Descendants ([Time].CurrentMember, [Time].[Month]),
Sum (Descendants ([Geography].CurrentMember, [Geography].[Store]),
Sum (Descendants ([Products].CurrentMember, [Products].[SKU]),
([Measures].[Units Sold] * [Measures].[Last Update of Price])
)))'
. . .```

This formula works, but it doesn't give the best performance. What happens in this formula is that inside one Sum function another Sum function is called as an expression etc. We end up with stack of nested Sum calls. It is better to rewrite it in a way that expresses the intention clearer:

```WITH MEMBER [Measures].[Dollars Sold] AS
'Sum (
Descendants ([Time].CurrentMember, [Time].[Month]) *
Descendants ([Geography].CurrentMember, [Geography].[Store]) *
Descendants ([Products].CurrentMember, [Products].[SKU]),
([Measures].[Units Sold] * [Measures].[Last Update of Price])
)'
. . .```

I.e. we sum up directly from the right level instead descending to it step by step in every dimension. Performance will be better, because while the same number of cells will be iterated, Sum function will be called only once with simpler expression, rather then being called multiple times with complex expressions. The difference perhaps not so significant in Analysis Services 2000, but query optimizer in Analysis Services 2005 is capable of building much better query plans for the later expression.

Alternative way of achieving the same result is to use cell calculations:

```WITH CELL CALCULATION AggDollarsSold FOR
'({[Measures].[Dollars Sold]},
Descendants ([Time].CurrentMember, [Time].[Month]),
Descendants ([Geography].CurrentMember, [Geography].[Store]),
Descendants ([Products].CurrentMember, [Products].[SKU])
)'
AS ' ([Measures].[Units Sold] * [Measures].[Last Update of Price]) '
. . .```

In this variant, we simply declare the required calculation (multiplication) at required level, and let engine aggregate from there. Again, there is not too much difference in performance between this and previous method in AS2K, but Analysis Services 2005's query optimizer can actually do pretty smart things in such scenario, because the user's intention is expressed very clearly. Actually, even syntax is much clearer in AS2005:

```({[Measures].[Dollars Sold]},
Descendants ([Time].CurrentMember, [Time].[Month]),
Descendants ([Geography].CurrentMember, [Geography].[Store]),
Descendants ([Products].CurrentMember, [Products].[SKU]))
= [Measures].[Units Sold] * [Measures].[Last Update of Price];```

Or, if the calculation really needs to be defined on the leaf-level of the cube, then simply

`({[Measures].[Dollars Sold]}, Leaves()) = [Measures].[Units Sold] * [Measures].[Last Update of Price];`

Don't you agree that this looks pretty elegant - compute the multiplication at the leaves for Dollars Sold measure and then automatically aggregate it up ! Even in AS2K the expression can be made to be more independent of metadata. Dimensions still need to be listed by names, but instead of specifying the last level in each dimension for the Descendants function, it is possible to use LEAVES flag, i.e.

```  Descendants ([Time].CurrentMember, , LEAVES),
Descendants ([Geography].CurrentMember, , LEAVES),
Descendants ([Products].CurrentMember, , LEAVES)```

In "MDX Solutions" book, [Last Update of Price] is a calculation by itself. But how would have we implemented this scenario, if [Last Update of Price] was a regular, non-calculated measure. Well, obviously the best solution would be not to do it in MDX at all, and instead load during processing by defining Dollars Sales as a measure derived from SQL row expression - [Units Sold] * [Last Update of Price]. However this only possible if both [Units Sold] and [Last Update of Price] come from the same fact table and are at the same granularity. In Analysis Services 2005 actually there is a way to define such expressions to be done during processing even if measures are from different fact tables and at different granularities (this is done similarly to how many-to-many dimensions are done) - but this is very interesting subject by itself, and deserves dedicated article. So let's assume that for some reason we cannot perform this aggregation during processing time. Perhaps it needs to be done at granularity above the leaf-level. Or perhaps [Units Sold] and [Last Update of Price] are not measures, but instead members of Account dimension etc. Then, if there are no other calculations affecting those measures (such as custom member formulas or cell calculations), we can use the infamous NonEmptyCrossJoin function to significantly reduce the space to iterate over:

```WITH MEMBER [Measures].[Dollars Sold] AS
'Sum (
NonEmptyCrossJoin(
Descendants ([Time].CurrentMember, , LEAVES),
Descendants ([Geography].CurrentMember, , LEAVES),
Descendants ([Products].CurrentMember, , LEAVES)
),
[Measures].[Units Sold] * [Measures].[Last Update of Price]
)'
. . .```

In Analsyis Services 2000, this last solution is probably the only practical one on non-trivial cubes. It is still not an ideal one, even if NonEmptyCrossJoin can be safely used. The reason is that since it uses Sum function directly with custom expression - the cache subsystem cannot assume that one level can be derived from another one, and will have to compute each cell separately. I.e. if user first queried all the cities in Washington, and later queried state Washington - AS2K will reaggregate all the values for the Washington from leaves instead of simply summing up the cities. Use of cell calculations would give a hint to cache subsystem that intermediate aggregates can be reused, but there is no place to inject NonEmptyCrossJoin call in there.

In Analysis Services 2005, on the other hand, the ideal solution is also the most intuitive and elegant one, that we already saw before:

`({[Measures].[Dollars Sold]}, Leaves()) = [Measures].[Units Sold] * [Measures].[Last Update of Price];`

Here we get best of both worlds - intermediate results can be reused just like if the data was coming from the real fact table, i.e. cache will be fully aggregatable and filterable. And query optimizer will automatically detect whether or not it is safe to automatically apply NonEmptyCrossJoin algorithm to reduce the number of cells to iterate over, and even if it is not safe to do over entire space or entire expression, it will apply it to the part of the space and part of the expression where it is safe. I.e. if we will go back to [Last Update of Price] being calculated measure, still [Units Sold] is a real measure, so query optimizer will do NonEmptyCrossJoin on it, and then perform inner join of the result with the calculated measure (since the multiplication of two numbers is NULL if either one of them NULL).

Published Sunday, February 13, 2005 7:34 PM by mosha
Filed under: