THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Microsoft OLAP by Mosha Pasumansky

MDX: NonEmpty, Exists and evil NonEmptyCrossJoin

When NonEmptyCrossJoin was introduced in AS2000 it made a lot of people happy. Indeed, this function seemed to be almost magic. It worked wicked fast and seemed to solve all the performance problems with. While the original intention was to be used in MDX SELECT queries generated by (now discontinued) English Query, the function was finding its way as an optimization technique into many scenarios. Here are two typical optimization tips for AS2000

1. Calculation of last non empty date. Instead of

Tail(FILTER(Date.MEMBERS, NOT IsEmpty(Date.CurrentMember)), 1).Item(0).Item(0)

in AS2000 we recommended to use

Tail(NonEmptyCrossJoin(Date.MEMBERS), 1).Item(0).Item(0)

2. Custom aggregation. Instead of

SUM(CrossJoin(Descendants(Product.CurrentMember, Product.[Product Name]), Descendants(Customer.CurrentMember, Customer.[Customer Name]))))

in AS2000 we recommended to use

SUM(NonEmptyCrossJoin(Descendants(Product.CurrentMember, Product.[Product Name]), Descendants(Customer.CurrentMember, Customer.[Customer Name]))

and so on. The observed performance improvements were very significant, that some respected MDX experts wrote articles about it -

and even though NonEmptyCrossJoin is not in standard MDX, some other OLAP Servers implemented it as well (probably feeling that this function is very important) -

This function is often cited in the MDX code snippets on the online forums, newsgroups and blog postings, it is used inside Microsoft own whitepapers and in the books. As an inventor of this function, I should feel proud. But there is a catch. The catch is what does this function sacrifices in order to achieve such a spectacular performance. The way this function is implemented, is it does direct scan of the fact table. This means, that the resulting tuples are included in the final set if and only if there are corresponding records in the fact table. So all the calculations are completely ignored - whether these are calculated members, cell calculations, custom rollups, custom member formulas, unary operators, semiadditive measures etc. So while it is commonly believed that

NonEmptyCrossJoin(set) is equivalent to Filter(set AS S, NOT IsEmpty(S.Current))

as a matter of fact

NonEmptyCrossJoin(set) is really equivalent to Filter(Distinct(StripCalculatedMembers(set)) AS A, NOT IsEmpty(CalculationPassValue(S.Current, 0, ABSOLUTE)))

(And I am not even mentioning a fact that the function has such a confusing name, because, obviously NonEmptyCrossJoin(set) is not equivalent to NON EMPTY CrossJoin(set) even ignoring the calculations issue, and that the way how NonEmptyCrossJoin works in virtual cubes treats current coordinate in measures dimension differently from the rest of MDX functions etc).

So we had to deal with a lot of cases where people would get incorrect results with NonEmptyCrossJoin, but they wouldn't want to use correct MDX, because it would have worse performance. So in AS2005 we went through NECJ usage patterns, and tried to fix them one by one.

1. Use inside calculations as optimization through set reduction

The second example at the beginning of this article is representative for this category. We have put logic inside the MDX query optimizer to recognize the cases when the empty cells can be eliminated from the set automatically. So instead of using AS2000 style of formula

SUM(NonEmptyCrossJoin(Descendants(Product.CurrentMember, Product.[Product Name]), Descendants(Customer.CurrentMember, Customer.[Customer Name]))

in AS2005 it should be written as

SUM(Descendants(Product.CurrentMember, Product.[Product Name]) * Descendants(Customer.CurrentMember, Customer.[Customer Name])))

(It is somewhat ironic, but the former method, i.e. with NECJ, will in fact work much slower in AS2005 if there is more then single cell is requested by the query. The reason is that instead of calling NonEmptyCrossJoin on each and every cell, AS2005 performs so called bulk evaluation over the entire space only once. The bulk evaluation can be done by query optimizer only on certain forms of MDX expressions, and use of NECJ ruins it. Bulk evaluation has multiple other benefits, for example, it realizes that the order of cells for Sum is not important, so it uses Unorder() internally etc. I believe Irina has some explanations about bulk evaluation in chapter 13 of her book).

Of course, there are even more efficient ways in AS2005 to perform such summation (I have written about them in this blog post), but the example is still good.

2.  For computing sets, either on the query axes or inside calculations

The first example at the beginning of the article is representative of this category. If we need to get the set of non empty tuples for the logic of calculation (i.e. to get Last or First Non Empty child). This is the classic scenario where we need to use

FILTER(set AS S, NOT IsEmpty(S.Current))

as we already know, this is not what we would get from NonEmptyCrossJoin - but AS2005 introduced function NonEmpty, which can do exactly the above if used as NonEmpty(set). The NonEmpty function works with the cube, not with fact table, therefore it takes into account all calculations defined in the cube. But NonEmpty doesn't sacrifices performance. It can automatically detect when the space covered by the sets doesn't have any calculations defined on it, and then it goes directly to fact table just like NonEmptyCrossJoin would. Even if there are some coordinates with calculations on them, like calculated members, the NonEmpty function is capable of partitioning the space into subspace covered with calculations and subspace not covered with calculations, and then running different algorithms on each subspace and merging the results together. In short, the NonEmpty function delivers the performance of NonEmptyCrossJoin in the cases when NonEmptyCrossJoin would produce the same results, and it always returns correct results. In some cases, for example for FILTER(set, NOT IsEmpty(Measures.SomeMeasure) OR NOT IsEmpty(Measures.AnotherMeasure)) the query optimizer will automatically convert the expression into NonEmpty(set, {Measures.SomeMeasure, Measures.AnotherMeasure}) - because such use of Filter function is very characteristic of OWC queries. This is one of the reasons why OWC works faster in AS2005 then it did in AS2000.

Another scenario is when the client application would like to get the same results as with NON EMPTY clause. And there are few reasons why NON EMPTY cannot be always used directly. One such reason is when the client application wants to implement windowing on the result of the query, i.e. only show the cells which fit the visible part of the display, and if the user wants to do PageDown or move right - then bring the cells from the shifted window. The accepted technique for implementing it is through use of Cell Range properties - BeginRange and EndRange - the standard XMLA command properties. But in order to compute the range, the cardinality of axes must be known (for more details see OLEDB for OLAP or XMLA specifications). Of course, with it is usually not possible to know the cardinality of axes before executing them. Therefore, one could use CREATE SET statements to create the sets to be later used as axes, but this rules out use of NON EMPTY clause. So here we can use the fact that NonEmpty has two parameters - the set to eliminate empty tuple from and the filterset. So we can use

CREATE SET NonEmptyAxis0 AS NonEmpty(Axis0, Axis1*WhereSet)
CREATE SET NonEmptyAxis1 AS NonEmpty(Axis1, Axis0*WhereSet)

The sets obtained in NonEmptyAxis0 and NonEmptyAxis1 would be exactly the same as in

SELECT NON EMPTY Axis0 ON 0, NON EMPTY Axis1 ON 1 FROM cube WHERE WhereSet
Note, that it is tempting to use for the second CREATE SET

CREATE SET NonEmptyAxis1 AS NonEmpty(Axis1, NonEmptyAxis0)

It seems that since we already reduced Axis0, then using the smaller set as a filter would improve performance. In practice it usually hurts performance significantly. (The reasons lie in the heuristics that NonEmpty uses to get the space coverage and the way bulk evaluation works - both subjects outside of the scope of this article.)

3. For directly accessing fact table

Finally there are some scenarios where it is desired to bypass all the calculations and work with measure group data in fact table rather then with cube data. Those are not typical OLAP scenarios, where cube is considered as a whole unit, and measure groups are usually hidden from the user. However, in more reporting style scenarios, user is exposed to multiple measure groups as part of data model. For example, Report Builder uses such paradigm. In Report Builder the cube looks like relational schema with dimensions and measure groups showing as tables. In order to access the fact tables of measure groups efficiently, the task that NonEmptyCrossJoin was originally built for, AS2005 offers variation over Exists function. Normally Exists is used to access dimension tables (although it also supports arbitrary complex joins over dimension tables). But if Exists is used with third parameter, which is measure group name, it goes against fact table instead. To that degree, it could be formally defined that

NonEmptyCrossJoin(set1, set2, ..., setN, K) is equivalent to Exists(set1*...*setK, set(K+1)*...*setN, "measuregroupname")

where "measuregroupname" is the name of the measure group to which the current measure belongs. Unlike Exists to dimension tables, Exists to measure group doesn't support joins - these would have to be done manually, but given the fact that measure group tables are usually much bigger then dimension tables, anyway this is probably something that user doesn't want to do.

The subject of NonEmpty and Exists algorithms is very important, but also not easy to explain in a short blog entry. There are several techniques how to optimize both, especially in the presence of non-trivial calculations. Perhaps we would cover it in more detail one day.

See Also

Dave Wickert wrote two good blog entries explaining similar subjects.

Published Monday, October 09, 2006 11:25 PM by mosha
Filed under:
Anonymous comments are disabled
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement