THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Microsoft OLAP by Mosha Pasumansky

Performance of IIF function in MDX

IIF is one of the most popular MDX functions, (this was proved by these surveys). Yet, it can cause significant performance degradation, which is often blamed on other parts of the system. Many times it is simple to rewrite the MDX expression to get rid of IIF altogether, and other times it is possible to slightly change the IIF to increase performance. While I and others mentioned this fact several times in blog entries, forum postings etc, people keep using and using it (and using it wrong). What is worse, after seeing CASE operator in samples and BOL, people start using CASE instead of IIF. From the performance point of view, CASE is always worse than IIF, at least IIF is possible to optimize in some case, but never CASE. So I decided to dedicate special blog to go over IIF usage and demonstrate with examples how horrible it can be and how to fix it.

To make it easier to demonstrate this on standard Adventure Works cube, the examples are somewhat artificial, but it is not hard to imagine how real-life example would fit into the same pattern.

Let's assume that we want to compute 'Normalized Cost' measure, which is the same as Standard Product Cost measure, except for when there is no promotion, i.e. when Promotion Type is No Discount, it should be Freight Cost plus Standard Product Cost. Using the IIF function, the solution would look like following:

CREATE [Normalized Cost] = 
  Iif( [Promotion].[Promotion Type].CurrentMember IS [Promotion].[Promotion Type].&[No Discount]
      ,[Measures].[Internet Standard Product Cost]+[Measures].[Internet Freight Cost]
      ,[Measures].[Internet Standard Product Cost]
  );

Let's test performance of this approach. To make the performance measurements non-trivial, we will also introduce into the query YTD-like calculation. The query that we are measuring will look the following then:

WITH 
MEMBER Measures.[Normalized Cost] AS Iif(
  [Promotion].[Promotion Type].CurrentMember IS [Promotion].[Promotion Type].&[No Discount]
  ,[Measures].[Internet Standard Product Cost]+[Measures].[Internet Freight Cost]
  ,[Measures].[Internet Standard Product Cost])
MEMBER [Ship Date].[Date].RSum AS Sum([Ship Date].[Date].[Date].MEMBERS), SOLVE_ORDER=10
SELECT 
 [Promotion].[Promotion Type].[Promotion Type].MEMBERS on 0
 ,[Product].[Subcategory].[Subcategory].MEMBERS*[Customer].[State-Province].[State-Province].MEMBERS ON 1
FROM [Adventure Works]
WHERE ([Ship Date].[Date].RSum, Measures.[Normalized Cost])

On my laptop, this query executed for 36 seconds. (Note, that we have included Promotion Type attribute on axis, in order for both formulas to participate in the query, and make it more difficult for query optimizer, although it actually doesn't matter here. The SOLVE_ORDER=10 is needed in order to prevent query optimizer to perform another optimization related to running sum which will mask inefficiencies of IIF. After all, we want to see the case where optimizer's tricks don't work, and IIF performs badly).

Now, we will use the fact, that if condition of IIF only looks at member coordinates and not at the cell values, it almost always can be rewritten using one or more SCOPE statements. In this simple, but common, case, single assignment is sufficient:

CREATE [Normalized Cost] = [Measures].[Internet Standard Product Cost];
([Promotion].[Promotion Type].&[No Discount], Measures.[Normalized Cost] =
  [Measures].[Internet Standard Product Cost]+[Measures].[Internet Freight Cost];

This is the proper way to put it inside MDX Script, but since the previous query was written using query calculated members, and there is no syntax for SCOPE or assignment statements inside query, we can for testing purposes translate SCOPE into WITH CELL CALCULATION clause. Our query will look then the following:

WITH 
MEMBER Measures.[Normalized Cost] AS [Measures].[Internet Standard Product Cost]
CELL CALCULATION ScopeEmulator 
  FOR '([Promotion].[Promotion Type].&[No Discount],measures.[Normalized Cost])' 
  AS [Measures].[Internet Freight Cost]+[Measures].[Internet Standard Product Cost]
MEMBER [Ship Date].[Date].RSum AS Sum([Ship Date].[Date].[Date].MEMBERS), SOLVE_ORDER=10
SELECT 
 [Promotion].[Promotion Type].[Promotion Type].MEMBERS on 0
 ,[Product].[Subcategory].[Subcategory].MEMBERS*[Customer].[State-Province].[State-Province].MEMBERS ON 1
FROM [Adventure Works]
WHERE ([Ship Date].[Date].RSum, Measures.[Normalized Cost])

This query is equivalent to the first one, but executes in 0.36 seconds - exactly 100 times faster ! The secret of why the second query is much faster is simple. Since SCOPE statement defines static subcubes on which formulas apply, the query optimizer is able to divide the space described by the query into two parts - one part where one formula applies, and another part, sliced by No Discount, where the second formula applies. Then query optimizer is able to apply bulk evaluation to both subspaces, and the results are computed lighting fast. (for more information about bulk evaluation, read this article). With IIF, no such division is possible, since query optimizer is not smart enough to understand that the condition inside IIF is static one. Therefore, query optimizer is afraid to choose the bulk evaluation plan, because it is afraid to compute too many unneeded cells. As we know, computing results cell by cell is much slower than doing computing all of them in bulk. Now, there are posts floating around newsgroups and forums about using magic "Cache Policy=9" in the connection string. Indeed, if it is used the first query completes under a second. But this is very dangerous path on step on. Cache Policy=9 forces query optimizer to choose bulk evaluation plan always and unconditionally. While in many cases (including this one), it is the best query plan, there are lots of cases when doing bulk evaluation is slower and less scalable. It is best to leave such decisions to the optimizer which one of dozens built-in execution plans should be applied to which subspace and which subexpression. Also, Cache Policy=9 is documented in BOL as reserved for future use, which means it is not officially supported, even though information about it leaked into the forums. It is only supported if it is recommended by the product support professional. It is our hope, that in the next version, query optimizer will become even more intelligent, and will be able to pick better execution plans automatically.

Now, as we mentioned above, IIF can only be replaced with SCOPE, if the condition considers only attribute coordinates, and not the cell values. So what if the condition needs to look at cell values ? While there is no general technique to battle with it, there are still some simple tips which can be applied. Let's consider the classic IIF of check for division by zero. I have written on this subject before, but it was focused more on correctness aspects rather than on performance aspects.

Let's say we want to compute ratio of sales to cost.

CREATE Ratio = [Measures].[Internet Standard Product Cost]/[Measures].[Internet Sales Amount];

But we are concerned that when Internet Sales Amount is 0, we might get division by zero. In reality, in Adventure Works, Internet Sales Amount is never 0 inside fact table. It can be NULL, but since Internet Standard Product Cost is a measure in the same measure group, and neither one of them is nullable, then we know that whenever Internet Sales Amount is NULL, then Internet Standard Product Cost is NULL as well, and NULL/NULL is defined as NULL in MDX. So in our example there is actually no need to check anything in the first place - because there never will be division by zero ! To test performance of simple division, let's run the following MDX query. It doesn't make much sense to sum up ratios, but it is an easy way to force computation of division over huge set.

WITH 
MEMBER Measures.Ratio AS [Measures].[Internet Standard Product Cost]/[Measures].[Internet Sales Amount]
MEMBER Measures.TestPerf AS SUM(
 ([Product].[Subcategory].[Subcategory].MEMBERS
 ,[Customer].[State-Province].[State-Province].MEMBERS
 ,[Customer].[Customer].[Customer].MEMBERS
 ,[Promotion].[Promotion Type].[Promotion Type].MEMBERS
 ,[Ship Date].[Date].[Date].MEMBERS), Measures.Ratio)
SELECT TestPerf ON 0
FROM [Adventure Works]

This query executed in only 0.1 second, and it is clear that it is because the bulk evaluation query plan was applied. Now, let's say that it was possible for Internet Sales Amount to become 0 for whatever reason (for example, 0's could happen in the fact table). Then, we would need to guard against it. But what should we return in case it could be division by zero ? This really depends on the business requirements now. Sometimes, the requirement is to return "NA" string. So the expression gets coded as following:

CREATE Ratio = 
  IIF([Measures].[Internet Sales Amount]=0
  ,"NA"
  ,[Measures].[Internet Standard Product Cost]/[Measures].[Internet Sales Amount]);

Unfortunately, this is a poor choice both from performance and from functionality point of view. If any MDX expression will ever use Ratio, it runs at risk of performing arithmetic operations between numbers and string "NA", which in best case will trigger error "Invalid data types used in the operation", and in worst case may lead to the wrong results. Performance of such expression can degrade significantly. I won't even dare to run the original query here, but even running over reduced set

WITH 
MEMBER Measures.Ratio AS 
  IIF([Measures].[Internet Sales Amount]=0
  ,"NA"
  ,[Measures].[Internet Standard Product Cost]/[Measures].[Internet Sales Amount])
MEMBER Measures.TestPerf AS SUM(
  ([Product].[Subcategory].[Subcategory].MEMBERS
  ,[Customer].[State-Province].[State-Province].MEMBERS
  ,[Promotion].[Promotion Type].[Promotion Type].MEMBERS
  ,[Ship Date].[Date].[Date].MEMBERS), Measures.Ratio)
SELECT TestPerf ON 0
FROM [Adventure Works]

it still finishes only after 64 seconds (and the result is not good - it is empty, because of mixed operations between strings and numbers). So since some of the performance can be explained by operations on the strings, how about using 0 instead of "NA" ?

WITH 
MEMBER Measures.Ratio AS 
  IIF([Measures].[Internet Sales Amount]=0
  ,0
  ,[Measures].[Internet Standard Product Cost]/[Measures].[Internet Sales Amount])
MEMBER Measures.TestPerf AS SUM(
  ([Product].[Subcategory].[Subcategory].MEMBERS
  ,[Customer].[State-Province].[State-Province].MEMBERS
  ,[Promotion].[Promotion Type].[Promotion Type].MEMBERS
  ,[Ship Date].[Date].[Date].MEMBERS), Measures.Ratio)
SELECT TestPerf ON 0
FROM [Adventure Works]

This finishes in 35 seconds, which is a little better, but still not good. And if we will use the original query

WITH 
MEMBER Measures.Ratio AS 
  IIF([Measures].[Internet Sales Amount]=0
  ,0
  ,[Measures].[Internet Standard Product Cost]/[Measures].[Internet Sales Amount])
MEMBER Measures.TestPerf AS SUM(
  ([Product].[Subcategory].[Subcategory].MEMBERS
  ,[Customer].[State-Province].[State-Province].MEMBERS
  ,[Customer].[Customer].[Customer].MEMBERS
  ,[Promotion].[Promotion Type].[Promotion Type].MEMBERS
  ,[Ship Date].[Date].[Date].MEMBERS), Measures.Ratio)
SELECT TestPerf ON 0
FROM [Adventure Works]

We will get the famous error #Error The expression contains a function that cannot operate on a set with more than 4,294,967,296 tuples.
This error happens when some operation in MDX requires iteration over set which has more than 4GB tuples. Usually this occurs in the scenario, where "bad" calculation needs to be evaluated over huge set. We notice, that the crossjoin inside SUM is indeed huge - it covers more than 4GB tuples. And because our calculated member uses IIF, the bulk evaluation mode is not applied, and instead query processor has to iterate cell by cell over the entire set. There are more than 4GB tuples to iterate, therefore iteration fails. Fortunately, there is a simple solution for this case. Remember the discussion about sparse and dense spaces that we had in this article. Constant expression 0 is dense over any space, but constant expression NULL is sparse over any space. Therefore the simple change to use NULL instead of 0, lets query optimizer to use bulk evaluation mode, since it knows that the "then" branch of IIF is always trivially sparse (always contains no data)

WITH 
MEMBER Measures.Ratio AS 
  IIF([Measures].[Internet Sales Amount]=0
  ,NULL
  ,[Measures].[Internet Standard Product Cost]/[Measures].[Internet Sales Amount])
MEMBER Measures.TestPerf AS SUM(
  ([Product].[Subcategory].[Subcategory].MEMBERS
  ,[Customer].[State-Province].[State-Province].MEMBERS
  ,[Customer].[Customer].[Customer].MEMBERS
  ,[Promotion].[Promotion Type].[Promotion Type].MEMBERS
  ,[Ship Date].[Date].[Date].MEMBERS), Measures.Ratio)
SELECT TestPerf ON 0
FROM [Adventure Works]

This query now finishes in 0.1 seconds, which is sure indication that bulk evaluation was used. But let's go back to the business requirements. What if business users don't want to see NULL, what if they do want to see "NA" instead. Well, this kind of formatting is best done by FORMAT_STRING logic, which exists exactly for that purpose - format raw cell values into something nice to display to the end user. We can use FORMAT_STRING to convert NULLs ratios into "NA", as in following example:

WITH 
MEMBER Measures.Ratio AS 
  IIF([Measures].[Internet Sales Amount]=0,NULL,[Measures].[Internet Standard Product Cost]/[Measures].[Internet Sales Amount])
  ,FORMAT_STRING='#.#;;;NA'
SELECT 
Measures.Ratio ON 0
,[Promotion].[Promotion Type].[Promotion Type].MEMBERS ON 1
FROM [Adventure Works]

In conclusion, my advice is to try to avoid IIF whenever possible, and if it is not possible to avoid, use it in such a way that bulk evaluation is possible. I also recommend to never use CASE clause for performance critical calculations.

Published Sunday, January 28, 2007 10:20 PM by mosha
Filed under: ,
Anonymous comments are disabled
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement