THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Microsoft OLAP by Mosha Pasumansky

Performance of multiplication in MDX

What can be simpler then the expression A*B. Yet there is enough interesting staff going on around it to warrant an article. By investigating this simple expression we will get an insight into Analysis Services MDX query optimizer, discuss performance implications and even discover that A*B is not always the same as B*A ! Multiplication is obviously a common operation which arises in variety of problems. However, the most interesting case for the purpose of our discussion is when the operands of the multiplication come from different granularities. This often happens while doing financial allocations, but we will take even more common scenario - currency conversion. In the simplest setup of currency conversion, we will have two measure groups:

  1. Sales measure group - which records sales transactions.
    Measures: Sales
    Dimensions:
    • Date
    • Customer
    • Product
    • Seller
    • Currency
    • Promotion
    • ...etc
  2. Currency measure group - which has currency exchange rates. Assuming that rates vary by days and by countries, we will have the following
    Measures: Rate
    Dimensions:
    • Date (granularity attribute Day)
    • Seller (granularity attribute Country)

Let's assume that the currency conversion that we are performing here is many to one type, i.e. we record the currency of the transaction, but want to report in single currency (for example in US Dollars) and the currency exchange rate table have the rates between different currencies and US Dollar. Then the MDX Script fragment which will perform this conversion will look like following:

SCOPE (Leaves([Date]), [Seller].[Country].[Country]);
  [Measures].[Sales] = [Measures].[Sales]*[Measures].[Rate];
END SCOPE;

Let's notice, that granularity of Currency measure group is higher then granularity of Sales measure group, therefore there will be a lot of cells in Sales which aren't covered by Currency (i.e. when we look at Products, Customers or go below Country in Seller dimension). We assume that we can still freely multiply by [Measures].[Rate] at these cells even though rates aren't dimensioned by Product, Customer etc because the IgnoreUnrelatedDimensions property on the Currency measure group is set to true. This causes the exchange rates to be virtually copied to the unrelated cells. If this property wasn't set to true, we would need to write the following.

SCOPE (Leaves([Date]), [Seller].[Country].[Country]);
  [Measures].[Sales] = [Measures].[Sales]*ValidMeasure([Measures].[Rate]);
END SCOPE;

(P.S. It actually could be a good idea to use ValidMeasure in this scenario always, regardless of IgnoreUnrelatedDimensions setting, since this will give performance boost, since ValidMeasure function gives more metadata to the query optimizer, but performance of ValidMeasure is probably subject for another article).

Now, let's consider the following query

SELECT [Measures].[Sales] ON COLUMNS, [Product].[Product].MEMBERS ON ROWS FROM Sales

This query has granularity of [All Dates] and [All Sellers] - therefore the calculation which was done on the leaves of Date and on Seller's Country needs to be aggregated up. The most straightforward approach would be to iterate over every day and every country, perform the multiplication of Sales by Rate and sum up the results. Let's assume that we store data for 3 years - it will be about 1000 days, and we operate in 10 countries. This means that we need to iterate over 10000 cells. For our query - this needs to be done for every single product. Assuming that we have 1000 products - this will mean 10,000,000 iterations. Analysis Services 2000 works this way. However, Analysis Services 2005 query optimizer can perform such calculations much more efficiently. First, it notices, that for A*B will be NULL if either A or B is NULL. And since summing up NULLs doesn't change the sum, the query optimizer knows that it is free to skip the cells which are guaranteed to have NULLs in them. Next, it is going to use the fact, that Analysis Services storage and data caches are optimized for sparse data - i.e. they don't contain records for which there is no data - i.e. don't contain records which would have the NULL measure values (well, this last statement is not always true, but for the purpose of this discussion it is). Therefore we can represent the operands as relations (the way the relation is defined in the relational model):

Relation Sales
Product Day Country Sales
       
       
Relation Rate
Product Day Country Rate
       
       

Here attributes Product, Day and Country form a key in both relations. Therefore, the problem is really reduced to the simple join between the two relations. In SQL it probably could be written as

SELECT Product, Sum(ConvertedSales) 
FROM 
(
  SELECT Product, Day, Country, Sales*Rate AS ConvertedSales 
  FROM Sales, Rate 
  WHERE Sales.Product = Rate.Product AND Sales.Day = Rate.Day AND Sales.Country = Rate.Country
) 
GROUP BY Product

Query engine now needs to implement this join. Analysis Services chooses  hash-join algorithm for this scenario. In hash-join, we need to iterate over one relation and then do hash search in the other relation with the same key. It is clear that the performance will be better if we chose to iterate over smaller relation and do searches into the bigger relation. Let's take our example and analyze it - which relation is smaller in our case. The answer feels at first somewhat non-intuitive - because the Sales relation is the smaller one, and Rate relation is bigger one. I say it is not intuitive, because definitely the Currency measure group has fewer records then the Sales measure group. However, Currency measure group is dense one while Sales measure group is sparse one. This means, that in the Currency measure group we have record for every possible combination of Day and Country. And since through either IgnoreUnrelatedDimensions property or through ValidMeasure function these values are populated for every Product as well - therefore Rate relation is very dense and contains record for every possible combination of Product, Day and Country. Sales relation, on the other hand, is sparse, because we didn't sell every product every day in every country. It is clear, that the Sales relation will always be smaller then the Rate relation. It is obvious to us now, but how does query optimizer knows about it ? In RTM version of Analysis Services, query optimizer always chooses to iterate over relation indicated by the first operand. Let's assuming that there were 10,000 product sales by day by country. Therefore if we write the expression as [Measures].[Sales]*[Measures].[Rate] - we will iterate over Sales relation, therefore in order to compute our query only 10,000 iterations will be required - dramatic improvement over 10,000,000 that we counted with the previous approach. Expression [Measures].[Rate]*[Measures].[Sales], on the other hand, will iterate over Rate relation, which is 100% dense - and this will lead to about 10,000,000 iterations again. (In SP1, the query optimizer can analyze the metadata about the relations before applying the join, so it is more intelligent, but the better solution which uses statistics is in plans for the next version).

The lesson that we learned here is that when using A*B expression, you need to use the sparser variable as first operand and denser variable as the second operand. Analysis Services built-in Currency Conversion wizard, for example, knows about this, and carefully chooses the order of operands in the generated MDX Script. Hopefully this article demystifies a little bit how AS query engine works, and this information can now be applied to more complex expressions, such as A+B and A-B.

Published Tuesday, December 06, 2005 7:11 PM by mosha
Filed under:
Anonymous comments are disabled
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement