THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Microsoft OLAP by Mosha Pasumansky

Ranking in MDX

Well, apparently I am one of the industry influencers ! At least this is what Kimball Group thinks, because they sent me their just published "The Microsoft Data Warehouse Toolkit: With SQL Server 2005 and the Microsoft Business Intelligence Toolset" book by Joy Mundy and Warren Thornthwaite. According to the message, they have sent it to industry influencers so we can voice our opinions, and, well, influence the industry. I started to go through the book, and it looks like another classic from Kimball Group, something that every DBA who does BI on Microsoft platform has to read. This post is not a book review, since I didn’t have a chance to read it deeper. However, one thing caught my attention, and I thought that the issue is common enough to write an article about it. At page 402, where authors talk about building BI applications, there is a MDX expression for calculated member to compute Employee ranks based on their sales. With some minor modifications, the expression looks like following

RANK(
  [Employee].[Employee].CurrentMember
 ,ORDER(
    [Employee].[Employee].[Employee].MEMBERS
   ,[Measures].[Reseller Sales Amount]
   ,BDESC))

It is not uncommon to see people implementing Rank using such construct, i.e. Rank over Order. Unfortunately, this construct has severe performance problem. Due to dynamic nature of the expression, it requires that the ordering of all employees should be done for every single cell where the calculated member is evaluated. Let’s try to rank all of our employees using this formula:

WITH MEMBER [Measures].[Employee Rank] AS 
 RANK([Employee].[Employee].CurrentMember, ORDER([Employee].[Employee].[Employee].MEMBERS, [Measures].[Reseller Sales Amount], BDESC))
SELECT
[Measures].[Employee Rank] ON 0 
,[Employee].[Employee].[Employee].MEMBERS on 1
FROM [Adventure Works]

It takes about 1 minute and 20 seconds on my machine to finish this query, which is way too long. Let’s try to analyze the complexity of the involved computations. We will denote the cardinality of Employee attribute as n. Since we do ordering over all employees, it will take O(n*log(n)) and since we are doing it for every single employee, the resulting complexity will be O(n2*log(n)). So let’s try to improve on it. It seems first, that the root of the inefficiency is in the fact that we are doing full sorting, when we only need to find a position of one tuple (employee) in the would be ordered set. Of course, it is possible to implement this operation linearly – by simply scanning the set, and counting the number of elements that were smaller then ours. This looks like a perfect job for the stored procedure. I have already written how MDX can be extended with stored procedures using Server ADOMD.NET when I discussed stored procedures returning sets here, so now we have even more practical example – use stored procedure for calculated members.

Below is the sample implementation of the above mentioned algorithm

using Microsoft.AnalysisServices.AdomdServer;

namespace MRank
{
    public class MRank
    {
        public int Rank(Tuple tuple, Set set, Expression exp)
        {
            int rank = 1;
            double valTuple = (double)exp.Calculate(tuple);
            foreach (Tuple t in set.Tuples)
            {
                double valCurrent = (double)exp.Calculate(t);
                if (valCurrent < valTuple)
                    rank++;
            }

            return rank;
        }
    }
}

Now we will register this assembly in the database, and send the following query:

WITH MEMBER [Measures].[Employee Rank] as 
  mrank.mrank.mrank.[rank](
   [Employee].[Employee].CurrentMember, [Employee].[Employee].[Employee].MEMBERS, -[Measures].[Reseller Sales Amount])
SELECT [Measures].[Employee Rank] ON 0 
,[Employee].[Employee].[Employee].MEMBERS ON 1
FROM [Adventure Works]

This query returns in less then 20 seconds, which is 4 times faster then before, but still looks too high. Probably we are running into overhead of stored procedures, since it has to go through managed-native transition, and not just for every invocation of stored procedures, but every time ADOMD.NET calls into server, i.e. on methods such as Expression.Calculate. Luckily, Analysis Services 2005 provides native override of MDX function Rank with 3 parameters, where the third parameter is the expression by which tuple should be ranked, and it implements linear algorithm similar to the one in the stored procedure we just developed. So now we can rewrite our query as

WITH MEMBER [Measures].[Employee Rank] AS
RANK([Employee].[Employee].CurrentMember, [Employee].[Employee].[Employee].MEMBERS, -[Measures].[Reseller Sales Amount])
SELECT [Measures].[Employee Rank] ON 0 
,[Employee].[Employee].[Employee].MEMBERS ON 1
FROM [Adventure Works]

This query now returns in about 6 seconds, which is still faster then before. Let’s do complexity analysis again. For every employee we scan all the employees, so we get O(n2) which is better then previous result of O(n2*log(n)). There are few things to notice here:

  • First, the expression of Rank was -[Measures].[Reseller Sales Amount]. The reason we needed to prefix the Reseller Sales Amount measure with unary minus, was because we wanted ranking in descending order, i.e. from higher to lower, and since we were counting the tuples which had smaller value – this would produce ranking in ascending order, therefore we had to flip the expression. Those who read my whitepaper about AS caching, might remember, that since AS doesn’t cache subexpressions, but only full coordinates, using such subexpression in Rank might hurt performance, but actually here it doesn’t, because unary minus is such a trivial operation.
  • Second, perhaps more importantly, our last two queries produced different results from the first query on employees which had exactly same Sales. This is most evident on employees who sold nothing, so their Reseller Sales Amount was NULL. In Rank(Order()) case, such employees were ranked according to their position in the attribute hierarchy, since Order in MDX is deterministic function, and it preserves the original set order when keys are equal. However, Rank with expression, simply counts number of tuples with lesser value, therefore employees with equal values get equal rank. In reality, it usually doesn’t matter which way to go, and it is possible to develop another stored procedure which will take two expressions instead of one, and the second expression will be used when values of first expression are equal.

But probably the most important issue here is that the query still feels too slow at 6 seconds. We would expect this query to return instantaneously. And indeed, there is a query plan which would make it to be instantaneous. It is quite simple actually. Let’s sort the employees first, and then build a hash index to look up employee’s position in the sorted list. The way to express this query plan in MDX, is to create a named set. Named sets in MDX are always static, i.e. they are evaluated in the context in which they were defined (note that static doesn’t mean materialized, for example CrossJoin can be perfectly static, but it is never materialized). And the hash index is built on top of the static set. Therefore our query will become now:

WITH 
SET OrderedEmployees AS ORDER([Employee].[Employee].[Employee].members, [Measures].[Reseller Sales Amount], BDESC)
MEMBER [Measures].[Employee Rank] AS RANK([Employee].[Employee].CurrentMember, OrderedEmployees)
SELECT [Measures].[Employee Rank] ON 0 
,[Employee].[Employee].[Employee].MEMBERS ON 1
from [Adventure Works]
This query indeed executes instantly, and since the sorting is done only once, and then followed by one linear scan, the complexity ends up being just O(log(n)).

There is one additional question left here. It is not always possible to define such a named set, and since this last query is actually equivalent to the first query (as taken from the book), then why isn’t query optimizer chooses query plan for the original query to be identical to the one in the last query. The answer is that in the general case Rank(Order()) is not equivalent to the Rank over static ordered set, precisely because in the first case the Order is dynamic and might return different results for different cells. However, in our example, and quite often in practice, the inner Order sets is what AS query optimizer calls “context independent”, i.e. it always produces the same result for every employee. AS query optimizer has many places where it can deduct that subexpressions are context independent and compute them only once. Unfortunately, this isn’t the case yet for Rank(Order()) constructs, but since the infrastructure is there, we can hope that future versions of AS will implement this optimization, and then this article will become obsolete :)

Update: I just found out that the Chapter 6 "Sorting and Ranking in MDX" from the new MDX Solutions 2nd Edition book was put online by Wiley (probably for promotional reasons). It contains really good complimentary information to this article, so if you found the article useful, I recommend to read the chapter as well: http://media.wiley.com/product_data/excerpt/80/04717480/0471748080.pdf

Published Tuesday, March 14, 2006 3:23 AM by mosha
Filed under: ,
Anonymous comments are disabled
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement