THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Microsoft OLAP by Mosha Pasumansky

Counting distinct values in MDX

How many elements in the set have distinct values ? This sounds like a simple question. Apparently, this is well known problem in Excel, and there is a classic solution for it. There are plenty of sites in Internet (for example here and here) which give the following solution:

=SUM(1/COUNTIF(A1:A6,A1:A6)))

(note that this formula needs to be entered with Ctrl-Shift-Enter to indicate that this is array formula, or it won’t give right result – one of the Excel quirks)

When someone showed me this solution, I just stared at it and I couldn’t understand how it worked. Only rewriting it in MDX clarified what was going on. Let’s take the following example based on Adventure Works – we want to count how different letters the product names start from (just wanted to pick something for which Product dimension didn’t have dedicated attribute, because otherwise it would’ve been too simple). The MDX which performs the same algorithm as Excel formula above in these conditions will look like following:

with 
  member measures.CountDistinctLetters as 
   Sum([Product].[Product].[Product].MEMBERS AS AllProducts,
       1/Filter([Product].[Product].[Product].MEMBERS, 
          VBA!Left(AllProducts.Current.Name,1) = VBA!Left([Product].[Product].CurrentMember.Name,1)
         ).Count)

select CountDistinctLetters on 0
from [Adventure Works]

It is more verbose, but probably it is not simpler to understand to normal developer than the Excel’s formula. In the nutshell the idea is following: For every element in the set, we compute how many other elements produce the same value (this is COUNTIF in Excel, and Filter(…).Count in MDX), and then dividing 1 by this number we get fraction of current element. The summing up all these fractions will give the desired number.

(Unfortunately there is no easy way to debug such formulas. The MDX debugger inside MDX Studio is not capable of dealing with such expressions. I actually spent quite a time thinking about how to build decent environment for debugging MDX, and after few failed starts I had what I consider as breakthrough idea. Early experiments were very encouraging, and if this idea works, I will blog soon about it. It radically different from all other MDX debugging attempts.)

It is clear that performance of such solution is horrible. For every element in the set, we go again and again over all elements in the set just to find out how many of them have same value. The algorithmic complexity of this nested loop approach is O(n^2). Indeed the query above executes for 12 seconds. It is possible to slightly improve it, by forcing caching the results of VBA!Left through

with 
  member Prefix as VBA!Left([Product].[Product].CurrentMember.Name,1)
  member measures.CountDistinctLetters as 
   Sum([Product].[Product].[Product].MEMBERS AS AllProducts,
       1/Filter([Product].[Product].[Product].MEMBERS, 
          (AllProducts.Current,Prefix) = Prefix
         ).Count)

select CountDistinctLetters on 0
from [Adventure Works]

and reducing time to 11 seconds, but obviously, different, more sane approach is needed here.

Another idea is instead of doing nested-loop algorithm, sort the elements of the set, and then scan sorted set once counting number of different elements. The only technical difficulty with that would be - while scaning sorted set, we need to compare current element with the previous one. Getting current element is easy, but how to get previous one ? There is no "Previous" function in MDX, there is only PrevMember, but it works on hierarchical ordering of the level, and here we have arbitrary set sorted by arbitrary criteria. The solution is to use CurrentOrdinal function, which returns the index of the tuple being iterated in the set, and then using CurrentOrdinal-1 as a new index, send it to Item function. In MDX this would be recorded as

with
  member Prefix as VBA!Left([Product].[Product].CurrentMember.Name,1)
  set ProductsSortedByPrefix as order([Product].[Product].[Product].MEMBERS, Prefix, BASC)
  member measures.CountDistinctLetters  as Sum(ProductsSortedByPrefix as y, 
    iif(y.CurrentOrdinal = 1 OR Prefix <> (Prefix, y.Item(y.CurrentOrdinal-2)), 1, 0))

select CountDistinctLetters  on 0
from [Adventure Works]

This query executes much faster now, in only 0.234 seconds. The algorithmic complexity is down to O(n*log(n)). However, it is still not perfect solution, since it is possible to solve this in just one scan of the original set, i.e. with O(n) complexity. In order to do that, we will have to write stored procedure though.

public int CountDistinctValues(Set set, Expression exp)
{
    System.Collections.Hashtable ht = new System.Collections.Hashtable();
    foreach (Tuple t in set.Tuples)
    {
        string v = exp.Calculate(t).ToString();
        if ( !ht.ContainsKey(v) )
            ht.Add(v, null);
    }

    return ht.Count;
}

Now, if we use this stored procedure in the query as following

with member CountDistinctLetters as ASSP.ASStoredProcs.Util.CountDistinctValues(
  [Product].[Product].[Product].MEMBERS,
  VBA!Left([Product].[Product].CurrentMember.Name,1))
select CountDistinctLetters on 0
from [Adventure Works] 

we get it working in only 0.062 seconds.

Published Saturday, June 14, 2008 6:13 PM by mosha
Filed under: ,
Anonymous comments are disabled
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement