THE SQL Server Blog Spot on the Web
Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Microsoft OLAP by Mosha Pasumansky

Counting distinct values in MDX

How many elements in the set have distinct values ? This sounds like a simple question. Apparently, this is well known problem in Excel, and there is a classic solution for it. There are plenty of sites in Internet (for example here and here) which give the following solution:

=SUM(1/COUNTIF(A1:A6,A1:A6)))

(note that this formula needs to be entered with Ctrl-Shift-Enter to indicate that this is array formula, or it won’t give right result – one of the Excel quirks)

When someone showed me this solution, I just stared at it and I couldn’t understand how it worked. Only rewriting it in MDX clarified what was going on. Let’s take the following example based on Adventure Works – we want to count how different letters the product names start from (just wanted to pick something for which Product dimension didn’t have dedicated attribute, because otherwise it would’ve been too simple). The MDX which performs the same algorithm as Excel formula above in these conditions will look like following:

with 
  member measures.CountDistinctLetters as 
   Sum([Product].[Product].[Product].MEMBERS AS AllProducts,
       1/Filter([Product].[Product].[Product].MEMBERS, 
          VBA!Left(AllProducts.Current.Name,1) = VBA!Left([Product].[Product].CurrentMember.Name,1)
         ).Count)

select CountDistinctLetters on 0
from [Adventure Works]

It is more verbose, but probably it is not simpler to understand to normal developer than the Excel’s formula. In the nutshell the idea is following: For every element in the set, we compute how many other elements produce the same value (this is COUNTIF in Excel, and Filter(…).Count in MDX), and then dividing 1 by this number we get fraction of current element. The summing up all these fractions will give the desired number.

(Unfortunately there is no easy way to debug such formulas. The MDX debugger inside MDX Studio is not capable of dealing with such expressions. I actually spent quite a time thinking about how to build decent environment for debugging MDX, and after few failed starts I had what I consider as breakthrough idea. Early experiments were very encouraging, and if this idea works, I will blog soon about it. It radically different from all other MDX debugging attempts.)

It is clear that performance of such solution is horrible. For every element in the set, we go again and again over all elements in the set just to find out how many of them have same value. The algorithmic complexity of this nested loop approach is O(n^2). Indeed the query above executes for 12 seconds. It is possible to slightly improve it, by forcing caching the results of VBA!Left through

with 
  member Prefix as VBA!Left([Product].[Product].CurrentMember.Name,1)
  member measures.CountDistinctLetters as 
   Sum([Product].[Product].[Product].MEMBERS AS AllProducts,
       1/Filter([Product].[Product].[Product].MEMBERS, 
          (AllProducts.Current,Prefix) = Prefix
         ).Count)

select CountDistinctLetters on 0
from [Adventure Works]

and reducing time to 11 seconds, but obviously, different, more sane approach is needed here.

Another idea is instead of doing nested-loop algorithm, sort the elements of the set, and then scan sorted set once counting number of different elements. The only technical difficulty with that would be - while scaning sorted set, we need to compare current element with the previous one. Getting current element is easy, but how to get previous one ? There is no "Previous" function in MDX, there is only PrevMember, but it works on hierarchical ordering of the level, and here we have arbitrary set sorted by arbitrary criteria. The solution is to use CurrentOrdinal function, which returns the index of the tuple being iterated in the set, and then using CurrentOrdinal-1 as a new index, send it to Item function. In MDX this would be recorded as

with
  member Prefix as VBA!Left([Product].[Product].CurrentMember.Name,1)
  set ProductsSortedByPrefix as order([Product].[Product].[Product].MEMBERS, Prefix, BASC)
  member measures.CountDistinctLetters  as Sum(ProductsSortedByPrefix as y, 
    iif(y.CurrentOrdinal = 1 OR Prefix <> (Prefix, y.Item(y.CurrentOrdinal-2)), 1, 0))

select CountDistinctLetters  on 0
from [Adventure Works]

This query executes much faster now, in only 0.234 seconds. The algorithmic complexity is down to O(n*log(n)). However, it is still not perfect solution, since it is possible to solve this in just one scan of the original set, i.e. with O(n) complexity. In order to do that, we will have to write stored procedure though.

public int CountDistinctValues(Set set, Expression exp)
{
    System.Collections.Hashtable ht = new System.Collections.Hashtable();
    foreach (Tuple t in set.Tuples)
    {
        string v = exp.Calculate(t).ToString();
        if ( !ht.ContainsKey(v) )
            ht.Add(v, null);
    }

    return ht.Count;
}

Now, if we use this stored procedure in the query as following

with member CountDistinctLetters as ASSP.ASStoredProcs.Util.CountDistinctValues(
  [Product].[Product].[Product].MEMBERS,
  VBA!Left([Product].[Product].CurrentMember.Name,1))
select CountDistinctLetters on 0
from [Adventure Works] 

we get it working in only 0.062 seconds.


Published Saturday, June 14, 2008 6:13 PM by mosha
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

SSAS-Info.com said:

Pingback. Link to this article was added to our website in the [Analysis Services Articles]/[MDX] section:

http://www.ssas-info.com/analysis-services-articles/50-mdx/949-counting-distinct-values-in-mdx

June 14, 2008 9:33 PM
 

PeterEb said:

Did you happen to also do the perf test in Excel? or is that an exercise for the reader?

June 15, 2008 2:29 AM
 

Miky Schreiber said:

I thought that this post is great when I started it, but in the end I see that it's brilliant! This is why I love programming!

June 15, 2008 4:04 PM
 

Chris Webb said:

Just for fun, here's another possible solution - faster than your initial approach, but not as fast as you final pure MDX approach:

with

member measures.Prefix as Left([Product].[Product].CurrentMember.Name,1)

member measures.CountDistinctLetters as

count({{} as myletters,

filter(

[Product].[Product].[Product].MEMBERS,

iif(

count({{Except({[Product].[Product].currentmember} as currentProduct, currentProduct)},

{filter(myletters,

([Product].[Product].currentmember,measures.Prefix) =

(currentProduct.item(0).item(0), measures.Prefix)

)}})>0, false,

count({myletters, currentProduct} as myletters)>0

)

)})

select measures.CountDistinctLetters on 0

from [adventure works]

Incidentally, isn't Left now one of the VBA functions built in to MDX? So isn't it better to use just plain Left rather than VBA!Left? Or doesn't it matter?

June 17, 2008 4:48 AM
 

Chris Webb said:

OK, here's what I think is my best attempt at a pure MDX solution, after having thought a bit about my approach and tried several variations:

with

member measures.CountDistinctLetters as

count(

filter(

{{} as myLetters,

topcount({[Product].[Product].[Product].MEMBERS} as allProds,

count([Product].[Product].[Product].MEMBERS)

, [Product].[Product].CurrentMember.Name)

},

iif(

Left([Product].[Product].CurrentMember.Name,1) = Left(myLetters.Item(0).Item(0).Name, 1)

, false, count({{[Product].[Product].CurrentMember},myletters} as myLetters)>0)

))

select

measures.CountDistinctLetters on 0

from [adventure works]

almost as good as your sproc, I think. I'm using TopCount here because, at least on the build I'm running (3054), that bug in the Order function means it runs faster than Order.

June 17, 2008 6:09 AM
 

Chris Webb said:

Right, now I have to do something useful today, but this is a tiny bit more efficient:

with

member measures.CountDistinctLetters as

sum(

{{} as myLetters,

topcount({[Product].[Product].[Product].MEMBERS},

count([Product].[Product].[Product].MEMBERS)

, [Product].[Product].CurrentMember.Name)

},

iif(

Left([Product].[Product].CurrentMember.Name,1) = Left(myLetters.Item(0).Item(0).Name, 1)

, null,

count({[Product].[Product].CurrentMember} as myLetters)

))

select

measures.CountDistinctLetters on 0

from [adventure works]

June 17, 2008 6:32 AM
 

mosha said:

Chris, these are clever tricks - but it will never be as fast as sproc function, this is math fact. O(n*log(n)) grows faster than O(n), although for smaller n's they may look comparable. BTW, my sproc version is not the fastest approach either, because it doesn't take advantage of block evaluation mode.

June 18, 2008 5:05 PM
 

Tiago Rente said:

Just curious, since we have very poor performance with DistinctCount, did you test it against DistinctCount function?

June 26, 2008 9:53 AM
 

mosha said:

Tiago - I don't think DistinctCount function can be applied to solve this problem.

June 26, 2008 12:41 PM

Leave a Comment

(required) 
(optional)
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement