THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Microsoft OLAP by Mosha Pasumansky

Take advantage of FE caching to optimize MDX performance

We usually treat caching system of Analysis Services as a black box, trusting that it will do the "right thing" to optimize the execution. And this is how things should be in the ideal world, caching system should be completely transparent to the end user. However, even though MDX query optimizer and caching system are very sophisticated, they are not perfect. In this article we will see how with very simple MDX rewrites, we can take better advantage of caching, and increase performance of MDX calculations significantly.

(In order to follow examples in this article, you will need to use MDX Studio tool)

As an example we will use very simple statistical analysis of the data, where we would apply "Three sigma rule" to find outliers across combination of attributes. It is based on the fact that for data which conforms normal distribution almost all of the values (99.7% of them) will lay within 3 standard deviations of the mean. So the values which are farther from the mean by more than 3 standard deviations can be considered outliers. (There are of course more advanced methods to find outliers in the data, but for the purpose of this article this is representative enough).

We will start with the analysis of Internet Sales by day inside Adventure Works cube. The MDX to find out "farther than three sigma" days seems straightforward

with 
 member sales_avg   as Avg  ([Date].[Date].[Date], [Internet Sales Amount]), format_string = 'currency'
 member sales_stdev as StDev([Date].[Date].[Date], [Internet Sales Amount]), format_string = 'currency'
select { [Internet Sales Amount] } on 0
, filter([Date].[Date].[Date], abs([Internet Sales Amount]-sales_avg) > 3*sales_stdev) on 1
from [Adventure Works]

If we execute this query we will get the list of 10 outlier dates. But the query takes 7.5 seconds to execute. This is horrible ! 7.5 seconds to go through 1158 members seem completely unreasonable. To get a better idea about what's going on, let's take a look at some execution statistics collected by MDX Studio:

Time             : 7 sec 417 ms
Calc covers      : 1164
Cells calculated : 1344448
Sonar subcubes   : 3
SE queries       : 1159
Cache hits       : 1159
Cache misses     : 2
Cache inserts    : 2
Cache lookups    : 1161

Now we can understand what was going on here. For each one of these 1158 members, the engine recalculated both average and standard deviation, even though they were exactly the same at each point (it is easy to verify through debugger in MDX Studio). The problem is that the engine didn't figure out itself that average and standard deviation were the same for all dates, so we need to help it a little bit. We can redefine sales_avg as a calculation which goes up to All member in Date attribute. This way regardless at which date we are looking right now, the cell coordinate will shift to the All member, and will be computed only once, because all the other times it is referenced it can be answered from FE cache. Calculation rewritten with this trick in mind will look the following:

with 
 member sales_avg_   as Avg  ([Date].[Date].[Date], [Internet Sales Amount]), format_string = 'currency'
 member sales_stdev_ as StDev([Date].[Date].[Date], [Internet Sales Amount]), format_string = 'currency'
 member sales_avg as ([Date].[Date].[All Periods],sales_avg_), format_string = 'currency'
 member sales_stdev as ([Date].[Date].[All Periods],sales_stdev_), format_string = 'currency'
select { [Internet Sales Amount] } on 0
, filter([Date].[Date].[Date], abs([Internet Sales Amount]-sales_avg) > 3*sales_stdev) on 1
from [Adventure Works]

If we execute it now, it is done in mere 0.1 fraction of second. The MDX Studio stats look like following

Time             : 109 ms
Calc covers      : 9
Cells calculated : 4642
Sonar subcubes   : 3
SE queries       : 2
Cache hits       : 4
Cache misses     : 2
Cache inserts    : 2
Cache lookups    : 6

This is much better. Instead of SE query per cell, we got only 2 SE queries (one for axis and one for avg/stdev calculation). There are also significantly less cells calculated, since both average and standard deviation are calculated only once now and served from the cache.

Let's make the problem a little bit complex now. We computed statistics across all 4 years, and discovered that all the outlier dates were in 2004. This is probably attributed to the growth over time. So let's compute statistics differently now, instead of computing them for all dates in 4 years, let's compute them within each month, and find outliers within the month. The straightforward approach to this can be coded in MDX as following.

with 
 member sales_avg   as Avg  ([Date].[Calendar].Parent.children, [Internet Sales Amount]), format_string = 'currency'
 member sales_stdev as StDev([Date].[Calendar].Parent.children, [Internet Sales Amount]), format_string = 'currency'
select { [Internet Sales Amount], sales_avg, sales_stdev } on 0
, filter([Date].[Date].[Date], abs([Internet Sales Amount]-sales_avg) > 3*sales_stdev) on 1
from [Adventure Works]

Executing this query now finds two outliers:

  Internet Sales Amount  sales_avg  sales_stdev
July 22, 2001 $38,241.29    $15,270.59    $6,747.86
December 8, 2001 $55,454.04    $24,371.87    $10,146.40

And the MDX Studio stats will be

Time             : 704 ms
Calc covers      : 1206
Cells calculated : 38792
Sonar subcubes   : 41
SE queries       : 1198
Cache hits       : 1198
Cache misses     : 2
Cache inserts    : 2
Cache lookups    : 1200

Again, the stats don't look good. Too many SE queries, too many cells calculated. How can we apply the same technique as before ? Moving coordinate to the All dates will be incorrect now, because average and standard deviations are now different for every month. Therefore we need to move coordinates to the month level. The most natural way of doing it is simple move call to the Parent function from within the Avg/Stdev calls into the tuple coordinate shift transformation. Here is how the resulting query will look like:

with 
 member sales_avg_   as Avg  ([Date].[Calendar].children, [Internet Sales Amount]), format_string = 'currency'
 member sales_stdev_ as StDev([Date].[Calendar].children, [Internet Sales Amount]), format_string = 'currency'
 member sales_avg    as ([Date].[Calendar].Parent,sales_avg_), format_string = 'currency'
 member sales_stdev  as ([Date].[Calendar].Parent,sales_stdev_), format_string = 'currency'
select { [Internet Sales Amount], sales_avg, sales_stdev } on 0
, filter([Date].[Date].[Date], abs([Internet Sales Amount]-sales_avg) > 3*sales_stdev) on 1
from [Adventure Works]

The stats from this rewrite will be

Time             : 203 ms
Calc covers      : 162
Cells calculated : 6954
Sonar subcubes   : 117
SE queries       : 78
Cache hits       : 78
Cache misses     : 2
Cache inserts    : 2
Cache lookups    : 80

Which is much more reasonable result.

Published Friday, March 28, 2008 5:46 PM by mosha
Anonymous comments are disabled
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement