THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Microsoft OLAP by Mosha Pasumansky

Best practices for server ADOMD.NET stored procedures

In this article we will discuss some of the best practices around writing efficient Analysis Services stored procedures using server ADOMD.NET. These days there is plenty of information about writing efficient MDX - in books, whitepapers, blogs, forums etc. Just recently Microsoft released two documents - Performance Guide and Design Best Practices (and I have seen another one, still in the works, which goes deeper into MDX internals than anything else previously published). But there is no similar information or resources about how to write stored procedures, which are the essential extensibility mechanism in Analysis Services. There is Analysis Services Stored Procedures open source project on CodePlex, which features plenty of source code for different stored procedures, but it is not easy for someone looking at it to extract best practices out of the code. This article's goal is to collect some of such best practices into single concise source.

Always traverse set with iterators and avoid accessing tuples by index or getting count of tuples

To demonstrate why this is important, let's pick a simple stored procedure which computes the sum of expression values over a set. It does exactly the same as built-in MDX function Sum, we will use it here because it is simple enough but illustrates the point well.

We will implement this stored procedure using two methods - IterSum will use tuple iterator and IndexSum one will access tuples by index.

 

        public decimal IterSum(Set InputSet, Expression Expr)
        {
            decimal sum = 0;
            foreach ( Tuple t in InputSet.Tuples )
            {
                sum += (decimal)Expr.Calculate(t);
            }

            return sum;
        }

        public decimal IndexSum(Set InputSet, Expression Expr)
        {
            decimal sum = 0;
            int c = InputSet.Tuples.Count;
            for (int i = 0; i < c; i++ )
            {
                Tuple t = InputSet.Tuples[i];
                sum += (decimal)Expr.Calculate(t);
            }

            return sum;
        }

Now, let's test the performance of these functions. We will start with the following simple queries:

with 
set QuerySet as [Customer].[Customer].[Customer].MEMBERS
member measures.SprocSum as ASSP.ASStoredProcs.SetOp.IndexSum(QuerySet, 1)
select SprocSum on 0
from [Adventure Works]

with 
set QuerySet as [Customer].[Customer].[Customer].MEMBERS
member measures.SprocSum as ASSP.ASStoredProcs.SetOp.IndexSum(QuerySet, 1)
select SprocSum on 0
from [Adventure Works]

We have chosen to use "1" as an expression in order not to spend time in retrieving data from the cube, but focusing purely on the performance of the stored procedure code (there is also another reason for this choice, explained later in the article). Both queries return immediately - just like everybody would expect. So far so good. Let's change queries a little bit and pass set directly instead of using named set:

with 
member measures.SprocSum as ASSP.ASStoredProcs.SetOp.IterSum([Customer].[Customer].[Customer].MEMBERS, 1)
select SprocSum on 0
from [Adventure Works]

with 
member measures.SprocSum as ASSP.ASStoredProcs.SetOp.IndexSum([Customer].[Customer].[Customer].MEMBERS, 1)
select SprocSum on 0
from [Adventure Works]

Now the first query still returns immediately, but the second one took 29 seconds ! Let's take it even further - we will add Filter(..., true) around the set, which pretty much does nothing semantically.

with 
member measures.SprocSum as ASSP.ASStoredProcs.SetOp.IterSum(Filter([Customer].[Customer].[Customer].MEMBERS, true), 1)
select SprocSum on 0
from [Adventure Works]

with 
member measures.SprocSum as ASSP.ASStoredProcs.SetOp.IndexSum(Filter([Customer].[Customer].[Customer].MEMBERS, true), 1)
select SprocSum on 0
from [Adventure Works]

The first query is still immediate, but the second one now raises to whooping 2 minutes 49 seconds !!!

What is going on here ? The short explanation is that not all set are equal as far as their internal implementation in Analysis Services goes. Some of them have efficient enumerators while other have less efficient enumerators and yet others have completely inefficient enumerators ! However, all sets have pretty reasonable iterators. To get more detailed explanation about set architecture in AS, I recommend reading Irina Gorbach's chapter 28 from the "Microsoft SQL Server 2005 Analysis Services" book, the explanation is at pages 513-514 in English edition, and in pages 577-579 in Russian edition (by the way, if you can get Russian edition - I recommend it over the English one - it is more complete, has deeper explanations and fixes most of the errors in the English edition).

But the lesson is clear - always iterate over set tuples using "foreach ( Tuple t in myset.Tuples )" construct, and avoid both Tuples.Count and Tuples[index] constructs. (There is also another reason - there is a bug with how tuples are created from access by index in tuples collection, and this bug can lead to internal exceptions in many scenarios). But what if you need to access the tuples in random order and/or more than once ? This leads us to the next best practice

Cache cell values if you need to access them more than once or if you need to get data for all tuples of the set in random order

The example that we will chose here will be implementation of Order(set, expr, BDESC) function. Sorting algorithms require in average O(n*log(n)) accesses to array values for the array of n elements, so clearly we will need to access tuple's data more than once. On top of it all sorting algorithms will touch data in pretty much random order related to their original positions. This is a classic scenario where caching makes sense. The code below demonstrates it.

01        private class TupleValue : System.IComparable
02        {
03            private Tuple _Tuple;
04            internal Tuple Tuple
05            {
06                get { return _Tuple; }
07                private set { _Tuple = value; }
08            }
09            private decimal _Value;
10
11            public TupleValue(Tuple t, decimal v)
12            {
13                _Tuple = t;
14                _Value = v;
15            }
16            
17            public int CompareTo(object obj)
18            {
19                if (obj is TupleValue)
20                {
21                    TupleValue tv = (TupleValue)obj;
22                    return tv._Value.CompareTo(_Value);
23                }
24                throw new System.ArgumentException("object is not a TupleValue");
25            }
26        }
27
28       public Set Order(Set InputSet, Expression SortExpression)
29       {
30           List TupleValues = new List();
31
32           int i = 0;
33           foreach (Tuple t in InputSet.Tuples)
34           {
35               TupleValues.Add(new TupleValue(t, (decimal)SortExpression.Calculate(t)));
36               i++;
37           }
38
39           int cTuples = i;
40
41           TupleValues.Sort();
42
43           SetBuilder sb = new SetBuilder();
44
45           for (i = 0; i < cTuples; i++)
46           {
47               sb.Add(TupleValues[i].Tuple);
48           }
49
50           return sb.ToSet();
51       }

It seems a bit long, but the interesting code is quite short. Line 33 iterates through the set using tuple iterator, and on line 35 populates the cache of TupleValues, where TupleValue is a pair of Tuple and value of expression evaluated at it. Then at line 41 we call .NET built-in Sort method which uses TupleValue's implementation of IComparable interface at lines 19 through 24. Let's check how well this code executes. We will compare our stored procedure with built-in MDX function Order. We will sort a reasonable big set - 166,356 tuples. In order to measure the time spent on the server and not Management Studio trying to render in UI all these tuples, we will wrap Order with Count to return a single value.

with 
member y as count(Order(
   [Customer].[Customer].[Customer].MEMBERS
  *[Product].[Category].[Category].MEMBERS
  *[Product].[Style].[Style].MEMBERS
, Measures.[Internet Sales Amount], BDESC))
select y on 0
from [Adventure Works]

with 
member y as count(ASSP.ASStoredProcs.SetOp.[Order](
   [Customer].[Customer].[Customer].MEMBERS
  *[Product].[Category].[Category].MEMBERS
  *[Product].[Style].[Style].MEMBERS
, Measures.[Internet Sales Amount]))
select y on 0
from [Adventure Works]

The results are somewhat surprising. The built-in MDX function Order finishes in 40 seconds. And our stored procedure which does the same thing finishes in only 14 seconds ! We were able to beat internals of AS engine with our stored procedure (the reason we were able to outperform built-in Order is because of performance bug with MDX's Order. Normally stored procedure should never perform better than corresponding MDX function).

Caching is definitely a widely used technique to improve performance, but there are certain caveats to it. The developer should watch the size of the cache not to overflow beyond system limits. In the above implementation we cache liberally both Tuple object and its value, and this can add up to quite a lot of memory over huge sets. There are, of course, techniques to deal with this issue, but we won't discuss them here, since these techniques are well described in the literature dedicated to .NET programming.

Now, this stored procedure, even though it was faster than built-in Order, still took a while to execute - 14 seconds. What if user wasn't willing to wait that long. How do we ensure that it can be canceled immediately. This leads us to the next best practice

Inject liberally "Context.CheckCancelled()" statement in the inner loops which do not call into server ADOMD.NET methods

Context.CheckCancelled() is a method which raises "Server: The operation has been canceled" exception if the user has issued command on the session. It is very cheap - so cheap, that it is practically free. It can be put inside every inner loop in the stored procedure. However, it doesn't buy anything, if that inner loop already calls some server ADOMD.NET method. For example, the loop at lines 33-37 calls into several ADOMD.NET methods - it calls Next on the tuple iterator and Calculate on the expression object. Since both of these methods check for cancellation anyway, additional call to CheckCancelled is redundant. However, inside the code of List.Sort method, there are no more calls into server ADOMD.NET, since the data is cached. We cannot modify the code of List.Sort, but we still get called back from there into TupleValue.CompareTo method at line 17 - so it is a great candidate to insert call to Context.CheckCancelled.

But how much time we spent inside Sorting vs. populating the cache with data ? The only way to answer it - is to profile our stored procedure, and this leads us to the next best practice

Integrate with AS Trace and SQL Profiler through the Context.TraceEvent method

The best way to understand where the time is spent inside execution of MDX query is, of course, by inspecting various events in AS Trace through tool such as SQL Profiler. The server object model provides nice integration with trace through the Context.TraceEvent method. This method allows to inject "User defined event" into the trace, and provides for custom event subclass, integer and string data. We can write to trace before and after our two main loops in order to find out how much time we spend in each. So the code with both CheckCancelled and TraceEvents added will look like below:

01        private class TupleValue : System.IComparable
02        {
03            private Tuple _Tuple;
04            internal Tuple Tuple
05            {
06                get { return _Tuple; }
07                private set { _Tuple = value; }
08            }
09            private decimal _Value;
10
11            public TupleValue(Tuple t, decimal v)
12            {
13                _Tuple = t;
14                _Value = v;
15            }
16            
17            public int CompareTo(object obj)
18            {
19                if (obj is TupleValue)
20                {
21                    Context.CheckCancelled();
22                    TupleValue tv = (TupleValue)obj;
23                    return tv._Value.CompareTo(_Value);
24                }
25                throw new System.ArgumentException("object is not a TupleValue");
26            }
27        }
28
29        public Set Order(Set InputSet, Expression SortExpression)
30        {
31            List<tuplevalue> TupleValues = new List<tuplevalue>();
32
33            Context.TraceEvent(100, 0, "Start getting data");
34
35            int i = 0;
36            foreach (Tuple t in InputSet.Tuples)
37            {
38                TupleValues.Add(new TupleValue(t, (decimal)SortExpression.Calculate(t)));
39                i++;
40            }
41
42            int cTuples = i;
43
44            Context.TraceEvent(100, cTuples, "Finish getting data for " + cTuples.ToString() + " tuples");
45
46            Context.TraceEvent(200, 0, "Start sorting");
47            TupleValues.Sort();
48            Context.TraceEvent(200, 0, "Finish sorting");
49
40            SetBuilder sb = new SetBuilder();
51
52            for (i = 0; i < cTuples; i++)
53            {
54                sb.Add(TupleValues[i].Tuple);
55            }
56
57            return sb.ToSet();
58        }

At line 21 we added CheckCancelled call, at lines 33 and 44 calls to TraceEvent to track start and finish of populating the cache - writing the number of tuples that went into cache, and at lines 46 and 48 - calls to TraceEvent to track time spent in sorting. The results can be seen in the following screenshot:

The important column here is CurrentTime - it is the one which allows us to calculate how much time was spent where. As expected - all the time was spent getting the data and populating the cache, and sort itself was instantaneous. Tracing is a great way to get insight into execution, but for performance reasons it should be kept out of the inner loops, and perhaps some of it even disabled in the release version.

Well, we are down to the last best practice, which is

Use the appropriate ImpersonatioInfo on the assembly for right balance between security and performance

ImpersonationInfo property of the assembly tells AS what principal should be impersonated when the stored procedure code is executed. In other words it determines the content of the thread token during stored procedure execution. The security implications of ImpersonationInfo are well understood. But what about performance implications - are there any ? In fact there are. When stored procedure calls into server ADOMD.NET method, the real server code gets execution. This code may read from files, create new threads and do all kinds of things that server needs to do in order to get data or perform the requested operation. Every time server codes interacts with OS, the OS checks the current thread token to determine the access rights. If the current thread token would remain the one under which stored procedure was executing - there is high likelihood that OS related operations will fail, because the user that was impersonated probably don't have permissions to read server data files etc. Therefore, every time there is a call from stored procedure to ADOMD.NET, it first must revert the currently impersonated thread token to the server, and before returning control back to the stored procedure - set it back. These transitions do take their time, although differences in time vary between different impersonation modes, system and network configurations, security settings etc. It is safe to say though, that the only impersonation mode which is free of such transitions is ImpersonateServiceAccount, since it essentially tells AS to use the same thread token both inside server code and inside stored procedure code. While from security point of view ImpersonateServiceAccount seems to be least safe, it must be considered in combination with assembly's PermissionSet. When PermissionSet is set to Safe, the ImpersonationInfo is not very important, because PermissionSet=Safe guarantees that the stored procedure cannot go outside of server's sandbox anyway - it cannot touch file system or network by itself, it cannot interact with OS etc. So unless stored procedure requires more broad PermissionSet, it is usually OK to use ImpersonationInfo=ImpersonateServiceAccount. Certainly this is true with all the examples in this article. But, of course, each case should be considered separately in order to ensure secure environment.

 

Published Thursday, April 19, 2007 12:26 AM by mosha
Anonymous comments are disabled
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement