I was reading the blog post "My notes from Microsoft BI Conference" by Vidas Matelis. It is a collection of notes recorded at different presentations, chalk talks etc. I felt that I agree with some of them, disagree with some, and have no opinion about some. I felt like it would be useful to write down my thoughts about these notes. Of course, when I don't agree with them - it doesn't mean that I am right - I am merely expressing my opinion on a matter, which is subjective. I also apologize upfront if any of my comments sound critical - my intent is not to critique, but to open a discussion, as an old saying goes "The truth is born in dispute". I will be citing the selected notes from Vidas's blog and then add my commentary.
- Attribute relationship is important. I have heard that at least 10 times in different sessions.
Indeed. If I were asked what is the single most important advice I could give to the cube designers - it would be to define correct attribute relationships in the dimensions. Note the accent on the word 'correct'. Lately, way too often I stumble upon cubes where people have heard how important attribute relationships are, so that they marked attributes related even if there was no true 1-to-many relationship between them. This could lead to disastrous results. If attributes are not 100% 1-to-many, the numbers in the cube could and will be wrong. They could even be non-deterministic ! Remember, it is very important to define relationships and convert hierarchies to be natural, but only if these relationships truly exist in the data.
- For MOLAP partitions no need to specify slice property, it is detected automatically.
I don't agree with this statement. It is only true for the single member slices. I.e. if we have daily partitions, it is not necessary to tell Analysis Services what day the MOLAP partition was processed for, because it will detect it automatically. But if the slice consists of more than one member - Analysis Services might not pick it up with great precision. What AS will do is to detect a range of Data IDs per attribute. Since the user has no control over how Data IDs are assigned to members, if partition has slice of only two members, it may turn out that their Data IDs are first and the last in the attribute, and the range will cover the entire space.
- If your partition is at the day or week level, no aggregations are possible for month level. Aggregations cannot cross partitions.
It is true that aggregations cannot cross partitions. But, of course, it is not true that aggregations are not possible at higher levels. What this note probably meant to say was that it isn't worth to set aggregations at the level above the one where partition is sliced by a single member.
- Each KPI formula is defined as hidden calculated measures, except when it is just reference to another measure. For better performance consider creating real calculated measures with properly designed calculations and then use these measures in KPI definitions.
Yes, KPIs may create hidden calculated measures, and not just for the KPI value, but for other properties as well - i.e. Goal, Trend etc. I have written about it here - www.sqljunkies.com/WebLog/mosha/archive/2005/12/31/cube_init.aspx. What is not clear from this note is how creating real calculated measures is going to help performance. The truth is - it won't help by itself. What this note probably meant was that if the calculated measures are created manually in the MDX Script, the cube designer has more control. For example, it is possible to specify performance related properties such as NON_EMPTY_BEHAVIOR, or non-performance properties such as FORMAT_STRING. It is possible to use this calculated measure inside SCOPE statement etc. So it gives flexibility both in functionality and in making performance optimizations.
- In multiple sessions I have heard talking about ”VisualTotals” problem. Basically best performance can be achieved for natural hierarchy totals - when children’s parent contains total value. When you use any filters, roles security based on username, subselect statements, etc, visual totals are calculated and performance will be affected.
There is some confusion going on inside this note. For example, what "roles security based on username" has to do with Visual Totals ? What's true is that Visual Totals have different forms - query based with subselects and "Default Visual Mode" connection string property, session based with named sets using VisualTotals function and CREATE SUBCUBE statement, there are Visual Totals which can be defined in the dimension security. But they are not inherently performance destroying. Different forms of Visual Totals have different semantics and different effect on performance and on caching. Putting blank statement that they are bad for performance just isn't right.
- In pre SP2 release you were able to increase YTD calculation performance by using fancy logic (Mosha blog example). After SP2 best performance can be achieved by using simple YTD() function.
I want to clarify this. The comment about "Mosha blog example" probably refers to the following blog post: www.sqljunkies.com/WebLog/mosha/archive/2006/11/17/rsum_performance.aspx. But this is exactly where I describe what a great performance SP2 delivers by using simple YTD() function ! I never fancied computing running sum using fancy logic. I got confused with either Richard Tkachuk or Chris Webb, both of whom blogged about fancy ways to compute running sum. I intentionally don't link to their posts, because I believe that the simpler the MDX expression is - the better.
- Excel 2007 uses subselects and that means visual totals. But you can run run excel in backward compatibility mode
More confusion about Visual Totals. Yes Excel 2007 uses subselects. Yes it means Visual Totals. But even when you run Excel in backward compatibility mode - you still going to get Visual Totals !!! Not through subselects, but through session property "Default Visual Mode". What this note probably meant to say, that when you run Excel in backward compatibility mode, you don't get subselects.
- 64bit machines does not run faster than 32bit. Many older 64 bit machines actually run slower than 32bit. That was fixed recently. 64 bit machines just lets you use more memory.
This is controversial statement. I am sure that in the scenarios that the presenter talked about this is true. However, there are also pieces of the engine code, which are explicitly optimized for 64bit, and where the performance of 64bit is better than 32bit, so it isn't just memory. But YMMV...
- In real life don’t just do full process on cube. Plan how you can improve processing. It is recommended that you submit in separate transactions ProcessData and ProcessIndexes statements instead of ProcessFull.
I wish there were more explanation behind this statement. Real life is complicated. Perhaps sometimes separating ProcessData and ProcessIndexes is the right thing, especially when the goal is to reduce processing window. But running queries on top of the data which doesn't have indexes could be disastrous for performance. So do the right thing for your setup. Don't just plan how you improve processing, plan how you improve overall system performance, and do what is more important in your scenario (for some people it is processing, for some it is queries, for some it is mix).
- Process partition clears cache for all partitions in affected measure group.
This statement makes it look like there are separate caches per partition. In reality data caches are per measure group (for SE) and per cube (for FE).
- If you are using crossjoin, put larger set always as first parameter of crossjoin
I was really puzzled by this one. There must be a justification for it, but I cannot think of it. However, I do agree that order of sets in crossjoin does matter a lot for performance, but it isn't the sizes of sets which are important in my opinion. It is complexity of MDX expressions behind them. The reason is very simple - CrossJoin function usually uses nested loop join algorithm. Therefore the sets with more complex MDX expressions should be the outer sets, and the simpler ones should be inner sets. This can be illustrated on the example from Adventure Works. The first query, where complex Filter is the outer parameter of CrossJoin returns instantaneously
WITH MEMBER Measures.Test AS
Filter([Customer].[Customer Geography].[Customer], LEFT([Customer].[Customer Geography].CurrentMember.Name,3) = "Mar")
SELECT Test ON 0
FROM [Adventure Works]
While changing the order of sets in CrossJoin and putting complex Filter as inner parameter like in the following query, executes for long 6 seconds:
WITH MEMBER Measures.Test AS
,Filter([Customer].[Customer Geography].[Customer], LEFT([Customer].[Customer Geography].CurrentMember.Name,3) = "Mar")
SELECT Test ON 0
FROM [Adventure Works]
- Exist function works on dimensions, not cubes, so they are faster
First, Exists function can work on either dimensions or measure groups. Second, the argument about "faster" is not clear to me. What is faster than what ?
- Fastest filter example: Exists(Customer.Members, Gender.[Male])
Here is even faster expression: Descendants( Gender_Customer_Hierarchy.Gender.[Male], Gender_Customer_Hierarchy.Customer, SELF ). I.e. if you have user natural hierarchy from Gender to Customer (it is natural because customer can have only one gender!), then using Descendants will always be better than Exists. Exists, however, is more universal, since it is usually used on top of attribute hierarchies and doesn't depend on the user hierarchies. But morale here is if you did define user hierarchies - don't be afraid to use them, since they will give boost in performance.
- If you have defined MeasureExpression, no aggregations are used.
That is no aggregation are used to derive the results at higher granularities, but aggregations can be used if the granularity is the same. Measure expressions in this respect behave just like Distinct Count and Many to Many dimensions.
- Kernel profiler could be used to show CPU usage per each Stored Procedure. This might help you find slowly performing Stored Procedures.
Another best practice is to integrate stored procedures with Trace through Context.TraceEvent method. This will give a lot of insight into performance of stored procedure. More details are in my "Best practices for server ADOMD.NET stored procedures" blog.
- Best performance can be achieved using standard aggregation functions instead of rewriting them (Example LastChild). But some of them require Enterprise Edition of SQL Server.
LastChild is not such a good example here, because performance differences between semiadditive measure LastChild and carefully written MDX are not that big. Better example is LastNonEmptyChild, where no matter how it is reimplemented in MDX, the built-in semiadditive measure will always be significantly better. But there is also a counterexample. It is possible to implement AverageOfChildren functionality in such a way that it will perform better than built-in semiadditive function. More details are in this blog post: www.sqljunkies.com/WebLog/mosha/archive/2007/05/27/counting_days_mdx.aspx. (it is not pure MDX solution since it requires creation of auxiliary measure group, but still).