|
|
|
|
-
Whenever I am asked “What is the single most important advice to write efficient MDX”, I always give same answer: The simpler MDX you write – typically the better performance you will get. The reasoning behind this is straightforward: If MDX is simple, it will be easier for the engine to understand and therefore to optimize it. I know that we all tend to think that we can do all kinds of tricks and outsmart engine, and sometimes it is true, but the engine becomes smarter and smarter with every release, and especially in AS2008. So the tricks which could’ve helped before, actually could be hurting more than helping. And now, it seems that I am not the only one saying that ! Mark Whitehorn (my coauthor of “Fast Track to MDX” book) together with Keith Burns have just published a technical whitepaper “Best Practices for Data Warehousing with SQL Server 2008”. This seems to be a whitepaper focused specifically on the improvements in SQL 2008, and it includes section named “Write simpler MDX without worrying about performance”. Nice title, and in addition it has explanation of block computation mode. I have really only couple of minor comments about it: - Block computation mode was not introduced in AS2008, it existed in AS2005 (I have plenty of examples in my blogs). AS2008 though, extended applicability of block computation mode very significantly. - Description of calculated member says that it is “running total of two consecutive years”, but the expression is not for running total, but rather for regular total, and also there is nothing specific about years in it. It will work with Month, Quarter, Day etc just as well. - I didn’t really understand the following statement “In most hierarchical structures we know that if data is present for one cell in 2003, it will be there for all cells in 2003. The trip to see if there is data for the previous period only needs to be carried out once”. I think this statement is really misleading. If data is present for one cell in 2003 – there is absolutely no guarantee that it will be present for all (even for any other) cells in 2003. The diagrams just above this statement show that in fact most of the cells don’t have data. What I think authors tried to say was that MDX function PrevMember doesn’t need to be called for every cell, because [2004].PrevMember will always result in [2003], and AS2008 is smart enough to make this evaluation only once, and then compute the entire subcube of 2003 for non-empty cells only - The example uses VISIBLE=2 property on calculated member. I wonder why…
|
-
MDX Formatting functionality debuted in MDX Studio 0.2.8, and judging by the amount of feedback I got on it – it became the most popular feature of MDX Studio ever ! Couple of people mentioned that for SQL formatting/beautifying needs, they would go online to sites like www.sqlhere.com, which often do reasonable job for MDX as well. So this got me thinking that I could put MDX formatting functionality online as well. With little refactoring of MDX Studio code I was able to hack my first ever ASP.NET application, which became MDX Studio Online Edition. This was a significant milestone, so I promoted version number to 0.3, and also promoted status from Alpha to Beta. The MDX Studio Online can be accessed at http://mdx.mosha.com In addition to “Format MDX” button, it also features “Parse” button, which does exactly the same as desktop version – it shows the MDX parse tree. Just like in desktop version, it can Format and Parse both queries (i.e. SELECT statements) and MDX Scripts (i.e. CREATE MEMBER/SET, SCOPE, assignments etc). The biggest known feature which is lacking from MDX Studio formatting – is support for comments. Currently it just ignores the comments instead of carrying them forward into formatted statements. I still didn’t figured out how to do it, but this issue tops my todo list. Update: Big thanks to Nick Medveditskov who also has written his own MDX Formatter application – he told me the trick he used to preserve comments in formatted MDX, and this functionality is now implemented in MDX Studio deployed online. Being an online application – you can use it right away, without any installation, so it can be especially handful for consultants working at customer site, where software installations are not allowed. Another great benefit of it being online application, is that everybody is always on the latest code, so all the hotfixes and version upgrades go live for everybody immediately (well the downside is that if I break something, then everybody is broken right away and there is no rollback, but this would never happen to me, of course, would it ?) I would like to thanks several people who did smoke testing for this application before it went public and provided lots of very valuable feedback and advice, people from all around the world – Vladimir Stepa (Germany), Vidas Matelis (Canada), Greg Galloway (US), Darren Gosbell (Australia), Ajit Singh (India) and Chris Webb (UK). After all of them gave green light to the current version, I feel OK now to release it to public. As usual please send your feedback about the tool to the MDX Studio forum at http://www.ssas-info.com/forum/MDXStudio
|
-
One of my most popular blog posts was “Analysis Services 2005 protocol - XMLA over TCP/IP”. Previously this information wasn’t available from anywhere else. But things changed today. Microsoft now has engaged in the “Open Specification program”. Citing from the web page: Microsoft is providing open connections to its high-volume products—Windows Vista (including the .NET Framework), Windows Server 2008, SQL Server 2008, Office 2007, Exchange Server 2007, and Office SharePoint Server 2007—as well as additional information so that software developers, business partners and competitors can better interact with these Microsoft products or invent new solutions for customers. As a developer, you now have full access to information about protocols, binary file formats, and other specifications for these products that can be used to create solutions. (Yes, these include Office file format specifications!) Since Analysis Services is part of SQL Server, the protocols for accessing Analysis Services are listed under “Microsoft SQL Server Protocols”. The most interesting one is [MS-SSAS9]: SQL Server Analysis Services Protocol Specification which documents Analysis Services 2005 protocol (and 2008 as well, since protocol didn’t change between these versions). There is also [MS-SSAS8]: SQL Server Analysis Services Protocol Specification which documents Analysis Services 2000 protocol. However, this one would be much harder to go through, since AS2000 had “smart client” architecture, where client code was doing lots of work – MDX parsing, evaluation, caching etc – therefore the protocol between client and server is at much lower level than MDX queries and it is not easy to make sense out of it.
|
-
Darren Gosbell put up a blog post about MDX Formatting, where he describes how he prefers to format MDX to look nice on the screen. So I thought – why to leave to a human something that machine can do. I decided to implement these rules (with some modifications) into MDX Studio. It became MDX Studio 0.2.8 release, available immediately for download at http://cid-74f04d1ea28ece4e.skydrive.live.com/browse.aspx/MDXStudio/v0.2.8 I only had couple of hours to work on it, so it is rough on the edges, and I will appreciate comments in the MDX Studio forum at http://www.ssas-info.com/forum/MDXStudio. But basic staff like calculated members, named sets, subselects etc should work. (One known exception is set aliases). The functionality is accessible through MDX –> Format menu. Here is example from Darren’s blog with member measures.ptd as 'sum(periodstodate([Date].[Calendar].[Month],
[Date].[Calendar].currentmember),[Measures].[Sales Amount] )',format_string = "currency"
select {[Measures].[Measures].[Sales Amount] ,measures.ptd} on 0,
{[Date].[Calendar].[Month].&[2003]&[2].children} on 1 from [Adventure Works] where
([Product].[Product Categories].[Category].&[1])
And here is how it looks after autoformatting:
WITH
MEMBER measures.ptd AS
sum(
periodstodate(
[Date].[Calendar].[Month],
[Date].[Calendar].currentmember),
[Measures].[Sales Amount])
, format_string = "currency"
SELECT
{
[Measures].[Measures].[Sales Amount],
measures.ptd
} ON 0,
[Date].[Calendar].[Month].&[2003]&[2].children ON 1
FROM [Adventure Works]
WHERE
[Product].[Product Categories].[Category].&[1]
|
-
Every year around March or April I am looking forward the OLAP Report site to see the OLAP market share numbers. Computing these numbers is not a simple task as Nigel Pendse explains on http://www.olapreport.com/market.htm, and this kind of analysis really is not available anywhere else. This year again, I was waiting for the 2007 results to see by how much Microsoft increased its market share. March and April passed, then May passed, we are well into June, and the results didn’t show up. So I contacted Nigel, and here is what he told me (publishing with permission): Mosha, Unfortunately, I don't think it's possible to calculate accurate OLAP market shares any more. Applix, Business Objects, Cognos and Hyperion were all acquired in the last year, and therefore no longer publish figures, so it's no longer possible to work out their OLAP revenues. Furthermore, in the year of acquisition, their figures are particularly distorted (sales people work very hard to close deals just before the acquisition, because they don't expect to keep the commission afterwards, so sales are high the quarter before the deals close, and weak the following quarter). Of, course, the disruption doesn't help sales in those companies, so I don't doubt that Microsoft increased its lead, its just that I don't have any good way any longer of calculating it. Generally, the new owners of those companies are not prepared to disclose product revenues. Regards Nigel So after the wave of last year consolidations, it is no longer possible to have accurate estimates. But even without precise numbers, how the future looks like ? (emphasis is mine) I may try again in the future, but at the moment, I can't think of a reliable way to calculate OLAP market shares. I think it's also less interesting than in the past, with fewer significant vendors remaining. Microsoft is clearly going to stay #1. Oracle+Hyperion will probably stay at #2, but its share is unlikely to rise. IBM (Cognos+Applix) will stay at number 3; its share may rise a bit, particularly if IBM gives TM1 the promotion that Applix could not afford to do, but I think IBM will stay in #3. SAP+BOBJ will stay in the next position. MicroStrategy may gain share, but is unlikely to overtake SAP+BOBJ. No other vendor comes close, so the top five aren't likely to change position any time soon. Anyway, the picture below shows how the market share trends looked like until 2006 (source: OLAP Report) Calculating the market shares for consolidated companies based on 2006 positions | | 2006 Market share | | Microsoft | 31.6% | | Oracle+Hyperion | 21.7% | | Cognos+Applix=IBM | 16.5% | | SAP+Business Objects | 13.1% | | Microstrategy | 7.3% | Update: Turns out the numbers in this table are not accurate, because I forgot to account that Business Objects had acquired Cartesis and that SAP had acquired OutlookSoft. Anyway, Nigel has updated his OLAP market share page with deeper analysis of 2007 consolidation effect on market shares, and he has more accurate numbers.
|
-
How many elements in the set have distinct values ? This sounds like a simple question. Apparently, this is well known problem in Excel, and there is a classic solution for it. There are plenty of sites in Internet (for example here and here) which give the following solution: =SUM(1/COUNTIF(A1:A6,A1:A6)))
(note that this formula needs to be entered with Ctrl-Shift-Enter to indicate that this is array formula, or it won’t give right result – one of the Excel quirks)
When someone showed me this solution, I just stared at it and I couldn’t understand how it worked. Only rewriting it in MDX clarified what was going on. Let’s take the following example based on Adventure Works – we want to count how different letters the product names start from (just wanted to pick something for which Product dimension didn’t have dedicated attribute, because otherwise it would’ve been too simple). The MDX which performs the same algorithm as Excel formula above in these conditions will look like following:
with
member measures.CountDistinctLetters as
Sum([Product].[Product].[Product].MEMBERS AS AllProducts,
1/Filter([Product].[Product].[Product].MEMBERS,
VBA!Left(AllProducts.Current.Name,1) = VBA!Left([Product].[Product].CurrentMember.Name,1)
).Count)
select CountDistinctLetters on 0
from [Adventure Works]
It is more verbose, but probably it is not simpler to understand to normal developer than the Excel’s formula. In the nutshell the idea is following: For every element in the set, we compute how many other elements produce the same value (this is COUNTIF in Excel, and Filter(…).Count in MDX), and then dividing 1 by this number we get fraction of current element. The summing up all these fractions will give the desired number.
(Unfortunately there is no easy way to debug such formulas. The MDX debugger inside MDX Studio is not capable of dealing with such expressions. I actually spent quite a time thinking about how to build decent environment for debugging MDX, and after few failed starts I had what I consider as breakthrough idea. Early experiments were very encouraging, and if this idea works, I will blog soon about it. It radically different from all other MDX debugging attempts.)
It is clear that performance of such solution is horrible. For every element in the set, we go again and again over all elements in the set just to find out how many of them have same value. The algorithmic complexity of this nested loop approach is O(n^2). Indeed the query above executes for 12 seconds. It is possible to slightly improve it, by forcing caching the results of VBA!Left through
with
member Prefix as VBA!Left([Product].[Product].CurrentMember.Name,1)
member measures.CountDistinctLetters as
Sum([Product].[Product].[Product].MEMBERS AS AllProducts,
1/Filter([Product].[Product].[Product].MEMBERS,
(AllProducts.Current,Prefix) = Prefix
).Count)
select CountDistinctLetters on 0
from [Adventure Works]
and reducing time to 11 seconds, but obviously, different, more sane approach is needed here.
Another idea is instead of doing nested-loop algorithm, sort the elements of the set, and then scan sorted set once counting number of different elements. The only technical difficulty with that would be - while scaning sorted set, we need to compare current element with the previous one. Getting current element is easy, but how to get previous one ? There is no "Previous" function in MDX, there is only PrevMember, but it works on hierarchical ordering of the level, and here we have arbitrary set sorted by arbitrary criteria. The solution is to use CurrentOrdinal function, which returns the index of the tuple being iterated in the set, and then using CurrentOrdinal-1 as a new index, send it to Item function. In MDX this would be recorded as
with
member Prefix as VBA!Left([Product].[Product].CurrentMember.Name,1)
set ProductsSortedByPrefix as order([Product].[Product].[Product].MEMBERS, Prefix, BASC)
member measures.CountDistinctLetters as Sum(ProductsSortedByPrefix as y,
iif(y.CurrentOrdinal = 1 OR Prefix <> (Prefix, y.Item(y.CurrentOrdinal-2)), 1, 0))
select CountDistinctLetters on 0
from [Adventure Works]
This query executes much faster now, in only 0.234 seconds. The algorithmic complexity is down to O(n*log(n)). However, it is still not perfect solution, since it is possible to solve this in just one scan of the original set, i.e. with O(n) complexity. In order to do that, we will have to write stored procedure though.
public int CountDistinctValues(Set set, Expression exp)
{
System.Collections.Hashtable ht = new System.Collections.Hashtable();
foreach (Tuple t in set.Tuples)
{
string v = exp.Calculate(t).ToString();
if ( !ht.ContainsKey(v) )
ht.Add(v, null);
}
return ht.Count;
}
Now, if we use this stored procedure in the query as following
with member CountDistinctLetters as ASSP.ASStoredProcs.Util.CountDistinctValues(
[Product].[Product].[Product].MEMBERS,
VBA!Left([Product].[Product].CurrentMember.Name,1))
select CountDistinctLetters on 0
from [Adventure Works]
we get it working in only 0.062 seconds.
|
-
BI Survey 8 is up and ready to accept participants. It’s hard to believe that we are at version 8 of what started as OLAP Survey 1 back in 2000. (On the other hand, my daughter also recently turned 8, she was born in 2000 too, and I still cannot believe that she is so big now, just finished second grade, it seems only yesterday she was a little baby). Anyway, over these years the OLAP/BI Survey has grown and matured, and every year many people including myself await the results and then carefully study them. So I encourage everybody to fill it, whether you use Microsoft OLAP, or some other OLAP system, whether you are happy or dissatisfied with it. More inputs – the better. Below is the official invitation from Nigel and Co:
We would very much welcome your participation in The BI Survey, conducted annually by Nigel Pendse. This is the largest independent survey of business intelligence/OLAP users worldwide. The Survey will obtain input from a large number of organizations to better understand their buying decisions, the implementation cycle and the business success achieved. Both business and technical users, as well as vendors and consultants, are welcome.
The BI Survey is strictly independent. While Microsoft and other vendors assist by inviting users to participate in the Survey, the vendors do not sponsor the survey, nor influence the questionnaire design or survey results. You will be able to answer questions on your usage of a BI product from any vendor. Your data will only be used anonymously, and no personal details will be passed to vendors or other third parties.
As a participant, you will not only have the opportunity to ensure your experiences are included in the analyses, but you will also receive a summary of the results from the full survey. You will also have a chance of winning one of ten $50 Amazon vouchers. Click here to complete the survey on-line.
English version: http://www.eu-survey.com/BI_8/BI.asp?lan=1&lin=22
German version: http://www.eu-survey.com/BI_8/BI.asp?lan=2&lin=120
Spanish version: http://www.eu-survey.com/BI_8/BI.asp?lan=3&lin=213
|
-
My post “MDX answer to NextAnalytics challenge” triggered some very active discussion, and I would like to address some of the comments Chris Webb said: I've done a fair amount of teaching MDX and the truth is that people do find even the basics very difficult to understand, but my feeling is that the problem people have with MDX is not the language as such but thinking a) multidimensionally and b) in sets. The question is then, does Ward's new language solve either of these problems? Or has he reinvented the wheel? I agree with the a) part, but not with the b) part. Thinking multidimensional is hard and take some time to get used to. However thinking in sets is not hard at all. SQL language is mostly based on relational algebra and uses sets extensively. Practitioners don’t find sets difficult. Thomas Ivarsson said: Can you publish the entire script at the end of this blog post? My conclusion is that we need better clients that can create these analytics sets at run time. End users will never do this and it is impossible to create all sets that the business users would like to have. It will fill up a cube in no time. I am pessimistic that we will ever see such sophisticated clients. If they were in demand, we would’ve already had them. I am afraid the users will have to code such analytics either using MDX or NextAnalytics script or something else. End users don’t have any problem building such sets in Excel, so maybe Excel is our hope ? Here is the entire script: // Nextanalytics challenge
// calculate the rolling six month average
CREATE Rolling6MonthAverage;
(Rolling6MonthAverage, [Date].[Month Name].[Month Name])
= Avg([Date].[Month Name].Lag(5):[Date].[Month Name], [Measures].[Internet Sales Amount]);
// calcualte the growth of that
CREATE GrowthOfRolling6MonthAverage = Rolling6MonthAverage - (Rolling6MonthAverage, [Date].[Month Name].PrevMember);
// calculate growth
CREATE GrowthOfSales = [Measures].[Internet Sales Amount] - ([Measures].[Internet Sales Amount], [Date].[Month Name].PrevMember);
// count each "growth" calc to the row average
CREATE HowManyTimeGrowthExceededRollingAverage =
Count(
Filter(([Date].[Calendar Year].&[2003],[Date].[Month Name].[Month Name]), GrowthOfSales > GrowthOfRolling6MonthAverage));
CREATE SET ProductsThatGrewFasterThanRollingAverageMore6OrMoreTimes AS
Generate([Customer].[Customer Geography].[Country],
CrossJoin([Customer].[Customer Geography],
Filter(
[Product].[Product].[Product],
HowManyTimeGrowthExceededRollingAverage >= 6
)
)
);
SELECT {} ON 0
, ProductsThatGrewFasterThanRollingAverageMore6OrMoreTimes ON 1
FROM [Adventure Works]
// create growth from start to end of period
CREATE GrowthFromStartToEnd =
([Measures].[Internet Sales Amount], ClosingPeriod([Date].[Calendar].[Month], [Date].[Calendar].[Calendar Year].&[2002]))
- ([Measures].[Internet Sales Amount], OpeningPeriod([Date].[Calendar].[Month], [Date].[Calendar].[Calendar Year].&[2002]));
CREATE SET ProductsThatGrewMoreThan25PctFromStartToEnd AS
Generate([Customer].[Customer Geography].[Country],
CrossJoin([Customer].[Customer Geography],
Filter([Product].[Product].[Product],
GrowthFromStartToEnd > 0.25*([Measures].[Internet Sales Amount], OpeningPeriod([Date].[Calendar].[Month], [Date].[Calendar].[Calendar Year].&[2002]))
)
)
);
SELECT {} ON 0
, ProductsThatGrewMoreThan25PctFromStartToEnd ON 1
FROM [Adventure Works]
SELECT {} ON 0
, Intersect(
ProductsThatGrewFasterThanRollingAverageMore6OrMoreTimes,
ProductsThatGrewMoreThan25PctFromStartToEnd ) ON 1
FROM [Adventure Works]
Miky Schreiber said:
I agree with Thomas. The question is how difficult it is to the end user. While the regular end user will never write MDX or NextAnalytics script, the advanced user will. Now, the question is what will be more simple for him - MDX or NextAnalytics script.
Valid point. But I don’t think there is an absolute measure of “what is simpler” – practitioners will tend to use the tool that they are more familiar with. MDX is probably more popular than nextanalytics, but SQL is way more popular than either one of these. And Excel will easily rival SQL in familiarity. On the other hand, developers will want to use “real” programming language. And the question of “what is simpler” will raise there as well. While functional languages like F# are clearly more suited for such tasks, more people are familiar with conventional languages like C#.
Update: Ward published the entire nextanalytics script for this problem, so constrast and compare it yourself: http://www.nextanalytics.com/public/Examples/MDX-Challenge.html
And finally Ward wrote long comment, from which I will cite selectively
Ward Yaternick said:
I also don’t agree with your implicit assumption that the cube and other infrastructure parts are insignificant. To put a cube together from scratch and maintain it is a big decision and cost.
…
Therefore, I think you have conclusively proved that (1) if the cube exists and (2) someone can author MDX then MDX does deliver on complex analytics.
I do agree with that. I wasn’t trying to say that setting up a cube was trivial. However, if the cube was set up for some particular purpose, it is very easy to extend it with MDX calculations to get additional value out of it.
nextanalytics can help in these situations because the data can come directly from SQL or CSV files and the incremental IT overhead behind an analytic is very low. You can feasibly write an analytic like what you did with MDX, run it against a CSV file or a simple SQL query, and then throw it away or keep it (all contained as lines in single text file) in a folder somewhere.
This certainly looks like a more agile way of doing things – quickly throw small script together, run it and move on. I think this approach will be attractive to lots of users – from small IT shops to independent users in bigger organizations. This is at odds, however, with classic methodology of building centralized data warehouse, one version of truth and all that. It doesn’t make it less viable, of course, it is just a tradeoff that someone will have to make – tradeoff between chaos and control, or, in Eric Raymond’s words, between bazaar and cathedral.
Another situation in which a cube isn’t ideal are when ISVs and consultants want to integrate analytics into their applications but they need to minimize the number of other vendor’s licenses and reduce the complexity and cost of the installed products. This priority is the same when an IT budget is under constraint.
I think this is somewhat weak argument, because it implies that both complexity and pricing of NextAnalytics is below this of other products, like Microsoft Analysis Services. But this is definitely something that practitioners will need to evaluate and compare.
Sometimes, the analytic data processing needs to run on a unix box against something like MySQL or Oracle i.e. where Microsoft OLAP isn’t available
Just wanted to note, that even though data could be in Unix box, in Oracle or MySQL, Microsoft OLAP doesn’t have problem connecting to it. Oracle is fully supported data source, and many people were successful reading data from MySQL too.
Perhaps they need an Open Source client to make integration easier with their dashboard or portal. For this reason, they can’t use any of the popular OLAP Clients
I didn’t quite get this. There are plenty of Open Source clients which connect just fine to Microsoft Analysis Services (usually through XMLA, but not only). Since you mentioned Pentaho earlier, I wanted to note that Pentaho Analysis Views are based on JPivot, which works fine with Microsoft OLAP. There are many other open source clients for Microsoft OLAP as well.
All this lands us where nextanalytics services it’s niches. By now I hope you can see our positioning. We’re not competing with MDX especially in the conditions you wrote about.
Yes, I think I see the positioning, and I believe that both your product and your approach have good potential, regardless of what MDX can and cannot do. And I actually think that it does compete, not just with MDX, but with OLAP and with Data Warehousing in general. And this is a good thing. Competition is healthy, it will force everybody to improve their products.
Update: Nextanalytics replied at http://www.nextanalytics.com/public/Examples/MDX-Challenge.html
|
-
NextAnalytics is a young BI company founded by Ward Yaternick of OLAP@Work fame. He has a blog dedicated to his new product, and while I was reading it, I ran across the post named “Can a business intelligence product be used to answer analytic questions?”. The main premises of this post was that real-life non-trivial calculations are very difficult with traditional BI tools using SQL or MDX, but can be done in much simpler fashion using NextAnalytics product. Ward then gives very concrete example:
That’s the crux of the problem: The nature of analytics often means that the need evolves. It’s either because the analytics identified the problem or the analytic led to other questions. By their nature, analytics are iterative and sequential. For example:
-
I want to see the fastest growing products but it's more complicated than just telling me one measurement.
-
for each region, I want to see which products were growing faster than the rolling six month average, at least six times out of the last N periods.
-
next, again for each region, of those products that were just discovered, which ones grew by more than twenty five percent from the start of the period to the end of the period.
-
I want to know the intersection of those two sets of products across all regions.
-
By the way, I won’t need these questions answered until next year at this time, although I might change them a bit.
Certainly this looks like something that can be done with basic MDX. Ward, however, stated that
That’s an example of one analytic combined with another, iteratively and sequentially. Neither of which would be possible in MDX or SQL without the creation of a complex underlying data structure. Notice that this isn’t simply an OLAP “drill down” or a “slice and dice”. These are distinct questions, with the answers needing to be merged and intersected. This kind of query would take only a few hours in nextanalytics, or a few months in a BI tool.
I left a comment to the blog saying that I disagreed with Ward’s statement, and unless by “complex underlying data structure” he meant regular OLAP cube, I thought this example can be easily done in MDX. Ward took his time to reply to my comment, going into detail how this would be done with nextanalytics:
Allow me to describe the processing a bit better in a pseudocode style: // the column axis has time periods, the row axis has regions and products. for each region { // part 1 calculate the rolling six month average calcualte the growth of that count each "growth" calc to the row average keep only rows above average six or more times // part 2 remove all time periods except the first and last calculate growth remove any below 25 % // part 3 intersect part 1 and part 2 }
And then he threw a gauntlet, challenging me to do the same in MDX as easily as the pseudocode above.
Note that in part 1, the output of the rolling average becomes the input to the growth calculations. Now that I've been more clear, do you still think this is achievable in MDX? Is it fair of me to qualify your claim by saying "In a single MDX expression?" "someone with a normal amount of MDX education" and "not having to create intermediate data structures in a server or disk"? Because, if it takes any of that, then my blog's point still stands IMO. To do all that with nextanalytics is about just as many lines as what I just typed above. No external tables, no external cubes. And, the nextanalytics "code" is still readable a year later and easily modified.
Well, I decided to pick up this challenge, and write complete implementation of this problem in MDX using Adventure Works sample database. I also decided to time myself to see how much time it will actually take – to have real answer to the claim that it would take “few months in a BI tool”. This weekend it was mix of rain and clouds in Seattle, so my family didn’t go backpacking, and I thought that I would use 2 hour window between teaching my son to bike on two wheels and going to watch “Chronicles of Narnia: Prince Caspian”, to implement this. This blog presents the results of this experiment.
I used MDX Studio to develop all of the code. One nice thing about it, is that it allows mixing MDX Script type of statement, which I used to define calculated measures with SELECT statements. In order to do it, I just specified “Cube=Adventure Works” in the connection string, and everything worked fine. (Of course, I also could’ve used more traditional syntax of CREATE MEMBER statement, but I am too used to MDX Scripts now).
I went line by line over the Ward’s pseudocode writing MDX for it. I didn’t try to write the super-optimized MDX, and I didn’t do any smart tricks – I just wanted to write natural, most straightforward MDX, that average MDX developer would’ve done in my place.
calculate the rolling six month average
There are several techniques how to approach it, and I wrote both chapter in my “Fast Track to MDX” book and blog entry about it, but let’s take the simplest approach: // calculate the rolling six month average
CREATE Rolling6MonthAverage;
(Rolling6MonthAverage, [Date].[Month Name].[Month Name])
= Avg([Date].[Month Name].Lag(5):[Date].[Month Name], [Measures].[Internet Sales Amount]);
This leaves the calculated measure to be NULL at the attributes above (or unrelated to) Month, and computes rolling 6 month average for Month and all attributes to which Month is related (i.e. Day etc).
calcualte the growth of that
That’s the part where Ward made a comment of “Note that in part 1, the output of the rolling average becomes the input to the growth calculations”. I just wanted to note, that this is absolutely routine in MDX // calcualte the growth of that
CREATE GrowthOfRolling6MonthAverage = Rolling6MonthAverage - (Rolling6MonthAverage, [Date].[Month Name].PrevMember);
count each "growth" calc to the row average
The way I interpret this is we need to compare sales growth to the growth of rolling average and then count how many times sales growth exceeded growth of rolling average. For simplicity of reading, I broke this into two calculated measures, one computing sales growth, and another one doing counting. Of course, it could’ve been done in one calculated measure too, but this way it is a little more natural to read. // calculate growth
CREATE GrowthOfSales = [Measures].[Internet Sales Amount] - ([Measures].[Internet Sales Amount], [Date].[Month Name].PrevMember);
In the original posting, Ward said “for each region, I want to see which products were growing faster than the rolling six month average, at least six times out of the last N periods.” So I really had choose what N periods would be in my example – and I choose that we are interested in the months of 2003. // count each "growth" calc to the row average
CREATE HowManyTimeGrowthExceededRollingAverage =
Count(
Filter(([Date].[Calendar Year].&[2003],[Date].[Month Name].[Month Name]), GrowthOfSales > GrowthOfRolling6MonthAverage));
keep only rows above average six or more times
This is simple filter, but since we need to find products for every region, we also need to run Generate over regions. For Adventure Works, I choose Country attribute to represent regions: // keep only rows above average six or more times
CREATE SET ProductsThatGrewFasterThanRollingAverageMore6OrMoreTimes AS
Generate([Customer].[Customer Geography].[Country],
CrossJoin([Customer].[Customer Geography],
Filter(
[Product].[Product].[Product],
HowManyTimeGrowthExceededRollingAverage >= 6
)
)
);
My understanding of nextanalytics scripting language is that it computes static sets in certain context, and this is why I used CREATE SET statement in MDX. However, MDX is obviously capable of computing such a set completely dynamically, depending on the current context, slices, subcubes etc. remove all time periods except the first and last calculate growth
Aha, here we have an opportunity to replace two lines of nextanalytics script with just one line of MDX. We don’t need to remove anything to calculate growth from the start of period to its end. Assuming again that the period is months in 2003, we would just use OpeningPeriod and ClosingPeriod MDX functions. // create growth from start to end of period
CREATE GrowthFromStartToEnd =
([Measures].[Internet Sales Amount], ClosingPeriod([Date].[Calendar].[Month], [Date].[Calendar].[Calendar Year].&[2002]))
- ([Measures].[Internet Sales Amount], OpeningPeriod([Date].[Calendar].[Month], [Date].[Calendar].[Calendar Year].&[2002]));
remove any below 25 %
One way to do it is to create calculated measure which computes relative growth as percentage, but then we would need to deal with division by zero handling, so it is simpler to code the 25% rule inside the Filter itself. CREATE SET ProductsThatGrewMoreThan25PctFromStartToEnd AS
Generate([Customer].[Customer Geography].[Country],
CrossJoin([Customer].[Customer Geography],
Filter([Product].[Product].[Product],
GrowthFromStartToEnd >
0.25*([Measures].[Internet Sales Amount],
OpeningPeriod([Date].[Calendar].[Month], [Date].[Calendar].[Calendar Year].&[2002]))
)
)
);
intersect part 1 and part 2
Just to prove the point, let’s use Intersect function. However, better approach would’ve been to use the Generate(Filter()), and combine the conditions inside Filter using AND. The following query shows final set of products by regions: SELECT {} ON 0
, Intersect(
ProductsThatGrewFasterThanRollingAverageMore6OrMoreTimes,
ProductsThatGrewMoreThan25PctFromStartToEnd ) ON 1
FROM [Adventure Works]
Hopefully this exercise will convince Ward that MDX is pretty powerful, yet simple language which allows for complex analysis. But how long did it take to implement all this logic. I measured myself, and it only took 21 minutes of my time to write all of the MDX statements. However, this also included about 4 minutes to arbitrage dispute between my kids, so net spent time was 17 minutes. Factoring that I am probably faster at writing MDX than average, I estimate that it would take around 1 hour to somebody with “normal amount of MDX education”. Much better than “few months in BI tool” and also somewhat better than “few hours in nextanalytics”.
|
-
We usually treat caching system of Analysis Services as a black box, trusting that it will do the "right thing" to optimize the execution. And this is how things should be in the ideal world, caching system should be completely transparent to the end user. However, even though MDX query optimizer and caching system are very sophisticated, they are not perfect. In this article we will see how with very simple MDX rewrites, we can take better advantage of caching, and increase performance of MDX calculations significantly. (In order to follow examples in this article, you will need to use MDX Studio tool) As an example we will use very simple statistical analysis of the data, where we would apply "Three sigma rule" to find outliers across combination of attributes. It is based on the fact that for data which conforms normal distribution almost all of the values (99.7% of them) will lay within 3 standard deviations of the mean. So the values which are farther from the mean by more than 3 standard deviations can be considered outliers. (There are of course more advanced methods to find outliers in the data, but for the purpose of this article this is representative enough). We will start with the analysis of Internet Sales by day inside Adventure Works cube. The MDX to find out "farther than three sigma" days seems straightforward with
member sales_avg as Avg ([Date].[Date].[Date], [Internet Sales Amount]), format_string = 'currency'
member sales_stdev as StDev([Date].[Date].[Date], [Internet Sales Amount]), format_string = 'currency'
select { [Internet Sales Amount] } on 0
, filter([Date].[Date].[Date], abs([Internet Sales Amount]-sales_avg) > 3*sales_stdev) on 1
from [Adventure Works]
If we execute this query we will get the list of 10 outlier dates. But the query takes 7.5 seconds to execute. This is horrible ! 7.5 seconds to go through 1158 members seem completely unreasonable. To get a better idea about what's going on, let's take a look at some execution statistics collected by MDX Studio: Time : 7 sec 417 ms
Calc covers : 1164
Cells calculated : 1344448
Sonar subcubes : 3
SE queries : 1159
Cache hits : 1159
Cache misses : 2
Cache inserts : 2
Cache lookups : 1161
Now we can understand what was going on here. For each one of these 1158 members, the engine recalculated both average and standard deviation, even though they were exactly the same at each point (it is easy to verify through debugger in MDX Studio). The problem is that the engine didn't figure out itself that average and standard deviation were the same for all dates, so we need to help it a little bit. We can redefine sales_avg as a calculation which goes up to All member in Date attribute. This way regardless at which date we are looking right now, the cell coordinate will shift to the All member, and will be computed only once, because all the other times it is referenced it can be answered from FE cache. Calculation rewritten with this trick in mind will look the following: with
member sales_avg_ as Avg ([Date].[Date].[Date], [Internet Sales Amount]), format_string = 'currency'
member sales_stdev_ as StDev([Date].[Date].[Date], [Internet Sales Amount]), format_string = 'currency'
member sales_avg as ([Date].[Date].[All Periods],sales_avg_), format_string = 'currency'
member sales_stdev as ([Date].[Date].[All Periods],sales_stdev_), format_string = 'currency'
select { [Internet Sales Amount] } on 0
, filter([Date].[Date].[Date], abs([Internet Sales Amount]-sales_avg) > 3*sales_stdev) on 1
from [Adventure Works]
If we execute it now, it is done in mere 0.1 fraction of second. The MDX Studio stats look like following Time : 109 ms
Calc covers : 9
Cells calculated : 4642
Sonar subcubes : 3
SE queries : 2
Cache hits : 4
Cache misses : 2
Cache inserts : 2
Cache lookups : 6
This is much better. Instead of SE query per cell, we got only 2 SE queries (one for axis and one for avg/stdev calculation). There are also significantly less cells calculated, since both average and standard deviation are calculated only once now and served from the cache.
Let's make the problem a little bit complex now. We computed statistics across all 4 years, and discovered that all the outlier dates were in 2004. This is probably attributed to the growth over time. So let's compute statistics differently now, instead of computing them for all dates in 4 years, let's compute them within each month, and find outliers within the month. The straightforward approach to this can be coded in MDX as following. with
member sales_avg as Avg ([Date].[Calendar].Parent.children, [Internet Sales Amount]), format_string = 'currency'
member sales_stdev as StDev([Date].[Calendar].Parent.children, [Internet Sales Amount]), format_string = 'currency'
select { [Internet Sales Amount], sales_avg, sales_stdev } on 0
, filter([Date].[Date].[Date], abs([Internet Sales Amount]-sales_avg) > 3*sales_stdev) on 1
from [Adventure Works]
Executing this query now finds two outliers:
| |
Internet Sales Amount |
sales_avg |
sales_stdev |
| July 22, 2001 |
$38,241.29 |
$15,270.59 |
$6,747.86 |
| December 8, 2001 |
$55,454.04 |
$24,371.87 |
$10,146.40 |
And the MDX Studio stats will be Time : 704 ms
Calc covers : 1206
Cells calculated : 38792
Sonar subcubes : 41
SE queries : 1198
Cache hits : 1198
Cache misses : 2
Cache inserts : 2
Cache lookups : 1200
Again, the stats don't look good. Too many SE queries, too many cells calculated. How can we apply the same technique as before ? Moving coordinate to the All dates will be incorrect now, because average and standard deviations are now different for every month. Therefore we need to move coordinates to the month level. The most natural way of doing it is simple move call to the Parent function from within the Avg/Stdev calls into the tuple coordinate shift transformation. Here is how the resulting query will look like: with
member sales_avg_ as Avg ([Date].[Calendar].children, [Internet Sales Amount]), format_string = 'currency'
member sales_stdev_ as StDev([Date].[Calendar].children, [Internet Sales Amount]), format_string = 'currency'
member sales_avg as ([Date].[Calendar].Parent,sales_avg_), format_string = 'currency'
member sales_stdev as ([Date].[Calendar].Parent,sales_stdev_), format_string = 'currency'
select { [Internet Sales Amount], sales_avg, sales_stdev } on 0
, filter([Date].[Date].[Date], abs([Internet Sales Amount]-sales_avg) > 3*sales_stdev) on 1
from [Adventure Works]
The stats from this rewrite will be Time : 203 ms
Calc covers : 162
Cells calculated : 6954
Sonar subcubes : 117
SE queries : 78
Cache hits : 78
Cache misses : 2
Cache inserts : 2
Cache lookups : 80
Which is much more reasonable result.
| | |