THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Microsoft OLAP by Mosha Pasumansky

  • Good Bye BI

    "Every new beginning comes from some other beginning's end" (Seneca)

    I spent big chunk of my professional career working on what we call today BI. From Panorama, coding some of the innovative cutting-edge (back then) algorithms, to Microsoft with mission "Bring BI to the masses", inventing MDX and developing what became the best and most widely used OLAP server in the industry, to working on the huge scale distributed BI systems for Microsoft online division.

    And then a year ago, I decided to do something else. So from last February, I joined the team working on search engine, which shipped this summer - bing. It is a fascinating field, with so many interesting things to do, so I quickly became entirely consumed by it. This left pretty much no time to even follow after what is happening in the BI industry, let alone participate in it. So I bid farewell to BI world, and wish all the best to all of you.

     

    Some of the legacies which still are relevant:

    - This blog: http://sqlblog.com/blogs/mosha/. Most of the questions I get (which I still get daily) can be answered by one of the posts I made in the past 5 years.

    - MDX Studio: http://www.ssas-info.com/forum/3-mdx-studio. Last release currently is 0.4.14, and can be downloaded from http://cid-74f04d1ea28ece4e.skydrive.live.com/browse.aspx/MDXStudio/v0.4.14?sa=694070447

    - Materials from the "Deep dive to MDX" presentation at SQL PASS 2008: http://cid-74f04d1ea28ece4e.skydrive.live.com/browse.aspx/Deep%20Dive%20to%20MDX.

    Thanks and good bye.

    Mosha. 

  • WolframAlpha

    Integration of BI and Search became a popular subject couple of years ago. No doubt, it was heavily influenced by the “Google revolution”. Search is a hugely successful application. It features simple and natural UI - free form text, yet it is extremely powerful and gives access to pretty much all of the world freely available information (i.e. on Internet). Sounds very close to the mission of the BI. So several of BI vendors tried to jump on the Search bandwagon, and integrate search-like functionality into the products. None were very successful with it – it didn’t catch up.

    Having worked on BI for quite a while, and recently having worked on Search, I grew more and more skeptical that meaningful integration of BI and Search was possible. It just seemed too difficult to do anything non trivial. (and it didn’t help to remember English Query fiasco, even though its last version could talk to the SSAS cubes).

    But then I saw today WolframAlpha, and I was completely awed by it. If it could do even half of the things that you see in the promotional video – it is pretty amazing. And with Stephen Wolfram genius behind it – I have no doubt it really can do all of that. It launches tomorrow, so we will have a chance to try it out for real, and in the meantime – watch this screencast video of what it promises to do: http://www.wolframalpha.com/screencast/introducingwolframalpha.html.

  • Deep dive to MDX presentation - slides

    During SQL PASS 2008 summit, I gave full day preconference seminar “Deep Dive to MDX”. The evaluation forms now have arrived, and the scores for the content, relevance and depth portion of the seminar are very good. The scores are not as high, however, for the time allocated (people argued that it should’ve been  2 days instead of one), and for the materials (since nobody received any materials). Both points are valid, one day turned out to be not enough, and perhaps two days wouldn’t have been enough either, since I got into less than a half of material that I had prepared. Also, due to some issues, I couldn’t hand down the presentation material – slides and MDX queries/calculations. These issues are cleared now, and I will start uploading the slides from the presentation. I don’t suppose they will be very useful for people who didn’t attend the session, since slides really only contain major talking points, and the rest was delivered verbally, but they probably will be very useful for people who did attend.

    The time allowed to cover the following subjects:

    • Overall Architecture
    • Subcubes
    • Sonar
    • Storage Engine query plan
    • Storage Engine cache
    • Formula Engine query plan
      • Block mode
    • Formula Engine cache
    • Sets

    More detailed Table of Content can be found here.

    I will do series of posts uploading one subject at a time. The first subject is “Overall Architecture”, and it gives high-level picture of the components discussed further in the presentation. The slides are here.

    Slightly unrelated announcement: MDX Studio 0.4.11 was also released today, with multiple UI enhancements, docked windows (Visual Studio style) etc. Versions for SSAS 2005 and SSAS 2008 are available, and can be downloaded from here.

  • AS2008 MDX: subselects and CREATE SUBCUBE in non-visual mode

    There were not very many changes to MDX syntax in AS2008, and Vidas Matelis described most of them in his blog here (for his list of all changes in AS2008 check this blog entry). I just noticed that there is at least one more change which Vidas didn’t include, but which is somewhat important: ability to define CREATE SUBCUBE and subselects in non visual mode.

    To understand this, we need to understand what CREATE SUBCUBE does. It really does two things:

    1. Implicit Exists between the set in CREATE SUBCUBE and query axis, query named sets including top level set functions (for more details see blog "Slicer and axis interaction in MDX Part 2 - Implicit Exists" – even though it talks about WHERE clause, everything from that entry equally applies to subselects and CREATE SUBCUBE as well)

    2. Applies visual totals to the values of physical measures even within expressions if there are no coordinate overwrites.

    Let’s look into this deeper. Consider the following example:

    CREATE SUBCUBE [Adventure Works] AS 
     (SELECT 
        {
           [Customer].[Customer Geography].[City].&[Redmond]&[WA]
          ,[Customer].[Customer Geography].[City].&[Seattle]&[WA]
        } ON 0 
      FROM [Adventure Works])

    We are creating filter to include only two cities – Redmond and Seattle. Now if we send the following query:

    WITH 
     MEMBER [Measures].[Gross Margin] AS [Measures].[Internet Gross Profit]/[Measures].[Internet Sales Amount]
       ,FORMAT_STRING = 'Percent' 
    SELECT {[Measures].[Internet Sales Amount], [Measures].[Internet Gross Profit],[Measures].[Gross Margin]} ON 0
    , [Customer].[City].MEMBERS ON 1
    FROM [Adventure Works]
      Internet Sales Amount Internet Gross Profit Gross Margin
    All Customers $153,989.23 $64,075.99 41.61%
    Redmond $78,824.37 $33,150.36 42.06%
    Seattle $75,164.86 $30,925.63 41.14%

    We will observe two things. First, we only get two cities on row axis – that’s the implicit exists part. Second, the values for physical measure Internet Sales Amount, cube defined calculated measure Internet Gross Profit and query defined calculated measure Gross Margin – are all visual totals, i.e. they all are computed using data for only two cities defined by the CREATE SUBCUBE command.

    AS2008 adds capability to only keep implicit exists behavior, and to return real totals as opposed to visual totals. The syntax is described in MSDN documentation and in our example it will be (note that NON VISUAL keywords)

    CREATE SUBCUBE [Adventure Works] AS NON VISUAL
     (SELECT 
        {
           [Customer].[Customer Geography].[City].&[Redmond]&[WA]
          ,[Customer].[Customer Geography].[City].&[Seattle]&[WA]
        } ON 0 
      FROM [Adventure Works])

    Now if we execute the same query, the result is different:

      Internet Sales Amount Internet Gross Profit Gross Margin
    All Customers $29,358,677.22 $12,080,883.65 41.15%
    Redmond $78,824.37 $33,150.36 42.06%
    Seattle $75,164.86 $30,925.63 41.14%

    The implicit exists still happens, we see only two cities, but there is no visual totals anymore. Both physical and calculated measures are computed now as if there were no CREATE SUBCUBE. Same functionality applies also to subselects (documented on MSDN here).

    Why is this important ? Remember, that Excel 2007 uses subselects to implement multiselect functionality. However, since subselects only supported visual totals mode before AS2008, Excel team was forced to remove ability to show non-visual totals in PivotTables. This caused users dissatisfaction, since users wanted to have control over numbers in PivotTables, just like with previous versions of Excel and Analysis Services. This problem is well documented, for example in "Common Questions Around Excel 2007 OLAP PivotTables" blog entry by Allan Folting from Excel team. So implementing NON VISUAL mode in AS2008 is a key enabling feature to Excel to be able to restore this PivotTables functionality in the future releases.

  • Custom filters with MDX subcubes

    Omniture Discover is not general purpose OLAP client tool. Rather it is specialized application that lets users to explore data from the Omniture’s Web Analytics product. But it does look and feel like an OLAP browser – which is not surprising, since “A” in OLAP stands for “Analysis”, and the product is all about analyzing web site performance. Omniture model feels like typical cube too - there are dimensions (called segments), and there are measures (called metrics), and there are drilldowns, pivots, aggregations etc. However, there is something not typical about this client tool – the way it handles slicers. Usually, OLAP client provides a UI element called slicer or filter or page field or something similar, where user can choose how the current view should be sliced. UI for slicer does differ between applications, from classic drilldown list to calendar control for Time dimension to something else, but there is always something to choose the current slice. Omniture Discover doesn’t have this. Instead, it uses different paradigm. User is expected to define “custom segments”, which really are dimension filters, but they are given names and list of these segments is stored in special folder. Then user can apply these predefined filters to any report view. The definitions of the segments are very flexible, they really can be arbitrary expressions – something from “Last 3 days” (filter on Time) to “Users referred by www.live.com through paid search campaign who bought product in specific category” (filter on Referer, Action and Product dimensions) etc. Dimensions used in the custom segment definition can be included in the report, and can be omitted from it – the filter is applied in either case. The only limitation is that only one such segment can be applied to the report at a time – if the user want to combine two custom segments, he needs to create a new one. First I was taken aback by how unusual this approach was, but then I saw that it could come in handy sometimes.

    So this is an interesting approach, how could we add this capability to existing Analysis Services client tools ? Well, the only way to extend existing client UI functionality is through actions, so it is clear we will need to use actions here, but how can we implement custom segments in MDX ? Actually, MDX has perfect solution for that – and it is CREATE SUBCUBE statement – in fact – this is exactly what CREATE SUBCUBE was invented for in the first place – to be able to apply arbitrary custom filters on any MDX query. CREATE SUBCUBE sets this filter in the session state, and all MDX queries executed after it are affected  by it.

    Let’s take few examples from Adventure Works 2008 cube to demonstrate this approach in action:

    1. Filtering on named set.

    In this example, we want to limit analysis to the top 50 customers. Fortunately, Adventure Works cube already has named set defined inside MDX Script which defines these customers. So, our CREATE SUBCUBE will  be fairly simple:

    CREATE SUBCUBE [Adventure Works] AS (SELECT [Top 50 Customers] ON 0 FROM [Adventure Works])

    The only tricky thing here to remember is how actions work in Analysis Services. The action expression is an MDX expression, which, in case of STATEMENT action type, should return string representing MDX statement to be executed by the client tool. Therefore, in the action expression field – we need to put a string with statement inside, not the statement itself. This will become more clear in example 3.

    Here is how action definition will look like:

    Name          : Top 50 Customers
    Target type   : Cells
    Target object : All cells
    Type          : Statement
    Expression    : "CREATE SUBCUBE [Adventure Works] AS (SELECT [Top 50 Customers] ON 0 FROM [Adventure Works])"
    Invocation    : Interactive
    Caption       : Top 50 Customers
    Caption is MDX: false

    This action now works fine inside cube browser – once executed, the current view needs to be refreshed (OWC doesn’t do it automatically after action execution) to reflect the currently set filter. The action will work in any client tool which properly supports statement actions. Unfortunately, it seems like Excel 2007 is not such a tool – I wasn’t able to get statement actions to work in it.

    2. Filtering on multiple dimensions.

    Of course the custom filters don’t have to be on single attribute, hierarchy or dimension – they can employ any combination of those. In this example, we want to filter on transactions performed by customers with low income, but on high priced products, perhaps this segment is susceptible for fraud.

    We have Yearly Income attribute in the Customer dimension, and we will pick “10000-30000” bucket as “low income” indicator. The Product dimension has List Price attribute, and we will take all the members with value above $1700 as high priced products – in MDX this is best done with range set operator [Product].[List Price].&[1700.99] : NULL

    CREATE SUBCUBE [Adventure Works] AS
    (
    SELECT 
       [Customer].[Yearly Income].&[0] * {[Product].[List Price].&[1700.99] : NULL}  ON 0
       FROM [Adventure Works]
    )

    The action definition would be

    Name          : Potential Fraud
    Target type   : Cells
    Target object : All cells
    Type          : Statement
    Expression    : "CREATE SUBCUBE [Adventure Works] AS 
      (SELECT [Customer].[Yearly Income].&[0] * {[Product].[List Price].&[1700.99] : NULL} ON 0 
      FROM [Adventure Works])"
    Invocation    : Interactive
    Caption       : Potential Fraud
    Caption is MDX: false

    3. Dynamic (context dependant) filters

    So far we only explored static filters – i.e. they always defined the same subspace, regardless of the position in the cube. This is how all Omniture custom segments work. However, MDX is, of course more powerful than that ! Action expression is called expression for a reason – it doesn’t have to return the constant string all the time. Let’s say we want to build a filter which will return last 3 months starting with the current month. In MDX we can do this easily using range set operator over Lag. The action target this time will be members of the Month attribute – because it only makes sense to count 3 months back if we are at granularity of Month (or if we are below it – so we could also define similar action over Days as well). The client tool will send the current Month coordinate, and it can be used inside MDX by referring to CurrentMember in the Month hierarchy. With some simple string concatenations (which unfortunately are always required to make dynamic action) we will get the following for action expression:

    "CREATE SUBCUBE [Adventure Works] AS (SELECT " + 
    [Date].[Month Name].CurrentMember.UniqueName + 
    ".Lag(2):" + [Date].[Month Name].CurrentMember.UniqueName + 
    " ON 0 FROM [Adventure Works])"

    The rest of parameters for the action are

    Name          : Last 3 Months
    Target type   : Attribute Members
    Target object : Date.Month Name
    Type          : Statement
    Expression    : see above
    Invocation    : Interactive
    Caption       : Last 3 Months
    Caption is MDX: false

    4. Arbitrary complex filters

    So far every filter we have discussed – was theoretically possible to define using standard slicer UI in client tools. That’s because every condition we put was either on single hierarchy or when it was on multiple hierarchies, they were combined using AND logic, i.e. using CrossJoin operator. But custom filters can be arbitrary complex – they can include any condition logic, combination of AND, OR, NOT, XOR and other logical operators. The good news are that even though it is usually impossible to set such complex filters in standard UI, it is totally possible with custom CREATE SUBCUBE statement. Let’s take the following example: We want to define a segment of customers with family. Definition of family means that either customer is married, or he has children. Customer dimension has the Marital Status attribute – so this will give us married or not information, and Customer dimension also has Number of Children attribute – everything which is not 0 – counts as having children – in MDX this can be easily coded with set unary operator "-". Now we need to figure out how to implement OR logic condition with sets – the answer is to use Union operator. We will union married customers no matter how many children they have, and customers with children no matter what their marital status is. Of course, there will be big overlap between these two sets, but CREATE SUBCUBE is going to do the right thing – it guarantees that there will be no double counting even if the set used to define it has duplicates.

    CREATE SUBCUBE [Adventure Works] AS
    (SELECT 
       {([Customer].[Marital Status].&[M], [Customer].[Total Children].[All Customers])
        ,([Customer].[Marital Status].[All Customers],-{[Customer].[Total Children].&[0]})}
    ON 0 FROM [Adventure Works])

    The resulting union set has interesting shape – it cannot be represented as pure crossjoin of attribute hierarchies. Such sets are nicknamed “arbitrary shaped sets” in Analysis Services, because they represent shapes of arbitrary complexity. Analysis Services in general doesn’t like arbitrary shaped sets, and there are often performance issues when using them, but if the business requirement calls for them – they can be used and will work correctly, although not most efficiently.

    5. Clearing up

    And, of course, we also need to provide a way for the user to remove whatever filter he had set. This is done with DROP SUBCUBE command, so we need to provision one more action to clear things up.

    Name          : Clear Segments
    Target type   : Cells
    Target object : All cells
    Type          : Statement
    Expression    : "DROP SUBCUBE [Adventure Works]"
    Invocation    : Interactive
    Caption       : Clear Segments
    Caption is MDX: false
  • Gross margin - dense vs. sparse block evaluation mode in MDX

    Gross margin (also known as Gross profit margin or Gross profit rate) is defined as (Revenue – Cost of Sales)/Revenue. In terms of Adventure Works sample database we can write this in MDX as

    [Gross Margin] = ([Measures].[Internet Sales Amount] - [Measures].[Internet Total Product Cost]) / [Measures].[Internet Sales Amount];

    While this expression is simple enough, it might be tempting to try to optimize it. If we look at the evaluation tree for this expression – we will see 5 nodes in it: Fetching values for Internet Sales Amount twice – because it appears twice in the expression, fetching values for Total Cost, one minus operation and one division operation. Accessing Internet Sales Amount twice is not really a problem, because second access is going to come from cache anyway, but one might think that removing this extra operation would still improve thing, if only a little.

    So, perhaps we can rewrite the formula using simple math equivalence: (a-b)/a = 1 – b/a

    The first problem with this rewrite is that it is not always correct. In math it relies on the fact that a/a=1, but is is only true when a <> 0. The result of a/a is undefined when a=0. In MDX, 0/0 will also result in undefined number, usually formatted as “-1.#IND”. But more interesting question is what would happen in MDX when a is empty, i.e. has value of NULL. In MDX NULL/NULL = NULL, therefore when both Sales and Cost are empty, i.e. there is no record in the fact table, the two formulas are going to give different results.

    (a-b)/a = NULL when a = NULL and b = NULL
    1 – b/a = 1    when a = NULL and b = NULL

    So at very least someone first need to set exact definition of Gross margin at the coordinates where no sales occurred. Is Gross margin 100% there or is it empty as well ?

    But I want to direct your attention to the performance aspect of this change. Remember, it seemed that 1-b/a would perform a little bit better – is it really so ? Let’s compare both approaches side by side in AS2005 using MDX Studio.

    // Gross margin using 1-a/b
    WITH 
      MEMBER [Gross Margin] AS 
           1
        - 
            [Measures].[Internet Total Product Cost]
          / 
            [Measures].[Internet Sales Amount] 
        , FORMAT_STRING = 'Percent'
      MEMBER [Max Gross Margin] AS 
        Max
        (
          (
            [Customer].[Customer Geography].[Customer]
           ,[Product].[Product Categories].[Subcategory]
           ,[Date].[Calendar].[Calendar Year]
          )
         ,[Measures].[Gross Margin]
        ) 
    SELECT 
      [Measures].[Max Gross Margin] ON 0
    FROM [Adventure Works];
    Time              : 6 sec 640 ms
    Calc covers       : 6
    Cells calculated  : 1
    Sonar subcubes    : 2
    NON EMPTYs        : 0
    Autoexists        : 1
    EXISTINGs         : 0
    SE queries        : 2
    Cache hits        : 3
    Cache misses      : 1
    Cache inserts     : 1
    Cache lookups     : 4
    Memory Usage KB   : 88840
    WITH 
      MEMBER [Gross Margin] AS 
          (
            [Measures].[Internet Sales Amount]
          - 
            [Measures].[Internet Total Product Cost]
          )
        / 
          [Measures].[Internet Sales Amount] 
        , FORMAT_STRING = 'Percent'
      MEMBER [Max Gross Margin] AS 
        Max
        (
          (
            [Customer].[Customer Geography].[Customer]
           ,[Product].[Product Categories].[Subcategory]
           ,[Date].[Calendar].[Calendar Year]
          )
         ,[Measures].[Gross Margin]
        ) 
    SELECT 
      [Measures].[Max Gross Margin] ON 0
    FROM [Adventure Works];
    Time              : 234 ms
    Calc covers       : 7
    Cells calculated  : 1
    Sonar subcubes    : 1
    NON EMPTYs        : 0
    Autoexists        : 0
    EXISTINGs         : 0
    SE queries        : 3
    Cache hits        : 3
    Cache misses      : 1
    Cache inserts     : 1
    Cache lookups     : 4
    Memory Usage KB   : 0

    The results come shockingly different. Almost 7 seconds for the “optimized” approach, and mere 234 ms for the original approach. Let’s try to understand why. First let’s compare the number of SE queries. It is 3 in the (a-b)/a approach, and 2 in the 1-b/a – so our optimization to reduce number of SE queries to fetch Internet Sales Amount indeed worked (and as we can see from hierarchical profiling), on cold cache run, only first SE query went for partitions, the other two came from cache).

    Usually, when we get such a big difference in the results, we suspect that slow query executes in the cell-by-cell mode, and fast query executes in the block mode. However, there is no reason to believe that the slow query went through cell-by-cell mode. Both expressions use exactly same operands: minus and divide, and both of these are optimized for block mode. Perfmon counters from the run also don’t give evidence for the cell-by-cell mode. But then, if both queries indeed went through block mode, why one query is so much significantly slower than the other one ?

    The answer lies in the density and sparsity characteristics of the space. Let’s take closer look at both evaluation subtrees:

    Operator "/"
     |
     +-- Operator "-"
     |    |
     |    +-- SE Data (Internet Sales Amount, Customer, Subcategory, Year)
     |    |
     |    +-- SE Data (Internet Total Product Cost, Customer, Subcategory, Year)
     |
     +-- SE Data (Internet Sales Amount, Customer, Subcategory, Year)

    In this tree every leaf element is SE Data, and it is sparse – i.e. even though the query subcubes cover big space, the SE Data node only fetches records that exist in fact table, i.e. non empty data. Both “/” and “-” operators operate efficiently on sparse data, because they can iterate only over non-empty cells (the algorithm for operator “*” was explained here, the algorithms for “/” and “-” are slightly more complicated, but similar).

    Now the evaluation subtree for the slow query:

    Operator "-"
     |
     +-- Constant (1) (Customer, Subcategory, Year)
     |
     +-- Operator "/"
          |
          +-- SE Data (Internet Total Product Cost, Customer, Subcategory, Year)
          |
          +-- SE Data (Internet Sales Amount, Customer, Subcategory, Year)

    Here we have two leaf nodes of SE Data – no problem with them, but another leaf node is constant 1. Since it is constant, it will evaluate the same value over all the cells in its subcube – and this value, 1, is not NULL. Therefore, this node evaluates to non empty result in every single cell. So while the query plan is still using block mode, it doesn’t get much benefit out of it, because the “block” is dense, there is no single empty cell in it, and while algorithm says to iterate over non empty cells, it ends up iterating over every cell, which makes it as slow as if it was in cell by cell mode.

  • Optimizing order of sets in MDX crossjoins

    For scalar values a*b is the same as b*a. (although we saw that performance of these two could be different in MDX in “Performance of multiplication in MDX” article). But if a and b are sets, then obviously results are different – since the order of tuples in the resulting set will be different. But does it matter if we were to perform some other operation on top of result, which wouldn’t depend on order of the tuples ? Semantically – the answer is no, but there could be a big difference in performance. Let’s take the following example:

    WITH MEMBER [Measures].x AS
     Count(
       (
         [Customer].[City].[City]
        ,[Customer].[Gender].[Gender]
        ,[Customer].[Education].[Education]
        ,[Product].[Subcategory].[Subcategory]
        ,[Product].[Color].[Color]
        ,[Product].[Size].[Size]
       )
     )
    SELECT x ON 0
    FROM [Adventure Works]
    

    This query runs in 78 ms – no problem. But if we do innocently looking change – move one set to another position -

    WITH MEMBER [Measures].x AS
     Count(
       (
         [Customer].[City].[City]
        ,[Product].[Subcategory].[Subcategory]
        ,[Customer].[Gender].[Gender]
        ,[Customer].[Education].[Education]
        ,[Product].[Color].[Color]
        ,[Product].[Size].[Size]
       )
     )
    SELECT x ON 0
    FROM [Adventure Works]

    Now this query takes 13 seconds, and also grabs significant memory. So what happened ? To understand this better, let’s go back to the first query and pay closer attention to what MDX Studio told us about the execution. We notice, that in PerfMon tab, it reported that there were 2 autoexists, and also hierarchical profiler shows 2 NON EMPTY events (each one corresponding to autoexist operation). Why does it report 2 ? The query has just single crossjoin, so there should’ve been only 1 autoexist. What happens here is that crossjoin detects that the set that it needs to apply autoexists to, can be nicely split into two parts – first part related to Customer dimension, and second part to Product dimension. Then it can apply the following formula

    JOIN( c1, c2, c3, p1, p2, p3 ) = CROSSJOIN( INNERJOIN(c1,c2,c3), INNERJOIN(p1,p2,p3) )

    I.e. it can run inner join against each dimension table separately, and then do full cross join between results. Inner join against single dimension table is very efficient since even in worse case, we don’t need to go deeper than the dimension key to which all other attributes relate – so this isn’t going to take additional memory or much time.

    However, in second case, the equation doesn’t hold anymore, since the order of dimensions inside crossjoin is mixed. So some other, much less efficient algorithm will be needed, probably loop join, with crossjoin now being pushed to the earlier stages. This is very inefficient and also going to take lots of additional memory.

    Conclusion: Inside crossjoin, cluster all the sets from the same dimension together – this will result in great benefit to performance.

  • Optimizing MDX aggregation functions

    One of the most significant changes in Analysis Services 2008 was improving performance of MDX queries and calculations. In particular, query optimizer can choose block (a.k.a. subspace) computation mode in query plan more often than in AS2005, and usually using block/subspace computation mode brings performance orders of magnitude better than without it. However, even in AS2008, query optimizer is not always capable of using block mode. The “Performance Improvements for MDX in SQL Server 2008 Analysis Services” whitepaper documents which MDX constructs are optimized and which not, and MDX Studio automates process of analyzing MDX and figuring out which fragments are not optimized, and offers advice how to optimize them. In this article we will cover techniques related to the optimization of aggregation functions in MDX, i.e. Sum, Min, Max, Aggregate and (in AS2008) Avg. All examples were done on AS2008 version of Adventure Works (some of these techniques will work with AS2005, but others won’t).

    1. Aggregating over Filter

    Scenario: We want to compute average sales for the products which increased its sales since same date last month. The normal approach is to use Filter function to determine which products grew their sales, and then apply Avg to the resulting set. Now applying this to every day in the Year 2003, we will get

    WITH 
      MEMBER [Measures].AvgGrowingProducts AS 
        Avg
        (
          Filter
          (
            [Product].[Product].[Product].MEMBERS
           ,[Measures].[Sales Amount] > ([Measures].[Sales Amount],ParallelPeriod([Date].[Calendar].[Month]))
          )
         ,[Measures].[Sales Amount]
        ) 
    SELECT 
      [Measures].AvgGrowingProducts ON 0
     ,Descendants
      (
        [Date].[Calendar].[Calendar Year].&[2003]
       ,[Date].[Calendar].[Date]
      ) ON 1
    FROM [Adventure Works];

    Running this query took more than 7 seconds on my laptop, and the culprit is clear from the perfmon counters – there were 295011 cells calculated – this is a sign of cell-by-cell iterations as opposed to working in the block mode. And if we run “Analyze” function of MDX Studio, it will tell us why this is happening: Function ‘Filter’ was used inside aggregation function – this disables block mode. How do we get rid of Filter ? I actually has written about it before – in the “Optimizing Count(Filter(...)) expressions in MDX” with respect to the Count function, but same technique can be applied to any aggregation function, including Avg (in AS2008). The idea is simple – even though Filter reduces the set of products, in order to compute the Filter in the first place, we really have to check sales of every single product, so we may as well just run Avg over all products, but with smart expression which will return NULL in case product should not be included – and since Avg ignores NULLs, this will work correctly. Here is how rewritten MDX will look like:

    WITH 
      MEMBER [Measures].Growth AS 
        IIF
        (
          [Measures].[Sales Amount] > ([Measures].[Sales Amount] ,ParallelPeriod([Date].[Calendar].[Month]))
         ,[Measures].[Sales Amount]
         ,NULL
        ) 
       ,FORMAT_STRING = 'Currency' 
      MEMBER [Measures].AvgGrowingProducts AS 
        Avg
        (
          [Product].[Product].[Product].MEMBERS
         ,[Measures].Growth
        ) 
    SELECT 
      [Measures].AvgGrowingProducts ON 0
     ,Descendants
      (
        [Date].[Calendar].[Calendar Year].&[2003]
       ,[Date].[Calendar].[Date]
      ) ON 1
    FROM [Adventure Works];

    Now it executes in mere 200 ms, and number of cells calculated is 365 – exactly the same number as number of cells in the result – which is the best theoretical value we can get !

    2. Aggregating over NonEmpty

    Scenario: Compute average sales for all products. The solution is straightforward:

    WITH 
      MEMBER [Measures].AvgProductSales AS 
        Avg
        (
          [Product].[Product].[Product].MEMBERS
         ,[Measures].[Sales Amount]
        ) 
    SELECT 
      [Measures].AvgProductSales ON 0
     ,[Date].[Date].[Date].MEMBERS ON 1
    FROM [Adventure Works];

    It is very efficient – only 100 ms and single SE query. But what if someone wants to optimize it even more ? Is this possible ? Poking with MDX Studio expression debugger, we can notice, that the sets over which Avg is run are pretty sparse. For example, on August 7, 2001 – there were only 3 products which had sales, the rest was empty. Year later, on August 7, 2002 there were only 5 products which sold. So, it might be tempting to add NonEmpty function over the products, in a hope to reduce the size of the set before it is fed to Avg:

    WITH 
      MEMBER [Measures].AvgProductSales AS 
        Avg
        (
          NonEmpty
          (
            [Product].[Product].[Product].MEMBERS
           ,[Measures].[Sales Amount]
          )
         ,[Measures].[Sales Amount]
        ) 
    SELECT 
      [Measures].AvgProductSales ON 0
     ,[Date].[Date].[Date].MEMBERS ON 1
    FROM [Adventure Works];

    This turns out to be a big mistake. The execution time jumps 18-fold to 1 sec 796 ms, and the all the perfmon counters jump as well, with number of SE queries going to 1189 from 1. I have written about this before, but it is worthwhile to reiterate. Unfortunately, this kind of wrong advice “reduce the space with NonEmpty before applying computation” is a popular myth. I see it mentioned in various presentations, and it even shows up in otherwise good “MDX Performance Hints” document. The irony is that the tip which is supposed to improve performance, actually makes it worse. The reason here is twofold:

    1) Since NonEmpty is used inside calculated member, it will be computed every time this calculated member is invoked. No block mode. Each call to NonEmpty triggers at least one SE query. This is why we see so many SE queries in perfmon – one for every cell which uses AvgProductSales calculated member.

    2) The result of NonEmpty could be a set of any shape and form. It is very difficult for Avg to work with such unpredictable sets. In our example we got lucky, that since we only had single hierarchy inside NonEmpty – we didn’t end up with arbitrary shape set – otherwise performance would’ve been even worse.

    So the guidance is to never use NonEmpty inside calculations, but it is OK to use NonEmpty while building axes or named sets – because then it will be only called once and not for every cell.

    3. Aggregating over Union

    Scenario: Running sum by day. Typical and very sensible approach to this problem is to call Sum over all the preceding days. Assuming that our calculation will only ever be called on Date attribute (this is easy to ensure with SCOPE assignment, but for simplicity I will put it here as regular calculated member):

    WITH 
      MEMBER [Measures].RunSales AS 
        Sum
        (
          NULL : [Date].[Date].CurrentMember
         ,[Measures].[Sales Amount]
        ) 
    SELECT 
      [Measures].RunSales ON 0
     ,[Date].[Date].[Date].MEMBERS ON 1
    FROM [Adventure Works];

    This works OK – about 2 seconds execution time, and in block mode. We still want to optimize it. The common optimization technique for running sum is to do summation at higher grains whenever possible. I.e. instead of running over days, we would like to run over months until the current month, and only from there over the remaining days in this month. This approach makes a lot of sense, but the implementation should be done carefully. Let’s see what will happen if we modify the set to be union of months and days:

    WITH 
      MEMBER [Measures].RunSales AS 
        Sum
        (
          Union
          (
            NULL : [Date].[Calendar].Parent.PrevMember
           ,
            [Date].[Calendar].FirstSibling : [Date].[Calendar].CurrentMember
          )
         ,[Measures].[Sales Amount]
        ) 
    SELECT 
      [Measures].RunSales ON 0
     ,[Date].[Date].[Date].MEMBERS ON 1
    FROM [Adventure Works];

    The results are disastrous ! The execution time rose above 3 seconds, but more alarmingly, both number of cells calculated and SE queries jumped to thousands. Why is that ? After all, if we read the official documentation, it says that Sum over Union is in fact optimized for block mode, but the results we see here clearly show the opposite. The root cause is a little bit deeper. It is true that Sum(Union()) is optimized, but one of the other conditions for aggregation functions to work optimally, is to have the input set in a good shape, i.e. not arbitrary shaped set. Our Union combines sets from different granularities, so we end up with the mixed grain set, which is classified as arbitrary shape. How can we fix it ? One way is to break single Sum into two Sum’s, each one over the set of uniform single grain, i.e.

    WITH 
      MEMBER [Measures].RunSales AS 
          Sum
          (
            NULL : [Date].[Calendar].Parent.PrevMember
           ,[Measures].[Sales Amount]
          )
        + 
          Sum
          (
            [Date].[Calendar].FirstSibling : [Date].[Calendar].CurrentMember
           ,[Measures].[Sales Amount]
          ) 
    SELECT 
      [Measures].RunSales ON 0
     ,[Date].[Date].[Date].MEMBERS ON 1
    FROM [Adventure Works];

    Now the results are great – the runtime is about 400 ms, and other stats look good too. It is possible to optimize it even further using techniques described in the “Take advantage of FE caching to optimize MDX performance” article. We notice that the first component of plus is the same for all days in the month, so we can take advantage of it by caching it at the first day and reusing for all other days as following:

    WITH 
      MEMBER [Measures].RunMonthSales AS
          Sum
          (
            NULL : [Date].[Calendar].CurrentMember
           ,[Measures].[Sales Amount]
          )
      MEMBER [Measures].RunSales AS 
          ([Measures].RunMonthSales, [Date].[Calendar].Parent.PrevMember)
        + 
          Sum
          (
            [Date].[Calendar].FirstSibling : [Date].[Calendar].CurrentMember
           ,[Measures].[Sales Amount]
          ) 
    SELECT 
      [Measures].RunSales ON 0
     ,[Date].[Date].[Date].MEMBERS ON 1
    FROM [Adventure Works];

    This gives better runtime of 280 ms, on bigger cubes the difference will be more significant.

    4. Aggregating over Exists or CrossJoin ?

    To be written when I get more time.

    5. Aggregting over Descendants or EXISTING ?

    To be written when I get more time.

  • Score another point for XMLA and MDX

    While some people are wondering whether ‘XMLA is dead’ and call to ‘Resurrect the XMLA Council’, the reality in the field is quite different. There are, of course, plenty of XMLA clients, and new ones being born – the server side, XMLA providers also gain numbers. Today open-source OLAP server Palo announced support for XMLA and MDX.

    Actually, they announced even more – also support for OLEDB for OLAP (which is where XMLA came from). Of course, the biggest driver seems to be compatibility with Excel PivotTables, but whatever is driving it, it should be pretty clear to everyone, that XMLA is not dead, it is alive and well, even though there were no changes in the standard since 2003 – maybe it means we did such a good job on the standard, that XMLA 1.1 is enough for everything :)

  • Get most out of partition slices

    Setting partition slice has always been an important optimization technique in Analysis Services. Every presentation talked about it and every whitepaper mentioned it, for example the Microsoft SQL Server 2000 Analysis Services Performance Guide contains a chapter appropriately named “Define the Data Slice for Each Partition”, here is a quote from it:

    “If the data slice value for a partition is set properly, Analysis Services can quickly eliminate irrelevant partitions from the query processing and significantly reduce the amount of physical I/O and processor time needed for many queries issued against MOLAP and HOLAP partitions <skip>

    Caution: Creating a partition without setting the data slice is not a good practice, and can result in considerable overhead being added to the system (artificially increasing response times). Without the data slice, Analysis Services cannot limit a query to the appropriate partitions and must scan each partition even if zero cells will be returned.

    The data slice enables Analysis Services to determine which partitions contain data relevant to the query”

    However, after AS2005 was released, there was a new tip making it into presentations - “For MOLAP partitions no need to specify slice property, it is detected automatically”. Even though I briefly debunked it before here,  this rumor just kept popping up, and it even made it as far as official “OLAP Design Best Practices for Analysis Services 2005” document – here is the quote:

    “For MOLAP partitions, you do not have to specify the slice because the server will be able to figure out the relevant partitions after the data has been processed”

    Even worse, the official documentation on data slices goes as far as stating:

    Data slices are applicable only to objects that use the ROLAP storage mode”

    But this is completely not true, and many people already discovered it. This Jesse Orosz’s blog entry has great description of 4 reasons why setting partition slice in MOLAP is important. In my blog I want to focus on #3 from his list – automatic slice detection is not perfect.

    Here is how automatic slice detection works – during building indexes for the partition, for every attribute it determines the minimum and maximum data id values in this attribute (for more detailed explanation see section “Partition Slice” in the Microsoft SQL Server 2005 Analysis Services book, page 355). It is possible to discover what values exactly Analysis Services detected, either doing it hard way as described here, or doing it easy way as described here. In either case, it is clear, that since slice autodetection operates with ranges of data ids, it is totally possible that partition which contains only two attribute members, but one with the lowest possible data id (2), and another one with highest possible data id, and as a result the range will cover all possible values, i.e. will be useless.

    On the other hand, if for certain attribute the data in partition contains only single value of member’s data id, then the range will be very effective – since such a range will be equivalent to slice on a single member.

    So this must be the root of all the confusion in documentation and in whitepapers. Prior to AS2005, partition slice could have only been a single member. And specifying partition slice of a single member is really not necessary in MOLAP, since it will always be automatically discovered by autoslice (well, except for other 3 reasons that Jesse lists in his blog). But starting with AS2005, it is possible to specify MDX sets for partition slice, and that’s exactly the case where setting partition slice manually makes sense.

    Analysis Services Stored Procedures Project on Codeplex features PartitionHealthCheck function, which shows the overlaps in data id ranges between partitions, but it is no clear what action the user can take when he discovers that the autodetected ranges are not optimal. User doesn’t have direct control over assignments of data ids to dimension members. What this blog entry is going to explore is how to avoid relying on autoslice altogether, and put the precise control over partition slices into the hands of the user.

    Almost related attributes

    First scenario to consider is what I call “almost related attributes”. I.e. we have attributes, which are not truly related to each other, but, on the other hand are not completely unrelated. Examples of such attributes are Zip code to State – it is almost many to one, but sometimes there are Zip codes which cross State boundaries. Another example is Week to Year. Year mostly contains 52 weeks, but the weeks at the beginning and end of year can cross to previous or next year. 

    Let’s use Adventure Works 2008 as our sample database. We have ‘Calendar Year’ and ‘Calendar Week’ attributes. Let’s first query for specific Calendar Year:

    SELECT 
      [Measures].[Reseller Sales Amount] ON 0
    FROM [Adventure Works]
    WHERE 
      [Date].[Calendar Year].[Calendar Year].[CY 2004];

    Running this query under Hierarchical Profiler of MDX Studio shows that only one partition – Reseller_Sales_2004 – was read for that query. This is because that partition contains data just for year 2004, and therefore the automatic slice detection worked just fine. But let’s query for a specific week in 2004:

    SELECT 
      [Measures].[Reseller Sales Amount] ON 0
    FROM [Adventure Works]
    WHERE 
      [Date].[Calendar Week].&[17]&[2004];

    Of course, we would like again to read just 2004 partition, but this time profiler shows that also partitions Reseller_Sales_2002 and Reseller_Sales_2003 were touched. This is pretty bad, this week 17 of 2004 is well inside year 2004 with no hope of crossing into 2003, let alone 2002 !

    So why did automatic slice detection did such a poor job here. To answer this question, let’s look at data id values for the Calendar Week attribute:

    WITH MEMBER [Measures].CalendarWeekDataID AS
      DataId([Date].[Calendar Week].CurrentMember)
     ,CAPTION = 'Data ID of Calendar Week'
    SELECT [Measures].[CalendarWeekDataID] ON 0
    , [Date].[Calendar Week].[Calendar Week] ON 1
    FROM [Adventure Works];

    This query shows the data id for every calendar week, and the problem is now obvious. The calendar weeks don’t have nice ordering by the years, instead first we get Week 1 of years 2002, 2003 and 2004 (this week didn’t exist in 2001), then Week 2 for these years etc. No wonder the range of data ids for calendar week, included week 17 of 2004 also in 2003 and 2002. We could fix the Calendar Week attribute, but we would still be at the mercy of Analysis Services about how it decides to assign data ids. Instead, we can explicitly set the slice on every partition in the ‘Reseller Sales’ measure group.

    And what is better way to build this slice if not by using another MDX query !

     

    WITH 
      SET WeeksIn2004 AS 
        Exists(
           [Date].[Calendar Week].[Calendar Week].MEMBERS
          ,[Date].[Calendar Year].[Calendar Year].[CY 2004])
      MEMBER [Measures].PartitionSlice AS
       "{" +
       Generate(
         WeeksIn2004 AS itr, 
             [Date].[Calendar Week].CurrentMember.UniqueName 
           + IIF(itr.CurrentOrdinal = WeeksIn2004.Count, "", ","))
       + "}"
    SELECT [Measures].PartitionSlice ON 0
    FROM [Adventure Works]

    This returns long string representing MDX set with all the weeks in 2004. We just need to make sure that the resulting string is XML’ized, i.e. ‘&’ is replaced with ‘&amp;’ if it is to be pasted directly into XMLA script for ALTER partition, and we are good to go.

    When we apply this procedure to every partition, rerunning the query

    SELECT 
      [Measures].[Reseller Sales Amount] ON 0
    FROM [Adventure Works]
    WHERE 
      [Date].[Calendar Week].&[17]&[2004];

    Now only touches 2004 partition.

    Related attributes

    So far we considered “almost related attributes”, as for truly related attributes one may think there is no problem. But in fact, in Adventure Works 2008, Calendar Week 2004 is defined as related to Calendar Year. So why when we sliced on [Date].[Calendar Week].&[17]&[2004] it didn’t decode the year ? If we run coordinate decoding in MDX Studio, it does properly show ‘CY 2004’ being decoded. The problem here lies in the fact that FE does coordinate decoding lazily, only if it really needs to. This makes sense – otherwise there will be a lot of time spent decoding attributes that nobody needs. So in our query, FE noticed that there were no calculations looking at Calendar Year’s CurrentMember, and it left it non decoded, as showed by verbose data in profiler in MDX Studio:

    Dimension 9 [Date] (0 0 0 0 0 0 0 0 0 0 52 0 0 0 0 0 0 0 0 0 0)  
    
    [Calendar Week]:[Week 17 CY 2004]  
    [Calendar Year]:0  

    But in this case, it looks like SE could benefit from attribute decoding on query subcube – this could be useful performance optimization.

    What about multiple attributes

    We now successfully dealt with Calendar Week attribute, but Date dimension contains plenty of other attributes. If their Data IDs are as messed up as with Calendar Week – we need to do something about them. Unfortunately here we run into unintentional, yet very unfortunate limitation of Analysis Services. It is impossible to set partition slice by multiple attributes if at least one of them is a set and not a single member (which is the only useful case for us !). This limitation stems from unfortunate combination of few different factors:

    1. Sets used in partition slice have to comply with limitations of SetToStr(…, CONSTRAINED) (see documentation). There is a good reason for that – partition slices are computed when cube is not really ready for querying, so putting just any MDX set could trigger FE wanting to get data from the cube, which is not yet ready with unpredictable results.

    2. SetToStr(…, CONSTRAINED) doesn’t allow crossjoin of sets using neither CrossJoin function, nor * operator, no tuple-like form – i.e. it is not possible to specify crossjoin

    3. While it is possible to specify set of tuples for SetToStr(…, CONSTRAINED), it still doesn’t help, since FE is not capable of building a subcube out of set of tuples (instead it immediatelly resorts to dreaded arbitrary shaped form).

    It would be useful if any one of these limitations was relaxed – because it would enable true great control over partition slices in non-trivial cubes.

  • Screencast of “MDX Studio to the rescue” presentation

    During MS BI Conference 2008 I had a presentation called “Got MDX problems ? MDX Studio to the rescue !”. Given how the theme of self-service BI has been in the center of the conference, I decided to apply some self-service BI to analyze the performance of the presentation. The conference website provided all the evaluation data through Excel Services running in Sharepoint, but as a true self-service power user, I immediately exported all the data to Excel to do all the analysis on my machine. Fortunately, even with 3000 participants, 120 sessions and 8 questions in each evaluation form – the total number of rows was well within today’s Excel abilities to handle, so I didn’t need to install Gemini to deal with millions of rows yet. The overall score was 3.94 (out of 5), barely above median at 3.87. By this overall score my session ranked at mediocre 50 out of 120. Slicing data by individual question, and comparing it to the rest of the sessions, I got some insights. The good news were that I did really well in what I considered the most important question in the evaluation: “How applicable/relevant was the content of this session to the requirements of your job”. Here, my session ranked 10th with score of 4.45. It scored even better in the “How knowledgeable was the speaker on the subject” – rank 9 with score of 4.68. But what pulled the total numbers down was “How effective was the speaker in presenting the information”. Here I ranked 86th with score of 3.64, below both median and average.

      Overall Speaker was knowledgeable Content applicable to the job Speaker presented effectively
    Highest 4.79 4.88 5.00 4.75
    MDX Studio 3.94 4.68 4.45 3.64
    Median 3.87 4.33 4.00 3.9
    Avg 3.80 4.23 3.90 3.83
    Lowest 2.61 2.5 2.33 2.7
    Rank 50 9 10 86
    Percentile 42.37% 7.63% 8.47% 72.88%
    Percent below 57.63% 92.37% 91.53% 27.12%
      

     

    So the conclusion is that the content was very valuable and useful, and I had all the knowledge, but clearly had a problem with presenting this content and knowledge effectively. Now I needed to drill down into that area – so I went over all the comments trying to correlate them with this low score. This revealed 3 areas that appeared repeatedly through the comments:

    1. Some people were not able to make it to the session because the room was too small, and and filled to capacity well before the session started. Few of these people filled the evaluation forms giving low satisfaction scores. So while I feel it is a little unfair to score the session that one didn’t attend, but on the other hand I totally understand their frustration, and they didn’t have any other mechanism to give feedback other than filling the evaluation forms.

    2. By far the biggest complaint was that the computer screen in the room was too small, and it was difficult or impossible to see what was on it. Pretty much everybody gave this feedback – even people who gave me high scores. And same sentiment was repeated again and again in the comments section of my blog as well.

    3. Lastly, I did not speak very clearly, so few people struggled not only to see what was on the screen, but also to understand what I was saying.

    OK, now I understood the root cause of the problem. But this isn’t enough, with all the talk about “Actionable BI”, the real value will come from taking the corrective action. But how can I correct the three problems outlined above ? I cannot go back in time and change the room and the screen, of course, but what I decided to do was to record screencast of some of the less trivial operations in MDX Studio and publish it. The presentation had 4 parts for it, representing MDX development cycle:

    Development –> Debugging –> Profiling –> Optimizing

    I choose to do the Debugging portion, and instead of speaking – put notes on the screen (point 3 addressed!). The resulting recording took 7 minutes, showed 2 most interesting examples of debugging from the session, multiple debug watches, coordinate overwrites etc.

    The only problem is that 7 minutes took almost 1 GBs in raw AVI file – and I don’t know place where I could host it. Uploading both to MSN video soapbox and to youtube produced such a low quality that it was impossible to see. So I compressed into WMV format, which is somewhat blurry, but still can be followed through, and uploaded to skydrive. Perhaps someone may have a better idea how to publish higher quality screencast on the web.

    Link: MDX Studio Debug Screencast (29.2 MBs)

    The queries used in the screencast:

    1.

    WITH 
      MEMBER Measures.[Total Number Of Products] AS 
        Count(
          Descendants(
            [Product].[Product Categories].CurrentMember
           ,[Product].[Product Categories].[Product])) 
      MEMBER Measures.[Number Of Sold Products Increasing] AS 
        Count(
          Filter(
            NonEmpty(
              Descendants(
                 [Product].[Product Categories].CurrentMember
                ,[Product].[Product Categories].[Product])
             ,[Measures].[Internet Sales Amount])
           ,
                ([Date].[Calendar].CurrentMember ,[Measures].[Internet Sales Amount])
              - ([Date].[Calendar].PrevMember    ,[Measures].[Internet Sales Amount])
            > 0
          )
        ) 
    SELECT
      { [Measures].[Total Number Of Products] ,[Measures].[Number Of Sold Products Increasing]} ON 0
     ,[Date].[Calendar].[Month].MEMBERS ON 1
    FROM [Adventure Works];

    2.

    WITH MEMBER [Measures].YOY AS 
      [Measures].[Sales Amount] - ([Measures].[Sales Amount], [Date].[Calendar Year].PrevMember)
    SELECT [Measures].YOY ON COLUMNS
    , [Date].[Calendar Quarter].[Calendar Quarter].MEMBERS ON ROWS
    FROM [Adventure Works]
    
    
  • MS BI Conference 2008 is over

    The conference is over, and I am back home. Of course unlike some other folks who will spend 11 hours flying over Atlantic to get home, I only spent 11 minutes driving over the Lake Washington. Well, it’s time to reflect on the conference, here are some of my thoughts:

    1. No doubt, Project Gemini was the main focus and center of the conference. I was overhearing people talking about it all the time - in the hallways, on escalators, during lunch etc. Both mainstream and blogosphere had plenty of coverage too. Pretty much everybody who stopped to have a chat with me talked about Gemini. I think I have heard the full specter of opinions, from absolute excitement and “ship it now” comments to full dismissal and “you are going to create spreadmart hell” comments (even though Gemini claims completely the opposite – solve the spreadmart hell). Anyway, I think we need some more time for dust to settle. One thing that everybody seems to agree is that underlying technology is very powerful (even though there were almost no details shared on this technology). Amir’s presentation yesterday was fantastic, especially since he pretty much had to repeat the same demo steps as in the keynote (but with 100 million rows instead of 20 million – yet performance was exactly the same – instantaneous). Unfortunately, his presentation won’t be available online, but in 6 weeks everybody will get it on DVD. I am curious to see response from analysts, there were quite a few in the conference, especially from Nigel Pendse.

    2. The technical content was much better this year than during last year BI conference. Pretty much every session I attended was great. The level of content was on par or sometimes even better than what we usually get in PASS and TechEd. (I didn’t go to any session in the “Business Value” track, nor to keynotes except for the first one – I heard not so great opinions about those). But the experience with chalk talks was horrible. The rooms were way too small, and people lined up well in advance to get into chalk talk they wanted, but many of them were not able to, the doors were closed when capacity got filled. I really felt sorry for people who spent lots of money and traveled from far away only to not being able to attend the session they wanted. Of course, all these sessions were recorded, but the whole point of attending the conference is to get into live presentations. So I didn’t even try to get into any of these chalk talks in order to get other people better chance.

    3. Big thanks for all the people with whom I got a chance to chat. It felt great to get all the “thanks – I read your blog” comments. Correlating number of hits in my blog and number of people with whom I talked – I think I now pretty much know every reader of my blog by name :)

    4. On a more negative side, my chalk talk about MDX Studio today didn’t go well at all. It must have been one of the worst presentations I ever did. People started to take seats in the room an hour before the time, and 30 minutes before the session the room was already completely full. I became very nervous, because I when I saw it – I was afraid that the sessions might not live up to the expectations (after all, people skipped another session just to get seat on mine), and this is exactly what happened. On top of it, the screen in the room was very small, and most people couldn’t even see anything on it, which, of course, didn’t help the session which spent almost all the time in demos. So, while the overall evaluation score for the session was OK and above average, I got some of the comments like “worst session in the conference” :( I apologize for that.

  • Watch announcements from MS BI Conference in video recording

    Hopefully last live blog post for today :) I keep getting constant stream of questions about both project Madison and Gemini. The Ted Kummert keynote where both announcements were made was recorded and can be viewed online – hopefully this will get many questions answered about both announcements.

    Steven Elop & Ted Kummert keynote recording: Low Medium High

    For project Madison (DATAllegro, scale out DW) - scroll to 1 hour 4 minutes

    For project Gemini and SQL Kilimanjaro release (self-service BI, column oriented, in memory) – scroll to 1 hour 16 minutes and enjoy “Fairy Tale” story and Donald’s demo. “Fairy Tale” is great, by the way – it is pretty funny but it resonates extremely well with anybody who has been doing real BI projects.

  • So what is exactly “Project Gemini”

    In the couple of hours since the announcement here at MS BI conference, I talked with few people already and saw few threads and comments in the Web, and (as expected), there is plenty of confusion about what “Project Gemini” is and isn’t etc. Here are the facts:

    1. Project Gemini is the Analysis Services

    2. It is next version of Analysis Services, which will ship before SQL11, as part of SQL Server “Kilimanjaro”

    3. It uses column oriented processing and in memory storage as core technological principals

    Out of all coverage on the Web, that I saw so far – the best explanation comes from Doug Henschen article “Microsoft Will Bow In-Memory Analysis, Appliance-Ready Database in 2010” in Intelligence Enterprise. Here is the key quote from this article:

    “It's essentially another storage mode for Microsoft SQL Server Analysis Services with access via MDX, so existing applications will be able to take advantage of the performance enhancements”

    Just to repeat it in other words: Existing applications that work today with Analysis Services through XMLA and MDX will be able to work with Gemini, because Gemini is Analysis Services. In fact, in “SQL 2008 and beyond” session today, Thierry demonstrated how Report Builder worked with Gemini seamlessly. And because column oriented (and in-memory) are used as a foundation of storage engine, the performance and scalability by data volumes is much better.

    To read more about column oriented processing: http://www.dbms2.com/category/database-theory-practice/columnar-database-management/

    To read more about in-memory databases: http://www.dbms2.com/category/memory-centric-data-management/in-memory-dbms/

  • The big news from MS BI Conference are out

    So there were two sets of announcements just made during Ted Kummert keynote.

    1. Project “Madison” – Share-nothing MPP scale-out version of SQL Server, powered by DATAllegro technology. We saw live demo of 24 instances running 150 TB data warehouse, returning queries into Reporting Services in seconds. This is great news, but everybody probably could predict that this is going to happen sooner or later after DATAllegro acquisition was announced. Well, it is sooner rather than later.

    2. SQL Server “Kilimanjaro” to be released in first half of 2010, driven by project “Gemini” technology. This is the big big news I referred to earlier. While Ted in his presentation and Donald during the demo were focused on the Self-Service part of it, I want to put attention to the technology powering it. Just repeating what Ted and Donald said: This is Analysis Services using column-oriented and in-memory technologies running inside Excel (first demo) and in SharePoint (second demo). This is what allows loading of all these tens millions of rows into Excel and fully manipulating them in split seconds, building pivot table on the fly, infering data relationships etc etc. As Ted said, the work on this project has been ongoing for quite some time now, way before SQL 2008 shipped, and this is first public showing of the technology. Much more details coming during this session – don’t miss it tomorrow !

    Update:

    Some more coverage here:

    Richard Tkachuk: http://blogs.msdn.com/sqlcat/archive/2008/10/06/project-gemini-building-models-and-analysing-data-from-excel.aspx

    Marco Russo: http://sqlblog.com/blogs/marco_russo/archive/2008/10/06/microsoft-bi-conference-2008-announcements-on-stage.aspx

More Posts Next page »
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement