|
|
|
|
SQLBI is a blog dedicated to building Business Intelligence solutions with SQL Server
-
A calculated column is a DAX expression which is evaluated when the PowerPivot workbook is updated. It is very useful, but there are cases where you want to delay calculation at query time. For example: - You want to make part of the calculation depending on the selection in the PivotTable (i.e. selecting a scenario for a simulation)
- You want to avoid the storage of another column in the PowerPivot workbook (it has a cost in terms of space, after all)
First of all, if you have to be worried about storage space, probably you have to look at your data source and check if the calculation can be made on source data, without requiring all the underlying columns to be imported. For example, if you have Quantity and Price and want to calculate Amount (as the product of the formers), you can make this calculation on the source data and import only Quantity and Amount (the Price can be calculated later as an average, which is what people usually want to do). That said, the point here is that calculating the Amount is pretty simple in a calculated column. 'Orders'[Amount] = 'Orders'[Quantity] * 'Orders'[Price] However, this formula is not working as calculated measure, because you will get this error message if you try to define a calculated measure with the expression above: The value for column 'Quantity' in table 'Orders' cannot be determined in the current context. To avoid this, you have to make a decision about the aggregation formula you want to use and then using the “X” version. For example, the Amount will be aggregated summing values, so it is safe writing this calculated measure: 'Orders'[Amount] = SUMX( 'Orders', 'Orders'[Quantity] * 'Orders'[Price] ) As I said at the beginning, this technique is fundamental whenever you want to delay the product operation at calculation time. In fact, you might have an expression other than a simple product, where part of the calculation depends on the current context of the calculation in the pivot table. More about this in a future post.
|
-
I recently talked about memory usage in PowerPivot and previously I wrote about difference modeling options using table normalization or not. Thus, I wondered whether table normalization is a good practice for performance and memory or not. The short answer is that yes, it might be better but not so much as you might suppose (and there are difference considerations about processing and query performance). Rule of thumb: a star schema is always a good thing! Have a good reading. - PowerPivot allows definition of relationships between tables loaded in a PowerPivot model. However, whenever these tables come from the same relational database, the same data could be loaded into a single table, by using data denormalization in the SQL query itself. From a modeling point of view, it is clear that the best way to import data into PowerPivot is creating a sort of star schema, by denormalizing entities that represent dimensions of data and by keeping the normalization of the data that follows the star schema pattern (numeric aggregatable measures in the fact table, qualitative attributes in dimension tables). But from a performance point of view, which is best for speed and memory consumption? We considered three cases isolating a single dimension normalized and denormalized: - Large dimension (240.000 items)
- Medium dimension (4.400 items)
- Small dimension (6 items)
We always used the same fact table with 6.5 millions of rows. Large Dimension Temporary SSAS database (it is located in C:\Users\USERNAME\AppData\Local\Temp and it has a name with the prefix IMBI_) is 119Mb when we use the denormalized dimension, and it is 113Mb when we use the normalized one (removing the surrogate key field pointing to large dimension, otherwise the size would have been 118Mb). However, there is a bigger difference in Excel files, because the denormalized file is smaller (58Mb) than the normalized one (66Mb). This could be caused by compression algorithms that are more efficient in that case. In the normalized case, the large dimension has a size of 19.3Mb and the fact table is 99.4b. In the fact table, the attribute holding the surrogate keys for the large dimension is 14.8Mb. Also time for full refresh of data is shorter in the normalized case, 90 seconds instead of 152 seconds required for the denormalized case. Memory required for full refresh is less for normalized case (481Mb) than for denormalized case (530Mb). In the denormalized case, it is relevant to note that the surrogate key can be removed (it is useless to analyze). The size for the fact table becomes 113Mb (like the whole cube), but in this case all the attributes that were previously in the denormalized dimension are now in the fact table itself. It is not easy to explain why the normalized case has a smaller size. In fact, the cardinality of the attributes (except the surrogate key) is the same, and considering that we don’t have the surrogate key anymore (it is copied twice in the normalized case, one in the fact table and another one in the dimension table), there should be something else. The space that would have been saved by removing the surrogate key is in fact used by other existing attributes, which uses more space to be stored, resulting in a similar size of the whole SSAS database. Trying to include the identity key for the fact table as an attribute, the processing fails with a 32-bit version of Excel. Reducing the volume to 3 millions of rows for the fact table, it works but the larger attribute is the identity key and it seems that size required for other attributes is lower. With a 4 million rows fact table, the database has a size of 147Mb including the identity key attribute, but is just 58Mb if we remove this attribute. Medium Dimension The SSAS database is 80Mb when we use denormalized dimension, and it is 76Mb when we use the normalized one. Excel file in this case is smaller for the normalized case (41Mb) than for the denormalized one (43Mb). Different performance in compression algorithms seems the only explanation, but we might check this again. Also time for full refresh favor the normalized case, requiring 97 seconds instead of 115 seconds of the denormalized case. Memory required for full refresh is less for normalized case (410Mb) than for denormalized case (485Mb). The difference in SSAS database size is not formed by attributes of the dimension that is denormalized into the fact table, but is caused by a larger size of the other attributes (which are also measures) when the fact table is processed in the denormalized case. This is really strange. The application key of the dimension we denormalized in the fact table has exactly the same cardinality as the surrogate key, which we removed from the fact table. Looking at the dimension of attributes in the SSAS cube files, it seems that SSAS looks for an attribute with the larger number of distinct values and uses it, but the algorithm should be much more complex than that because it is not what we can observe looking at SSAS database file in temporary directory. The SSAS database is 66Mb when we use denormalized dimension, and it is 65Mb when we use the normalized one. Both Excel files are around 34Mb of size. Time for full refresh is shorter for normalized case, 120 seconds instead of 143 seconds required for denormalized case. Memory required for full refresh is similar, 420Mb instead of 436Mb of committed size for normalized case against denormalized one. Even in this case, the difference seems to be caused by different algorithms used by SSAS to process data. More complex scenario Considering that there are complex dependencies among attributes, we added another test. We got a 2 million rows fact table with 7 dimensions and we tried to denormalize all of these dimensions in the fact table. The final result is still not so different, even if in this case denormalization seems to be more optimized. The SSAS database is 91Mb when we use denormalized dimension, and it is 93Mb when we use the normalized one. Similar differences are in Excel files, 55.8Mb the denormalized one and 57.3Mb the normalized one. There are much bigger differences in terms of time and memory required to process. The normalized version needs 54 seconds for a full refresh and committed memory size does not exceed 480Mb. The denormalized version requires 156 seconds for a full refresh (it is three times than normalized version!) and consumes much more memory during the process, reaching 930Mb of committed memory size (almost double the memory required for normalized version). Final Considerations There are no significative differences in storage space between normalized and denormalized dimensions. However, data normalization may strongly affect the processing phase (the action made whenever data is imported or refreshed). Dimensions containing the higher number of rows and columns are best candidates to be normalized. The complex scenario offers an example of a big difference in memory required to process data, and this is especially important using the 32 bit version of Excel .  To reduce time and memory required, it is really important having less attributes, and high cardinality attributes are particular expensive. Measures in fact tables are considered also attributes and they often have high cardinality. This is important to consider: each measure added to a model is expensive if the fact table has many rows. Calculated columns have the same consumption than imported columns. Only calculated measures don’t increase required storage space, because they are evaluated at query time and not at processing time.
|
-
I started using SQL Search (a free product from RedGate) several weeks ago, while it was still in beta. Now that the product has been released and is freely downloadable, I can say it has being really useful since the first beta. My common need is to look for the use of a column in every table, view, stored procedure and UDF… And this tool save me the time of generating the script and making a text search – not mentioning that I can restrict the scope of a search and there is also an integration with Object Explorer to select an object found in the search. Considering that the tool is free, it has the better possible ROI!
|
-
Denny Lee described how to open a trace file in PowerPivot for Excel. It is something that is very important because the SSAS engine which runs PowerPivot runs inside the Excel process and doesn’t have an external connection (thus, also SQL Profiler cannot connect directly to that engine). Of course, this will be also the preferred way to diagnose user troubles when they have a big Excel file that is too wide to be moved by email to make a first diagnostic.
|
-
PowerPivot for Excel is an Add-In which uses a local version of Analysis Services (SSAS) to process data and make calculation to respond to user queries made using Excel 2010. The SSAS engine is loaded in-memory in the process of Excel. Especially with a 32 bit of Excel, there are particular considerations about the use of the memory. When an Excel file containing PowerPivot data is open, the memory consumption is not affected by the PowerPivot volume of data. These data are loaded only when the PowerPivot window is opened or when the PivotTables and/or PivotCharts based on PowerPivot data are updated. Loading PowerPivot data into Excel requires two steps: - The SSAS backup contained into the Excel file is restored into a temporary table (a folder named with an IMBI_ prefix is created in the C:\Users\<<USERNAME>>\AppData\Local\Temp directory). This requires disk space sufficient to restore these files, which are usually 2 to 3 times the size of the Excel file.
- The SSAS database, once decompressed, is also loaded in virtual memory. Thus, as a rule of thumb, loading and browsing data of an Excel file containing PowerPivot data requires as much memory as 2 to 3 times the size of the Excel file.
The memory available for PowerPivot in a 32 bit version of Excel is just above 1Gb (it is lower than the virtual memory addressable space). For this reason, is it not so hard to get this error during the process of a large set of data: Memory error: Allocation failure : Not enough storage is available to process this command. . The operation has been cancelled. The best way to avoid this error is using a 64 bit of Excel. However, there are many reasons for not having a 64 bit of Excel even if the operating system is a 64 bit one. The main reason, in my personal case, is the use of Outlook AddIns that are not available for the 64 bit version and it is not possible to mix 32 and 64 bit of Office products on the same machine. Therefore, it is a good idea trying to optimize the PowerPivot data model in a way that makes a better use the available memory. Reduce the number of rows This is probably a useless suggestion. Reducing the number of rows, the size of the database is reduced too. If you are extracting data from a database and you don’t need the full detail during the analysis, you might import data at the cardinality really required by your analysis. For example: if a table contains all the sales transactions of a retail store, but you only need to analyze data at the month level, you might group data by month, reducing the number of rows to be loaded. Most of the times, this optimization is not possible, because the user doesn’t want to lose the granularity of analysis. Reduce the number of columns This is by far the most important suggestion. If you don’t need a column, don’t import it! PowerPivot uses a column-oriented database technology. Each column has its own storage and indexed structure. Each column increases processing time and memory required for both processing and execution. An important concept is that every column imported is relevant. Even if PowerPivot internally uses an Analysis Services engine, it is a particular type (in-memory) that doesn’t make distinctions between quantitative measures and qualitative attributes. Thus, a SalesQuantity column is considered as both a dimension attribute and a measure. Optimize column data types Not every column has the same weight for memory and processing time. A column with few distinct values will be lighter than a column with a high number of distinct values. As we said, this is important also for measures, which are considered also possible quantitative attributes. If the measure you are storing is a float and is the result of a calculation, you might consider reducing the number of digits to be imported. This will reduce the size of the dictionary, and possibly also the number of distinct values. Columns with string data types PowerPivot stores a dictionary of all the unique strings in each attribute. Thus, having a long string is not a problem by itself, nor having a few very long strings if the number of distinct values is low. The point is that the average length of a string in an attribute multiplied by the number of unique values of the attribute is the measure of the size required to store this dictionary. For this reason, importing a long description that is different for every transaction is not a good idea. Avoid high-cardinality columns Columns that contain a very high number of distinct values are very expensive for PowerPivot. For example, the Invoice ID in a Sales Transactions table is very useful for the end user when it is necessary to drill-down at a transaction level. However, you have to be aware that this single column might be the most expensive one of your PowerPivot dataset, especially during the process phase. Consider Calculated Columns Each calculated column is stored into the PowerPivot data model just as it was an imported column. This is not true for calculated measures, which are calculated at query time. If the same calculation can be defined as either a calculated column or a calculated measure, the latter is a better solution from a resource consumption point of view. However, a calculated measure loses the ability to navigate into its values as an attribute like you can do using a calculated column. If you have to store a measure as a calculated column, consider reducing the number of digits of the calculation: you can use the DAX function named ROUND just for this purpose. Normalize data Normalizing data doesn’t have a big effect on the size of the resulting database. However, it might have a strong impact on both processing time and memory required to process data. The key is to find a right balance. A full denormalized table, which is the case of a single SQL query that denormalizes all of the attributes resulting in a PowerPivot dataset made of a single table, has a long processing time and requires more memory to be processed. In fact, during the process PowerPivot holds 1-2 millions of rows in a buffer and, if the rows have are large, this restricts the memory available to store data. At the same time, a complete normalization of data, like the one of a third normal form, could be not a good idea. The number of relationship increase very much and it makes necessary having many columns in the model just for technical reason (defining relationships), but they are of no use for the end user. The resulting model is much more complex and is not faster than a balance between these two extremes. The right balance is the one offered by a typical star schema. Putting all the attributes of a single entity into one table, just like a dimension in a Kimball star schema, appears as the best tradeoff also for PowerPivot. The resulting model is also easier to navigate for the end users. This level of normalization usually offers the best processing times, some saving in storage size and a better use of memory during processing of data (the phases of import and refresh data).
|
-
The calculation for ABC analysis can be made in PowerPivot using calculated columns. In this way each row can have an attribute with the appropriate ABC class. The ABC calculation has to be made considering a particular grouping and sort order. For example, we wanted to define an ABC class of Products for internet sales transactions of AdventureWorks. We want to classify our Products in this way: - Products that make 70% of the sales are in Class A
- Products that make 20% of the sales are in Class B
- Products that make 10% of the sales are in Class C
Example with a single denormalized table In this first example we use a completely denormalized table. A different approach would be necessary using a normalized product table. WARNING: Please note that calculating ABC class on a denormalized table could have bad performances because it requires heavy calculation during data load. The normalized approach is far better in terms of performance, because the calculation only depends on the size of the related table (usually the dimension) and not on the size of the table containing the sale transactions. This is the query we use to get data from AdventureWorks DW. SELECT c.FirstName + ' ' + c.LastName AS Customer, c.EmailAddress, p.EnglishProductName, p.ModelName, s.OrderQuantity, s.SalesAmount FROM dbo.FactInternetSales s INNER JOIN dbo.DimProduct p ON p.ProductKey = s.ProductKey INNER JOIN dbo.DimCustomer c ON s.CustomerKey = c.CustomerKey There are several steps to obtain the desired ABC class, each of them is a new calculated column in the PowerPivot dataset. Before starting, consider that we will use the ProductName and the SalesAmount columns in our formulas. - SalesAmountProduct – this is the sales amount for each Product; the same value is duplicated for each row of the same product. The EARLIER function get the product name of the current row and the FILTER returns all the rows for the sales of the same product. This complexity is not required in the normalized version of the ABC calculation.
= SUMX( FILTER( 'Sales', 'Sales'[ProductName] = EARLIER('Sales'[ProductName]) ), 'Sales'[SalesAmount] )
- CumulatedProduct – this value is the cumulated value of a product, considering them ordered from the top-seller down to the worst one. In this case, the EARLIER function is used to get the sales amount for the current product and the FILTER returns all the rows of the products that sold at least the amount value of the current product.
= SUMX( FILTER( 'Sales', 'Sales'[SalesAmountProduct] >= EARLIER('Sales'[SalesAmountProduct] ) ), 'Sales'[SalesAmount] )
- SortedWeightProduct – This calculation simply transforms the CumulatedProduct calculation into a percentage. This number will be used to filter all the transactions according to the percentage limit corresponding to ABC classes (usually A = 70%, B = 20% and C = 10%).
='Sales'[CumulatedProduct] / SUM( 'Sales'[SalesAmount] )
- ABC Class Product – this is the final result of our calculation; depending of the value of SortedWeightProduct an A, B or C is displayed
=IF( 'Sales'[SortedWeightProduct] < 0.7, "A", IF( 'Sales'[SortedWeightProduct] < 0.9, "B", "C" ) ) In this way, we obtain an ABC class for each product sale. At this point, we can browse the data with a PivotTable in Excel. Just to make an example, we can look at the relevance of ABC products for each customer (in this case customers are sorted by total of sales in descending order). Example with normalized tables In this second example we use three normalized tables, importing from AdventureWorks DW the following tables: - DimCustomer
- DimProduct
- FactInternetSales
In this case, we defines these calculated columns in the DimProduct table. In this case will use the ProductKey to identify the product and SalesAmount as the measure to use for ABC classification. - SalesAmountProduct – this is the sales amount for each Product; in this case, each calculation is different for each row, because each row is a single product. We make use of the existing relationship between FactInternetSales and DimProduct (remember that we are defining a calculated column in the DimProduct table).
= SUMX( RELATEDTABLE('FactInternetSales'), 'FactInternetSales'[SalesAmount] )
- CumulatedProduct – this value is the cumulated value of a product, considering them ordered from the top-seller down to the worst one. In this case, the EARLIER function is used to get the sales amount for the current product and the FILTER returns all the rows of the products that sold at least the amount value of the current product.
= SUMX( FILTER( 'DimProduct', 'DimProduct'[SalesAmountProduct] >= EARLIER( 'DimProduct'[SalesAmountProduct] ) ), 'DimProduct'[SalesAmountProduct] )
- SortedWeightProduct – This calculation simply transforms the CumulatedProduct calculation into a percentage. This number will be used to filter all the transactions according to the percentage limit corresponding to ABC classes (usually A = 70%, B = 20% and C = 10%).
= 'DimProduct'[CumulatedProduct] / SUM( 'DimProduct'[SalesAmountProduct] )
- ABC Class Product – this is the final result of our calculation; depending of the value of SortedWeightProduct an A, B or C is displayed
= IF( 'DimProduct'[SortedWeightProduct] < 0.7, "A", IF( 'DimProduct'[SortedWeightProduct] < 0.9, "B", "C" ) ) As we said, the calculated columns are all defined in the DimProduct table. Finally, we can do the same ABC analysis as in the case of the denormalized table used as a source of data. In this case we see the distribution of ABC class of products among models. As we might expect, there is a strong relationship, which indicates we might classify Models instead of products, producing a very similar results, but with a few notable exceptions (like the Road-650 model at row 11, which indicates there are smaller sales divided between very similar products). From a business point of view, this could be an important decision to be made (classifying ABC for models instead than for products), especially if products differs only for characteristics that doesn’t affect the production lines. Working with denormalized attributes on normalized tables At this point, it could be interesting to look at how we can write the ABC classification based on the models of the product using the previous example with normalized tables. In this case, we will use a mix of the techniques we have seen in the two scenarios (denormalized and normalized tables). The set of tables is the same as the previous example and we will continue to add columns to the DimProduct table. In this case, the ModelName column will be our key to identify the granularity of ABC classification. - SalesAmountModel – this is the sales amount for each product Model; in this case, calculation is the same for all the products of the same model. The EARLIER function get the model name of the current row (a single product) and the FILTER returns all the products of the same model.
=SUMX( FILTER( 'DimProduct', 'DimProduct'[ModelName] = EARLIER( 'DimProduct'[ModelName] ) ), 'DimProduct'[SalesAmountProduct] )
- CumulatedProduct – this value is the cumulated value of a model, considering them ordered from the top-seller down to the worst one. We have to use the CALCULATE function to sum a calculated column filtering only the rows (products) of a certain model. We cannot use the SUMX on the FILTER result, because it wouldn’t contain the SalesAmountProduct column we need to sum.
= SUMX( FILTER( 'DimProduct' , 'DimProduct'[SalesAmountModel] >= EARLIER( 'DimProduct'[SalesAmountModel] )), 'DimProduct'[SalesAmountProduct] )
- SortedWeightProduct – This calculation simply transforms the CumulatedModel calculation into a percentage. This number will be used to filter all the transactions according to the percentage limit corresponding to ABC classes (usually A = 70%, B = 20% and C = 10%). Please look at the denominatore, which contains an aggregatable column (SalesAmountProduct) instead of the sum of the CumulatedModel column, which is not aggregatable!
= 'DimProduct'[CumulatedModel] / SUM( 'DimProduct'[SalesAmountProduct] )
- ABC Class Product – this is the final result of our calculation; depending of the value of SortedWeightProduct an A, B or C is displayed
= IF( 'DimProduct'[SortedWeightModel] < 0.7, "A", IF( 'DimProduct'[SortedWeightModel] < 0.9, "B", "C" ) ) This is the new DimProduct with both the ABC Class Product and ABC Class Model (even if it is not visible in this screenshot, there are several cases where these two calculated columns don’t have the same value). Using the ABC Class Model, we can see that each model only belongs to one class. Common calculations Coming back to DAX, we can see that the three cases are very similar and, just to simplify cut & paste of these calculations, we can customize just the first column, leaving standard DAX formula (without the name of the table) in the following ones. Usually, we will need to create just one ABC class in a PowerPivot workbook. For this reason, having these formulas ready to use could be useful. I use the example using SalesAmountProduct as a starting measure (we can simply search & replace this name with our own measure) and ‘Table’ to indicate the table in which we create these calculated columns with the data in PowerPivot. - SalesAmountProduct – this is the only calculation that differs for each implementation. Look at previous examples to check what DAX formula to use.
- CumulatedProduct – in this case the denominator (highlighted in bold) shall correspond to the value of SalesAmount calculated at the granularity of Table. Thus, the following formula is good only for Product if the denormalized table Is the Product one, otherwise check the bold denominator because it must be adapted to conform to Table granularity (see the previous sections to look at some examples).
= SUMX( FILTER( ‘Table’, [SalesAmountProduct] >= EARLIER( [SalesAmountProduct] ) ), [SalesAmountProduct] )
- SortedWeightProduct
= [CumulatedProduct] / SUM( [SalesAmountProduct] )
- ABC Class Product
= IF( [SortedWeightProduct] < 0.7, "A", IF( [SortedWeightProduct] < 0.9, "B", "C" ) )
|
-
Following the full disclosure of Grant Fritchey and Andy Leonard, here are my evaluations of PASS 2009 Summit evaluations (in both sessions I was on stage together with Alberto Ferrari). To make the post more readable, I anticipate the comments. The first session was about the SQLBI Methodology. We previously delivered this session at PASS Europe and I have to say we didn’t expect the room full of people. We got many many feedback after the session and we were very happy. The results seems very good and of course we would have needed more time to cover this argument appropriately. Probably it would require a workshop of one day. However, we think that in one hour the important messages have been delivered and the following part is a thorough study of our paper and an adaptation of the methodology to specific scenarios. Despite our Italian accent, we’ve been rated very high for our presentation skills. I suspect that it is because we are funny for that accent, but if it helps to digest the content better, it’s good anyway! The other session (Custom Security in SQL Server Analysis Services) was a substitution of a canceled session and it was not announced very well. The content was the same as one of the chapters of our Expert Cube Development with Microsoft SQL Server 2008 Analysis Services book. There were few people but all very interested in the topic and most of them probably solved some issues that was puzzling them! It’s strange that in our feeling the SQLBI Methodology session was much better than the security one, but if you look at the numbers both have been good, but the Security has been rated a little bit higher. Probably, one session with practical solutions to specific technical issue is rated very high if people had exactly those problems! A methodology session is probably more important, but at the same time you need time to apply the concepts and they are somewhat more vague until you don’t use them. For the curious of you, here are the raw data. BIA-411 - SQLBI Methodology There were around 200 attendees (room was full), we got 87 feedbacks. Usefulness of Session Average: 9 (10%) Good: 38 (33%) Excellent: 65 (56%) Speaker’s Presentation Skills Average: 1 (1%) Good: 29 (33%) Excellent: 57 (66%) Speaker’s Knowledge Average: 0 Good: 22 (25%) Excellent: 65 (75%) Accuracy of Session Description, etc. Poor: 1 (1%) Average: 8 (9%) Good: 30 (34%) Excellent: 48 (55%) Amount of Time for Session Very Poor: 1 (1%) Poor: 2 (2%) Average: 14 (16%) Good: 40 (46%) Excellent: 30 (34%) Quality of Materials Very Poor: 1 (1%) Poor: 0 Average: 9 (10%) Good: 35 (40%) Excellent: 42 (48%) Comments Presentation was good, still hung up on the concept of views. Good dynamics between the presenters. Portions of the presentation were very good but were offset by parts that were average. A little more time for this session and refined materials would make this good session excellent. BIA-321- Custom Security in SQL Server Analysis Services There were around 30 attendees, we got 14 feedbacks. Usefulness of Session Good: 4 (29%) Excellent: 10 (71%) Speaker’s Presentation Skills Average: 1 (7%) Good: 5 (36%) Excellent: 8 (57%) Speaker’s Knowledge Good: 2 (14%) Excellent: 12 (86%) Accuracy of Session Description, etc. Good: 5 (36%) Excellent: 9 (64%) Amount of Time for Session Good: 6 (43%) Excellent: 8 (57%) Quality of Materials Good: 6 (43%) Excellent: 8 (57%) Comments (no comments)
|
-
I know, the title is provocative. It's intentional. I'd like to get feedback about my opinion.
My point is simple (but not so short!). PowerPivot (which, you know, I like) and self-service BI are the new buzzwords for BI in Redmond. However, there are mature and solid technologies out there (read Analysis Services) which simply don't have a correspondent client technology provided by Microsoft that takes advantage of all the SSAS capabilities. I know, I'm not talking about BI for the masses, I'm talking about a more "traditional" BI, that can be still a barrier to Microsoft technologies penetration in some companies. In fact, who has a solution based on another technology might be unlikely to migrate everything just to deliver some report to more users. Right or not, this is what happens in the real world.
Now, if only I had know that MS didn't had plan to write the "final" BI client in 2004... You know, the day MS presented WPF (it was called Avalon at that time) it was clear that the new generation BI client tool had to be written using that technology. And to me it was obvious that someone were writing that killer application. Five, six years later, nothing happened. Microsoft bought ProClarity in the worst move ever. They scared potential investor in the BI client tool and they failed in producing a client tool which was independent from the Office product line. If only I had know that... yes, I would've looked for VC to produce that killer app. Today, it would be a game over.
Instead, nothing really important happened. Yes, if you have Office 2010 you can use Excel 2010 that fixed some bugs of Excel 2007 and it has cool new features and can be published with SharePoint 2010. Amazing. But, I know a lot of companies where delivering SharePoint 2010 just for BI is something that nobody will consider. Finally, Microsoft provides a complete set of technologies to those companies who want to provide BI for the masses. But our advanced users doesn't have very much. PowerPivot is very cool for all these databases they don't have in an OLAP cube. But if they already have a cube, they lack of an advanced client tool.
To tell the true, several new tools emerged in the last few years, and I've seen very very good piece of software. But, no killer app, sorry - no WPF-Silverlight powerful-animated-fast client tool out there (if someone wrote it, please ring the bell!). Maybe I'm wrong. Maybe the market is too small for "traditional BI". I don't know.
But I know that Microsoft is not playing the BI client game out of Office. Otherwise, it wouldn't make sense that Jamie McLennan left Microsoft for a newco (founded just two months ago with VC investments) that will work in the Data Mining field using Microsoft techologies. What else if not a "client tool" for Data Mining?
Today I'm much less worried than some years ago suggesting alternative (to Excel) client tools for BI, created by third parties (that usually are Microsoft partners, anyway). In fact, I still don't see the killer app. Today, we have the technology foundation for building Minority Report style BI tools. But, where is the product? I still don't see it, and I really don't believe (at this point) that I will see it from Microsoft.
|
-
The CALCULATE function in DAX is the magic key for many calculations we can do in DAX. However, it is not pretty intuitive how it works and I spent a lot of time trying to understand how it can be used. First of all, this is the syntax. CALCULATE( <expression>, <filter1>, <filter2>… ) The expression that we put in the first parameter has to be evaluated to return the result (that is a value, not a table). For this reason, the expression is usually an aggregation function like SUM, MIN, MAX, COUNTROWS and so on. This expression is evaluated in a context that is modified by the filters in the following parameters. A key point is that these filters can both enlarge and restrict the current context of evaluation. Let’s try to understand what it means by considering a few examples. The following table is the one we import in PowerPivot in a table named Demo. If we project the count of Rows in a PivotTable putting the Name on the Rows, we get the following result: Now, we might have the need to calculate a measure which is not affected by the selection of the Name or, in other words, which always calculate the context for all the names. Thus, we define this calculated measure: CountAllNames = CALCULATE( COUNTROWS('Demo'), ALL( 'Demo'[Name] ) ) We obtain the CountAllNames column in the PivotTable that always returns the number of all the rows of the Demo table, without considering the filter on the Name. However, if we add Color attribute to the rows of the PivotTable, the CountAllNames is filtered by that attribute too. In fact, for each color, we have 2 rows for a Name (see Count of Rows column) and 6 rows considering all the names (see CountAllNames column). Now, we can also add a Boolean expression as a filter of the CALCULATE function. For example, we might filter just the Car transport. CountAllNamesCar = CALCULATE( COUNTROWS('Demo'), ALL( 'Demo'[Name] ), 'Demo'[Transport] = "Car" ) In this case, we will reduce the CountAllNamesCar column for a color of Emily to 3, because the number of rows with color Green and Car transport are 3. At this point we might wonder whether the Calculate filter parameters are enlarging or restricting the context of evaluation. The rule is the following one. If the current context has a filter on a column of a PowerPivot table (which is a selection of a PivotTable, regardless it is a slicer, a report filter or a row/column selection), any reference for that column in one or more filter parameters of the Calculate function replaces the existing context. Then, the filters specified in the CALCULATE parameters are combined together like they were in an AND condition of a WHERE clause of a SQL SELECT statement. For instance, consider a filter on the Color green using a Boolean expression in the CALCULATE function: ColorGreen = CALCULATE( COUNTROWS('Demo'), 'Demo'[Color] = "Green" ) A Boolean expression used as a filter parameter in a CALCULATE function corresponds to an equivalent FILTER expression that operates on all the values of a column (for this reason, you can only a single column can be used in a Boolean expression that is used as a table filter expression): ColorGreen = CALCULATE( COUNTROWS('Demo'), FILTER( ALL('Demo'[Color]), 'Demo'[Color] = "Green" ) ) We obtain that ColorGreen column always filters by color Green and each Name has only 2 rows with color Green. Any filter expression in a CALCULATE statement overrides the existing selection of the PivotTable for the columns it contains. In the previous rule we highlighted the “reference” definition, because the FILTER that is internally used in place of the Boolean expression uses a FILTER expression that returns a set of values for the Color column. Thus, the existing selection for the color (the color is in fact specified in the rows of the PivotTable) is overridden by our filter and only green rows in the source table are considered to calculate the ColorGreen measure value. The true reason we lose the current selection on the color attribute is that the ALL( Demo[Color] ) expression returns a set of all the color values and ignores the existing selection. If we don’t want to lose the existing selection of the PivotTable (that means that we don’t want to lose the existing filters on the calculation context), we can simply use in the FILTER expression a function that doesn’t ignore the existing selection. Instead of using the ALL( Demo[Color] ) expression as the source of the filter, we can use the VALUES( Demo[Color] ) expression, which keeps existing selections and returns the values still available in the color attribute. In fact, if we use the following calculated measure: ColorGreen = CALCULATE( COUNTROWS('Demo'), FILTER( VALUES('Demo'[Color]), 'Demo'[Color] = "Green" ) ) We obtain as a result that the Color filter in the PivotTable is still active, and returns no rows for all the colors but green. We can see that the FILTER expression in a CALCULATE function always replaces the previous context for the referenced columns. However, we can save the existing context by using an expression which uses the existing context and further restricts the members we want to consider for one or more columns. And this is what we have done using VALUES formula instead of ALL as the first parameter of the FILTER call. Thanks to Marius Dumitru, the various combination of FILTER, ALL, VALUES in a CALCULATE statement can be summarized in this way. = CALCULATE( COUNTROWS('Demo'), FILTER( ALL('Demo'[Color]), 'Demo'[Color] = "Green" ) ) -- ignores/replaces existing Color filters and sets a filter on Green = CALCULATE( COUNTROWS('Demo'), FILTER( VALUES('Demo'[Color]), 'Demo'[Color] = "Green" ) ) -- keeps existing Color filters and adds a further filter on Green = CALCULATE( COUNTROWS('Demo'), FILTER( ALL('Demo'[Color]), 'Demo'[Color] = "Green" ), VALUES('Demo'[Color]) ) -- same as the previous expression (keeps existing Color filters) -- NOTE: the first filter would consider all the colors, but the second -- expression (VALUES) only consider the current selection and -- the two filters will be considered using an AND condition, thus = CALCULATE( COUNTROWS('Demo'), FILTER( ALL('Demo'), 'Demo'[Color] = "Green" ) ) -- ignores/replaces filters on all Demo columns, -- and sets a filter on all columns (Name, Transport etc., not just Color) -- with rows that meet the condition Finally, some words of caution have to be spent to the first parameter we pass to the FILTER function. If we consider this expression: ColorGreen = CALCULATE( COUNTROWS('Demo'), FILTER( 'Demo', 'Demo'[Color] = "Green" ) ) We pass the whole Demo table to the FILTER condition, which results a filter of the current context with all the columns! In this way we apply a restrictions on the color green and we get the same result as before (no rows for all the colors but green, the selection of color of the PivotTable is still applied) but, remember, the FILTER is returning ALL the rows. What does it mean? Well, consider a further selection on the PivotTable where the Transportation attribute is filtered by Bike. This is the result using the ColorGreen definition we have just defined. Now, let’s add another filter to the CALCULATE function, filtering also the rows with Trasport equals to Car. CarGreen = CALCULATE( COUNTROWS('Demo'), FILTER( 'Demo', 'Demo'[Color] = "Green"), 'Demo'[Transport] = "Car" ) which, as we have seen before, corresponds to: CarGreen = CALCULATE( COUNTROWS('Demo'), FILTER( 'Demo', 'Demo'[Color] = "Green"), FILTER( ALL( 'Demo'[Transport] ), 'Demo'[Transport] = "Car" ) ) The results is that the filter for Bike defined in the PivotTable plus the filter for Car defined in the CarGreen measure returns no rows at all in the CarGreen result. If we think about it, why this happens is not very intuitive at this point! Our filter on the Transport column is actually restricting the current selection and is not replacing it! But, wait, if we write just the filter on Transport, without the filter on Color, we have: TransportCar = CALCULATE( COUNTROWS('Demo'), 'Demo'[Transport] = "Car" ) that produces the following result, which replaces the Bike selection of the PivotTable in the TransportCar column! This last calculation (TransportCar) is simple to explain, because it has the same behavior we have seen before with the first ColorGreen calculated measure, where the filter of the color green replaced any existing color selection in the PivotTable. The difference in the CarGreen calculation is the other filter parameter, which returns all the columns from the Demo table. Consider the highlighted filter in the CarGreen formula: CarGreen = CALCULATE( COUNTROWS('Demo'), FILTER( 'Demo', 'Demo'[Color] = "Green"), FILTER( ALL( 'Demo'[Transport] ), 'Demo'[Transport] = "Car" ) ) The filter on color green returns all the columns of the current context. If we consider the corresponding rows for the cell B5 of the PivotTable (Emily, Green, Bike), this is just one row (the yellow one), and this row has the Bike value for the Transport attribute. When we apply the second filter, we have a single value for the attribute Transport, which is Car. At this point, the intersection between those two sets of Transport (one is only Bike, the other is only Car) is an empty set. Thus, the result for CarGreen measure is empty, because there are no corresponding rows for the selection made. This can be tricky, but we finally have this behavior. - The CALCULATE function applies a calculation (the first parameter) for each cell, considering the resulting context by applying the filters (the second and following parameters) to the current context.
- Each filter can have values for one or more columns.
- Each column is computed individually in the filters expressions of the CALCULATE function
- If a column value is specified in at least one filter, it replaces the selection of the current context for that column.
- If a filter expression returns more columns, each one has its own independent set of values in the final calculation context definition
- If a column is specified in more filters, the resulting values are the intersection of these set of values (for that column).
- After all the filters have been evaluated, the intersection of the column values determines the calculation context for the expression passed as the first parameter to the CALCULATE function
Despite its complexity, this calculation is pretty fast. The key point is to understand all the side effects we have when a filter returns more columns than those we specified in the filter condition itself, which is something we have to consider carefully each time we use one or more FILTER functions inside a CALCULATE expression.
|
-
Even if the site still is not updated, the Microsoft BI Conference 2010 will be joined with TechEd 2010 in New Orleans, LA. The conference will start on June 7, 2010, with a pre-conference day on June 6 (there are still no updates about BI Conference on the TechEd site too). I think this is a good way to optimize content delivery. Several (technical) sessions can really be shared across TechEd and BI Conference, and there is the opportunity to add sessions less technical and more business-oriented that wouldn’t have fit very well at TechEd.
|
-
A PivotTable based on PowerPivot data doesn’t offer a native Distinct Count aggregation. To calculate it, you need to create a calculated measure in the PivotTable. For example, if we have a simple Sales table from AdventureWorks loaded in PowerPivot and projected into a PivotTable like the following picture.  To create a new measure, we need to right-click on the field pane the menu “Add New Measure…” At this point we name the new measure Distinct Customers and write the DAX formula that calculates the number of rows returned by the DISTINCT function, which returns a list of the distinct values contained in a column (we choose EmailAddress in this example, but if we had a customer key we would have used it instead).  The DAX formula is: COUNTROWS( DISTINCT( Table[Column] ) ) Where Table is the name of the PowerPivot Table containing the column and Column is the name of the column containing the values we want to count as distinct. Having defined this calculated measure, it is computed in the context defined by the PivotTable query. For example, we can look at the number of distinct customers that have bought each product, comparing that number with the number of sales for each product (which is always a number greater than or equal to the number of distinct customers). I’d like to add a personal note. It would have been better if a shorter syntax would have been available, and much better if PivotTable would have offered the Distinct Count as a possible option in the Summarize By options that are available for each native measure (current available options are Sum, Count, Min, Max and Average). Moreover, even a more direct DAX function like a DISTINCTCOUNT( [column] ) would have been easier and intuitive for “that guy”!
|
-
I just read what I think is the best analogy for PowerPivot so far – these are the words of Amir Netz in a discussion: A good analogy is the relationship between mainsteam media (newspapers and TV) and internet blogs. In the old days (7 years back...), all of the information was provided exclusively by the mainstream media. The rise of the blogs is the rise of "Self Service Journalism". The blogs are all over the place, have huge audiances and serve a critical role in providing information to the masses. Still, the blogs are no replacement for mainstream media (MSM). They are no as credible, the don't have the resources to do in-depth expose, and they cannot provide the professional production value of the MSM. Both have a place. Both feed from each other. Both make people know more and be better informed. Replace blogs with PowerPivot and MSM with "Corporate BI" and you have a good picture of how things will play out. Now I feel myself like those journalists who also have a private blog, that apparently are undermining their core business but that instead, in the long term (they hope) are going to support it. Today, PowerPivot lacks of many features just to be able to compete with existing technologies for corporate BI. However, even in a version X where these gaps will be filled, the missing piece will always be the lack of data validation. At least, this is the theory. In fact, there is always someone who gives more credibility to an anonymous blogger than to mainstream media… but, you know, censorship would be way worse… :-)
|
-
PowerPivot doesn’t have the capability of really understand a many-to-many (M2M) relationship between two tables. In a relational world, a many-to-many relationship is materialized using a bridge table that split this relationship in two separate one-to-many relationships between the two original tables and the bridge table. Apparently, we can do the same in PowerPivot, but the behavior is not the expected one. Some workaround is possible using DAX, but there are some undesirable side effects if we only use calculated column. As we will see, it is necessary to use calculated measures to get the best results.
Consider two tables, Customers and Accounts.
Customers Table
Accounts Table
The many-to-many relationship between Customers and Accounts is expressed by the following Cross table.
Cross Table
In PowerPivot we define the same three tables (Customers, Accounts and Cross) and then we define these relationships:
As we will see, the best result is obtained using calculated measures. However, it is interesting to look at the results that can be obtained defining calculated columns, just to understand how PowerPivot works under the hood. You can skip the following section going directly to “Defining Calculated Measures” if you just want to look at the best solution. But if you want to understand how it works, it is better to start from the beginning, looking at the calculated columns before.
Defining Calculated Columns
At this point, we can try the PivotTable using the Amount measure next each Customer, but the result will sum up all the accounts without filtering only the accounts belonging to the Customer in the same resulting row.
The value of Grand Total is right, because it represents the sum or all the accounts – but this is also the same value for each customer and this is not right.
We can solve this by adding a DAX expression into a new column in the Customers table. We will call this column Amount of Accounts.
The DAX expression will sum (SUMX) all the value of the column Amounts in the Accounts table (‘Accounts’[Amount]) that are related (RELATED(‘Accounts’[Amount])) to the rows in the Cross table considered for each customer (RELATEDTABLE(Cross)). The final result is the desired value for each single customer.
As you can see, Marco has a total of 30, which is the sum of the accounts A (10) and B (20). However, the Grand Total is now 95, which is right if you sum the amount of each customer, but it is wrong if you consider that the same account (B) is considered for both customers Marco and Hal. In other words, if these were bank accounts, looking at this number we would be saying that the bank has more deposits than real.
Moreover, if you want to analyze which are the accounts accessible by each customer, dicing Customers and Accounts doesn’t filter out the accounts that don’t belong to the analyzed Customer. In the following screenshot you can see that all the accounts are present with the total amount of the accounts belonging to analyzed Customer.
We can obtain a better result with a model that is less intuitive from the point of view of the final user. We can create a RelatedAmount calculated column in the Cross table using the DAX expression RELATED(Accounts[Amount]).
In this way, we will use the implicit selection on Cross table made by the relationships we created before in the PowerPivot model, and the final result will also consider the relationship existing between the Cross and the Accounts tables to filter out Accounts which are not related to the selected Customer. Moreover, the value for each account will also be right.
However, the Grand Total is still higher than the real amount of all accounts. This work-around still is not ideal, even because the more intuitive place for the end user for having the amount for all the customers is the Customers table, and not the Cross table which should be invisible to the end user (it is a service table and shouldn’t have meaningful information for the data analyst).
Defining Calculated Measures
The best solution is defining a calculated measure in the Customers table (right click the Customers node in the PowerPivot Task Pane):
We define a formula that sums the amount of all the accounts that have at least a line in the Cross table which is related to the context of the calculation in the PivotTable (we will analyze the formula in details later).
The final result in our PivotTable is exactly what we wanted: right results for accounts and customers, exclusion of the accounts which are not owned by a customer, right total for all customers. Moreover, the measure we defined (Available Amount) can be defined wherever we want, even in the Customers table (like we did).
Now, the problem is only that the Amount value in the Accounts table might be misbelieving (look at the first example in the previous “Defining Calculated Columns” section). However, we can hide a column (even if we cannot hide a calculated measure, like the “Available Amount” we just created using the PivotTable).
At this point, we can also move the calculated measure under the Accounts table, which seems the more intuitive place to look for, especially if there aren’t other measures with similar names.
In this way, we have the cleanest presentation for the PivotTable user.
Calculated Measures Under The Hood
To understand how we created the Available Amount calculated measure, we have to split it into smaller terms (I thanks Marius Dumitru for his help in clarifying me these steps). In the following steps, we will keep the original “Accounts[Amount]” column hidden from the PivotTable.
First of all, we define these two calculated measures:
- Accounts[Amt] = Sum( [Amount] )
- Cross[x] = CountRows(Cross)
Apparently, these two formulas calculate the sum of all the rows in the corresponding tables (Accounts and Cross). But, remember, we are defining calculated measures and these formulas are executed in a particular execution context for each cell of the resulting PivotTable. Look at the following picture.
The x calculation has been filtered by the selection of Customers and Accounts: this is regular, because the Cross table is like a fact table, which is filtered by the selections made on all of the dimensions. However, the Amt calculation is filtered only by Account selection and is not affected by the Customer in the context. In fact, there are not direct relationships between Customers and Accounts tables. To get the right Amount value, we need to filter the Amt sum also by Account, and to do this, we can filter only the rows in the Account table that have at least one corresponding row in the Cross table for the current Account and Customer in the execution context. For this reason we use the FILTER function to get the accounts for the current selection of customers. These accounts will have at least one row in the Cross table, which is related to the Customer table, and this predicate will filter all the accounts we have to sum to get the desired result. The following expression implements this logic, returning the set of accounts to be summed for the current context.
( Filter(Values(Accounts[Account]), [x] > 0) )
At this point, it is sufficient to reiterate this process for each cell that have to be calculated. The same formula can be written with a shorter version that eliminates the CALCULATE syntax.
Calculate( [Amt], ( Filter(Values(Accounts[Account]), [x] > 0) ) )
[Amt] ( Filter(Values(Accounts[Account]), [x] > 0) )
Please note that the calculated measures [Amt] and [x] doesn’t need to be preceded by the name of the table – other than calculated columns, the name of the owning tables has only two purposes for calculated measures:
- Providing a “default table” that will be used when you write simple aggregation functions without specifying a table: writing Sum([C]) in a measure will get expanded to Sum(DT[C]) (or SumX(DT, [C])), where DT is the table that the measure is defined on.
- Defining where (under which table) the UI should the measures in the FieldList (like DisplayFolder does in MDX).
Thus, starting from this formula:
Calculate( [Amt], ( Filter(Values(Accounts[Account]), [x] > 0) ) )
We can obtain our [Available Amount] calculated measure by simply replacing [Amt] and [x] terms (we use the Accounts[Amount] explicit term so that the formula can be defined under any table of the model):
Calculate( Sum(Accounts[Amount]), ( Filter(Values(Accounts[Account]), CountRows(RelatedTable(Cross)) > 0) ) )
The following ones are alternative ways of writing the same calculation (please note we are using an additive measure like Amount – in case of non-additive measures, the SumX alternatives don’t work well).
Calculate(Sum([Amount]), Filter(Values(Accounts[Account]), Calculate(CountRows(Cross)) > 0))
SumX(Filter(Values(Accounts[Account]), CountRows(RelatedTable(Cross)) > 0), SumX(RelatedTable(Accounts), [Amount]))
SumX(Filter(Values(Accounts[Account]), CountRows(RelatedTable(Cross)) > 0), Calculate(Sum([Amount])))
SumX(Filter(Values(Accounts[Account]), CountRows(RelatedTable(Cross)) > 0), [Amt])
Final considerations
When you look at PowerPivot, you might think that calculated columns and calculated measures are pretty similar, just because they share the same expression language (DAX). However, they are really different and provide complementary tools to implement more advanced calculations. I don’t like the fact that calculated measures belong to the PowerPivot model but can be defined only using a PivotTable. I would prefer having also a dedicated tool into the PowerPivot window to access calculated measures definitions.
Calculated measures are flexible enough to make us implement a many-to-many relationship calculation. This is a very important feature to handle more complex models presenting data in an intuitive way. I’m just at the beginning of the exploration of these capabilities and I still hadn’t time to analyze performance and scalability of such a calculations. In the future, I’d like to compare features and performance with equivalent “traditional” SSAS models.
|
-
In the last few days Alberto wrote two very interesting posts:
- Remote Execution of SSIS Packages - this is something that many many people asked me since 2005. Now there is the complete code with the solution to this issue. Considering that SSIS 2008 R2 doesn't have such a feature and we might only hope it will appear in SSIS vNext (2011? 2012?), this is something that we can use today until the next two versions of SQL Server Integration Services!
- SqlBulkCopy Performance Analysis - this is a post that introduces an interesting paper about the SqlBulkCopy class (and, in general, bulk operations in SQL Server). How many surprises! When I read the draft and discussed with Alberto during the tests, I discovered something that was completely new to me, and I was supposed to know this things very well (having worked *a lot* with them). Don't miss this paper!
|
-
During a conversation I expressed the idea of making a DSLs to model cubes by leveraging the “Oslo” platform – my needs was having a human readable source file, easy to compare if you have two versions of the same entity (cube, dimension, or something else). Chris pointed me to InstantCube, which seems exactly what I thought! The implementation is a little bit different, because it also creates the relational model behind the cube (and it could be too much for me in several cases), but having a DSL for defining SSAS entities is definitely a good idea!
|
|
|
|
|
|