PowerPivot for Excel is an Add-In which uses a local version of Analysis Services (SSAS) to process data and make calculation to respond to user queries made using Excel 2010. The SSAS engine is loaded in-memory in the process of Excel. Especially with a 32 bit of Excel, there are particular considerations about the use of the memory.
When an Excel file containing PowerPivot data is open, the memory consumption is not affected by the PowerPivot volume of data. These data are loaded only when the PowerPivot window is opened or when the PivotTables and/or PivotCharts based on PowerPivot data are updated.
Loading PowerPivot data into Excel requires two steps:
- The SSAS backup contained into the Excel file is restored into a temporary table (a folder named with an IMBI_ prefix is created in the C:\Users\<<USERNAME>>\AppData\Local\Temp directory). This requires disk space sufficient to restore these files, which are usually 2 to 3 times the size of the Excel file.
- The SSAS database, once decompressed, is also loaded in virtual memory. Thus, as a rule of thumb, loading and browsing data of an Excel file containing PowerPivot data requires as much memory as 2 to 3 times the size of the Excel file.
The memory available for PowerPivot in a 32 bit version of Excel is just above 1Gb (it is lower than the virtual memory addressable space). For this reason, is it not so hard to get this error during the process of a large set of data:
Memory error: Allocation failure : Not enough storage is available to process this command. .
The operation has been cancelled.
The best way to avoid this error is using a 64 bit of Excel. However, there are many reasons for not having a 64 bit of Excel even if the operating system is a 64 bit one. The main reason, in my personal case, is the use of Outlook AddIns that are not available for the 64 bit version and it is not possible to mix 32 and 64 bit of Office products on the same machine. Therefore, it is a good idea trying to optimize the PowerPivot data model in a way that makes a better use the available memory.
Reduce the number of rows
This is probably a useless suggestion. Reducing the number of rows, the size of the database is reduced too. If you are extracting data from a database and you don’t need the full detail during the analysis, you might import data at the cardinality really required by your analysis. For example: if a table contains all the sales transactions of a retail store, but you only need to analyze data at the month level, you might group data by month, reducing the number of rows to be loaded. Most of the times, this optimization is not possible, because the user doesn’t want to lose the granularity of analysis.
Reduce the number of columns
This is by far the most important suggestion. If you don’t need a column, don’t import it!
PowerPivot uses a column-oriented database technology. Each column has its own storage and indexed structure. Each column increases processing time and memory required for both processing and execution.
An important concept is that every column imported is relevant. Even if PowerPivot internally uses an Analysis Services engine, it is a particular type (in-memory) that doesn’t make distinctions between quantitative measures and qualitative attributes. Thus, a SalesQuantity column is considered as both a dimension attribute and a measure.
Optimize column data types
Not every column has the same weight for memory and processing time. A column with few distinct values will be lighter than a column with a high number of distinct values. As we said, this is important also for measures, which are considered also possible quantitative attributes. If the measure you are storing is a float and is the result of a calculation, you might consider reducing the number of digits to be imported. This will reduce the size of the dictionary, and possibly also the number of distinct values.
Columns with string data types
PowerPivot stores a dictionary of all the unique strings in each attribute. Thus, having a long string is not a problem by itself, nor having a few very long strings if the number of distinct values is low. The point is that the average length of a string in an attribute multiplied by the number of unique values of the attribute is the measure of the size required to store this dictionary. For this reason, importing a long description that is different for every transaction is not a good idea.
Avoid high-cardinality columns
Columns that contain a very high number of distinct values are very expensive for PowerPivot. For example, the Invoice ID in a Sales Transactions table is very useful for the end user when it is necessary to drill-down at a transaction level. However, you have to be aware that this single column might be the most expensive one of your PowerPivot dataset, especially during the process phase.
Consider Calculated Columns
Each calculated column is stored into the PowerPivot data model just as it was an imported column. This is not true for calculated measures, which are calculated at query time. If the same calculation can be defined as either a calculated column or a calculated measure, the latter is a better solution from a resource consumption point of view. However, a calculated measure loses the ability to navigate into its values as an attribute like you can do using a calculated column.
If you have to store a measure as a calculated column, consider reducing the number of digits of the calculation: you can use the DAX function named ROUND just for this purpose.
Normalizing data doesn’t have a big effect on the size of the resulting database. However, it might have a strong impact on both processing time and memory required to process data.
The key is to find a right balance. A full denormalized table, which is the case of a single SQL query that denormalizes all of the attributes resulting in a PowerPivot dataset made of a single table, has a long processing time and requires more memory to be processed. In fact, during the process PowerPivot holds 1-2 millions of rows in a buffer and, if the rows have are large, this restricts the memory available to store data.
At the same time, a complete normalization of data, like the one of a third normal form, could be not a good idea. The number of relationship increase very much and it makes necessary having many columns in the model just for technical reason (defining relationships), but they are of no use for the end user. The resulting model is much more complex and is not faster than a balance between these two extremes.
The right balance is the one offered by a typical star schema. Putting all the attributes of a single entity into one table, just like a dimension in a Kimball star schema, appears as the best tradeoff also for PowerPivot. The resulting model is also easier to navigate for the end users.
This level of normalization usually offers the best processing times, some saving in storage size and a better use of memory during processing of data (the phases of import and refresh data).