I recently talked about memory usage in PowerPivot and previously I wrote about difference modeling options using table normalization or not. Thus, I wondered whether table normalization is a good practice for performance and memory or not. The short answer is that yes, it might be better but not so much as you might suppose (and there are difference considerations about processing and query performance). Rule of thumb: a star schema is always a good thing! Have a good reading.
PowerPivot allows definition of relationships between tables loaded in a PowerPivot model. However, whenever these tables come from the same relational database, the same data could be loaded into a single table, by using data denormalization in the SQL query itself.
From a modeling point of view, it is clear that the best way to import data into PowerPivot is creating a sort of star schema, by denormalizing entities that represent dimensions of data and by keeping the normalization of the data that follows the star schema pattern (numeric aggregatable measures in the fact table, qualitative attributes in dimension tables). But from a performance point of view, which is best for speed and memory consumption?
We considered three cases isolating a single dimension normalized and denormalized:
- Large dimension (240.000 items)
- Medium dimension (4.400 items)
- Small dimension (6 items)
We always used the same fact table with 6.5 millions of rows.
Temporary SSAS database (it is located in C:\Users\USERNAME\AppData\Local\Temp and it has a name with the prefix IMBI_) is 119Mb when we use the denormalized dimension, and it is 113Mb when we use the normalized one (removing the surrogate key field pointing to large dimension, otherwise the size would have been 118Mb). However, there is a bigger difference in Excel files, because the denormalized file is smaller (58Mb) than the normalized one (66Mb). This could be caused by compression algorithms that are more efficient in that case.
In the normalized case, the large dimension has a size of 19.3Mb and the fact table is 99.4b. In the fact table, the attribute holding the surrogate keys for the large dimension is 14.8Mb. Also time for full refresh of data is shorter in the normalized case, 90 seconds instead of 152 seconds required for the denormalized case. Memory required for full refresh is less for normalized case (481Mb) than for denormalized case (530Mb).
In the denormalized case, it is relevant to note that the surrogate key can be removed (it is useless to analyze). The size for the fact table becomes 113Mb (like the whole cube), but in this case all the attributes that were previously in the denormalized dimension are now in the fact table itself.
It is not easy to explain why the normalized case has a smaller size. In fact, the cardinality of the attributes (except the surrogate key) is the same, and considering that we don’t have the surrogate key anymore (it is copied twice in the normalized case, one in the fact table and another one in the dimension table), there should be something else. The space that would have been saved by removing the surrogate key is in fact used by other existing attributes, which uses more space to be stored, resulting in a similar size of the whole SSAS database.
Trying to include the identity key for the fact table as an attribute, the processing fails with a 32-bit version of Excel. Reducing the volume to 3 millions of rows for the fact table, it works but the larger attribute is the identity key and it seems that size required for other attributes is lower. With a 4 million rows fact table, the database has a size of 147Mb including the identity key attribute, but is just 58Mb if we remove this attribute.
The SSAS database is 80Mb when we use denormalized dimension, and it is 76Mb when we use the normalized one. Excel file in this case is smaller for the normalized case (41Mb) than for the denormalized one (43Mb). Different performance in compression algorithms seems the only explanation, but we might check this again. Also time for full refresh favor the normalized case, requiring 97 seconds instead of 115 seconds of the denormalized case. Memory required for full refresh is less for normalized case (410Mb) than for denormalized case (485Mb).
The difference in SSAS database size is not formed by attributes of the dimension that is denormalized into the fact table, but is caused by a larger size of the other attributes (which are also measures) when the fact table is processed in the denormalized case.
This is really strange. The application key of the dimension we denormalized in the fact table has exactly the same cardinality as the surrogate key, which we removed from the fact table. Looking at the dimension of attributes in the SSAS cube files, it seems that SSAS looks for an attribute with the larger number of distinct values and uses it, but the algorithm should be much more complex than that because it is not what we can observe looking at SSAS database file in temporary directory.
The SSAS database is 66Mb when we use denormalized dimension, and it is 65Mb when we use the normalized one. Both Excel files are around 34Mb of size. Time for full refresh is shorter for normalized case, 120 seconds instead of 143 seconds required for denormalized case. Memory required for full refresh is similar, 420Mb instead of 436Mb of committed size for normalized case against denormalized one.
Even in this case, the difference seems to be caused by different algorithms used by SSAS to process data.
More complex scenario
Considering that there are complex dependencies among attributes, we added another test. We got a 2 million rows fact table with 7 dimensions and we tried to denormalize all of these dimensions in the fact table. The final result is still not so different, even if in this case denormalization seems to be more optimized.
The SSAS database is 91Mb when we use denormalized dimension, and it is 93Mb when we use the normalized one. Similar differences are in Excel files, 55.8Mb the denormalized one and 57.3Mb the normalized one.
There are much bigger differences in terms of time and memory required to process. The normalized version needs 54 seconds for a full refresh and committed memory size does not exceed 480Mb. The denormalized version requires 156 seconds for a full refresh (it is three times than normalized version!) and consumes much more memory during the process, reaching 930Mb of committed memory size (almost double the memory required for normalized version).
There are no significative differences in storage space between normalized and denormalized dimensions. However, data normalization may strongly affect the processing phase (the action made whenever data is imported or refreshed). Dimensions containing the higher number of rows and columns are best candidates to be normalized. The complex scenario offers an example of a big difference in memory required to process data, and this is especially important using the 32 bit version of Excel .
To reduce time and memory required, it is really important having less attributes, and high cardinality attributes are particular expensive.
Measures in fact tables are considered also attributes and they often have high cardinality. This is important to consider: each measure added to a model is expensive if the fact table has many rows. Calculated columns have the same consumption than imported columns. Only calculated measures don’t increase required storage space, because they are evaluated at query time and not at processing time.