THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alberto Ferrari

PowerPivot: improve the performances by sorting tables

While it is well know that in UDM data should be loaded sorted, to get better file compression and map structures. The world of PowerPivot (and BISM) is yet not explored. Working with a 50 millions rows table, I tested various scenarios to check if sorting affects PowerPivot data structures too.

The fact table structure is very simple:

CREATE TABLE Fact_Transaction (
      ID_Account INT NOT NULL ,
      ID_Type INT NOT NULL ,
      ID_Date INT NOT NULL ,
      Amount MONEY NOT NULL
) 

There are 240,000 accounts, 1,000 dates and 4 types. Thus, the ID_Account is the most selective, followed by the date and the type. I tried loading data inside PowerPivot using different sorting and the final workbook size is in the next table:

Order File Size (K)
HEAP (unosrted) 498,905
Type, Date, Account 372,505
Date, Type, Account 372,209
Account, Date, Type 374,001

Thus, the compression algorithm in PowerPivot seems to prefer sorted data (which is somehow expected, even if not obvious). The big difference is between sorted and unsorted data while changing the sort order does not affect the final size in an evident way. This is definitely expected since the ratio between transactions (50 millions) and accounts (240.000) is very high, resulting in a good sorting even with the most selective column. Different distributions in real world scenarios might lead to different results and they are definitely worth trying.

What is interesting is that the query speed of the final workbook is much better with the smaller ones, when compared with the biggest, i.e. unsorted. This might indicate that it is not just a matter of workbook size but even of internal data structures that result in a better optimized format when data is fed sorted to PowerPivot. I don’t have clear number here, but the difference is evident at first glance by pivotting over the data model.

The final hint is straightforward: if you use a heap or a table with an INT IDENTITY column for your fact table, then it might be useful to check different sorting and (as always) verify the final result. It might be the case that you are wasting memory and CPU due to a poorly designed data structure or, in other words, that there is space for improvement.

Published Friday, April 15, 2011 2:51 PM by AlbertoFerrari

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Mike said:

I'm curious about your opening comment "While it is well know that in UDM data should be loaded sorted".  Can you elaborate on this please?

August 24, 2011 2:51 AM
 

AlbertoFerrari said:

@Mike,

Take a look at this old paper: http://technet.microsoft.com/en-us/library/cc917609.aspx. It is quite long, so search for "Process Buffer" and you'll get a description of how SSAS processes partitions. The key point is that it reads data in chunks, sorts them and then writes data to disk. If the data is already sorted, each chunk will be higher compressed and map files will be better organized.

The same whitepaper is available for 2008 in docx format here: http://www.microsoft.com/download/en/details.aspx?id=17303 and it is surely worth a read by anyone serious about SSAS performance tuning.

August 24, 2011 3:14 AM
 

Mike said:

Thanks Alberto.  Can you give any tips on the best sort method?  I couldn't quite follow it from those whitepapers.  Do you sort by the Dimension Keys?

Thanks

August 30, 2011 7:54 PM
 

AlbertoFerrari said:

@Mike,

I got the best results sorting by the least variant datum first. If the same value for a column appears many times in a chunk, it will be compressed better. On the other hand, if the value changes very often, then compression will be worse.

In my example, the type is quite always a constant, the date changes slowly and the account changes very fast. Using type, date, account gave e the best results.

Nevertheless, I wanted to make some definitive tests but still didn't have time to plan and write them.

August 31, 2011 3:17 AM
 

Emil Glownia said:

Many thanks for your blog post. It is very useful for me.

November 14, 2012 8:29 AM

Leave a Comment

(required) 
(required) 
Submit

About AlbertoFerrari

Alberto Ferrari is a Business Intelligence consultant. He his interests lie in two main areas: BI development lifecycle methodologies and performance tuning of ETL and SQL code. His main activities are with SSIS and SSAS for the banking, manufacturing and statistical sectors. He is also a speaker in international conferences like European PASS Conference and PASS Summit.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement