THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

SQL Server 2008 Page Compression: Compression ratios with real-world databases

In my previous post on data compression, I looked at how rebuilding a table with page compression works with multiple processors via the MAXDOP option. In this post, I'll focus on what compression ratios I have seen in the real-world databases. Now, if you understand how SQL Server 2008 data compression works, you know that what compression ratio you may get really depends on to what extent SQL Server can find duplicate values on the data/index pages. Consequently, one real-world database can have a very high compression ratio while another may have a very low compression ratio: there is no typical compression ratio to expect.

That said, it is still interesting to report what compression ratios I have seen with real customer databases instead of 'cooked' data. Before I give the results, let me define what I mean by compression ratio:

        Compression Ratio = Uncompressed Size / Compressed Size

To see what kind of compression ratios I may run into with real-world databases, I randomly selected 14 customer databases. IN this blog post, these databases have been renamed to DB1, DB2, ..., DB13, and DB14 to shield their identities. For each database, the following steps were performed:

  1. Record the data size of sp_spaceused for the database. This is the uncompressed data size.
  2. For each user table in the database, rebuild it with page compression. This was done by generating and executing a script that contains the following ALTER TABLE statement for each table:
              alter table <the_table> rebuild with (data_compression=page, maxdop=8);
  3. Record the data size of sp_spaceused for the database. This is the compressed data size.
  4. For each user table in the database, rebuild the table with no compression, which effectively decompresses the table. This was done by generating and executing a script that contains the following ALTER TABLE statement for each table:
              alter table <the_table> rebuild with (data_compression=none, maxdop=8);
  5. Record the data size of sp_spaceused for the database. This is the decompressed data size.

The data size obtained in Step 3 was used as database compressed size, and the data size obtained in Step 5 was used as the database uncompressed size. The data size obtained in Step 1 was not used because it may be polluted by fragmentation. If we used the data size from Step 1, we could potentially be using a size that was inflated by fragmentation, resulting in an incorrectly higher compression ratio.

The following two charts summarize the results:

The top chart includes the three database space measures mentioned above. As you can see, fragmentation was not a significant factor for these databases since the uncompressed size and decompressed size were quite close for each database. As it turned out, these databases were regularly maintained, including rebuilding clustered indexes, it's no surprise that there wasn't any serious fragmentation.

To highlight the compressed ratios for these databases, the bottom chart plots the compression ratios. So for these databases, the compression ratios--obtained using SQL Server 2008 CTP6 page compression--vary between 1.7 and 5.6.

The space savings were substantial with these databases. When decompressed, the total space usage was ~261GB. Compared that to the total space usage of ~88GB when compressed. In other words, SQL Server 2008 page compression would save us 173GB in storage space.

Published Sunday, May 11, 2008 8:20 PM by Linchi Shea

Attachment(s): real_comp.gif

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

AaronBertrand said:

Thanks Linchi!

I would love to see the rebuild times associated with these (especially for the larger databases).  Any chance you recorded the times, or did you just let the scripts run overnight and not record all the metrics?

I plan to do some very similar comparisons against various customers using our schema and am curious how predictable the timing will be.

May 11, 2008 9:26 PM
 

Gaurav Bindlish said:

Linchi, Good data. Are you considering running some benchmarks on these databases to see how much CPU overhead is there to perform transactions against these compressed database?

May 12, 2008 12:29 AM
 

Nick said:

Looks good. Thanks. Including backup time and size would be good too... but these are all the things we need to do ourselves :-)

I assume that when you re-index a compressed database that the amount of transactionlog required is the same (comparatively, reduced relative to the compressed size) since it's logging whole pages (which are compressed)... ahhhh there's just so much to look in to.

May 12, 2008 8:29 PM
 

Linchi Shea said:

Aaron;

I ran these same tests many times to make I got consistent data. But I didn't think the elapsed times for these tests were particularly useful to report since they depend some much on the hardware configurations and other factors. That said, I will report some elapsed time metrics for some other more controlled scenarios, and especially for comparison purposes.

May 12, 2008 9:27 PM
 

Linchi Shea said:

Gaurav;

Yes, I will report metrics on performance impact of data compression, but not necessarily on the CPU overhead.

May 12, 2008 9:29 PM
 

Linchi Shea said:

Before I begin, let me be very clear that the results I report in this post were obtained with SQL Server

May 12, 2008 11:00 PM
 

Mike said:

Were your tests done on a SAN or direct hard drives?  Would there be any benefit of compression on a SAN (ie, less IO???)? or would bottleneck still be moving data from SAN to server?

May 15, 2008 5:16 PM
 

Linchi Shea said:

For what is discussed in this post, it doesn't matter whether the storage is SAN or directly attached as the key focus here is the savings in space, not improvement in performance.

May 17, 2008 12:52 PM

Leave a Comment

(required) 
(required) 
Submit

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement