THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

SQL Server 2008 Data Compression: Impact of Data Distribution

In this post, I continue to explore the implications of SQL Server 2008 data compression. It is particularly worth highlighting the fact that SQL Server 2008 data compression is performed at the page level instead of the table level. In other words, when SQL Server 2008 goes about compressing data, it does it one page at a time, and its compression algorithm considers only the data on that page.

The page level approach to data compression has important implications, one of which is that what kind of compression ratio you get on a table depends on how the data in the table is distributed. That is, it is possible that with the exactly same set of data in the table, you can get radically different compression ratios if data is shuffled around in the table.

To drive this point home, let me show you an extreme example with the following table:

create table customer (

     cid   int identity,

     filler char(100)   NOT NULL,

)

With this table, I tested two scenarios:

Scenario 1: Clustered index was created on column cid:

create clustered index cus_cx on customer(cid);

Scenario 2: Clustered index was created on column filler

create clustered index cus_cx on customer(filler);

Again, with the exactly same set of data in the table, in each scenario I compressed the data, and compared the table size before the compression and after the compression.  The data compression was done with ALTER TABLE statement as follows:

ALTER TABLE customer REBUILD

            WITH (DATA_COMPRESSION = PAGE, MAXDOP=8);

The following chart shows the table size before and after compression for each test scenario:

 

Clearly, when the table was clustered on the column cid, the data compression ratio was very low (slightly greater than 1), whereas when the table was clustered on the column filler, the data compression ratio was very good at ~10 (the compressed table was about 10 times smaller).

Why?

First we need to know exactly what data was in the table, i.e. we need to know the characteristics of the data. Well, the data was populated with the following script:

declare @i int

set @i = 1

begin tran

while @i <= 8000000

begin

      insert customer(filler)

      select left(cast(@i % 80 as varchar(8)) + '00000000', 7) + replicate('a', 92)

 

      if (@i % 10000) = 0

      begin

           commit tran

           begin tran

      end

      select @i = @i + 1

end

commit tran

With this script, the first 80 rows would have the cid values going from 1 through 80, and the filler column would have 80 distinct values. With the cid values going from 81 through 160, the filler column would have another 80 distinct values, but as a set they were identical to the first 80 values. This pattern repeated every 80 rows because of the expression @i % 80 in the script.

Note that the table row was sized such that each page would contain ~80 rows. This means that if there was a clustered index on column cid, every page would have 80 distinct values in the filler column. When SQL Server tried to compress such a page, there wasn’t much to compress. Now, when the clustered index was created on the filler column instead, every page was filled with duplicate values in the filler column, making the page extremely conducive to data compression.

Again, this post is to help highlight the behavior of SQL Server 2008 data compression. The behavior presented here is neither good nor bad. It’s just a natural consequence of the page-level algorithm. Oracle’s data compression also adopts a page/block level algorithm, and therefore would see the same behavior.

Note that if all the values are distinct, no matter how you shuffle the data you won't improve its compression ratio.

Published Friday, June 20, 2008 1:59 AM by Linchi Shea

Attachment(s): image002.gif

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Chris Culler said:

The example was crafted to show the effect of page-level compression, but I would have preferred an example that belonged in the world of real RDBs.  A lengthy column of only 80 distinct values among 8 million rows screams "normalize me!"  Doing so would achieve far better than 10:1 compression, and with 1970's "technology."

June 23, 2008 11:01 AM
 

Linchi Shea said:

Chris;

For this example, a table-level compression approach (such as what is used by DB2 v9.5) would indeed achieve better compression ratio. But there are pros and cons between table-level compression and page/block-level compression. For instance, the table level compression is sensitive to the initial sample. If the initial sample is not representative, the compression ratio can be very bad. I understand that it can be corrected with a reorg later on. But reorg is a major operation.

Also note that the example applies to Oracle as well, as it uses a block-level compression algorithm.

June 28, 2008 1:15 PM

Leave a Comment

(required) 
(required) 
Submit

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement