In this post, I continue to explore the implications of SQL Server 2008 data compression. It is particularly worth highlighting the fact that SQL Server 2008 data compression is performed at the page level instead of the table level. In other words, when SQL Server 2008 goes about compressing data, it does it one page at a time, and its compression algorithm considers only the data on that page.
The page level approach to data compression has important implications, one of which is that what kind of compression ratio you get on a table depends on how the data in the table is distributed. That is, it is possible that with the exactly same set of data in the table, you can get radically different compression ratios if data is shuffled around in the table.
To drive this point home, let me show you an extreme example with the following table:
create table customer (
cid int identity,
filler char(100) NOT NULL,
)
With this table, I tested two scenarios:
Scenario 1: Clustered index was created on column cid:
create clustered index cus_cx on customer(cid);
Scenario 2: Clustered index was created on column filler
create clustered index cus_cx on customer(filler);
Again, with the exactly same set of data in the table, in each scenario I compressed the data, and compared the table size before the compression and after the compression. The data compression was done with ALTER TABLE statement as follows:
ALTER TABLE customer REBUILD
WITH (DATA_COMPRESSION = PAGE, MAXDOP=8);
The following chart shows the table size before and after compression for each test scenario:
Clearly, when the table was clustered on the column cid, the data compression ratio was very low (slightly greater than 1), whereas when the table was clustered on the column filler, the data compression ratio was very good at ~10 (the compressed table was about 10 times smaller).
Why?
First we need to know exactly what data was in the table, i.e. we need to know the characteristics of the data. Well, the data was populated with the following script:
declare @i int
set @i = 1
begin tran
while @i <= 8000000
begin
insert customer(filler)
select left(cast(@i % 80 as varchar(8)) + '00000000', 7) + replicate('a', 92)
if (@i % 10000) = 0
begin
commit tran
begin tran
end
select @i = @i + 1
end
commit tran
With this script, the first 80 rows would have the cid values going from 1 through 80, and the filler column would have 80 distinct values. With the cid values going from 81 through 160, the filler column would have another 80 distinct values, but as a set they were identical to the first 80 values. This pattern repeated every 80 rows because of the expression @i % 80 in the script.
Note that the table row was sized such that each page would contain ~80 rows. This means that if there was a clustered index on column cid, every page would have 80 distinct values in the filler column. When SQL Server tried to compress such a page, there wasn’t much to compress. Now, when the clustered index was created on the filler column instead, every page was filled with duplicate values in the filler column, making the page extremely conducive to data compression.
Again, this post is to help highlight the behavior of SQL Server 2008 data compression. The behavior presented here is neither good nor bad. It’s just a natural consequence of the page-level algorithm. Oracle’s data compression also adopts a page/block level algorithm, and therefore would see the same behavior.
Note that if all the values are distinct, no matter how you shuffle the data you won't improve its compression ratio.