Earlier today, MSSQLTips posted a backup compression tip by Thomas LaRock (blog | twitter). In that article, Tom states: "If you are already compressing data then you will not see much benefit from backup compression." I don't want to argue with a rock star, and I will concede that he may be right in some scenarios. Nonetheless, I tweeted that "it depends;" Thomas then asked for "an example where you have data comp and you also see a large benefit from backup comp?" My initial reaction came about for two reasons:
- I know that the benefit you get from data compression depends heavily on data skew and how many repeated patterns you can find on a single page, whereas backup compression is not limited to finding compressible data only within the same page; and,
- I see no reason why anyone should NOT use backup compression, if they have the capability available to them (in SQL Server 2008, Enterprise Edition only; in SQL Server 2008 R2, Standard Edition and higher), unless they are already extremely CPU-bound (typically these days we are predominately I/O-bound).
I've performed several tests of data compression and the additional CPU required was more than worth the savings in I/O (and ultimately duration), so I would expect the same type of benefit here. But I need to put my money where my mouth is... so here is an example that shows two different scenarios where data compression is used, and a variance in the benefit that data compression and then using backup compression gains you. I created 6 databases, all prefixed with ct (which stands for "compression test"):
CREATE DATABASE ct_none_ordered; CREATE DATABASE ct_row_ordered; CREATE DATABASE ct_page_ordered;
CREATE DATABASE ct_none_not_ordered; CREATE DATABASE ct_row_not_ordered; CREATE DATABASE ct_page_not_ordered; |
(I used a prefix mainly because it made it very easy for me to generate backup commands just for these databases.)
In each database I created a table with an IDENTITY column and a single VARCHAR(36) column. In each case, I used a different compression setting to match the database. In the databases with the _ordered suffix, I created a clustered index on the VARCHAR column:
CREATE TABLE dbo.a ( i INT IDENTITY(1,1) PRIMARY KEY NONCLUSTERED, a VARCHAR(36) ) WITH (DATA_COMPRESSION = NONE|ROW|PAGE); GO
CREATE CLUSTERED INDEX a ON dbo.a(a) WITH (DATA_COMPRESSION = NONE|ROW|PAGE); GO |
And in the the databases with the _not_ordered suffix, I made the primary key on the IDENTITY column clustered (and for as much symmetry as possible, created a non-clustered index on the GUID column):
CREATE TABLE dbo.a ( i INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, a VARCHAR(36) ) WITH (DATA_COMPRESSION = NONE|ROW|PAGE); GO
CREATE INDEX a ON dbo.a(a) WITH (DATA_COMPRESSION = NONE|ROW|PAGE); GO |
Then I created two tables in tempdb to store a seed list of VARCHAR(36) values - one with relatively ordered values, and one with randomly ordered values. The quickest way I could think of to do this would be to use NEWSEQUENTIALID() and NEWID(). Now, this isn't exactly scientific because NEWSEQUENTIALID() doesn't increment by the leading character, but there are enough repeating sequences on a page that I predicted it would accomplish the results I was after (and if it didn't, I could come up with something else). I populated each table with 100,000 rows. My plan was to then load 10,000,000 rows into each of the above 6 tables, compare the space used on disk, and then compare the backup sizes and duration.
USE tempdb; GO SET NOCOUNT ON; GO
CREATE TABLE dbo.[seq] ( i UNIQUEIDENTIFIER NOT NULL DEFAULT NEWSEQUENTIALID() ); GO
INSERT dbo.[seq] DEFAULT VALUES; GO 100000
CREATE TABLE dbo.[rand] ( i UNIQUEIDENTIFIER NOT NULL DEFAULT NEWID() ); GO
INSERT dbo.[rand] DEFAULT VALUES; GO 100000 |
Now that I had my seed values, I could start populating my real tables. I had a loop like this for each _ordered table:
INSERT dbo.a(a) SELECT a = RTRIM(i) FROM tempdb.dbo.[seq] ORDER BY a; GO 100 |
And then a loop like this for each _not_ordered table:
INSERT dbo.a(a) SELECT a = RTRIM(i) FROM tempdb.dbo.[rand] ORDER BY NEWID(); GO 100 |
Was this experiment going to be fast? Heck no! It ended up taking about an hour and a half to populate the data (60,000,000 rows), and 11 minutes, 44 seconds to perform 12 backups. Here are the results for backup sizes:
And here are the results for backup times:
As I expected, the ordered data did not compress well using only row compression, but it did compress very well when using page compression (over 60%). And even at the best data compression rate, we see backup compression rates over 70%, and 40% less time is required to perform those backups. In fact I was unable to come up with a counter-example - where backup compression did NOT improve things, or even where they remained the same. I saw benefits to using backup compression in every single case I tested, and the savings are substantial in even the worst case (45% space savings).
I will concede, though, that all of these tests were performed on a relatively idle server with no CPU or memory pressure. So there are certainly cases I can envision where the advantages gained by the improved I/O when using backup compression can be outweighed by the additional cost to CPU. But as I mentioned before, most of the systems I come across are I/O bound, and can afford to trade some CPU for better I/O.