THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Merrill Aldrich

Technical content about Microsoft data technologies. All opinions expressed are purely my own and do not reflect positions of my employer or associates.

Scandalous II: Shh! I am De-duplicating Compressed Backups

This is part II of two Scandalous posts. Watch, mouth agape, as I run with scissors, right up against prevailing wisdom! Unfollow me now, before it’s too late!

Here’s the thing. There are two really outstanding posts out there on the ‘tubez that explain in vivid detail the problems with sending compressed data into a de-duplicating appliance. And these guys are both absolutely right. Everything in their posts is correct, and I would ask that, if you haven’t, you please read them before mine:

First, Brent Ozar: 

(And, may I say, well done on the Numero Uno Google result for that post. Very nice!)

Next Denny Cherry: 

(A very respectable #3 on the Google-ometer.)

Now, I’m not kidding. These guys know their stuff, and they are right. Stop reading right now.





Still here? Ok, now come closer.



I studied this whole thing very carefully, and I do it anyway.

While it’s true that de-duplication works poorly with compressed data, and if you compare the de-dupe ratios for “usual” uncompressed files with the de-dupe ratios for compressed files, the compressed data looks very, very bad. But there’s even more to this story, so much more that we decided to, in a limited way, stuff the compressed files into our DDR anyway.

Here’s why:

Both SQL Server backups and file compression are a deterministic process. If you back up the same database twice, and it has the same data pages in it, and those pages are largely unchanged, then the backup files will be substantially the same. This is true if you compress both files with the same algorithm and settings, too – the data in the compressed files will be largely identical. It will not be like any OTHER files on your network, but the two files will be similar to one another.

If you change a small percentage of the data pages in the data file, that will still be true: a compressed backup of the database on, say, Monday will be mostly the same as a compressed backup of the same database, with modest changes, on Tuesday.

What that means is that if I have a 1 TB database, which I do, that produces a 250 GB compressed backup file, and that database receives mainly incremental changes from day to day or week to week, then each successive backup will be similar to the previous one. And if I copy them into a de-duplicating store (at least the one I have to work with) then, while the first file will be basically 100% net new data, the second will de-dupe against the first. It’s not as effective as other types of files, but it does help. Let’s say, for argument, that I get 75% de-duplication of only the two files, instead of the normal 85%+ across many instances of other files, I am still getting 75% de-duplication, and that can be very useful.

Useful how? Well, we have SAN replication married to our de-duplicating store for offsite backup and disaster recovery. That means that each night I have to transmit a LOT of SQL backup data across a WAN to another site. What’s a lot? For me, that just means the pipe is small and the data is much bigger. And that process would go a lot faster if, somehow, by magic, a whole lot of the data were already at the other end of the pipe before I start.

See where I’m going with this? With de-duplicated files, as days and weeks pass, each time we replicate new files from one site to the other, a whole lot of the data is already there at the other site. We only have to transmit the net new data. Even if that’s only 50% (a very poor performance number for de-duplicated storage in most people’s minds) that’s still cutting the data in half. Which is pretty good. Plus it’s compressed, which helps every other aspect of the backup story.

So we have what I think is a good compromise, born out by internal testing:

  1. Keeping compressed SQL Backups in de-duplicated storage indefinitely, as a replacement for tapes, is impractical. It’s just too expensive. So we keep the SQL Backups in there only for the purpose of DR, and we have a pretty aggressive purge schedule to be rid of old files. The sweet spot seems to be to keep only a week or two.
  2. We use tapes too, for archival purposes, and they have longer retention.
  3. We back up to local (DAS or SAN) disks first at the SQL Server and then copy into the de-duplicating store, so that the backup process performs well and isn’t bottlenecked at the network or at the speed the appliance can receive the files. So backups go to disk, then get copied into the de-dupe store, cancel against whatever is in there, and then it replicates them off site.

This is not a cheap setup, but it works great. I love it. That 250 GB file I mentioned is available at my other site in a couple of hours, because it’s always mostly there already. Your mileage may vary depending on all the specifics of the technology you have, and, as I said, Brent and Denny are right.

* Professional driver on a closed course; don’t try this at home; no animals were de-duped in the production of this post.


Published Friday, April 22, 2011 11:01 PM by merrillaldrich

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



Julian Kuiters said:

Hi Merrill, what tools are you using to do the de-duped file transfers?

April 24, 2011 6:27 AM

merrillaldrich said:

We use a Data Domain (now EMC Data Domain) setup. Complete files get copied in at one data center, deduped, then the net new bits get replicated off site.

April 24, 2011 12:19 PM

Denny Cherry said:

At some point I need to do a follow up to my post.  We did a bunch more testing in the Quantum lab that I need to post about.  I'll try and get the followup information posted this week, but as I've got Jury duty this week that may not happen this week.

April 25, 2011 1:55 AM

merrillaldrich said:

Cool - I'm very curious to see what you all found

April 25, 2011 9:26 AM

Yassine Elouati said:

Designed such a setup for a client last year. It cuts backup time, saves space. Dedupe restores can be slower vs reading compressed files from NAS but are sure worth it.

April 27, 2011 10:27 AM

Steve G. said:

May I ask the obvious question: How do deduped backups (I'm assuming at this point you're doing full db backups) compare with the traditional differential and log backups? Wouldn't a differential backup get you close to a deduped full backup without all that expensive software and hardware?

April 27, 2011 2:20 PM

merrillaldrich said:

Steve - that is an excellent question. Diffs and log backups are great, and we use them. Here's a summary, from my point of view:

- Differential compressed backups will almost certainly be all net-new data to the dedupe process, and will not deduplicate very effectively. OTOH, neither does the "new" part of a full backup. It's basically the same underlying issue for both types of files.

- Compressed log backups probably have the same behavior, being all net-new / deduping badly.

- Uncompressed diffs or log backups may dedupe more effectively, but of course are much larger on local disk and when being transmitted over the network.

The net result - we settled on a process where we use basically the same conventions we have always used: small DBs get backed up with a full, nightly, big databases get a weekly full and nightly diffs. The fulls dedupe well, and the diffs dedupe badly (but work anyway.)

For log backups we use uncompressed .TRN files on most systems, compressed on 2008 enterprise - a decision driven only by the introduction of native compressed backups in 2008.

I don't think I would change to a long, long series of *only* diffs or log backups, like, say a month, because I am paranoid about being dependent on that one full backup at the start of the chain.

April 29, 2011 1:21 AM

merrillaldrich said:

One other observation - this is a case where we already have a  de-duplication setup, and we were deciding whether or not to use it for sql backups in addition to all the other things we use it for. It probably does not make sense to run out and buy one for sql server, for the reasons Brent and Denny outline. If it were only for sql backups, it's quite expensive and the advantage is diminished.

April 29, 2011 10:05 AM

james Wood said:

We do the same, - backup to deupe storage, then replicate the deduped backup to another backup server with deduped storage in another site. I tried to add compressed .bak files to this system but to be honest, the compressed files still took ages to replicate across the WAN. So I took the technology at it's word. Now I snapshot the database using VSS, mount the iSCSI LUNS directly on the backup server , and back this to dedupe storage. 1.9TB copied in about 6 hours. The good bit... I then copy this dedupe storage to another dedupe storage across the WAN (admittedly 100mbps) and the same backup set takes 3hrs! thats a throughput of 11 Gigabytes a minute! As mentioned in this post, if the data blocks already exist in the destination, the blocks get skipped. SO in short, from what i found,  if you can afford the initial space, the dedupe works faster on uncompressed data

October 28, 2015 9:05 AM

silk said:^Eescort.html^Eescort.html^Eescort.html^Eescort.html

February 9, 2019 8:20 AM

Leave a Comment


This Blog


Privacy Statement