THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Merrill Aldrich

Don't get burned by replication of SQL Server files

Here's a tidbit for those who might have SQL server in their environments, maybe without knowing all the nitty gritty low-down: if you try to use file system replication (robocopy, xcopy, repli-whatever) to maintain a DR server from your production SQL Server, you might be in for a nasty surprise.

I recently had to troubleshoot a scenario like this: the app owner came to me and said, "the DB on our disaster recovery system is in suspect mode, what should we do? (and what happened?)" I immediately went to disk corruption, etc., but it turns out the issue was really simpler: it turns out there was nothing wrong with the machine other than copying files onto it.

Huh?

Yes, copying files had corrupted the database. Here's how:

The (well meaning) group involved had established replication of files from a production server to this failover server, thinking about disaster recovery. For almost all "vanilla" files, this works great. In practically all other cases outside SQL Server, each file is a separate entity, working alone. If the file is unlocked at any point in time, then it's pretty safe to copy it off to another server.

SQL Server databases, however, are always composed of at least two separate files that work together: the data file and the log file. In addition, those two files are typically locked whenever the SQL Server service is running and has them open. Two issues with file replication, then: the database files are always locked, so they probably won't get replicated very often, but more importantly, you have to copy both of them at the same time to end up with a working database at the other end of the process. If the log and data files log sequence numbers (LSN's) don't line up, the recovery process doesn't work, and the database will not function.

What I think happened with my app owner is this: the source system and destination went offline for a brief period (probably patching), so the original MDF and LDF files were unlocked. The replication thingamagig managed to pick up and copy the small log file but not the big data file to the its target server. When the target server's SQL Server service fired up, it found a mismatched log and data file, and so, of course, failed recovery and would not bring the DB online.

The takeaway - SQL Server has great technology for disaster recovery, including the simple/basic log shipping method. Copying raw files from disk, however, won't give you DR!

Published Monday, February 01, 2010 2:45 PM by merrillaldrich
Filed under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Karel Vandenhove said:

You can delete the log file and then re-attach the datbase file.  

A new log file will be created so at least not all is lost.

Not a very goor practice though.

February 2, 2010 5:53 AM
 

Gail said:

Karel, that's not always true. SQL can only recreate the log if the database was shut down cleanly. If it wasn't, attempting to attach just the mdf will result in a file activation error and the DB not attaching.

February 6, 2010 3:42 PM
 

Chad Miller said:

I think this situation where you get what you pay for. Using host-based replication for DR, as you've pointed out is not a good solution, however using SAN-based replication like EMC SRDF/A is. There are big differences in host-based and storage-based replication that shouldn't be dismissed.

I've been using EMC SRDF/A for a number of years to replicate Oracle, Informix, SQL Server as well as web and application servers to a DR site 1,000 miles away from our primary site. It works and it works well. In fact I would even state SAN-based replication implemented correctly (disk consistency groups) is far superior solution than native SQL Server solutions. I had used logging shipping prior to SAN-based replication and we were spending 15% of our time maintaining--now 0%. I'm speaking strictly for the DBA group. There is a maintenance aspect for the Storage team, but not as much as log shipping. The DR solution is tested multiple times per year and have not had issues. For storage-based replication vendors such EMC have very detailed whitepapers specific to database technologies.

February 7, 2010 1:02 PM
 

merrillaldrich said:

Chad - all true. If you have a "real" replication system designed for this purpose, it can be great. I was just warning people off the basic automated file copy from server to server.

February 7, 2010 6:06 PM
 

Dean Hiller said:

Won't SAN replication have real problems under high load if it gets behind as a transaction in a database can span 2 blocks and the SAN is transaction unaware and may replicate the first block, but the second block is not replicated....ie. database is now corrupt and ACID lost.   or am I missing other information about san replication???

I have seen many QA sites tests DR with "very small load"...you probably won't see the problem then.  I would think you need to have high enough load and test it enough times and I would think it would manifest itself once in a while.  

January 18, 2011 1:03 PM
 

merrillaldrich said:

Dean - I believe the SAN vendor solutions like SRDF have some mechanisms built in to prevent what you describe and make them safe for SQL Server and other database products.

Worth noting - as long as the log records are correctly stored in the log file, ahead of the associated data page writes, then the transactional consistency of the database is safe. When the database starts up in the DR site, the log will be replayed and the database pages should be brought back into a consistent state. On the other hand, if the log doesn't make it over intact, or the data file changes replicate ahead of the log file, it does seem like one could face "bad things"

January 20, 2011 1:28 AM
 

Asheesh said:

The correct way to put it is - "block level" replication solutions are better than "file level" replication solutions. (At least for the case described in this post).

"block level" solutions could be a host based -like Veritas Volume Replicator, or a "off-host" based like SRDF or HTC.

All these block-based replication solutions have a concept of "consistency groups" - which is a group of storage ( either Volumes or LUNs) which have writes-in-order-replicated.

This will ensure there is no data-corruption.

I havent heard of any file-level-replication solution that does "consistency groups". (technically speaking, it is difficult)

March 25, 2011 9:25 AM
 

merrillaldrich said:

Asheesh - well put, you've explained it better than I did. Everyone else > what he said. :-)

March 25, 2011 9:52 AM
 

Phil said:

We use SRDF/A, but in 3 of 4 DR test exercises, including one just concluded, the MDF and LDF files were corrupt on the SAN target volume and could not be attached to the SQL DR instance. The DB is not transaction intensive, but the replication target is reached over a WAN that has had performance issues in the past. The network can be a major player in the data consistency game.

May 13, 2011 9:50 AM
 

venkat said:

Hi Chad,

When we do SAN level replication, what is the status of the database when the Disks are mounted on the DR server? I am looking at guides  for SQL server

November 29, 2011 3:20 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement