THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Merrill Aldrich

Visualize Disaster

Or, How Mirroring Off-Site Saved my #Bacon

My company does most things right. Our management is very supportive, listens and generally funds the technology that makes sense for the best interest of the organization. We have good redundancy, HA and disaster recovery in place that fit our objectives. Still, as they say, bad things can happen to good people. This weekend we did have an outage despite our best efforts, and that’s the reason for this post. It went pretty well for my team, all things considered, but we’d put in a lot of preparation.

It’s a Matter of When and How, not If.

If you work in this industry it’s very likely that you will be hit with something like this at some point. It’ll come when you don’t expect it, in a way you probably didn’t imagine. What separates the women from the girls, so to speak, is how ready you are for the unexpected when it happens. In sports there’s a technique called “positive visualization” or “mental rehearsal” where a skier, or other athlete goes through his or her whole run beforehand, imagining each gate and how they will successfully navigate the course. As far as I have seen it’s quite effective for creating the right mindset to have a successful run, and it’s something I use in planning for failover or disaster recovery. I thought I’d take a minute, fresh from my recent hair-graying experience, to remind everyone to go through this exercise.

Imagine Your Datacenter is Gone

For this to be effective, you have to really go through it in detail, like the skier imagining every gate. You get a phone call, and find out every server in your primary datacenter is offline, and no one knows exactly why. (I am altering the scenario here a little to respect the privacy of my company, but in reality the reason doesn’t matter.) Let’s say the UPS caught fire, or the sprinklers went off, or someone forgot to put diesel in the generator, or there was an earthquake, or a meteor. Fact is, you had a highly redundant infrastructure with failover clusters and backup power and RAID and dual power supplies and redundant everything and still your whole site went down.

It happens. A lot.

First, are you in a company that has not invested in a DR strategy and is totally unprepared? If so, then I would strongly suggest going through this exercise, at a high level, with your boss. An outage like this can and does put whole companies out of business, and if you have a CIO, he or she might not be the CIO afterward.

Really Do this Detailed Thought Experiment

Let’s say, for the sake of argument, that your datacenter will be partially offline for four days (I am improvising here, and you can too, but make it realistic). Run through these visualizations, and imagine how you handle it in your environment:

  1. Do you have your data? How much data is lost? The servers are offline, and you’re not getting anything from their local drives. Not full backups, not log backups. Nada. Tape drive is there, but has no power. Is there an up-to-date copy at another site? Or recent backups at another site? If so, what’s the recovery point (ie. how old is it) and what would that mean for your business?
    But don’t stop with the data -
  2. If you have another server at another site, can all your applications and/or users actually connect to it? Connection strings in place? How do you repoint everything so that the failover server can be used? It’s easy to gloss over this issue, but think it through in detail: will you have to have staff go and change individual machines to connect to another server? This can be a really sticky problem.
  3. Is there enough bandwidth between your users and your DR site to sustain the normal workload?
  4. So you have data, and clients that can point to it. Can they authenticate? Are all the logins in place from the old server? Remember, you can’t reach the old box, can’t query it – there is no way to take a look at this point and see what logins were there. How do you recover authentication? If you are using SQL auth, what about all those passwords? Even if you use mirroring or log shipping, the login issue can still take your system down.
  5. Now, hopefully, you have connections and users authenticating. What about those SQL Agent jobs? Are the jobs all present on your failover machine? How will you know?
  6. Use SSIS in your environment? Where are the packages? Do those fail over to your DR site too? How are the DR copies kept current?
  7. Keep going like this down every subsystem you use – replication, ETL processes, reporting, etc.

The answers to all these questions will be different for you, but the importing thing is that there should be real, working answers. In the middle of the night, sitting at the keyboard, you can’t be figuring this stuff out.

I am pretty happy with what we have built at my office – though I am careful not to get complacent. We use a combination of mirrored failover clusters (that is, failover clusters in two datacenters, with database mirroring synchronizing them over our WAN) and de-duplicated backup files that are SAN-replicated offsite to provide two channels of offsite backup. For some older systems we are still on log shipping, but plan to advance those to mirroring as soon as our ISVs and budgets permit. We use some custom SQL Agent/PowerShell jobs to propagate logins and jobs between the sites – which jobs we had to actually use this weekend for the first time. I am happy to report, they worked. Whew.

On the other hand, I came from a company that was much more reckless, and effectively had no failover strategy whatsoever. They were frankly just too cheap to do it. Looking back, it terrifies me, because I know how simple it could be for them to lose their whole business.

Edit 3/11/2011

With the earthquake in Japan, this is in the forefront of many people's minds today. Please have a look at these other posts, esp. Paul Randal's survey:

http://www.sqlskills.com/BLOGS/PAUL/post/Surveys-DR-plan-testing-and-considering-human-nature.aspx 

http://www.sqlha.com/blog/ 

Could be a good time to approach the boss about this important topic.

Published Monday, March 07, 2011 10:21 PM by merrillaldrich

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Jason said:

Great article.  As the first full time dedicated DBA at the company I'm working at, I'm working on DR from the database side of things.  I have yet to find a willing partner on the application side of things.  But I do know that I have reliable data offsite that can be brought up as needed.  Most of the systems in place now are vendor purchased off the shelf type systems, so in the worst case were the primary datacenter might be offline for days or weeks, we could stand up some VM's and enlist the support of our vendors to bring the applications up and pointing to the recovered databases.

I am just now working with management and the business to identify which applications need to be mirrored or log-shipped to a warm-standby system.  Once that list is created, we can then work with the vendors to get the application side set up before a disaster.  The problem is weeding out the business claim that EVERY APP NEEDS TO BE READY AT A MOMENTS NOTICE!!!! omg!  Once we start sending the quotes for redundant hardware and warm-standby for every application they use, that changes things a lot and they start thinking about work-around that would keep the business runnning for a day or two on most applications.

March 8, 2011 7:46 AM
 

hj meulekamp said:

How is the mirroring over WAN working for you? do you have a high dedicated bandwidth between the databaseservers? Or tunneling via vpn over the internet?

Do you use the high safety mode? Any negative impact on raw transaction volume the server can process?

We are looking into multiple datacenter solution right now, and I'm a fan of mirroring.. Did not have to do it over a WAN solution..

Thanks,

March 8, 2011 8:14 AM
 

merrillaldrich said:

Jason - sounds like a good start. We have a tiered system categorizing apps into four groups with different recovery point / recovery time objectives.

Hj, it's working. We do use async (knowing the risk) because of bandwidth, but we watch the send queues carefully with SCOM. It's not a dedicated connection. That whole issue is an important consideration.

March 8, 2011 10:00 AM
 

J.D. Gonzalez said:

Fantastic article.  Quite frankly, I wish people would do this exercise at least once a quarter.  If you could practice, even better.  I worked in a place paid for hardware but didn't put too much emphasis on DR.  They relied too much on VM and SAN replication.

Every time I brought up bandwidth and storage I was blown off.  As luck would have it, there was a failure and people were scrambling left and right.  We eventually got things recovered and finally the rest of IT started to see what I had been trying to get at.

Personally, I don't even think it's possible to have a DR plan without first visualizing and performing scenario walkthroughs.  Good post!

March 8, 2011 4:50 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement