Or, How Mirroring Off-Site Saved my #Bacon
My company does most things right. Our management is very supportive, listens and generally funds the technology that makes sense for the best interest of the organization. We have good redundancy, HA and disaster recovery in place that fit our objectives. Still, as they say, bad things can happen to good people. This weekend we did have an outage despite our best efforts, and that’s the reason for this post. It went pretty well for my team, all things considered, but we’d put in a lot of preparation.
It’s a Matter of When and How, not If.
If you work in this industry it’s very likely that you will be hit with something like this at some point. It’ll come when you don’t expect it, in a way you probably didn’t imagine. What separates the women from the girls, so to speak, is how ready you are for the unexpected when it happens. In sports there’s a technique called “positive visualization” or “mental rehearsal” where a skier, or other athlete goes through his or her whole run beforehand, imagining each gate and how they will successfully navigate the course. As far as I have seen it’s quite effective for creating the right mindset to have a successful run, and it’s something I use in planning for failover or disaster recovery. I thought I’d take a minute, fresh from my recent hair-graying experience, to remind everyone to go through this exercise.
Imagine Your Datacenter is Gone
For this to be effective, you have to really go through it in detail, like the skier imagining every gate. You get a phone call, and find out every server in your primary datacenter is offline, and no one knows exactly why. (I am altering the scenario here a little to respect the privacy of my company, but in reality the reason doesn’t matter.) Let’s say the UPS caught fire, or the sprinklers went off, or someone forgot to put diesel in the generator, or there was an earthquake, or a meteor. Fact is, you had a highly redundant infrastructure with failover clusters and backup power and RAID and dual power supplies and redundant everything and still your whole site went down.
It happens. A lot.
First, are you in a company that has not invested in a DR strategy and is totally unprepared? If so, then I would strongly suggest going through this exercise, at a high level, with your boss. An outage like this can and does put whole companies out of business, and if you have a CIO, he or she might not be the CIO afterward.
Really Do this Detailed Thought Experiment
Let’s say, for the sake of argument, that your datacenter will be partially offline for four days (I am improvising here, and you can too, but make it realistic). Run through these visualizations, and imagine how you handle it in your environment:
- Do you have your data? How much data is lost? The servers are offline, and you’re not getting anything from their local drives. Not full backups, not log backups. Nada. Tape drive is there, but has no power. Is there an up-to-date copy at another site? Or recent backups at another site? If so, what’s the recovery point (ie. how old is it) and what would that mean for your business?
But don’t stop with the data -
- If you have another server at another site, can all your applications and/or users actually connect to it? Connection strings in place? How do you repoint everything so that the failover server can be used? It’s easy to gloss over this issue, but think it through in detail: will you have to have staff go and change individual machines to connect to another server? This can be a really sticky problem.
- Is there enough bandwidth between your users and your DR site to sustain the normal workload?
- So you have data, and clients that can point to it. Can they authenticate? Are all the logins in place from the old server? Remember, you can’t reach the old box, can’t query it – there is no way to take a look at this point and see what logins were there. How do you recover authentication? If you are using SQL auth, what about all those passwords? Even if you use mirroring or log shipping, the login issue can still take your system down.
- Now, hopefully, you have connections and users authenticating. What about those SQL Agent jobs? Are the jobs all present on your failover machine? How will you know?
- Use SSIS in your environment? Where are the packages? Do those fail over to your DR site too? How are the DR copies kept current?
- Keep going like this down every subsystem you use – replication, ETL processes, reporting, etc.
The answers to all these questions will be different for you, but the importing thing is that there should be real, working answers. In the middle of the night, sitting at the keyboard, you can’t be figuring this stuff out.
I am pretty happy with what we have built at my office – though I am careful not to get complacent. We use a combination of mirrored failover clusters (that is, failover clusters in two datacenters, with database mirroring synchronizing them over our WAN) and de-duplicated backup files that are SAN-replicated offsite to provide two channels of offsite backup. For some older systems we are still on log shipping, but plan to advance those to mirroring as soon as our ISVs and budgets permit. We use some custom SQL Agent/PowerShell jobs to propagate logins and jobs between the sites – which jobs we had to actually use this weekend for the first time. I am happy to report, they worked. Whew.
On the other hand, I came from a company that was much more reckless, and effectively had no failover strategy whatsoever. They were frankly just too cheap to do it. Looking back, it terrifies me, because I know how simple it could be for them to lose their whole business.
Edit 3/11/2011
With the earthquake in Japan, this is in the forefront of many people's minds today. Please have a look at these other posts, esp. Paul Randal's survey:
http://www.sqlskills.com/BLOGS/PAUL/post/Surveys-DR-plan-testing-and-considering-human-nature.aspx
http://www.sqlha.com/blog/
Could be a good time to approach the boss about this important topic.