Answer: Only if you want a shot at it actually working when you need it.
There are a few reasons you need to regularly test your recovery plans… I’ve got my top three.
- Backups only work if they are good.
- Your documentation is only useful if you can follow it.
- You are soft and easily crushed.
Everyone knows the mantra of “backup, backup, backup” but you also have to test those backups for accuracy and functionality. I’m not going to beat this one endlessly, but please read an old post of mine – “Epic Fail #1” to see how backups can fail in spectacular, unplanned ways.
Simply put, you need good documentation. You need easy to locate lists of vendors, support numbers, configuration details of machines and applications, notes on how “this” interacts with “that”, what services have dependencies on others and step by step instructions for processes you don’t do often and even those you DO do everyday.
When under pressure to troubleshoot an issue that is causing downtime, it’s likely you’ll loose track of where to find information you need to successfully recover. Having clean documentation will keep you calm and focused at a time you really need to have your head in the game.
Realistically, your documentation will be out of date when you use it. You won’t mean for it to be, but even if you have a great DR plan in place, I’ll bet you upgraded a system, changed vendors, or altered a process almost immediately after your update cycle. Regular review of your documents is a valuable part of testing, even if you don’t touch your lab.
My personal method is to keep a binder with hard copies of all my DR documentation handy. Whenever I change a system, I make a note on the hard-copy. Quarterly, I update the electronic version and reprint it. With the binder, I always have information handy in case the electronic version is not accessible and the version with the handwritten notes is often more up to date with the added margin notes. Even something declaring a section “THIS IS ENTIRELY WRONG NOW” can save someone hours of heading down the wrong path.
No one wants to contemplate their mortality, I completely understand. (Or maybe you just want to go on vacation without getting a call half way through. Shocker, right?) But if you happen to hold the only knowledge of how something works in your data center, then you are a walking liability for your company. You aren’t securing your job by being the only person with the password to the schema admin account, for example. It only takes one run in with a cross-town bus to create a business continuity issue for your company that didn’t even touch the data center.
This extends to your documentation. Those step-by-step instructions for recovery need to include information and tips that someone else on your team (or an outside consultant) can follow without having prior intimate knowledge of that system. Sometimes the first step is “Call Support, the number is 800-555-1212” and that’s okay.
The only way to find out what others don’t know is to test. Test with tabletop exercises, test with those backup tapes and test with that documentation. Pick a server or application and have someone who knows it best write the first draft and then hand it to someone else to try to follow. Fill in the blanks. Repeat. Repeat again.
A lot of this process requires only your time. Time you certainly won’t have when your CEO is breathing down you neck about recovering his email.
This is post part of a 15 part series on Disaster Recovery and Business Continuity planning by the US based Microsoft IT Evangelists. For the full list of articles in this series see the intro post located here: http://mythoughtsonit.com/2014/02/intro-to-series-disaster-recovery-planning-for-i-t-pros/
If you are ready to take things further, check out Automated Disaster Recovery Testing with Hyper-V Replica and PowerShell – http://blogs.technet.com/b/keithmayer/archive/2012/10/05/automate-disaster-recovery-plan-with-windows-server-2012-hyper-v-replica-and-powershell-3-0.aspx