Business Continuity and the Cloud

This week marks the start of TechNet on Tour, coming to twelve cities. The full day workshops include lecture and hands-on-labs where you can learn about some of the ways you can utilize Microsoft Azure to help with your disaster recovery planning.

But let me tell you about the first “business continuity” plan I was part of. It involved a stash of tapes, daily backups on a two week cycle with the Friday backups being held for a month. The nightly backup job fit on two tapes and every morning, I ejected the tapes from the machine and dropped them in my bag. They went home with me, across town, and came back every day to be swapped with latest ones. Whenever I took a vacation, I designated an available person to perform the same task.

That was it. The tapes were rarely looked at, the data never tested and fortunately, never needed. We were partying like it was 1999. Because it was.

Still, the scenario isn’t uncommon. There are still lots of small businesses, with only single locations and still lots of tapes out there. But now, there is more data and more urgency for that data to be recovered as quickly as possible with as little loss as possible. And there are still only 24 hours in the day. How annoying to arrive at work in the morning, only to find the overnight backup job still running.

As I moved through jobs and technologies evolved, we addressed the growing data and lack of time in many ways… Adjusting backup jobs to capture less critical or infrequently changing data only over the weekends. More jobs that only captured delta changes. Fancier multiple-tape changers, higher density tapes, local “disk to disk” backups that were later moved to tape, even early “Internet” backup solutions, often offered by the same companies that handled your physical tape and box rotation services.

We also chased that holy-grail of “uptime”. Failures weren’t supposed to happen if you threw enough hardware in a room. Dual power supplies, redundant disk arrays, multiple disk controllers, UPS systems with various bypass offerings. Add more layers to protect the computers, the data.

Testing was something we wanted to do more often. But it was hard justify additional hardware purchases to upper management. Hard to find the time to set up a comprehensive test. But we tried and often failed. And learned. Because each test or real outage is a great opportunity to learn. Outages are often perfect storms… if only we had swapped out that dying drive a day before, if only that piece of hardware was better labeled, if only that was better documented… and each time we made improvements.

I remember, after a lengthy call with a co-location facility that wanted us to sign a year agreement even though we only wanted space for 3 months to run a recovery test, how I wished for something I could just use for the time I needed. It’s been a little over 5 years since that phone call, but finally there is an answer and it’s “the cloud”.

Is there failure in the cloud? Of course, it’s inevitable. For all the abstractness, it’s still just running on hardware. But the cloud provides part of an answer that many businesses simply didn’t have even five years ago. Business that never recovered from the likes of Katrina and other natural or man-made disasters, might still have a shot today.

So catch a TechNet Tour if it passes through your area. Look at taking advantage of things like using the cloud as target instead of tape, or replicating a VM to Azure with Azure Site Recovery. Even starting to dabble in better documentation or scripting with PowerShell to make your key systems more consistently reproducible will go a long way. Do a “table top” dry run of your existing DR plan today.

Sysadmins don’t let other sysadmins drop DLT tapes in their bags. Let’s party like it’s 2015. Because it is.