Playing IT Fast and Loose

It’s been a long time since I’ve been at work from dusk ’til dawn. I not saying that I’m the reason we have such fabulous uptime, there are a lot of factors that play into it. We’ve got a well rounded NetOps team, we try to buy decent hardware, we work to keep everything backed up and we don’t screw with things when they are working. And we’ve been lucky for a long time.

It also helps that our business model doesn’t require selling things to the public or answering to many external “customers”. Which puts us in the interesting position where its almost okay if we are down for a day or two, as long as we can get things back to pretty close to where they were before they went down. That also sets up to make some very interesting decisions come budget time. They aren’t necessarily “wrong”, but they can end up being awkward at times.

For example, we’ve been working over the last two years to virtualize our infrastructure. This makes lots of sense for us – our office space requirements are shrinking and our servers aren’t heavily utilized individually, yet we tend to need lots of individual servers due to our line of business. When our virtualization project finally got rolling, we opted to us a small array of SAN devices from Lefthand (now HP). We’ve always used Compaq/HP equipment, we’ve been very happy with the dependability of the physical hardware. Hard drives are considered consumables and we do expect failures of those from time to time, but whole systems really biting the dust? Not so much.

Because of all the factors I’ve mentioned, we made the decision to NOT mirror our SAN array. Or do any network RAID. (That’s right, you can pause for a moment while the IT gods strike me down.) We opted for using all the space we could for data and weighed that against the odds of a failure that would destroyed the data on a SAN, rendering entire RAID 0 array useless.

Early this week, we came really close. We had a motherboard fail on one of the SANs, taking down our entire VM infrastructure. This included everything except the VoIP phone system and two major applications that have not yet been virtualized. We were down for about 18 hours total, which included one business day.

Granted, we spent the majority of our downtime waiting for parts from HP and planning for the ultimate worst – restoring everything from backup. While we may think highly of HP hardware overall, we don’t think very highly of their 4-hour response windows on Sunday nights. Ultimately, over 99% of the data on the SAN survived the hardware failure and the VMs popped back into action as soon as the SAN came back online. We only had to restore one non-production server from backup after the motherboard replacement.

Today, our upper management complemented us on how we handled the issue and was pleased with how quickly we got everything working again.

Do I recommend not having redundancy on your critical systems? Nope.

But if your company management fully understands and agrees to the risks related to certain budgeting decisions, then as a IT Pro your job is to simply do the best you can with what you have and clearly define the potential results of certain failure scenarios.

Still, I’m thinking it might be a good time to hit Vegas, because Lady Luck was certainly on our side.