Playing IT Fast and Loose

It’s been a long time since I’ve been at work from dusk ’til dawn. I not saying that I’m the reason we have such fabulous uptime, there are a lot of factors that play into it. We’ve got a well rounded NetOps team, we try to buy decent hardware, we work to keep everything backed up and we don’t screw with things when they are working. And we’ve been lucky for a long time.

It also helps that our business model doesn’t require selling things to the public or answering to many external “customers”. Which puts us in the interesting position where its almost okay if we are down for a day or two, as long as we can get things back to pretty close to where they were before they went down. That also sets up to make some very interesting decisions come budget time. They aren’t necessarily “wrong”, but they can end up being awkward at times.

For example, we’ve been working over the last two years to virtualize our infrastructure. This makes lots of sense for us – our office space requirements are shrinking and our servers aren’t heavily utilized individually, yet we tend to need lots of individual servers due to our line of business. When our virtualization project finally got rolling, we opted to us a small array of SAN devices from Lefthand (now HP). We’ve always used Compaq/HP equipment, we’ve been very happy with the dependability of the physical hardware. Hard drives are considered consumables and we do expect failures of those from time to time, but whole systems really biting the dust? Not so much.

Because of all the factors I’ve mentioned, we made the decision to NOT mirror our SAN array. Or do any network RAID. (That’s right, you can pause for a moment while the IT gods strike me down.) We opted for using all the space we could for data and weighed that against the odds of a failure that would destroyed the data on a SAN, rendering entire RAID 0 array useless.

Early this week, we came really close. We had a motherboard fail on one of the SANs, taking down our entire VM infrastructure. This included everything except the VoIP phone system and two major applications that have not yet been virtualized. We were down for about 18 hours total, which included one business day.

Granted, we spent the majority of our downtime waiting for parts from HP and planning for the ultimate worst – restoring everything from backup. While we may think highly of HP hardware overall, we don’t think very highly of their 4-hour response windows on Sunday nights. Ultimately, over 99% of the data on the SAN survived the hardware failure and the VMs popped back into action as soon as the SAN came back online. We only had to restore one non-production server from backup after the motherboard replacement.

Today, our upper management complemented us on how we handled the issue and was pleased with how quickly we got everything working again.

Do I recommend not having redundancy on your critical systems? Nope.

But if your company management fully understands and agrees to the risks related to certain budgeting decisions, then as a IT Pro your job is to simply do the best you can with what you have and clearly define the potential results of certain failure scenarios.

Still, I’m thinking it might be a good time to hit Vegas, because Lady Luck was certainly on our side.

Tomorrow is World Backup Day

What have you backed up lately?

If you are a systems admin, you probably already have a backup solution in place at the office or for your clients. Take some time tomorrow to check in on those processes to make sure you aren’t missing something important and that they are working the way you expect.

At home, check on or implement a solution for your important files and photos on your home computers. It can be as simple as purchasing a portable drive or using a cloud based solution. I’m a SugarSync fan myself. If you want to check out SugarSync for yourself, use this referral code and get some bonus free space.

With the proper backup solution in place, your home laptop can be almost instantly replaceable with no worries. I recently reinstalled the OS on my netbook and was able to sync all my data files right back on with SugarSync. It’s easy and helps me sleep better at night!

Learn more about World Backup Day at http://www.worldbackupday.net/

The How and Why of an ImageRight Test Environment

Over the last few days, I’ve coordinated setting up a new test environment for ImageRight, now that we’ve upgraded to version 5. Our previous test environment was still running version 4, which made it all but useless for current workflow development. However, workflow development isn’t the only reason to set up an alternate ImageRight system – there are some other cool uses.

ImageRight has an interesting back-end architecture. While it’s highly dependant on Active Directory for authentication (if you use the integrated log on method), the information about what other servers the application server and the client software should interact with is completely controlled with database entries and XML setup files. Because of this you can have different ImageRight application servers, databases and image stores all on the same network with no conflicts or sharing of information. Yet, you don’t need to provide a separate Active Directory infrastructure or network subnet.

While our ultimate goal was to provide a test/dev platform for our workflow designer, we also used this exercise as an opportunity to run a “mini” disaster recovery test so I could update our recovery documentation related to this system.

To set up a test environment, you’ll need at least one server to hold all your ImageRight bits and pieces – the application server service, the database and the images themselves. For testing, we don’t have enough storage available to restore our complete set of images, so we only copied a subset. Our database was a complete restoration, so test users will see a message about the system being unable to locate documents that weren’t copied over.

I recommend referring to both the “ImageRight Version 5 Installation Guide” and the “Create a Test Environment” documents available on the Vertafore website for ImageRight clients. The installation guide will give you all the perquisites need to run ImageRight and the document on test environments has details of what XML files need to be edited to ensure that your test server is properly isolated from your production environment. Once you’ve restored your database, image stores and install share (aka “Imagewrt$), its quick and easy to tweak the XML files and get ImageRight up and running.

For our disaster recovery preparations, I updated our overall information about ImageRight, our step-by-step guide for recovery and burned a copy of our install share to a DVD so it can be included in our off-site DR kit. While you can download a copy of the official ImageRight ISO, I prefer to keep a copy of our expanded “Imagewrt$” share instead – especially since we’ve added hotfixes to the version we are running, which could differ from the current ISO available online from Vertafore.

Because setting up the test enviroment was so easy, I could also see a use where some companies may want to use alternate ImageRight environments for extra sensitive documents, like payroll or HR. I can’t speak for the additional licensing costs of having a second ImageRight setup specificially for production, but it’s certainly technicially possible if using different permissions on drawers and documents doesn’t meet the business requirements for some departments.

If You Build It, Can They Come?

I’ve posted several times about working on a disaster recovery project at the office using Server 2008 Terminal Services. We’ve officially completed the testing and had some regular staffers log on and check things out. That was probably one of the most interesting parts.

One issue with end user access was problems with the Terminal Services ActiveX components on Windows XP SP3. This is disabled by default as part of a security update in SP3. This can usually be fixed with a registry change which I posted about before, however that requires local administrative privileges that not all our testing users had. There are also ActiveX version issues if the client machine is running an XP service pack that is earlier than SP3.

Administrative privileges also caused some hiccups with one of our published web apps that required a Java plug-in. At one point, the web page required a Java update that could only be installed by a server administrator and this caused logon errors for all the users until that was addressed.

In this lab setting, we had also restored our file server to a different OS. Our production file server is Windows 2000 and in the lab we used Windows 2008. This resulted in some access permission issues for some shared and “home” directories. We didn’t spend any time troubleshooting the problem this time around, but when we do look to upgrade that server or repeat this disaster recovery test we know to look into the permissions more closely.

Users also experienced trouble getting Outlook 2007 to run properly. I did not have issues when I tested my own -there were some dialog boxes that needed to be address before it ran for the first time to confirm the username and such. While the answers to those boxes seem second nature to those of us in IT, we realized that will need to provide better documentation to ensure that users get email working right the first time.

In the end, detailed documentation proved to be the most important aspect of rolling this test environment out to end users. In the event of a disaster, it’s likely that our primary way of sharing initial access information would be by posting instructions to the Internet. Providing easy to follow instructions that include step-by-step screenshots that can be followed independently are critical. After a disaster, I don’t expect my department will have a lot of time for individual hand-holding for each user that will be using remote access.

Not only did this project provide an opportunity to update our procedures used to restore services, it showed that it’s equally as important to make sure that end users have instructions so they can independently access those services once they are available.

Document Imaging Helps Organize IT

Since our implementation of ImageRight, our Network Operations team has embraced it as a way to organize our server and application documentation in a manner that makes it accessible to everyone in our team. Any support tickets, change control documents, white papers and configuration information that is stored in ImageRight is available to anyone in our group for reference.

This reduces version control issues and ensures that a common naming (or “filing”) structure is used across the board, making information easier to find. (For reference, an ImageRight “file” is a collection of documents organized together like a physical file that hangs in a file cabinet.) Plus, the ability to export individual documents or whole ImageRight “files” to a CD with an included viewer application is a great feature that I’m using as part of our Disaster Recovery preparations.

I have a single file that encompasses the contents of our network “runbook”. This file contains server lists and configuration details, IP and DNS information, network maps, application and service dependencies, storage share locations/sizes, support contact information, etc. It consists of text documents, spreadsheets, PDF files and other types of data. I keep a hard copy printed at my desk so I can jot notes when changes are needed, but ImageRight ensures I have an electronic backup that I can edit on a regular basis. Plus, I regularly export a updated copy to a CD that I add to the off-site Disaster Recovery box.

The value of ImageRight in a disaster scenario expands beyond just our configuration documents. In an office where we deal with large amounts of paper, encouraging people to see that those documents are added to ImageRight in a timely manner will ensure faster access to work products after an event prevents access to the office or destroys paper originals.

Restoring ImageRight in the DR Scenario

Our document imaging system, ImageRight, is one of the key applications that we need to get running as soon as possible after a disaster. We’ve been using the system for over 2 years now and this is the first time we’ve had a chance to look closely at what would be necessary in a full recovery scenario. I’d been part of the installation and the upgrade of the application, so I had a good idea of how it should be installed. Also, I had some very general instructions from the ImageRight staff regarding recovery, but no step by step instructions.

The database is SQL 2005 and at this point it wasn’t the first SQL restoration in this project, so that went relatively smoothly. We had some trouble restoring the “model” and “msdb” system databases, but our DBA decided those weren’t critical to ImageRight and to let the versions from the clean installation stay.

Once the database was restored, I turned to the application server. A directory known as the “Imagewrt$” share is required as it holds all the installation and configuration files. We don’t have all the same servers available in the lab, so we had to adjust the main configuration file to reflect the new location of this important share. After that, the application installation had several small hurdles that required a little experimentation and research to overcome.

First, the SQL Browser service is required to generate the connection string from the application server to the database. This service isn’t automatically started in the standard SQL installation. Second, the ImageRight Application Service won’t start until it can authenticate its DLL certificates against the http://crl.verisign.net URL. Our lab setup doesn’t have an Internet connection at the moment so this required another small workaround – temporarily changing the IE settings for the service account to not require checking the publisher’s certificate.

Once the application service was running, I installed the desktop client software on the machine that will provide remote desktop access to the application. That installed without any issue and the basic functions of searching for and opening image files were tested successfully. We don’t have the disk space available in the lab to restore ALL the images and data, so any images older than when we upgraded to version 4.0 aren’t available for viewing. We’ll have to take note of the growth on a regular basis so that in the event of a real disaster we have a realistic idea of how much disk space is required. This isn’t the first time I’ve run short during this test, so I’m learning my current estimates aren’t accurate enough.

Of course, it hasn’t been fully tested and there are some components I know we are using in production that might or might not be restored initially after a disaster. I’m sure I’ll get a better idea of what else might be needed after we have some staff from other departments connect and do more realistic testing. Overall, I’m pretty impressed with how easy it was to get the basic functionality restored without having to call ImageRight tech support.

Paper vs. Electronic – The Data Double Standard

One of the main enterprise applications I’m partly responsible for administering at work is our document imaging system. Two years have passed since implementation and we still have some areas of the office dragging their feet about scanning their paper. On a daily basis, I still struggle with the one big elephant in the room – the double standard that exists between electronic data and data that is on paper.

The former is the information on our Exchange server, SQL servers, financial systems, file shares and the like. The the latter is the boxes and drawers of printed pages – some which originally started out on one of those servers (or a server that existed in the past) and some which did not. In the event of a serious disaster it would be impossible to recreate those paper files. Even if the majority of the documents could be located and reprinted any single group of employees would be unable to remember everything that existed in a single file, never mind hundreds of boxes or file cabinets. In the case of our office, many of those boxes contain data that dates back decades, containing handwritten forms and letters.

Like any good company, we have a high level plan that dictates what information systems are critical and the amount of data loss that will be tolerated in the event of an incident. This document makes it clear that our senior management understands the importance of what the servers in the data center contain. Ultimately, this drives our IT department’s regular data backup policies and procedures.

However, IT is the only department required by this plan to ensure the recovery of the data we are custodians of. What extent of data loss is acceptable for the paper data owned by every other department after a fire or earthquake? A year of documents lost? 5 years? 10 years? No one has been held accountable for answering that question, yet most of those same departments won’t accept more than a day’s loss of email.

Granted, a lot of our paper documents are stored off site and only returned to the office when needed, but there are plenty of exceptions. Some staffers don’t trust off site storage and keep their “most important” papers close by. Others in the office will tell you that the five boxes next to their cube aren’t important enough to scan, yet are referenced so often they can’t possibly be returned to storage.

And there lies the battle we wage daily as the custodians of the imaging system, simply getting everyone to understand the value of scanning documents into the system so they are included in our regular backups. Not only are they easier to organize, easier to access, more secure and subject to better auditing trails, there is a significant improvement in the chance of the survival when that frayed desk lamp cord goes unnoticed.

Disaster Recovery Testing – Epic Fail #1

As I’ve mentioned before, my big project for this month is disaster recovery testing. A few things have changed since our last comprehensive test of our backup practices and we are long overdue. Because of this, I expect many “failures” along the way that will need to be remedied. I expect our network documentation to be lacking, I expect to be missing current versions of software in our disaster kit. I know for a fact that we don’t have detailed recovery instructions for several new enterprise systems. This is why we test – to find and fix these shortcomings.

This week, at the beginning stages of the testing we encountered our first “failure”. We’ve dubbed it “Epic Failure #1” and its all about those backup tapes.

A while back our outside auditor wanted us to password protect our tapes. We were running Symantec Backup Exec 10d at the time and were happy to comply. The password was promptly documented with our other important passwords. Our backup administrator successfully tested restores. Smiles all around.

We faithfully run backups daily. We run assorted restores every month to save lost Word documents, quickly migrate large file structures between servers, and to correct data corruption issues. We’ve had good luck with with the integrity of our tapes. More smiles.

Earlier this week, I load up the first tape I need to restore in my DR lab. I typed the password to catalog the tape and it tells me I have it wrong. I typed it again, because it’s not an easy password and perhaps I had made a mistake. The error message appears, my smile did not.

After poking in the Backup Exec databases on production and comparing existing XML catalog files from a tape known to work with the password, we conclude that our regular daily backup jobs simply have a different password. Or at least the password hash is completely different, yet this difference is repeated across the password protected backup jobs on all our production backup media servers. Frown.

After testing a series of tapes from different points in time from different servers, we came the the following disturbing conclusion: The migration of our Backup Exec software from 10d to 12.5, which also required us to install version 11 as part of the upgrade path, mangled the password hashes on the pre-existing job settings. Or uses a different algorithm, or something similar with the same result.

Any tapes with backup jobs that came from the 10d version of the software use the known password without issue. And any new jobs that are created without a password (since 12.5 doesn’t support media passwords anymore) are also fine. Tapes that have the “mystery password” on them are only readable by a media server that has the tape cataloged already, in this case the server that created it. So while they are useless in a full disaster scenario, they work for any current restorations we need in production. We upgraded Backup Exec just a few months ago, so the overall damage is limited to a specific time frame.

Correcting this issue required our backup administrator to create new jobs without password protection. Backup Exec 12.5 doesn’t support that type of media protection anymore (it was removed in version 11) so there is no obvious way to remove the password from the original job. Once we have some fresh, reliable backups off-site I can continue with the disaster testing. We’ll also have to look into testing the new tape encryption features in the current version of Backup Exec and see if we can use those to meet our audit requirements.

The lesson learned here was that even though the backup tapes were tested after the software upgrade, they should have been tested on a completely different media server. While our “routine” restore tasks showed our tapes had good data, it didn’t prove they would still save us in a severe disaster scenario.

Disaster Recovery – But for Real

This past week I’ve been doing the preliminary work (installing servers mostly) to get ready for our scheduled disaster recovery test. I expect that we’ll learn a lot about our existing plan and systems documentation and will be looking to make some changes that will make any need for a large recovery faster and more effective.

Meanwhile, I’m managing some real disaster recovery, but on a smaller scale. A few weeks ago I posted about the need to upgrade our ImageRight installation to resolve a bug that could cause some data loss. The ImageRight support staff worked hard to run the preliminary discovery/fixing of the image files and database, followed by performing the upgrade.

Not long after, I got an email from someone in another department asking me to “find” the annotations added to a invoice that seemed to have gone missing. She was assuming that since some temporary help had worked on the document, a user error had been made and a “copy without annotations” had been introduced. I could recover the annotations by looking through the deleted pages and at previous versions of those pages.

However, what I found was a bit unexpected. I found a history of changes being made to the document, but no actual annotations visible. Curious.

So I opened a support ticket. After several remote sessions and research, the ImageRight team was “nearly positive” (they need more testing to confirm) that the process run before our last upgrade to correct the potential data loss, actually introduced a different kind of data loss. The result is that the database knows about the affected annotations happening, but the physical files that represent the annotated versions had been replaced with non-annotated versions.

We do have the logs from the original process, so it was just a matter of ImageRight Support parsing that data to generate a list of files that were changed. Now we begin the task of recovering those files from tape.

Our Sr. DBA had been working on side project that loads all our backup catalogs into a database so we have a comprehensive reference from all backup servers to identify what tapes to recall when people ask for recoveries. That project is proving its worth this time around, since we need to locate and restore over 1000 files. He also needs to cross referencing them to the individual documents accessible via the desktop client so we can do a visual comparison to any special cases and to provide a record of which documents were affected in a format that’s understandable to everyone else, in case additional concerns come up after we repair the damage.

Our current plan is to have this resolved by the end of next weekend, but this is something that needs careful handling since we don’t want end users to have any doubt about the integrity of the system, which I still have total confidence in once we sort this out. Thus, I’m happy to spend the extra time making sure no other issues are introduced.

Plus I need some time to find our DBA some really nice coffee for his efforts.

Dusting off the Disaster Recovery Plan

This week, I started testing our department’s disaster recovery plan. The goal is to use the contents of our existing “disaster recovery box” that we keep off-site combined with our current backup tapes to restore some key parts of our infrastructure.

Success or failure will be measured by what road bumps we encounter and most importantly, our ability to work around them using only the resources in the box. If I have to go “outside the box” for some critical piece of software or some undocumented configuration detail it would be a black mark in our preparations that needs to be remedied.

Our testing scenario includes the domain, Exchange, the document imaging system, the financial system, the primary file server and the time card application. We are also going to provide remote access to restored applications so staff from other departments can test out the results and give us feedback on changes that could improve the end-user experience during this type of event. As an added bonus, we’ll be able to try out Server 2008 R2 Remote Desktop Services.

In the last 6 months we started using VMWare ESX to consolidate some of our servers in production, but none of the machines needed for this scenario are virtual yet. I will be doing “classic” restores where the OS has to be installed before restoring our data from backup tapes. However, we are using VMWare to host several of the machines in the disaster lab, so I will be able to save time by cloning my first installation of Windows Server a few extra times before installing specific applications.

Depending on how this project goes, I’d like to see us take more advantage of virtualization within our disaster recovery planning and maybe start looking into backup solutions that are easier and faster than tape.