What’s in Your Runbook?

At least once a year, the time comes to re-address the documentation around the IT department regarding disaster recovery. One of the things I’ve been working on improving over the last two years is our network runbook. We keep a copy of this binder in two places – in our document management system (which can be exported to a CD) and in hard copy, because when systems are down the last thing you want to be unable to access is the documentation about how to make things work again. 

Here’s a rundown of what I have in mine so far, it’s in 10 sections:

  1. Runbook Summary – A list of all servers with their IP address, main purpose, a list of notable applications running on each and which are virtual or not. I also include a list of which servers are running which operating system, a list of key databases on servers and finally copies of some of our important passwords.
  2. Enterprise AD – A listing of all corporate domains and which servers perform what roles. I include all IP information for each server, the partitions and volumes on each and where the AD database is stored. Functional levels for the domain and forest are also documented.
  3. Primary Servers and Functions – This is similar to the Enterprise AD section, but it’s for all non-domain controllers. I list out server information for file services, database servers and their applications and backup servers. I document shares, partition and volume information (including the size), important services that should be running and where to find copies of installation media.
  4. ImageRight – Our document management system deserves it’s own section. In addition to the items similar to the servers in the previous section, I also include some basic recovery steps, dependencies and the boot sequence of the servers and services. Any other information for regular maintenance or activities on this system are also included here.
  5. Email / Exchange – This is another key system that deserves it’s own section in my office. I include all server details (like above) and also completely list out every configuration setting in Exchange 2003. This will be less of an issue with Exchange 2007 or 2010 where more of the configuration information is stored in Active Directory. However, it makes me feel better to have it written down. I also include documentation related to our third-party spam firewall and other servers related to email support.
  6. Backup Details – A listing of each backup server, what jobs it manages and what data each of those jobs capture.
  7. Telecommunications – Details about the servers and key services. I also include information regarding our auto attendants, menu trees and software keys.
  8. Networking – Maps and diagrams for VLANs, static IP address assignments, external IP addresses
  9. Contacts & Support – Internal and external support numbers. Also include circuit numbers and other important identifying information.
  10. Disaster Recovery – Information about the location of our disaster recovery kit, hot line and website. A list of the contents of our disaster kit and knowledge base articles related to some of our DR tasks and hard copies of all our disaster recovery steps.

This binder is always in flux – I’m always adding and changing information and making notes, as well as trying to keep up with changes that other team members are making to the systems they work with most.  It will never be “done” but I’m hoping that whenever I have to reach for it, that it will always be good enough.

Goodbye Live Communications Server 2005

If you happen to be a regular reader of Techbunny.com, you probably know that while I’m a big user of Microsoft products, I’m still happy to remove a MS product when something from a 3rd party will meet my needs. 

In this case, it was Live Communications Server 2005 that took the hit.  We have very few users that regularly “instant message” within the office and with our recent Shoretel upgrade, the conference bridge included basic IM services that could be integrated within our VoIP desktop software.  This would reduce the need for us to manage another server VM and free up those resources for other purposes.

I was concerned that removing LCS would be a chore, but it turns out it was quite easy with less than a dozen steps.  Find them here in TechNet.  I also love the great post-removal report that was generated, as I was able to add that to my change control documentation.

While the upcoming version of Microsoft Unified Communications looks like it will have some great collaboration features, sometimes it’s easier to just go with something you might already have handy through a third-party, especially if you don’t need a lot of bells and whistles.

Network Clean Up: Don’t Forget About Your LAN

“The network is slow.” 
Probably the worst complaint a Systems Administration team at any small to mid-sized office can get.  The end users often can’t pinpoint what “slow” is or when it happens, it’s seemingly random, or they report it after the fact when there is nothing to actively troubleshoot.
I am not a networking “guru” by stretch of the imagination. Like many small offices, our NetOps team consists of several people who may have some areas they enjoy or “specialize” in, but are mostly jack-of-all trades, ready to jump in and sort things out whenever things need attention.  I enjoy the variety, but sometimes the ongoing project list leaves you in a situation where certain areas of your “kingdom” are left until they cry out in pain.
The LAN in my office was one of those lost souls.  Sure, I’ve got my Network+ training, I used to have a valid CCNA certification, I know the difference between a hub and a switch and I can find enough of the settings in my HP and Cisco switches to assign IP addresses for management access and use some basic features.  And then my skill set drops off there – because small networks are often “set it and forget it”.   
We think about collecting SNMP logs and monitoring traffic and all that cool stuff and then reality sets in: I wish I had the time to spend installing and learning enough about those tools so they can be really useful when someone comes knocking with a “slowness” complaint.  But I don’t.  So finally I brought in someone who actually looks at networks every day. Someone who knows the settings on network gear and can look at how they work together.  Yes, I can pull out some crossover cables and make packets move from point A to point B, but I wanted some advice from someone who really understood how it all worked.
It was eye-opening.  My switches that linked the users workstations to our servers were all connected, but they were naturally oversubscribed without taking advantage of trunking any of the ports together to pass traffic to core switch over larger pipe. Spanning tree was configured incorrectly and not at turned on all on some switches.
The end result was that while my Layer 3 setup looked fine to me, the Layer 2 traffic was actually taking an extra hop through a switch that was accidentally acting as the spanning tree root, adding unnecessary delay.  After correcting that issue and ordering up some gig modules to add trunking up to our core switch, upload/download speeds of files to servers appears to be coming close the maximum available from the desktops.

Next up – increasing the speed of our internet connect by switching from frame relay to fiber from our ISP and subscribing to a bigger pipe on that end.

Don’t Forget: Today is SysAdmin Appreciation Day!

System Administrator Appreciation Day is ccelebrated all day the last Friday in July, so it’s not too late for you to show your beloved Systems Administrator, Help Desk Tech, Network Guru, or even that person in your office who’s not “officially” a sysadmin but he helps you out of a jam with your computer anyway.

I’m not going to tell you what the best gift is, but even a little gift card for coffee or lunch can go a long way.  Better yet, invite them to grab that snack face to face.  Believe it or not, sysadmins like to escape the office from time to time too!

A Shoretel Upgrade Hiccup, plus Why I Love Our DBAs

A few weeks ago, I posted about our Shoretel upgrade from version 6.1 to 10.1. Overall, the upgrade was smooth and including an upgrade of the conference bridge hardware and software to version 7. However, there was one little post-upgrade problem. I was unable to view or edit the user configuration for a subset of my users using the Shoretel Director web portal. An “data undefined” error would display in my browser and then once that box was clear, the word undefined appeared in one of the data fields for the user. All other fields were blank and I couldn’t perform any actions like delete, save or reset.

After performing a database repair with our VAR, a ticket was opened with Shoretel directly. A Shoretel engineer looked at the issue, took copies of our database and log history from the upgrade and we were left to wait for a resolution of some sort. The users in question had fully functional phones and voicemail, as well as any other feature they had before the upgrade. Outside of a slowing growing list of tweaks I couldn’t make to those users, the system was perfectly stable.

Because the users had fully functional services, I doubted we were up against any major database corruption. While one could argue that we did an extensive upgrade in one evening (6.1 to 7.5 to 8.5 to 10.1) we didn’t deviate from the standard upgrade process that one could have done over time. While waiting for Shoretel to respond to the escalated ticket, our senior in-house DBA came across some free time and was able to take a look at the MySQL database himself.

The list of affected users spanned departments and had very little in common outright. However, I suspected they had some common component enabled and those settings were causing the new version of the Shoretel Director web portal to choke when loading the information. I’ve noticed that some fields that weren’t required in the past (like Last Name) are now required, so I was hoping it was something along those lines.

I provided my list and my hunch to our DBA who started sorting and running queries on our users table to see what could possibly be mucking up the system. It wasn’t long before he found the culprit – the password hash for the conference bridge for those users in question. For the majority of the users of the conference bridge, I used the same, relatively simple password for every person when setting up their bridge access for the first time. The stored hash for that password, as well as one other password that was used more than once in the system, was causing the problem. Our DBA nulled out the passwords and the user settings were then accessible.

We aren’t sure if it was those two particular passwords or the fact that they were duplicated that was the issue, but we did learn that sometimes knowing your data is more important than anything a vendor could do for you. Because we were familiar with our users, our DBA was able to look for patterns that made sense to us. Our ticket has been with Shoretel for several weeks – it was likely they were looking for a programmatic issue of some kind, because the database was technically sound. Not sure how long it would have taken if our DBA hadn’t had time for a side project.

As a systems administrator, I like to think I can troubleshoot most issues. But database management is an area I don’t spend a lot of time in and I’m thankful for having a great DBA resource sitting nearby. Sometimes being good at your job means recognizing those that do their job well too and making sure they know you wouldn’t be nearly as good without them.

The Observer Effect at the Helpdesk

In quantum physics there is the a phenomena known as the Observer effect, aka the Hawthorne effect. It refers to changes that act of observing will make on whatever is being observed. This effect is a regular occurrence when working in system administration, particularly on the help desk.

You don’t have to work in IT very long before someone will tell you that their computer starting working “just fine” when you showed up at their desk. This is the Observer effect in action.

When a support persons appears to troubleshoot an issue and asks an end user to recreated the problem, the problem will not occur. This is most often because the end user is now paying attention to what they are doing and aren’t making the same mistake they were making before.

This effect manifests itself in the opposite way as well. I’ve gotten the occasional call stating that an end user “was doing something all morning, but now it’s not working.” When asked to recreated the problem, it usually becomes apparent that the person has suddenly begun to pay too much attention to the steps they are taking -thinking too much about them and stopping too soon in the chain of mouse clicks or key strokes to finish the action. Instead of being observed by an outside party, the user has suddenly become the observer themselves and changed how they perceive what they are doing on the computer.

Either condition results in a help desk ticket, the observer effect either causes the problem or helps to solve it.

Microsoft Support – Look Again

I have to admit the first place I go to for answers to problems with Microsoft products is Google. Years ago, I learned that I was more likely to get my answer starting outside of the Microsoft Support web pages. In many cases, I’d even find knowledge base articles faster when searching the whole Internet vs. starting directly in the knowledge base portal itself. That fact alone has kept me from starting out at “support.microsoft.com” for a long time. Old habits die hard.

But I’ve been giving Microsoft Support a second look lately and it’s improved over the years.

One of the areas you should check out when supporting home or office users is the Solution Centers which will tailor content to the OS or application you select. Depending on your selection, you might find options to access Microsoft Fix it, which can lead you to some automatic diagnostics and solutions. There are automated solutions for XP, Vista, Internet Explorer, Windows Media Player and others. Windows 7 has a lot of the automated diagnostic features built in and the Fix it web page provides alternate instructions for accessing those tools.

Another area to check out is the Microsoft Answers forum, which is geared toward more consumer level Q&A on desktop operating systems, Office products, Windows Live, Security Essentials.

Finally, if you seek more support information for enterprise applications and Windows Server, TechNet is the place to be. Check out Keith Comb’s recent post about improvements in TechNet Search. Don’t forget about the TechNet Forums and Community areas too – lots of great blogs and other resources are there, like the Fix it Blog that posts regular additions to the Fix it solutions, especially for more of the server products.

Happy Help-desking!

She’s Geeky Conference: Days 2 & 3

This weekend I enjoyed some more great sessions at the She’s Geeky unConference. Not only was this event filled with a collection of fantastic women with a variety of tech interests that I can’t even begin to list, it was a great opportunity to learn new tips and tricks for soft skills that aren’t always high on the “geekdom” list! Practicing the “elevator pitch”, improving your speaking skills and discussing how to manage transition as tech roles evolve were some of the sessions on agenda wall today. The notes for the sessions will be posted to the She’s Geeky Wiki over the next few days and I’ll post the links to the sessions I enjoyed most when they are available.
The one thing that seemed to be missing from the weekend was other system administrators. I was excited to enjoy the experience with Jessica DeVita, the owner of UberGeekGirl, but it was a little hard to believe that out of approximately 300 registered attendees, less than .01% identified themselves as server or desktop administrators. Those that even hinted they might have done it previously didn’t even utter the word “Windows”.
Is there something about this particular area of tech that makes it even less appealing for women? Maybe that will have to be a session topic when I attend next year.

Two Links from my last 24 Hours

It’s been a busy last few days, but I don’t want to forget a couple of links that have been useful recently.

The first comes from @nelz9999, who shared a link about managing geeks in the corporate environment. The second was happened upon by a co-worker as we were troubleshooting a BlackBerry trackball that wasn’t working properly. This is how you get those little things clean, but be careful when dealing with those tiny magnetic rollers.

Download the Employee Separation Checklist

I heard from quite a few people about how useful my post was about Employee Separations. As a bonus, I put together a document that breaks out the items into a checklist that you can edit to meet the needs of your environment. No matter how often you are removing user accounts or performing some other similar task, a checklist helps ensure you don’t miss anything in case your work is questioned at a later time.