Happy 10th Anniversary to Active Directory

Time sure flies when you are busy keeping up with Active Directory, which has been around since it’s release on February 17, 2000 with Windows Server 2000.

I remember the first time I was part of an upgrade from NT 4.0 to Windows 2000 Active Directory. I was the sole IT person in the branch office and was working on a project to upgrading my branch office to be a child domain in the headquarters’ “new” Windows 2000 Active Directory forest.

The NT 4.0 PDC in my office had a DNS suffix defined in the network settings, and unknown to us at the time, caused my domain to end up with a disjointed namespace. Once we realized we had an issue, I got to be part of my first upgrade and my first rollback – all in the same evening.

Because I had taken my backup domain controller offline, it was pretty easy for me to bring NT 4.0 back to life. It was far more work for my colleagues at headquarters, who had to call support services for details on using NTDSUTIL to remove the remnants of the child domain controller out of AD forest so we could perform the upgrade again.

Several years, and several domain controller demotions later, I’m quite comfortable with the process I first saw happen back in that little closet of a server room. Active Directory, it’s certainly been fun 10 years!

Don’t Overlook the Metadata Cleanup

It seems inevitable that while restoring Active Directory in a disaster recovery scenario, one is going to feel rushed. Even with this being a test environment, I felt like getting AD back was something that needed to be quick so we could move onto the more user-facing applications, like Exchange.

My network has two active directory domains, a parent and a child domain in a single forest. The design is no longer appropriate for how things are organized for our company and we’ve been slowly working to migrate servers and services to the root domain. Right now, we are down to the remaining 3 servers in our child domain and one remaining service account. The end is in sight, but I digress.

The scope of our disaster recovery test does not involve restoring that child domain. This is becoming an interesting exercise, because it will force us to address how to get those few services that reside in that domain working in the DR lab. This will also help us when we plan the process for moving those services in production.

Bringing back a domain controller for my root domain went by the book. I could explain away all of the random error messages, as they all were related to this domain controller being unable to replicate to other DCs, as they hadn’t been restored. I had recovered the DC that held the majority of the FSMO roles and sized the others. I started moving onto other tasks, but I couldn’t get past the errors about this domain controller being unable to find a global catalog. All the domain controllers in our infrastructure are global catalogs, including this one, as I hadn’t made a change to the NTDS settings once it was restored.

So I took the “tickle it” approach and unchecked/rechecked the Global Catalog option. The newly restored DC successfully relinquished its GC role and then refused to complete the process to regain the role again. It was determined to verify this status with the other domain controllers it knew about, but couldn’t contact.

I knew for this exercise, I wasn’t bringing back any other domain controllers. And in reality, even if I was going to need additional DCs, it was far easier (and less error-prone) to just promote new machines than to bother restoring every DC in our infrastructure from tape. (However I still back up all my domain controllers, just to be prepared.)

To solve the issue, I turned to metadata cleanup. Using NTDSUTIL, I removed the references to the other DC for root domain, the DC for the child domain and finally, the lingering and now orphaned child domain itself. I also had to go into “AD Domains and Trusts” to delete the trust to the child domain, which wasn’t removed when the metadata was deleted. Once all these references were removed, the domain controller successfully was able to assume the global catalog role and I could comfortably move on to restoring our Exchange server.

And I’ve learned that just because I can explain an error, doesn’t mean I can ignore it.

AD Recycle Bin – New in Server 2008 R2

This week I continued with disaster recovery testing in our lab, the first machine restored from tape being one of our domain controllers. While checking over the health of the restored Windows 2003 active directory, I remembered that we are using a third-party tool in production to aid in the recovery of deleted items – Quest’s Active Directory Recovery Manager. To be honest, we haven’t had a reason to use the software since we installed it, which I suppose is a good thing. But it is a stress reliever to know that it’s there for us.

Restoring this product in our test lab isn’t part of the scope of this project, but it does have me looking forward to planning our active directory migration to Server 2008 R2, which includes a new, native “recycle bin” feature for deleted active directory objects. You can find more details about how this feature works in Ned Pyle’s post on the Ask the Directory Services Team blog, The AD Recycle Bin: Understanding, Implementing, Best Practices, and Troubleshooting.

While the native feature doesn’t have the ease of a GUI and requires your entire forest to be at the 2008 R2 functional level, it’s certainly worth becoming familiar with. Once I’m done with all this disaster testing, you can be sure this feature will on the top of my list to test out when I’m planning that upgrade.

Check Out TechNet Events

Today I enjoyed a morning at the Microsoft office in SF attending an event in the current series of TechNet Events. Through the months of September and October, the TechNet Events team is traveling around the US providing tips, solutions and discussion about using Windows 7 and Server 2008 R2.

Today’s presentation was given by Chris Henley, who led some lively and informative discussions on three topics – Tools for migration from Windows XP to Windows 7, Securing Windows 7 in a Server 2008 R2 Environment (with Bitlocker, NAP and Direct Access) and new features in Directory Services.

I was excited to see specific information on Active Directory. If you missed the blogs about Active Directory Administrative Center back in January like I did, you’ll like some of the new features in this 2008 R2 tool, including the ability to connect to multiple domains and improved navigation views.

If there isn’t an event near you this time around, check back after the holidays when they’ll head out again for another series.

Stumbling Over AD Intergration in ImageRight

Friday night, I was responsible for a maintenance upgrade to our document imaging system, ImageRight. This upgrade was required to repair a potentially serious data corruption issue that was discovered by the vendor. We weren’t affected by the corruption at the time it was discovered but some functionality had been disabled as a work-around, so we had to schedule time to perform the fix.

First off, let me say that I really like the vendor and I like how the product works overall. However, we always seem to be the client who has issues that the vendor never seems to encounter before. It was almost refreshing when they called about the corruption issue and it wasn’t something we’d found first.

Friday morning, I had exchanged a few emails with the vendor support tech who was going to do the upgrade to firm up the planned roll-back procedures (for our change management documents) and to clarify any last minute items. He mentioned a known bug related to environments with ImageRight users that spanned multiple Active Directory domains, fondly referred to as the “AD dual domain bug” and how the upgrade shouldn’t be performed if we had an environment with those characteristics.

Yes, we have two domains. But no, we don’t have accounts that are used by ImageRight from the second domain. We confirmed those details and I mentioned in one of my reply emails that the AD bug had me a bit worried anyway. I was told my environment wasn’t going to be an issue based on their testing. (Okie–dookie then.)

So away we went with the upgrade. That was the easy part. Then came the testing.

Exactly half the program worked – literally half. The program launches two windows when it starts – one window acts as the file manager, for searching and loading image files and the other is the image viewer. I could see and use the file manager portion, but the viewer never loaded, only returning an error that was cryptic overall but referenced active directory about 15 times.

Um, yeah.

So they uninstalled and reinstalled just to make sure some random DLL wasn’t left behind or something. But that didn’t solve the problem. I chilled out on hold for a while. It was already after 7pm here on the west coast, so I felt a little bad for the tech on the east coast. “Are we sure this isn’t a different manifestation of the AD bug?”, I asked.

I chilled out on hold for a while longer while the support tech consults some developers on his cell phone. No, it shouldn’t be the AD bug, we are getting an error “too soon” in the loading of the program, but there is a hotfix for that bug that should be released on Monday. A developer was working on getting a copy for us now if we wanted to try that.

Why not? It’s already broken, might as well toss one more thing at it before we roll everything back. Sure enough the hotfix did the trick, avoiding a roll-back and saving me another late night at the office. I wasn’t surprised that it was yet another instance of something no other client they have has ever experienced.

I’m not sure how I feel about always being the “one-off” case, but it always seems to work out fine in the end. Though I’m thinking of framing my “I’m a bit worried about the AD bug” email that I sent out before we even began.

NTDS Error 2103

This week one of my domain controllers developed a curious problem. I don’t like curious problems, especially ones that rear their heads after the server reboots.

The error was an NTDS General event 2103, which indicates that the AD database “was restored using an unsupported procedure and Net Logon service has been paused”. Research and KB Article 875495 lists event 2103 and 3 other events related to a condition known as USN Rollback.

This DC is running Windows 2003 SP2, so based on the article, I should be seeing at least the more serious NTDS Replication 2095 event as well, due to a hotfix in SP1 that made the error logging somewhat more verbose. But I’m not. This makes it more curious. Am I in a rollback state or not?

KB 8759495 also lists some possible causes of this state, some of which are possible in a virtual environment – the case for this DC. It points me to another KB Article 888794 which lists out a bunch of considerations for hosting DCs as VMs. However our environment met all the requirements, including one related to write caching on disks, as our host machine has battery backed disk caching. So I rule out that we actively caused a potential rollback.

Repadmin has a switch (/showutdvec) that can be used to determine USN status by displaying the up-to-dateness vector USN for all DCs that replicate a common naming context. If the direct replication partners have a higher USN for the DC in question than that DC has for itself, that’s considered evidence of a USN rollback. My DC did not have this problem, as it had a USN higher than it’s partners. So at this point I couldn’t confirm or deny a true USN rollback issue, however it seemed the the DC “thought” it was having this problem. Maybe I could figure out why the DC was in this limbo.

So I returned to the original article to look for specific causes. One line reads, “Starting an AD domain controller whose AD database file was restored (copied) into place by using an imaging program such as Norton Ghost.”

Thinking back, the conversion of this DC from physical to virtual did not go as smoothly as I would have hoped. I remembered I had to resolve some issue where I was getting an error in the logs related to the directory database file not being where the OS expected it, even though the path on the server hadn’t changed during the conversion. It was odd at the time, but the posted fix seemed to clear the issue and I’d moved on.

I’m guessing that perhaps that was the start of my issues – maybe the P2V process made the OS think the database was different copy even though it wasn’t. The result was that the server thought it was rolled back, but the USNs never reflected a problem. So I decided it was better to be safe than sorry and assume this “limbo” condition was not how I wanted to leave things.

The resolution for USN rollback is a forced removal of the domain controller from AD. Since this is a DC in a child domain that’s being phased out, very few changes happen to that domain so I wasn’t concerned about possibly loosing changes that may have been made on that DC. It was only the FSMO holder for one role which was easily seized by the other DC.

My decision now is to decided between bringing up a replacement DC for this domain next week or just run one DC for the time being and try to speed up the remaining tasks that need to be done before we can removed the child domain all together.

But that’s for another day!

Immediately = 15 Minutes

Yesterday: One of my office domain controllers, ROOTDC01, failed. Not so much that things stopped working when it failed, but it left us open to serious downtime if it’s partner, ROOTDC02, failed before we had replaced the first one. I decided that it didn’t make sense to bring a replacement Windows 2000 domain controller in, only to proceed with our planned domain controller upgrade project in about 4 weeks. It only made extra work. This was the (sort of) perfect opportunity to bring in a shiny new already Windows 2003 DC into the organization. And it would also force me to finally “walk the walk” after quite a few months of “talk” (and testing!).

Earlier Today: This evening, a co-worker and I started on upgrading the schema in our organization to support this shiny new DC. This process, which happens on ROOTDC02 (the remaining DC), is relatively simple on paper and successful 99.9% of the time. But it could do major damage the other .01%. And since I didn’t have an 2nd DC to act as a backup, a screw-up could leave me doing a lot of disaster recovery. For many many hours.

All I really had to do was follow the step-by-step directions that I prepared for myself during the testing phases. And then, of course, second guess my directions. Wring my hands, close my eyes tightly and pace around while things were happening. And in some cases, when Microsoft documentation says “immediately” they really mean “give it 15 minutes to stew a bit.” This is when most of the pacing happens. And rapid refreshing of my replication monitor application.

Now: Everything seems to have gone nicely. No system errors that weren’t expected. No crashes, no blips. It’s only the year 2006 and I’ve finally gotten around to getting our systems up to 2003.

Monday: Bring in ROOTDC03, the new partner for ROOTDC02. We are still in a touchy spot over the weekend – but I think we’ll be fine. Once that new server is running, I can start upgrading the other DC to Windows 2003… might even finish the whole project before my deadline.