Virtualized Domain Controllers? I’ll pass, thanks.

There are a lot of good arguments for virtualizing DCs. You should have several of them for redundancy, but depending on the number of employees and general work load, DCs tend to be underutilized and it can be hard to warrant having a whole physical server for each one. But after loosing a second domain controller after doing essentially some basic VM maintenance, I’m not sold.

You may remember a previous post of mine from the summer of 2009 about NTDS Error 2103, when the DCs in a small child domain were virtualized. I had agreed to virtualize both DCs from that domain as the domain was not supporting any user accounts and had less than a half dozen servers as members. One did not convert well and we decided to just leave the remaining DC as the sole one standing for that domain after vetting out the risks. There are several “rules” to follow when virtualizing DCs, particularly not restoring snapshots of them and not putting yourself in the situation where your VM host machine need to authenticate to DCs that can’t start up until your host authenticates.

Fast forward about 16 months, to now. Our system administrator who handles the majority of our ESX management was migrating many of our VMs to our newly installed SAN. He reported that he shut down the DC normally, moved the VM and then started it back up a few hours later after all the server files had been copied over. The few servers that use that DC were working properly and everything looked good.

But alas, a few weeks later, the server reported a USN rollback condition. Replication and netlogon services stopped. I checked the logs to see if I could figure out the cause, but only saw things that added to the confusion. The DC was mysteriously missing logs from between the time of the VM relocation and the time of the NTDS error. And the forest domain controllers had logs indicating it had been silent for nearly 2 weeks. At this point, I can only speculate what went bad.

We slapped a bandage on the server by restarting netlogon so those few servers could authenticate, but without replication happening properly, the server will simply choke up again. And after the tombstone lifetime passes, the forest domain will consider it a lost cause. It’s essentially a zombie.

So begins our finally steps to decommission that child domain. I have no interest in restoring that domain from backup, since removing that domain has been an operations project that has been bumped for a long time. Now our hand has been forced and the plan is simple. Change a couple service accounts, move 2 servers to join the forest root domain and then NTDSUTIL that DC into nothingness.

As for our two forest root domain controllers? I’ll throw my body in front of their metal cases for a long time to come.