How to configure your virtual Domain Controllers and avoid simple mistakes with resulting big problems

So You went ahead and used virtualized Domain Controllers for Your Active Directory domain, congratulations! I am sure You will be happy with the decission, as long as You have a decent virtualizing environment, this will give You both peace of mind, faster recovery and cheaper redundancy.

There is however some special considerations You must do, when You are using virtual Domain Controllers, not to mention, please with sugar on top, Windows Server Active Directorydo NOT P2V/Convert Your physical Domain Controllers to virtual, without at least reading this article!

What areas do we need to consider on a virtual DC?

  • Time synchronization
  • Disk cache
  • Suspend/pausing virtual machine
  • Snapshots and System State backups
  • Performance

Personally I much prefer virtual Domain Controllers, from having a lot of physical ones, but there are some considerations to be made, about perhaps leaving some physical and what features to use on the virtual and what settings to use as well. This article attempts to uncover some of the points to consider, specifically for virtal DC’s. The list is in no way meant to be the only considerations, but is mostly the things that I personally have noticed forgotten in environments I have encountered. Add Your own preferences and research to this and You should be well on Your way to live happily forever with Your virtual DC’s.

 

Lets begin with Time Synchronization of Virtual Domain Controllers

Time in an Active Directory environment is paramount to all authentication and secure communication, for both Domain Controllers, servers and clients. In an Active Directory environment, kerberos is used to issue a ticket during login, this ticket is default valid for 8 hours, and prevents constant authentication on Domain Controllers, every time a user accesses ressources. Instead the kerberos ticket is served and verified thru out the forest. However, the encryption and security between the client and the domain controller issuing the ticket, requires an exchange of passwords and setup of a secure channel. To prevent anyone from being able to listen on the network and reuse the packets of authentication from the client from before, all packets include a timestamp. If the timestamp coming from the client differs with more than default 5 minutes from the Domain Controllers time, it will discard the packet as fake.

The default maximum time difference allowed is only 5 minutes and is set in  “Maximum tolerance for computer clock synchronization” Group Policy setting for the domain.

Because of this time synchronization between windows clients in an Active Directory envrionment is extremely important. A domain controller and client with times that are not the same, can prevet logon and access to network ressources.

All domain controllers, will by default have the time service (w32time) running and it will function both as a client for the DC it self and as a NTP server for domain servers and workstations to synchronize with. In a domain, all DC’s will automatically synchronize time with the Domain Controller that has the PDC FSMO role running. The Domain Controller with the PDC role should then be manually configured to sync it’s time with a good NTP source.

 

Why time synchronization fails when the Domain Controllers are virtual

Virtual machines, will by default have varying ressources, cpu clocks, etc. On a busy system, they may even be denied ressources for short periods of time or during high workloads, VMotion, Backups or may recieve higher cpu ressources than the operating system is even aware it could get. I.e. the operating system believes it has 1 cpu of 2.4 ghz, in reality it is running on a VMware server with 8 cores of 2.4ghz.

This results in something usually refered to as time drifting, the clock and the “ticks” it uses to keep time will sometimes run a faster or slower. Personally I have seen virtual servers misconfigured for time synchronization, that were off by several hours. Most time synchronization clients will have a limit on how much the time may differ from the NTP source and still synchronize. Some systems are set for no more than 15 minutes, 1 hour, 15 hours or even synchronize no matter the time difference, this may also prevent synchronizing because the time has drifted too much since last sync. (info on how to change the max difference time limit here)

The time service (w32time) running on Windows Servers and Domain Controllers, will be well sufficient of keeping time very accurate on a physical machine, with default sync’s being done every 45 minutes untill 3 successfull sync’s, then every 8 hours. (info on w32time registry settings here) The time service on Domain Controllers, also functions as the time server for all clients in the domain, so do not just disable this service, if You do not need the client functionality!

So basicly we need to ensure that our virtual Domain Controllers and especially our DC with the PDC FSMO role are always synchronized perfectly, otherwise we risk problems with authentication thruout the domain.

 

VMware tools time synchronization – important!

Be aware, that a VMware timekeeping document, describes a serious problem, that might make this solution very inappropriate. Specifically this document for ESX/ESXi 3.5, describing that VMware tools, is only able to correct time that is behind real time.

However, at this writing, VMware Tools clock synchronization has a serious limitation: it cannot correct the guest clock if it gets ahead of real time.

This limitation applies only to periodic clock synchronization. VMware Tools does a one-shot correction of the virtual machine clock that may set it either backward or forward in two cases: when the VMware Tools daemon starts (normally while the guest operating system is booting), and when a user toggles the periodic clock synchronization feature from off to on.

Another document from VMware vSphere 4.1 – VMware Tools Configuration Utility User’s Guide, says that it is capable of both.

If the clock on the guest falls behind the clock on the host, VMware Tools moves the clock on the guest forward to match the clock on the host. If the clock on the guest is ahead of that on the host, VMware Tools causes the clock on the guest to run more slowly until the clocks are synchronized.

Basicly this means, before using VMware tools for synchronizing our Virtual Domain Controllers, check that it is able to correct time if the time on the virtual machine is ahead of real time!

 

Solution with w32time /NoSync and VMware Tools on VMware environments (read above section first! not recommended)

  • Configure ALL VMware host’s to sync’ their time thru NTP, this is important, since we will use their time to set the time for all virtual machines. Dont forget to set the NTP client on the VMware host to start up automatically or with the host. Choose the “Configurations” tab, under software select “Time Configuration”, top right corner select Properties and fill out relevant information. For NTP servers have a look at www.pool.ntp.org

  • Configure virtual Domain Controllers, not to sync with the time service by using the NoSync parameter and let the service know that the server has an authorative time.
  • Install and configure VMware tools and configure it to synchronize time with the ESX/ESXi hosts

This solution should be stable, even under light and extreme loads. The largest problem here is usually in keeping strict control over the VMware hosts and ensuring that any hosts the DC’s might run under (i.e. moved with VMotion) is always configured to use NTP to the same source. The most likely problem is that someone will add a new ESX/ESXi and forget to configure ntp servers.

 

Using w32time for NTP sync on virtual Domain Controllers (recommended)

Since even 5 minutes drift can cause problems, and virtual machines as described before have a tendency to drift in time, it is not enough to synchronize the time on the virtual Domain Controllers every 8 hous (default after 3 sync’s). So we need to configure the servers to synchronize more often, but putting a high load on internet NTP servers is usually frowned upon by the internet community.

My recommended solution would be to use at least 2 physical servers locally that can run as local time servers. They in turn should synchronize their time from a trusted time source (physical or NTP from internet). Then have the virtual Domain Controller with the PDC FSMO role sync’ from thoose machines, and in turn other Domain Controllers from either the physical NTP servers or the PDC FSMO role holder as prefered. If You have a physical Domain Controller, use it as the PDC FSMO role holder, and configure it as the main time source for all other servers.

Make sure You increase the number of synchronizations being made on virtual Domain Controllers, I would suggest to something between 15 minutes to an hour. Also ensure the servers are configured to sync’ time, no matter how big a difference there is from the server and the time source.

 

Windows Time (w32time service) registry settings to configure

Some of the important registry settings to configure for Windows Time service (w32time). You can see all of them here. The parameters are set under this registry key.

  • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\Parameters
KeyName KeyType Values Description
 Type REG_SZ Nt5DS
NTP
NoSync 
Default domain sync
NTP time sync
No Sync – the one to use if using VMware tools to sync
 ReliableTimeSource REG_DWORD 0
Do not mark  computer as having reliable time source
Mark computer as having reliable time source (only usefull for DC) – Use this if using VMware tools, to tell the service that it is getting a reliable time sync.
 Period REG_DWORD  0
1-300
65532
Sync once a day
Sync x Times per day
Sync every 45 minutes, untill 3 successfull syncs, then every 8 hours (Default setting)
 NtpServer REG_SZ FQDN/IP Write the FQDN or IP of the NTP time server, if multiple servers seperate servernames with spaces

 [update january 2011] I was given a hint about a good VMware KB article that has information about “Timekeeping best practices for Windows” from VMware, the above link from Microsoft, still has more information about the registry settings available, but it is good to have multiple sources to reference.

Disk Cache on virtual servers running Active Directory

Domain Controllers will automatically disable disk caching to ensure that database integrity is not lost due to crashing or power failure, this is not limited to Domain Controllers, but all services using the Extensible Storage Engine databases, including WINS, DHCP and File Replication Service (FRS). Depending on the virtualization environment, this will be of no use, if the disk emulation ignores this or lets the operating system think the disk is writing directly but is actually caching.

If possible use a SCSI emulator that is compatible with forced unit access, so it tells the system when data is actually written. Ensure high availability and redundancy for storage systems and follow best practices and have uninteruptable power supply for both storage and virtual hosts.

Basicly any database systems storage should always be secured to avoid database corruption, this includes both physical and virtual systems.

 

Why suspending or “pausing” a virtual Domain Controller is not wise

When You suspend an operating system and later on resume it, the machine will not be aware of what is happening. As far as it cares, the time lapsed did not exist. Any connections to the machine will be lost and reconnect usually automatically, the time will be changed – usually automtically.

If the Domain Controller has been offline for too long, it will have objects on it that were supposed to have been deleted by the tombstoning process. If this happens the Domain Controller will stop replication with it’s partners. You will see an event in the logs with ID 2042, Source NTDS Replication, Description: It has been too long since this machine last replicated with the named source machine. The time between replications with this source has exceeded the tombstone lifetime. Replication has been stopped with this source.

It is better to shutdown a Domain Controller properly, than leaving it suspended. Also even physical Domain Controllers, should not be left turned off for too long – so not much difference there. But try to avoid using suspending the Domain Controller, after all it was never designed to be aware of this and nasty things might happen.

 

Using Snapshots on a Domain Controller is worse than dropping Your iPhone in the toilet!

Realy bad things happen, if You revert to an old snapshot of a Domain Controller. It is even worse than hot-cloning Your domain controller, in reality You are making almost 100% sure that You will break consistency in Your Active Directory domain and loose data. Read more about cloning Virtual Domain controllers without getting fired, to get more details of why an old version of a current Domain Controller is bad for You.

To explain it quick and simple. All Domain Controllers are aware of what replication has been done with other Domain Controllers, they even replicate this information by sharing USN values from other Domain Controllers. This helps the Domain Controllers to know what other Domain Controllers may need updated and who not to bother updating, thereby saving bandwith and time.

Let’s look at an example of this going terribly wrong:

  • DC1 and DC2 are completely in sync and agree that they have syncronized to “version A”, You take a snapshot of DC2.
  • A couple of users are added, some deleted, computer accounts change passwords, contact details on a user is changed, a couple of machines are added to the domain, and so on.
  • DC1 and DC2 sync’s again and are completely in sync, they both agree that they are now on “version B”.
  • DC2 is rolled back to the previous snapshot, and is now on “version A”.
  • When DC1 and DC2 next time talk together to try and sync, DC1 will not update DC2 with changes, because it “knows” that it allready has this information. (it does not matter that DC2 knows the info is missing, it will still not get it.).
  • Some more changes are made, lets say a couple of users are added on DC1.
  • DC1 and DC2 starts syncronizing, this time DC1 will give the new users to DC2, because this change is made AFTER “version B”, they will now both agree on being on “version C”.

Note that DC2 will never get the changes made between “version A” and “version B”, and it will never be the wiser – ok maybe not entirely true, it will figure it out some time and start writing event log entries about it.

This is also refered to as USN rollback or in this case a failed USN rollback, if done intentionally i.e. during a restore this can be a powerfull tool to rollback changes made in a Active Directory Domain, but when done like this unintentionally, it can result in replication errors, lingering objects that should have been deleted, inconsistency between domain controllers, computers and accounts that can log on to certain DC’s but not others, one password on one DC, second password on another, and so on. All terrible results that should never be introduced to a production environment. Also detecting and recovering from theese problems can be almost impossible. A Microsoft Knowledge Article about this problem can be found here.

If this happens to You, the best action to take is to immediately unpromote the affected Domain Controller, since this Domain Controller will never have correct information any longer and may also recieve changes that it will not replicate to other Domain Controllers. Do NOT fix replication faults on a “sick” DC – You will only allow the “sick” DC to replicate with Your “healthy” DC’s!

If needed You can most likely promote this DC again, since it will do a full replication upon promotion, but if it is possible, why not setup a new clean DC? Also any changes made to the defective Domain Controller while it was up, will be lost – If it was running for a long time, and has important information, consider exporting information to the healthy Domain Controllers (time consuming).

 

Use System State for backups of Active Directory, NOT snapshots!

If You are considering using snapshots to restore an Active Directory server, reconsider. The above information clearly shows why. The only way to introduce a Domain Controller that is restored from a snapshot, is if it is the ONLY Domain Controller in existence. Instead use Microsoft builtin tools to do System State backup, and if needed backup the files. This will not only work , but will also be completely supported by Microsoft. If You have a full System State backup of an Active Directory domain, it does not matter what other problems You have or how much trouble You are in -Microsoft will always be able to restore information needed to setup the Domain again.

Also make sure You read up on “authorative restores” for Active Directory, when restoring objects or full domains.

 

Performance is an issue, but is there anything special we should consider for virtual Domain Controllers?

Most Domain Controllers running infrastructure services like AD, DNS, DHCP, WINS. Running on physical machines of even small hardware will happily service loads of users and machines, with little utilization of the machines capabilities. Considering that we should ALWAYS have good redundancy on theese services, it seems wasted to setup alot of power draining physical machines to run theese services, when we can setup multiple virtual machines to do the same and better utilize hardware capabilities.

However we should be aware, that putting Domain Controllers on overloaded virtual environments or with too little reserved ressources, can result in bad performance, delays for clients accessing ressources, logins, etc.

So consider how much ressources You need carefully and where to place theese services.

  • Global Catalog services have information about Your entire forest and can be heavily utilized by Exchange servers for information.
  • The PDC FSMO role will recieve more requests, both from clients and other Domain Controllers in case of a mismatch and password changes.
  • Other FSMO roles in general recieve little extra load and are usually used infrequently, high availability is however important in some situations.
  • DNS/WINS servers service clients with information from cache (read memory) and speed is important to acheive little lag on name lookups, ensure enough memory.
  • All Domain Controllers require readily available disk, cpu and memory ressources, ensure enough is available for good performance and consider doing regular checks.
  • High availability of Active Directory (and DNS) services, is highest priority to have a functional infrastructure, without them nothing else may work.

Ensure that the virtual environment is able to load and start properly, when all virtual machines have been turned off. For example, if the virtual center service requires DNS to access databases and ESX/ESXi hosts, it would not work if all it’s DNS servers were virtual and offline!

 

Other stuff and ressources for further information

Personally I have very good experiences using virtual Domain Controllers, both for performance and especially for its ability to have very high availability. As long as You are aware of potential problems and how to avoid them, You should be just as happy using only virtual as physical DC’s and maybe even save som money and power as well.

Writing this article I also stumbled onto this good Microsoft KB that has info on the same topic as this article.

DirTeam also has some good info and specifics for Hyper-V and Domain Controllers here.

If I remember more articles or KB’s I will add them here later.

Enjoy and feel free to comment with Your own experiences and suggestions to live happily with virtual DC’s. Use this article as You may want; copy, steal, link, etc. But if You do use it, please link to this article.

 

– Article updated with some extra info about VMware tools and windows time service setup.

-Sole

13 Responses to “How to configure your virtual Domain Controllers and avoid simple mistakes with resulting big problems”

  • Sole:

    Updated article, with some extra VMware tools info, specifically info about not setting time correctly, if virtual machines time is ahead of real time.

  • […] Active Directory Network  Active Directory Time Synchronization Problems with Hyper-V  How to configure your virtual Domain Controllers and avoid simple mistakes with resulting big proble… Posted: vrijdag 10 september 2010 22:03 by Sander Berkouwer Filed under: Active Directory, System […]

  • Jack:

    Great article. Picked up a fault in our infrastructure with the details. Thanks sole.

  • Sole:

    Glad it could help 🙂

  • Ibrar Ashraf:

    Wow… did not see that you had updated the article. Nice and easy reading. 🙂

  • Jack:

    Great article. Have you had any experience with concerns around file system security? Any controls/precautions you know of to prevent a server admin from making copies of the virtual server image files?

  • Sole:

    I believe most companies adopt the security practice, that if you have permission to virtual disks it is equal to having access to physical servers. Then most would say if you have permission to physical access you can always gain full access to everything. So do not give people you don’t trust access to the virtual admin passwords 😛

    I do not think there is any way to prevent an admin from copying a server.

  • Sunad:

    Hi All,

    We have DHCP server on VM.
    I was strugling almost one complete day with IP address problem.
    Problme was client machines were unknowingly receiving IP address from different scope which we never had configured.
    Finally with huge google we have found “VMware does include a DHCP service that is running by default you will need to disable this.”

    This resolves 🙂
    Thanks to all & google.
    So here i am taking opportunity to share the info.

  • Bijesh:

    Simply Superb! Thanks for sharing this info

  • Ben:

    We have just had an issue where a hardware failure on a physical DC corrupted AD and pushed the corruption to all DC’s in all sites. In the space of 5 mins, all DC’s tomb-stoned themselves and all member servers disappeared from AD. Strangely all users and PC’s were OK. The only solution was an authoritative restore from tape and a rebuild of all other DC’s. All member servers then lost their trust relationship with the restored domain and had to have their machine password reset (ensure you always know a local admin password) and all 4 Exchange servers were corrupted and AD had to be completely re-configured to bring them back online. Total job took 72 hours and 4 PSS calls. A regular snapshot type backup (eg using Veeam Backup and Recovery, I am not talking about a host snapshot of a VM) would have been far quicker to restore to get basic network functionality and authentication back online and with the other DC’s basically dead in the water anyway surely would have been safe enough?

    Has anyone got any means of protection against an AD failure like this? It is pretty unlikely I know but now I have seen it happen, my client wants to ensure it cannot happen again! I have no idea how to best approach it, I think the best I can offer is that we will try to make recovery faster.

  • Jason A:

    Ben: Take a look at Recovery Manager for Active Directory Forest Edition. It would have handled your situation quite well. It actually has the ability to automate the entire rebuild of AD. http://www.quest.com/recovery-manager-for-active-directory-forest-edition/

  • Derek T:

    Hi, I am just about to move our DCs from physical to virtual and reading this article has given me plenty of food for thought. Great piece of work!

    Thanks

    DT

  • Geekay:

    Hi
    We have DC’s on a vsphere 5.0 (previously on VSphere 4.0) cluster. THe issue we see is that every now and then the DC will drop off the network. Tried Disconnecting and reconnecting the vNIC, VMotion to another host, also cannot login to DC. Only a reboot of the VM resolves the issue. Disabled auto DRS for these VMs, this did not resolve the issue. This does not happen to any other 100 or so other VMs, ONLY the Domain Controllers. It also does not happen to our DC’s on HyperV. Has anyone experienced this? What could be causing this?

Leave a Reply