How Nutanix helped us uncover an unexpected issue with a public NTP source
*** UPDATE *** This blog identifies issues we’ve found with the UK Pool but we have since heard of issues with many regions of the ntp.org time servers. We assume that they do not respond well to aggressive polling from a single source, and it may be that requests are being silently blocked because of this. We have had no such problems with other NTP sources, however.
For many years now we have used uk.pool.ntp.org as an NTP source to synchronise time on all our systems. It is becoming increasingly important to keep all systems tightly integrated to a common time reference. This is especially the case with distributed clusters now becoming more common with the advent of Hyperconverged systems.
Recently we had started to get alerts on our Nutanix cluster. It started with one cluster, so with Nutanix we were investigating any configuration or networking issues that could be causing it. Then when two other clusters on another site started displaying errors then we began to suspect it was the time source – unlikely as that seems.
The problem was intermittent, and determined with a built in NTP configuration check performed hourly by the Nutanix system. We were getting errors reported between 5% and 15% of the time, which varied by node. Every time an error was detected it raised an alert, which was getting wearing, so we had to make some sort of change.
We therefore changed all systems over to time.google.com – Google’s new (as of last December) free NTP source. We use their Global DNS Cloud service, so assuming that the level of reliability would be the same we decided to at least try this alternate time source.
Immediately the errors stopped. So this indicates that there is a problem server int he UK NTP Org pool. As we’d been having trouble for weeks on and off it doesn’t look like it’s going to get resolved anytime soon.
For many this intermittent failure may not even be detected, and in any 24 hour period a successful sync will keep servers in check. However a Nutanix cluster alerts if a node drifts more than 3 SECONDS away from the central offset, so we can’t afford to not take an NTP config issue seriously.
The picture below shows the very stark contrast between the hourly checks on uk.pool.ntp.org and switching to time.google.com. The red bars are the percentage of the day that the check failed.
Not difficult to see when we made the change!
So if you seem to be having any issues getting reliable time then don’t assume your source, even if a huge public one, is infallible.
Of course, it’s all made a lot easier if you have tools like Nutanix NCC and Prism instead of having to trawl through lines of text output 🙂