Runbook: NodeClockNotSynchronising
Alert Details
- Alert Name: NodeClockNotSynchronising
- Expression:
min_over_time(node_timex_sync_status{nanocosmosGroup=~".+", instance=~".+", environment=~".+", job=~".+"}[5m]) == 0 and node_timex_maxerror_seconds{nanocosmosGroup=~".+", instance=~".+", environment=~".+", job=~".+"} >= 16
Description
This alert triggers when the node’s clock is not synchronizing properly. It checks if the synchronization status has been zero (indicating no synchronization) over the past 5 minutes and if the maximum error in seconds is greater than or equal to 16 seconds.
Possible Causes
- NTP (Network Time Protocol) service not running or misconfigured
- Network issues affecting NTP synchronization
- High system load causing time drift
- Hardware clock issues
- Incorrect NTP server configuration
Troubleshooting Steps
1. Check NTP Service Status
Ensure that the NTP service is running on the affected instance:
sudo systemctl status ntp
Expected Output
You should see an output indicating that the NTP service is active and running:
● ntp.service - Network Time Service
Loaded: loaded (/lib/systemd/system/ntp.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2024-11-13 14:00:00 UTC; 15min ago
Docs: man:ntpd(8)
2. Restart NTP Service
If the NTP service is not running, start or restart it:
sudo systemctl restart ntp
Expected Output
Check the status again to ensure the service restarted successfully:
sudo systemctl status ntp
You should see an output indicating that the service is active and running.
3. Check NTP Synchronization
Verify that the system is synchronized with NTP servers:
ntpq -p
Expected Output
You should see a list of NTP servers and their synchronization status:
remote refid st t when poll reach delay offset jitter
==============================================================================
*time.google.com .GOOG. 1 u 64 64 377 23.1 0.123 0.456
4. Check System Load
High system load can cause time drift. Check the current system load:
uptime
Expected Output
You should see an output similar to this:
14:15:00 up 10 days, 3:22, 2 users, load average: 0.15, 0.10, 0.05
5. Check Hardware Clock
Ensure that the hardware clock is set correctly:
sudo hwclock --show
Expected Output
You should see the current hardware clock time:
2024-11-13 14:15:00.123456+00:00
If the hardware clock is incorrect, set it to the current system time:
sudo hwclock --systohc
6. Check Network Connectivity
Ensure that the network connectivity to NTP servers is intact. Use tools like ping or traceroute:
ping time.google.com
Expected Output
You should see successful replies:
PING time.google.com (216.239.35.0): 56 data bytes
64 bytes from 216.239.35.0: icmp_seq=0 ttl=54 time=23.1 ms
Additional Steps
1. Monitor Clock Synchronization
Continuously monitor clock synchronization to ensure it remains within acceptable limits. Use tools like prometheus and grafana to set up dashboards and alerts.
2. Configure NTP Servers
Ensure that the NTP configuration includes reliable NTP servers. Edit the NTP configuration file (usually /etc/ntp.conf):
sudo nano /etc/ntp.conf
Example Configuration
server time.google.com iburst
server time1.google.com iburst
server time2.google.com iburst
server time3.google.com iburst
By following these steps, you should be able to troubleshoot and resolve the “NodeClockNotSynchronising” alert. If the issue persists, further investigation into the specific NTP configuration and network conditions may be necessary.