Runbook: NodeClockSkewDetected

Alert Details

Alert Name: NodeClockSkewDetected
Expression: (node_timex_offset_seconds{nanocosmosGroup=~".+", instance=~".+", environment=~".+"} > 0.05 and deriv(node_timex_offset_seconds{nanocosGroup=~".+", instance=~".+", environment=~".+"}[5m]) >= 0) or (node_timex_offset_seconds{nanocosmosGroup=~".+", instance=~".+", environment=~".+"} < -0.05 and deriv(node_timex_offset_seconds{nanocosGroup=~".+", instance=~".+", environment=~".+"}[5m]) <= 0)

Description

This alert triggers when there is a significant clock skew detected on a node. Clock skew can cause issues with time-sensitive applications and distributed systems. The alert checks if the time offset is greater than 0.05 seconds or less than -0.05 seconds and if the offset is not decreasing.

Possible Causes

NTP (Network Time Protocol) service not running or misconfigured
Network issues affecting NTP synchronization
High system load causing time drift
Hardware clock issues

Troubleshooting Steps

1. Check NTP Service Status

Ensure that the NTP service is running on the affected instance:

sudo systemctl status ntp

Expected Output

You should see an output indicating that the NTP service is active and running:

● ntp.service - Network Time Service
   Loaded: loaded (/lib/systemd/system/ntp.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2024-11-13 14:00:00 UTC; 15min ago
     Docs: man:ntpd(8)

2. Restart NTP Service

If the NTP service is not running, start or restart it:

sudo systemctl restart ntp

Expected Output

Check the status again to ensure the service restarted successfully:

sudo systemctl status ntp

You should see an output indicating that the service is active and running.

3. Check NTP Synchronization

Verify that the system is synchronized with NTP servers:

ntpq -p

Expected Output

You should see a list of NTP servers and their synchronization status:

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*time.google.com  .GOOG.           1 u   64   64  377    23.1    0.123   0.456

4. Check System Load

High system load can cause time drift. Check the current system load:

uptime

Expected Output

You should see an output similar to this:

 14:15:00 up 10 days,  3:22,  2 users,  load average: 0.15, 0.10, 0.05

5. Check Hardware Clock

Ensure that the hardware clock is set correctly:

sudo hwclock --show

Expected Output

You should see the current hardware clock time:

2024-11-13 14:15:00.123456+00:00

If the hardware clock is incorrect, set it to the current system time:

sudo hwclock --systohc

6. Check Network Connectivity

Ensure that the network connectivity to NTP servers is intact. Use tools like ping or traceroute:

ping time.google.com

Expected Output

You should see successful replies:

PING time.google.com (216.239.35.0): 56 data bytes
64 bytes from 216.239.35.0: icmp_seq=0 ttl=54 time=23.1 ms

Additional Steps

1. Monitor Clock Skew

Continuously monitor clock skew to ensure it remains within acceptable limits. Use tools like prometheus and grafana to set up dashboards and alerts.

2. Configure NTP Servers

Ensure that the NTP configuration includes reliable NTP servers. Edit the NTP configuration file (usually /etc/ntp.conf):

sudo nano /etc/ntp.conf

Example Configuration

server time.google.com iburst
server time1.google.com iburst
server time2.google.com iburst
server time3.google.com iburst

By following these steps, you should be able to troubleshoot and resolve the “NodeClockSkewDetected” alert. If the issue persists, further investigation into the specific NTP configuration and network conditions may be necessary.

Runbook: NodeClockSkewDetected#

Alert Details#

Description#

Possible Causes#

Troubleshooting Steps#

1. Check NTP Service Status#

Expected Output#

2. Restart NTP Service#

Expected Output#

3. Check NTP Synchronization#

Expected Output#

4. Check System Load#

Expected Output#

5. Check Hardware Clock#

Expected Output#

6. Check Network Connectivity#

Expected Output#

Additional Steps#

1. Monitor Clock Skew#

2. Configure NTP Servers#

Example Configuration#

Runbook: NodeClockSkewDetected

Alert Details

Description

Possible Causes

Troubleshooting Steps

1. Check NTP Service Status

Expected Output

2. Restart NTP Service

Expected Output

3. Check NTP Synchronization

Expected Output

4. Check System Load

Expected Output

5. Check Hardware Clock

Expected Output

6. Check Network Connectivity

Expected Output

Additional Steps

1. Monitor Clock Skew

2. Configure NTP Servers

Example Configuration