Runbook for “HighCpuLoad” Alert

1. Identify the Problem

When the “HighCpuLoad” alert is triggered, it indicates that the CPU load on one or more instances is critically high.

2. Check Current CPU Load

Use the following command to check the current CPU load on the affected instance:

top -bn1 | grep "Cpu(s)"

Expected Output

You should see an output similar to this:

%Cpu(s):  2.3 us,  1.2 sy,  0.0 ni, 95.8 id,  0.5 wa,  0.0 hi,  0.2 si,  0.0 st

3. Identify CPU-Consuming Processes

To identify processes consuming the most CPU, use:

ps aux --sort=-%cpu | head -n 10

Expected Output

This command lists the top 10 processes by CPU usage:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      1234  50.0  1.0 123456 789012 ?       S    10:00   0:30 /usr/bin/python3
...

4. Restart CPU-Consuming Services

If a specific service is consuming too much CPU, consider restarting it. For example, if a web server is using excessive CPU:

sudo systemctl restart apache2

Expected Output

Check the status to ensure the service restarted successfully:

sudo systemctl status apache2

You should see an output indicating that the service is active and running.

5. Optimize Application Performance

If an application is causing high CPU load, consider optimizing its performance. This might involve code optimization, load balancing, or scaling the application.

6. Check for Background Jobs

Ensure there are no unnecessary background jobs running. List all cron jobs with:

crontab -l

Expected Output

Review the list of scheduled jobs and disable any that are not needed.

7. Update Prometheus Configuration

If the CPU load threshold needs adjustment, update the Prometheus alert expression and reload the configuration:

Edit the Prometheus configuration file (usually prometheus.yml):

- alert: HighCpuLoad
  expr: 100 - (avg without (cluster) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High CPU load on {{ $labels.instance }}"
    description: "CPU load is above 80% on {{ $labels.instance }}."

Reload Prometheus configuration:

curl -X POST http://<prometheus_host>:<prometheus_port>/-/reload

Expected Output

Prometheus should reload the configuration without errors, and the alert should be updated accordingly.

Conclusion

By following these steps, you should be able to troubleshoot and resolve the “HighCpuLoad” alert. If the issue persists, further investigation into the specific processes and system configuration may be necessary.