Runbook for “HighCpuLoad” Alert
1. Identify the Problem
When the “HighCpuLoad” alert is triggered, it indicates that the CPU load on one or more instances is critically high.
2. Check Current CPU Load
Use the following command to check the current CPU load on the affected instance:
top -bn1 | grep "Cpu(s)"
Expected Output
You should see an output similar to this:
%Cpu(s): 2.3 us, 1.2 sy, 0.0 ni, 95.8 id, 0.5 wa, 0.0 hi, 0.2 si, 0.0 st
3. Identify CPU-Consuming Processes
To identify processes consuming the most CPU, use:
ps aux --sort=-%cpu | head -n 10
Expected Output
This command lists the top 10 processes by CPU usage:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1234 50.0 1.0 123456 789012 ? S 10:00 0:30 /usr/bin/python3
...
4. Restart CPU-Consuming Services
If a specific service is consuming too much CPU, consider restarting it. For example, if a web server is using excessive CPU:
sudo systemctl restart apache2
Expected Output
Check the status to ensure the service restarted successfully:
sudo systemctl status apache2
You should see an output indicating that the service is active and running.
5. Optimize Application Performance
If an application is causing high CPU load, consider optimizing its performance. This might involve code optimization, load balancing, or scaling the application.
6. Check for Background Jobs
Ensure there are no unnecessary background jobs running. List all cron jobs with:
crontab -l
Expected Output
Review the list of scheduled jobs and disable any that are not needed.
7. Update Prometheus Configuration
If the CPU load threshold needs adjustment, update the Prometheus alert expression and reload the configuration:
Edit the Prometheus configuration file (usually prometheus.yml):
- alert: HighCpuLoad
expr: 100 - (avg without (cluster) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
for: 5m
labels:
severity: critical
annotations:
summary: "High CPU load on {{ $labels.instance }}"
description: "CPU load is above 80% on {{ $labels.instance }}."
Reload Prometheus configuration:
curl -X POST http://<prometheus_host>:<prometheus_port>/-/reload
Expected Output
Prometheus should reload the configuration without errors, and the alert should be updated accordingly.
Conclusion
By following these steps, you should be able to troubleshoot and resolve the “HighCpuLoad” alert. If the issue persists, further investigation into the specific processes and system configuration may be necessary.