Runbook for “HighSystemLoad” Alert
1. Identify the Problem
When the “HighSystemLoad” alert is triggered, it indicates that the system load on one or more instances is critically high. This means the system is handling more processes than it can efficiently manage.
2. Check Current System Load
Use the following command to check the current system load on the affected instance:
uptime
Expected Output
You should see an output similar to this:
13:51:44 up 10 days, 3:22, 2 users, load average: 1.15, 0.75, 0.50
The load averages are for the last 1, 5, and 15 minutes.
3. Identify Load-Consuming Processes
To identify processes consuming the most system resources, use:
top -bn1 | head -n 20
Expected Output
This command lists the top processes by CPU and memory usage:
top - 13:51:44 up 10 days, 3:22, 2 users, load average: 1.15, 0.75, 0.50
Tasks: 123 total, 1 running, 122 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.3 us, 1.2 sy, 0.0 ni, 95.8 id, 0.5 wa, 0.0 hi, 0.2 si, 0.0 st
KiB Mem : 7977000 total, 1234000 free, 3456000 used, 2345000 buff/cache
KiB Swap: 2048000 total, 2035000 free, 123000 used. 4567000 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1234 root 20 0 123456 78901 12345 S 5.0 1.0 0:30.12 /usr/bin/python3
...
4. Restart Load-Consuming Services
If a specific service is consuming too many resources, consider restarting it. For example, if a web server is using excessive resources:
sudo systemctl restart apache2
Expected Output
Check the status to ensure the service restarted successfully:
sudo systemctl status apache2
You should see an output indicating that the service is active and running.
5. Optimize Application Performance
If an application is causing high system load, consider optimizing its performance. This might involve code optimization, load balancing, or scaling the application.
6. Check for Background Jobs
Ensure there are no unnecessary background jobs running. List all cron jobs with:
crontab -l
Expected Output
Review the list of scheduled jobs and disable any that are not needed.
7. Update Prometheus Configuration
If the system load threshold needs adjustment, update the Prometheus alert expression and reload the configuration:
Edit the Prometheus configuration file (usually prometheus.yml):
- alert: HighSystemLoad
expr: node_load15 / on(instance) group_left max by (instance) (instance:node_num_cpu:sum * 100) > 1
for: 5m
labels:
severity: critical
annotations:
summary: "High system load on {{ $labels.instance }}"
description: "System load is above 1.0 on {{ $labels.instance }}."
Reload Prometheus configuration:
curl -X POST http://<prometheus_host>:<prometheus_port>/-/reload
Expected Output
Prometheus should reload the configuration without errors, and the alert should be updated accordingly.
Conclusion
By following these steps, you should be able to troubleshoot and resolve the “HighSystemLoad” alert. If the issue persists, further investigation into the specific processes and system configuration may be necessary.