Runbook for “HighSystemLoad” Alert

1. Identify the Problem

When the “HighSystemLoad” alert is triggered, it indicates that the system load on one or more instances is critically high. This means the system is handling more processes than it can efficiently manage.

2. Check Current System Load

Use the following command to check the current system load on the affected instance:

uptime

Expected Output

You should see an output similar to this:

 13:51:44 up 10 days,  3:22,  2 users,  load average: 1.15, 0.75, 0.50

The load averages are for the last 1, 5, and 15 minutes.

3. Identify Load-Consuming Processes

To identify processes consuming the most system resources, use:

top -bn1 | head -n 20

Expected Output

This command lists the top processes by CPU and memory usage:

top - 13:51:44 up 10 days,  3:22,  2 users,  load average: 1.15, 0.75, 0.50
Tasks: 123 total,   1 running, 122 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.3 us,  1.2 sy,  0.0 ni, 95.8 id,  0.5 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem :  7977000 total,  1234000 free,  3456000 used,  2345000 buff/cache
KiB Swap:  2048000 total,  2035000 free,    123000 used.  4567000 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1234 root      20   0  123456  78901  12345 S   5.0  1.0   0:30.12 /usr/bin/python3
...

4. Restart Load-Consuming Services

If a specific service is consuming too many resources, consider restarting it. For example, if a web server is using excessive resources:

sudo systemctl restart apache2

Expected Output

Check the status to ensure the service restarted successfully:

sudo systemctl status apache2

You should see an output indicating that the service is active and running.

5. Optimize Application Performance

If an application is causing high system load, consider optimizing its performance. This might involve code optimization, load balancing, or scaling the application.

6. Check for Background Jobs

Ensure there are no unnecessary background jobs running. List all cron jobs with:

crontab -l

Expected Output

Review the list of scheduled jobs and disable any that are not needed.

7. Update Prometheus Configuration

If the system load threshold needs adjustment, update the Prometheus alert expression and reload the configuration:

Edit the Prometheus configuration file (usually prometheus.yml):

- alert: HighSystemLoad
  expr: node_load15 / on(instance) group_left max by (instance) (instance:node_num_cpu:sum * 100) > 1
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High system load on {{ $labels.instance }}"
    description: "System load is above 1.0 on {{ $labels.instance }}."

Reload Prometheus configuration:

curl -X POST http://<prometheus_host>:<prometheus_port>/-/reload

Expected Output

Prometheus should reload the configuration without errors, and the alert should be updated accordingly.

Conclusion

By following these steps, you should be able to troubleshoot and resolve the “HighSystemLoad” alert. If the issue persists, further investigation into the specific processes and system configuration may be necessary.