Runbook for “HighCpuSteal” Alert

1. Identify the Problem

When the “HighCpuSteal” alert is triggered, it indicates that the CPU steal time on one or more instances is critically high. CPU steal time occurs when the hypervisor of a virtual machine is using the CPU for other tasks, causing the VM to wait.

2. Check Current CPU Steal Time

Use the following command to check the current CPU steal time on the affected instance:

top -bn1 | grep "Cpu(s)"

Expected Output

You should see an output similar to this:

%Cpu(s):  2.3 us,  1.2 sy,  0.0 ni, 95.8 id,  0.5 wa,  0.0 hi,  0.2 si,  0.0 st

The st value represents the CPU steal time.

To get a more detailed view of CPU steal time over a period, use:

sar -u 1 10 | grep "steal"

Expected Output

This command provides CPU usage statistics every second for 10 seconds, including steal time:

12:00:01 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
12:00:02 AM     all      2.30      0.00      1.20      0.50      0.20     95.80
...

4. Check for Overcommitment

High CPU steal time often indicates that the physical host is overcommitted. Check the number of virtual CPUs (vCPUs) assigned to your instance:

lscpu | grep "^CPU(s):"

Expected Output

You should see the number of CPUs assigned to your instance:

CPU(s):              4

5. Migrate to a Less Loaded Host

If possible, migrate the instance to a less loaded host. This can be done through your cloud provider’s management console or using command-line tools specific to your environment.

6. Optimize Workload Distribution

Distribute workloads more evenly across instances to reduce the load on any single instance. This might involve scaling out your application or using load balancing.

7. Update Prometheus Configuration

If the CPU steal time threshold needs adjustment, update the Prometheus alert expression and reload the configuration:

Edit the Prometheus configuration file (usually prometheus.yml):

- alert: HighCpuSteal
  expr: avg without (cluster) (rate(node_cpu_seconds_total{mode="steal"}[10m])) * 100 > 5
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High CPU steal time on {{ $labels.instance }}"
    description: "CPU steal time is above 5% on {{ $labels.instance }}."

Reload Prometheus configuration:

curl -X POST http://<prometheus_host>:<prometheus_port>/-/reload

Expected Output

Prometheus should reload the configuration without errors, and the alert should be updated accordingly.

Conclusion

By following these steps, you should be able to troubleshoot and resolve the “HighCpuSteal” alert. If the issue persists, further investigation into the specific processes and system configuration may be necessary.