Runbook for “HighCpuSteal” Alert
1. Identify the Problem
When the “HighCpuSteal” alert is triggered, it indicates that the CPU steal time on one or more instances is critically high. CPU steal time occurs when the hypervisor of a virtual machine is using the CPU for other tasks, causing the VM to wait.
2. Check Current CPU Steal Time
Use the following command to check the current CPU steal time on the affected instance:
top -bn1 | grep "Cpu(s)"
Expected Output
You should see an output similar to this:
%Cpu(s): 2.3 us, 1.2 sy, 0.0 ni, 95.8 id, 0.5 wa, 0.0 hi, 0.2 si, 0.0 st
The st value represents the CPU steal time.
3. Identify CPU Steal Time Trends
To get a more detailed view of CPU steal time over a period, use:
sar -u 1 10 | grep "steal"
Expected Output
This command provides CPU usage statistics every second for 10 seconds, including steal time:
12:00:01 AM CPU %user %nice %system %iowait %steal %idle
12:00:02 AM all 2.30 0.00 1.20 0.50 0.20 95.80
...
4. Check for Overcommitment
High CPU steal time often indicates that the physical host is overcommitted. Check the number of virtual CPUs (vCPUs) assigned to your instance:
lscpu | grep "^CPU(s):"
Expected Output
You should see the number of CPUs assigned to your instance:
CPU(s): 4
5. Migrate to a Less Loaded Host
If possible, migrate the instance to a less loaded host. This can be done through your cloud provider’s management console or using command-line tools specific to your environment.
6. Optimize Workload Distribution
Distribute workloads more evenly across instances to reduce the load on any single instance. This might involve scaling out your application or using load balancing.
7. Update Prometheus Configuration
If the CPU steal time threshold needs adjustment, update the Prometheus alert expression and reload the configuration:
Edit the Prometheus configuration file (usually prometheus.yml):
- alert: HighCpuSteal
expr: avg without (cluster) (rate(node_cpu_seconds_total{mode="steal"}[10m])) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "High CPU steal time on {{ $labels.instance }}"
description: "CPU steal time is above 5% on {{ $labels.instance }}."
Reload Prometheus configuration:
curl -X POST http://<prometheus_host>:<prometheus_port>/-/reload
Expected Output
Prometheus should reload the configuration without errors, and the alert should be updated accordingly.
Conclusion
By following these steps, you should be able to troubleshoot and resolve the “HighCpuSteal” alert. If the issue persists, further investigation into the specific processes and system configuration may be necessary.