KubeCPUQuotaOvercommit
KubeCPUQuotaOvercommit
Description
Less critical than KubeCPUOvercommit because is calculated with pod limits instead of pod requests
This alert fires when the total CPU limits enforced by Kubernetes exceed the available CPU allocatable on one or more nodes, causing CFS quota overcommitment.
CPU quota overcommitment can result in CPU throttling, increased latency, degraded performance, and potential instability for workloads, especially under peak load conditions.
Possible Causes:
- CPU limits are set too high across multiple workloads
- Excessive number of CPU-limited pods scheduled on the same node
- Node allocatable CPU reduced due to system or kubelet reservations
- Misconfigured resource limits and requests
- Cluster scaling lagging behind workload growth
- Batch workloads or sudden replica increases
- Autoscaler not reacting quickly enough
Severity estimation
Medium to High severity, depending on workload criticality:
- Low: Occasional throttling with minimal user impact
- Medium: Sustained throttling causing latency or degraded throughput
- High: Critical services affected, requests failing or delayed
- Critical: Multiple services degraded, cluster instability observed
Severity increases with:
- Level of overcommitment
- Number of affected workloads
- Duration of sustained throttling
- Importance of affected workloads
Troubleshooting steps
-
Confirm CPU quota throttling
- Command / Action:
- Inspect container CPU throttling metrics in Prometheus
-
container_cpu_cfs_throttled_seconds_total
- Expected result:
- Low or near-zero throttling
- additional info:
- Sustained high throttling confirms CPU quota overcommitment
- Command / Action:
-
Check CPU limits on affected pods
- Command / Action:
- Review CPU limits and requests
-
kubectl describe pod <pod-name> -n <namespace>
- Expected result:
- CPU limits match realistic workload requirements
- additional info:
- Overly high limits contribute directly to quota overcommitment
- Command / Action:
-
Compare total CPU limits vs node allocatable
- Command / Action:
- Inspect node CPU capacity
-
kubectl describe node <node-name>
- Expected result:
- Total CPU limits do not exceed node allocatable capacity significantly
- additional info:
- Significant overcommitment increases throttling risk
- Command / Action:
-
Identify heavily throttled pods
- Command / Action:
- Correlate throttling metrics with pods
-
kubectl top pod -n <namespace>
- Expected result:
- Throttling evenly distributed or minimal
- additional info:
- “Noisy neighbor” pods may dominate CPU usage
- Command / Action:
-
Adjust CPU limits
- Command / Action:
- Reduce excessive CPU limits for pods or deployments
-
kubectl set resources deployment <deployment-name> –limits=cpu=<value> -n <namespace>
- Expected result:
- Throttling rate decreases
- additional info:
- Validate changes under load
- Command / Action:
-
Tune CPU requests
- Command / Action:
- Align CPU requests with actual workload usage
-
kubectl set resources deployment <deployment-name> –requests=cpu=<value> -n <namespace>
- Expected result:
- Better pod placement and reduced contention
- additional info:
- Requests affect scheduling; limits affect throttling
- Command / Action:
-
Scale workloads or cluster
- Command / Action:
- Add replicas or nodes to distribute CPU load
-
kubectl scale deployment <deployment-name> –replicas=<n> -n <namespace>
- Expected result:
- CPU pressure per pod is reduced
- additional info:
- Horizontal scaling mitigates quota contention
- Command / Action:
-
Review autoscaler configuration
- Command / Action:
- Check HPA and Cluster Autoscaler settings
-
kubectl get hpa -A
- Expected result:
- Autoscaling reacts appropriately to increased CPU load
- additional info:
- Delayed scaling can worsen CPU quota overcommitment
- Command / Action:
Additional resources
- Kubernetes CPU quotas and CFS
- Kubernetes resource management
- Prometheus container CPU metrics
- Related alert: CPUThrottlingHigh
- Related alert: KubeCPUOvercommit