Alert Runbooks

KubeCPUOvercommit

KubeCPUOvercommit

Description

This alert fires when total CPU requests or limits across pods exceed the allocatable CPU capacity of one or more Kubernetes nodes.
CPU overcommitment increases the risk of CPU contention, throttling, degraded performance, and unstable workloads, especially during traffic spikes.


Possible Causes:


Severity estimation

Medium to High severity, depending on workload impact.

Severity increases with:


Troubleshooting steps

  1. Identify affected nodes

    • Command / Action:
      • Check node allocatable CPU and usage
      • kubectl describe node <node-name>

    • Expected result:
      • Requested CPU is below allocatable CPU
    • additional info:
      • Focus on nodes reporting high pod density or CPU pressure

  1. Review CPU requests per namespace

    • Command / Action:
      • List CPU requests aggregated by namespace
      • kubectl get pods -A -o custom-columns=NS:.metadata.namespace,CPU:.spec.containers[*].resources.requests.cpu

    • Expected result:
      • Requests align with actual workload needs
    • additional info:
      • Overestimated requests increase overcommitment risk

  1. Compare requests vs actual usage

    • Command / Action:
      • Inspect real CPU usage
      • kubectl top pod -A

    • Expected result:
      • CPU usage roughly matches requests
    • additional info:
      • Large gaps indicate inefficient request sizing

  1. Check CPU limits and throttling

    • Command / Action:
      • Review CPU limits for affected pods
      • kubectl describe pod <pod-name> -n <namespace>

    • Expected result:
      • CPU limits are reasonable and consistent
    • additional info:
      • Tight limits amplify the impact of overcommitment

  1. Identify noisy neighbors

    • Command / Action:
      • Detect pods with high CPU consumption
      • kubectl top pod -n <namespace>

    • Expected result:
      • CPU usage is evenly distributed
    • additional info:
      • A few CPU-hungry pods can starve others

  1. Reduce CPU requests where possible

    • Command / Action:
      • Tune CPU requests to realistic values
      • kubectl set resources deployment <deployment-name> –requests=cpu=<value> -n <namespace>

    • Expected result:
      • Lower total requested CPU on nodes
    • additional info:
      • Always validate changes under load

  1. Scale the cluster

    • Command / Action:
      • Add nodes or enable autoscaling
      • kubectl get nodes

    • Expected result:
      • CPU pressure is reduced across nodes
    • additional info:
      • Ensure Cluster Autoscaler is properly configured

  1. Reschedule workloads

    • Command / Action:
      • Evict or rebalance pods
      • kubectl drain <node-name> –ignore-daemonsets

    • Expected result:
      • Pods redistribute to less loaded nodes
    • additional info:
      • Use carefully in production environments

Additional resources