Alert Runbooks

KubeCPUQuotaOvercommit

KubeCPUQuotaOvercommit

Description

Less critical than KubeCPUOvercommit because is calculated with pod limits instead of pod requests

This alert fires when the total CPU limits enforced by Kubernetes exceed the available CPU allocatable on one or more nodes, causing CFS quota overcommitment.

CPU quota overcommitment can result in CPU throttling, increased latency, degraded performance, and potential instability for workloads, especially under peak load conditions.


Possible Causes:


Severity estimation

Medium to High severity, depending on workload criticality:

Severity increases with:


Troubleshooting steps

  1. Confirm CPU quota throttling

    • Command / Action:
      • Inspect container CPU throttling metrics in Prometheus
      • container_cpu_cfs_throttled_seconds_total

    • Expected result:
      • Low or near-zero throttling
    • additional info:
      • Sustained high throttling confirms CPU quota overcommitment

  1. Check CPU limits on affected pods

    • Command / Action:
      • Review CPU limits and requests
      • kubectl describe pod <pod-name> -n <namespace>

    • Expected result:
      • CPU limits match realistic workload requirements
    • additional info:
      • Overly high limits contribute directly to quota overcommitment

  1. Compare total CPU limits vs node allocatable

    • Command / Action:
      • Inspect node CPU capacity
      • kubectl describe node <node-name>

    • Expected result:
      • Total CPU limits do not exceed node allocatable capacity significantly
    • additional info:
      • Significant overcommitment increases throttling risk

  1. Identify heavily throttled pods

    • Command / Action:
      • Correlate throttling metrics with pods
      • kubectl top pod -n <namespace>

    • Expected result:
      • Throttling evenly distributed or minimal
    • additional info:
      • “Noisy neighbor” pods may dominate CPU usage

  1. Adjust CPU limits

    • Command / Action:
      • Reduce excessive CPU limits for pods or deployments
      • kubectl set resources deployment <deployment-name> –limits=cpu=<value> -n <namespace>

    • Expected result:
      • Throttling rate decreases
    • additional info:
      • Validate changes under load

  1. Tune CPU requests

    • Command / Action:
      • Align CPU requests with actual workload usage
      • kubectl set resources deployment <deployment-name> –requests=cpu=<value> -n <namespace>

    • Expected result:
      • Better pod placement and reduced contention
    • additional info:
      • Requests affect scheduling; limits affect throttling

  1. Scale workloads or cluster

    • Command / Action:
      • Add replicas or nodes to distribute CPU load
      • kubectl scale deployment <deployment-name> –replicas=<n> -n <namespace>

    • Expected result:
      • CPU pressure per pod is reduced
    • additional info:
      • Horizontal scaling mitigates quota contention

  1. Review autoscaler configuration

    • Command / Action:
      • Check HPA and Cluster Autoscaler settings
      • kubectl get hpa -A

    • Expected result:
      • Autoscaling reacts appropriately to increased CPU load
    • additional info:
      • Delayed scaling can worsen CPU quota overcommitment

Additional resources