Alert Runbooks

KubeClientErrors

KubeClientErrors

Description

This alert fires when Kubernetes API server clients are experiencing a high rate of errors when communicating with the API server, typically above 1% of requests resulting in HTTP 5xx responses over a sustained period.

A high client error rate indicates that components such as controllers, schedulers, operators, or custom workloads are failing to communicate with the API server, which can lead to reconciliation failures, missed deployments, and cluster control plane degradation.


Possible Causes:


Severity estimation

Medium to High severity, depending on which clients are affected:

Severity increases with:


Troubleshooting steps

  1. Identify which clients are producing errors

    • Command / Action:
      • Check alert labels for the job or instance to identify the affected client, then query error rates in Prometheus
      • sum by (verb, code) (rate(rest_client_requests_total{job="<job>", code=~“5..”}[5m]))

    • Expected result:
      • Specific HTTP verbs and status codes are identified, narrowing down the type of failure
    • additional info:
      • Common error codes: 500 (server error), 503 (unavailable), 504 (gateway timeout)

  1. Check API server health and availability

    • Command / Action:
      • Verify the API server pods are running and healthy
      • kubectl get pods -n kube-system | grep kube-apiserver

      • kubectl logs <apiserver-pod> -n kube-system –tail=50

    • Expected result:
      • API server pods are Running with no error logs indicating crashes or overload
    • additional info:
      • Check for repeated timeout, etcd, or too many requests messages in API server logs

  1. Check API server request latency and load

    • Command / Action:
      • Inspect API server metrics for high latency or request rate
      • apiserver_request_duration_seconds_bucket

      • apiserver_current_inflight_requests

    • Expected result:
      • Request latency is within normal bounds; inflight requests are not saturated
    • additional info:
      • A saturated API server will start returning 429 or 503 errors to clients

  1. Check for throttled or rate-limited clients

    • Command / Action:
      • Look for 429 (Too Many Requests) responses in client metrics
      • sum by (job) (rate(rest_client_requests_total{code=“429”}[5m]))

    • Expected result:
      • No or very low rate of 429 responses
    • additional info:
      • A client generating too many requests may need its reconciliation interval or list/watch frequency reduced

  1. Identify deprecated or removed API usage

    • Command / Action:
      • Check API server logs or metrics for requests to deprecated API versions
      • apiserver_requested_deprecated_apis

      • kubectl logs <apiserver-pod> -n kube-system | grep -i “deprecated|removed”

    • Expected result:
      • No requests to removed API versions that would result in 404/410 errors
    • additional info:
      • After a Kubernetes upgrade, operators using removed APIs will fail; update the operator or its CRDs

  1. Check admission webhook availability

    • Command / Action:
      • Verify that admission webhooks are reachable and not timing out
      • kubectl get validatingwebhookconfigurations

      • kubectl get mutatingwebhookconfigurations

    • Expected result:
      • Webhook endpoints are reachable and responding within their timeout
    • additional info:
      • A failing webhook with failurePolicy: Fail will cause all matching API requests to return 500 errors

  1. Verify client certificate validity

    • Command / Action:
      • Check if the client certificates used to authenticate with the API server are valid and not expired
      • kubectl get csr

      • openssl x509 -in <cert-file> -noout -dates

    • Expected result:
      • Certificates are valid and not expired
    • additional info:
      • Expired client certificates will cause all requests from that component to fail with 401 Unauthorized

  1. Check network connectivity between clients and the API server

    • Command / Action:
    • Expected result:
      • Returns ok; connectivity is confirmed
    • additional info:
      • Network policies, firewall rules, or CNI issues may block access to the API server from certain pods or nodes

  1. Restart the affected client if errors persist after root cause is resolved

    • Command / Action:
      • If the root cause is fixed but the client is still in an error state, restart it
      • kubectl rollout restart deployment <client-deployment> -n <namespace>

    • Expected result:
      • The client reconnects to the API server successfully; error rate drops to zero
    • additional info:
      • Some clients cache stale connections or credentials and require a restart to recover

Additional resources