KubeClientErrors

Description

This alert fires when Kubernetes API server clients are experiencing a high rate of errors when communicating with the API server, typically above 1% of requests resulting in HTTP 5xx responses over a sustained period.

A high client error rate indicates that components such as controllers, schedulers, operators, or custom workloads are failing to communicate with the API server, which can lead to reconciliation failures, missed deployments, and cluster control plane degradation.

Possible Causes:

Kubernetes API server is overloaded or throttling requests
API server is restarting or temporarily unavailable
Network issues between the client and the API server
Client using an outdated or incompatible API version (deprecated/removed APIs)
Misconfigured RBAC causing authorization errors
Faulty operator or controller generating excessive API requests
API server admission webhooks timing out or rejecting requests
Expired or invalid client certificates

Severity estimation

Medium to High severity, depending on which clients are affected:

Low: Non-critical operators or monitoring agents experiencing errors; cluster operations unaffected
Medium: Core controllers (deployment, replicaset) experiencing errors; reconciliation delayed
High: Multiple critical components failing to reach the API server; deployments and scaling operations impaired
Critical: API server effectively unreachable for most clients; cluster control plane non-functional

Severity increases with:

Error rate (higher % of failed requests)
Criticality of the affected client (scheduler, controller-manager vs. a custom operator)
Duration of the error condition
Number of distinct clients affected

Troubleshooting steps

Identify which clients are producing errors
- Command / Action:
  - Check alert labels for the job or instance to identify the affected client, then query error rates in Prometheus
  - sum by (verb, code) (rate(rest_client_requests_total{job="<job>", code=~“5..”}[5m]))
- Expected result:
  - Specific HTTP verbs and status codes are identified, narrowing down the type of failure
- additional info:
  - Common error codes: 500 (server error), 503 (unavailable), 504 (gateway timeout)

Check API server health and availability
- Command / Action:
  - Verify the API server pods are running and healthy
  - kubectl get pods -n kube-system | grep kube-apiserver
  - kubectl logs <apiserver-pod> -n kube-system –tail=50
- Expected result:
  - API server pods are Running with no error logs indicating crashes or overload
- additional info:
  - Check for repeated timeout, etcd, or too many requests messages in API server logs

Check API server request latency and load
- Command / Action:
  - Inspect API server metrics for high latency or request rate
  - apiserver_request_duration_seconds_bucket
  - apiserver_current_inflight_requests
- Expected result:
  - Request latency is within normal bounds; inflight requests are not saturated
- additional info:
  - A saturated API server will start returning 429 or 503 errors to clients

Check for throttled or rate-limited clients
- Command / Action:
  - Look for 429 (Too Many Requests) responses in client metrics
  - sum by (job) (rate(rest_client_requests_total{code=“429”}[5m]))
- Expected result:
  - No or very low rate of 429 responses
- additional info:
  - A client generating too many requests may need its reconciliation interval or list/watch frequency reduced

Identify deprecated or removed API usage
- Command / Action:
  - Check API server logs or metrics for requests to deprecated API versions
  - apiserver_requested_deprecated_apis
  - kubectl logs <apiserver-pod> -n kube-system | grep -i “deprecated|removed”
- Expected result:
  - No requests to removed API versions that would result in 404/410 errors
- additional info:
  - After a Kubernetes upgrade, operators using removed APIs will fail; update the operator or its CRDs

Check admission webhook availability
- Command / Action:
  - Verify that admission webhooks are reachable and not timing out
  - kubectl get validatingwebhookconfigurations
  - kubectl get mutatingwebhookconfigurations
- Expected result:
  - Webhook endpoints are reachable and responding within their timeout
- additional info:
  - A failing webhook with failurePolicy: Fail will cause all matching API requests to return 500 errors

Verify client certificate validity
- Command / Action:
  - Check if the client certificates used to authenticate with the API server are valid and not expired
  - kubectl get csr
  - openssl x509 -in <cert-file> -noout -dates
- Expected result:
  - Certificates are valid and not expired
- additional info:
  - Expired client certificates will cause all requests from that component to fail with 401 Unauthorized

Check network connectivity between clients and the API server
- Command / Action:
  - Verify network connectivity from the affected pod/node to the API server endpoint
  - kubectl exec -it <pod-name> -n <namespace> – curl -k https://kubernetes.default.svc/healthz
- Expected result:
  - Returns ok; connectivity is confirmed
- additional info:
  - Network policies, firewall rules, or CNI issues may block access to the API server from certain pods or nodes

Restart the affected client if errors persist after root cause is resolved
- Command / Action:
  - If the root cause is fixed but the client is still in an error state, restart it
  - kubectl rollout restart deployment <client-deployment> -n <namespace>
- Expected result:
  - The client reconnects to the API server successfully; error rate drops to zero
- additional info:
  - Some clients cache stale connections or credentials and require a restart to recover

KubeClientErrors

KubeClientErrors

Description

Possible Causes:

Severity estimation

Troubleshooting steps

Additional resources