KubeClientErrors
KubeClientErrors
Description
This alert fires when Kubernetes API server clients are experiencing a high rate of errors when communicating with the API server, typically above 1% of requests resulting in HTTP 5xx responses over a sustained period.
A high client error rate indicates that components such as controllers, schedulers, operators, or custom workloads are failing to communicate with the API server, which can lead to reconciliation failures, missed deployments, and cluster control plane degradation.
Possible Causes:
- Kubernetes API server is overloaded or throttling requests
- API server is restarting or temporarily unavailable
- Network issues between the client and the API server
- Client using an outdated or incompatible API version (deprecated/removed APIs)
- Misconfigured RBAC causing authorization errors
- Faulty operator or controller generating excessive API requests
- API server admission webhooks timing out or rejecting requests
- Expired or invalid client certificates
Severity estimation
Medium to High severity, depending on which clients are affected:
- Low: Non-critical operators or monitoring agents experiencing errors; cluster operations unaffected
- Medium: Core controllers (deployment, replicaset) experiencing errors; reconciliation delayed
- High: Multiple critical components failing to reach the API server; deployments and scaling operations impaired
- Critical: API server effectively unreachable for most clients; cluster control plane non-functional
Severity increases with:
- Error rate (higher % of failed requests)
- Criticality of the affected client (scheduler, controller-manager vs. a custom operator)
- Duration of the error condition
- Number of distinct clients affected
Troubleshooting steps
-
Identify which clients are producing errors
- Command / Action:
- Check alert labels for the
joborinstanceto identify the affected client, then query error rates in Prometheus -
sum by (verb, code) (rate(rest_client_requests_total{job="<job>", code=~“5..”}[5m]))
- Check alert labels for the
- Expected result:
- Specific HTTP verbs and status codes are identified, narrowing down the type of failure
- additional info:
- Common error codes:
500(server error),503(unavailable),504(gateway timeout)
- Common error codes:
- Command / Action:
-
Check API server health and availability
- Command / Action:
- Verify the API server pods are running and healthy
-
kubectl get pods -n kube-system | grep kube-apiserver
-
kubectl logs <apiserver-pod> -n kube-system –tail=50
- Expected result:
- API server pods are Running with no error logs indicating crashes or overload
- additional info:
- Check for repeated
timeout,etcd, ortoo many requestsmessages in API server logs
- Check for repeated
- Command / Action:
-
Check API server request latency and load
- Command / Action:
- Inspect API server metrics for high latency or request rate
-
apiserver_request_duration_seconds_bucket
-
apiserver_current_inflight_requests
- Expected result:
- Request latency is within normal bounds; inflight requests are not saturated
- additional info:
- A saturated API server will start returning 429 or 503 errors to clients
- Command / Action:
-
Check for throttled or rate-limited clients
- Command / Action:
- Look for 429 (Too Many Requests) responses in client metrics
-
sum by (job) (rate(rest_client_requests_total{code=“429”}[5m]))
- Expected result:
- No or very low rate of 429 responses
- additional info:
- A client generating too many requests may need its reconciliation interval or list/watch frequency reduced
- Command / Action:
-
Identify deprecated or removed API usage
- Command / Action:
- Check API server logs or metrics for requests to deprecated API versions
-
apiserver_requested_deprecated_apis
-
kubectl logs <apiserver-pod> -n kube-system | grep -i “deprecated|removed”
- Expected result:
- No requests to removed API versions that would result in 404/410 errors
- additional info:
- After a Kubernetes upgrade, operators using removed APIs will fail; update the operator or its CRDs
- Command / Action:
-
Check admission webhook availability
- Command / Action:
- Verify that admission webhooks are reachable and not timing out
-
kubectl get validatingwebhookconfigurations
-
kubectl get mutatingwebhookconfigurations
- Expected result:
- Webhook endpoints are reachable and responding within their timeout
- additional info:
- A failing webhook with
failurePolicy: Failwill cause all matching API requests to return 500 errors
- A failing webhook with
- Command / Action:
-
Verify client certificate validity
- Command / Action:
- Check if the client certificates used to authenticate with the API server are valid and not expired
-
kubectl get csr
-
openssl x509 -in <cert-file> -noout -dates
- Expected result:
- Certificates are valid and not expired
- additional info:
- Expired client certificates will cause all requests from that component to fail with 401 Unauthorized
- Command / Action:
-
Check network connectivity between clients and the API server
- Command / Action:
- Verify network connectivity from the affected pod/node to the API server endpoint
-
kubectl exec -it <pod-name> -n <namespace> – curl -k https://kubernetes.default.svc/healthz
- Expected result:
- Returns
ok; connectivity is confirmed
- Returns
- additional info:
- Network policies, firewall rules, or CNI issues may block access to the API server from certain pods or nodes
- Command / Action:
-
Restart the affected client if errors persist after root cause is resolved
- Command / Action:
- If the root cause is fixed but the client is still in an error state, restart it
-
kubectl rollout restart deployment <client-deployment> -n <namespace>
- Expected result:
- The client reconnects to the API server successfully; error rate drops to zero
- additional info:
- Some clients cache stale connections or credentials and require a restart to recover
- Command / Action:
Additional resources
- Kubernetes API Server
- Kubernetes API deprecation policy
- Admission Webhooks
- Kubernetes API flow control
- Related alert: KubeVersionMismatch