AggregatedAPIErrors
AggregatedAPIErrors
Description
This alert fires when a Kubernetes aggregated API is responding with errors (for example 5xx responses, timeouts, or repeated failures) while still being registered in the cluster.
Unlike KubeAggregatedAPIDown, this alert indicates the API is reachable but unhealthy, leading to partial failures in features that depend on it.
Commonly affected components:
- Metrics Server (
metrics.k8s.io) - Horizontal Pod Autoscaler (HPA)
- Custom Resource APIs
- Controllers and operators using aggregated APIs
Possible Causes:
- Aggregated API server under heavy load
- Resource exhaustion (CPU, memory)
- Backend dependency failures (databases, external APIs)
- TLS handshake or certificate validation issues
- Network latency or packet loss
- Bugs or crashes inside the aggregated API server
- Misconfigured request limits or timeouts
- Partial rollout or mixed API versions
Severity estimation
Medium to High severity, depending on error rate and API importance.
- Low if errors are sporadic and retries succeed
- Medium if error rate is sustained and affects autoscaling or controllers
- High if core functionality is degraded
- Critical if errors block scaling, reconciliation, or cluster operations
Severity increases with:
- Error frequency and duration
- Number of dependent workloads
- Whether errors are user-facing or control-plane related
Troubleshooting steps
-
Identify failing aggregated API
- Command / Action:
- List APIService objects and observe conditions
-
kubectl get apiservice
- Expected result:
AVAILABLE=Truebut errors observed in metrics/logs
- additional info:
- This alert usually fires even when APIService is marked available
- Command / Action:
-
Inspect APIService details
- Command / Action:
- Review conditions and recent transitions
-
kubectl describe apiservice
- Expected result:
- No frequent flapping or warning conditions
- additional info:
- TLS or backend issues may appear intermittently
- Command / Action:
-
Check aggregated API pod status
- Command / Action:
- Inspect pod health and restarts
-
kubectl get pods -n
- Expected result:
- Pods are
Runningwith low restart counts
- Pods are
- additional info:
- Frequent restarts indicate instability
- Command / Action:
-
Review aggregated API logs
- Command / Action:
- Look for HTTP 5xx errors, panics, or timeouts
-
kubectl logs -n
- Expected result:
- Requests handled without repeated errors
- additional info:
- Stack traces or timeout messages indicate internal failures
- Command / Action:
-
Check resource usage
- Command / Action:
- Inspect CPU and memory usage
-
kubectl top pod -n
- Expected result:
- Resource usage within limits
- additional info:
- Resource starvation commonly causes API errors
- Command / Action:
-
Verify Service and Endpoints
- Command / Action:
- Ensure traffic is correctly routed
-
kubectl get svc -n
-
kubectl get endpoints -n
- Expected result:
- Endpoints match running pods
- additional info:
- Stale or missing endpoints can cause intermittent failures
- Command / Action:
-
Check networking and latency
- Command / Action:
- Validate network stability
-
ping
-
curl https://:/healthz
- Expected result:
- Low latency and consistent responses
- additional info:
- Network issues can cause request timeouts
- Command / Action:
-
Restart or scale the aggregated API
- Command / Action:
- Restart or scale replicas to reduce load
-
kubectl rollout restart deployment -n
-
kubectl scale deployment –replicas= -n
- Expected result:
- Error rate drops and alert resolves
- additional info:
- Scaling improves availability for high-traffic APIs
- Command / Action:
Additional resources
- Kubernetes Aggregated API documentation
- Kubernetes APIService reference
- Metrics Server troubleshooting
- Related alert: KubeAggregatedAPIDown
- Related alert: TargetDown