AggregatedAPIErrors

Description

This alert fires when a Kubernetes aggregated API is responding with errors (for example 5xx responses, timeouts, or repeated failures) while still being registered in the cluster.
Unlike KubeAggregatedAPIDown, this alert indicates the API is reachable but unhealthy, leading to partial failures in features that depend on it.

Commonly affected components:

Metrics Server (metrics.k8s.io)
Horizontal Pod Autoscaler (HPA)
Custom Resource APIs
Controllers and operators using aggregated APIs

Possible Causes:

Aggregated API server under heavy load
Resource exhaustion (CPU, memory)
Backend dependency failures (databases, external APIs)
TLS handshake or certificate validation issues
Network latency or packet loss
Bugs or crashes inside the aggregated API server
Misconfigured request limits or timeouts
Partial rollout or mixed API versions

Severity estimation

Medium to High severity, depending on error rate and API importance.

Low if errors are sporadic and retries succeed
Medium if error rate is sustained and affects autoscaling or controllers
High if core functionality is degraded
Critical if errors block scaling, reconciliation, or cluster operations

Severity increases with:

Error frequency and duration
Number of dependent workloads
Whether errors are user-facing or control-plane related

Troubleshooting steps

Identify failing aggregated API
- Command / Action:
  - List APIService objects and observe conditions
  - kubectl get apiservice
- Expected result:
  - AVAILABLE=True but errors observed in metrics/logs
- additional info:
  - This alert usually fires even when APIService is marked available

Inspect APIService details
- Command / Action:
  - Review conditions and recent transitions
  - kubectl describe apiservice
- Expected result:
  - No frequent flapping or warning conditions
- additional info:
  - TLS or backend issues may appear intermittently

Check aggregated API pod status
- Command / Action:
  - Inspect pod health and restarts
  - kubectl get pods -n
- Expected result:
  - Pods are Running with low restart counts
- additional info:
  - Frequent restarts indicate instability

Review aggregated API logs
- Command / Action:
  - Look for HTTP 5xx errors, panics, or timeouts
  - kubectl logs -n
- Expected result:
  - Requests handled without repeated errors
- additional info:
  - Stack traces or timeout messages indicate internal failures

Check resource usage
- Command / Action:
  - Inspect CPU and memory usage
  - kubectl top pod -n
- Expected result:
  - Resource usage within limits
- additional info:
  - Resource starvation commonly causes API errors

Verify Service and Endpoints
- Command / Action:
  - Ensure traffic is correctly routed
  - kubectl get svc -n
  - kubectl get endpoints -n
- Expected result:
  - Endpoints match running pods
- additional info:
  - Stale or missing endpoints can cause intermittent failures

Check networking and latency
- Command / Action:
  - Validate network stability
  - ping
  - curl https://:/healthz
- Expected result:
  - Low latency and consistent responses
- additional info:
  - Network issues can cause request timeouts

Restart or scale the aggregated API
- Command / Action:
  - Restart or scale replicas to reduce load
  - kubectl rollout restart deployment -n
  - kubectl scale deployment –replicas= -n
- Expected result:
  - Error rate drops and alert resolves
- additional info:
  - Scaling improves availability for high-traffic APIs

Additional resources

Kubernetes Aggregated API documentation
Kubernetes APIService reference
Metrics Server troubleshooting
Related alert: KubeAggregatedAPIDown
Related alert: TargetDown