Alert Runbooks

AggregatedAPIErrors

AggregatedAPIErrors

Description

This alert fires when a Kubernetes aggregated API is responding with errors (for example 5xx responses, timeouts, or repeated failures) while still being registered in the cluster.
Unlike KubeAggregatedAPIDown, this alert indicates the API is reachable but unhealthy, leading to partial failures in features that depend on it.

Commonly affected components:


Possible Causes:


Severity estimation

Medium to High severity, depending on error rate and API importance.

Severity increases with:


Troubleshooting steps

  1. Identify failing aggregated API

    • Command / Action:
      • List APIService objects and observe conditions
      • kubectl get apiservice

    • Expected result:
      • AVAILABLE=True but errors observed in metrics/logs
    • additional info:
      • This alert usually fires even when APIService is marked available

  1. Inspect APIService details

    • Command / Action:
      • Review conditions and recent transitions
      • kubectl describe apiservice

    • Expected result:
      • No frequent flapping or warning conditions
    • additional info:
      • TLS or backend issues may appear intermittently

  1. Check aggregated API pod status

    • Command / Action:
      • Inspect pod health and restarts
      • kubectl get pods -n

    • Expected result:
      • Pods are Running with low restart counts
    • additional info:
      • Frequent restarts indicate instability

  1. Review aggregated API logs

    • Command / Action:
      • Look for HTTP 5xx errors, panics, or timeouts
      • kubectl logs -n

    • Expected result:
      • Requests handled without repeated errors
    • additional info:
      • Stack traces or timeout messages indicate internal failures

  1. Check resource usage

    • Command / Action:
      • Inspect CPU and memory usage
      • kubectl top pod -n

    • Expected result:
      • Resource usage within limits
    • additional info:
      • Resource starvation commonly causes API errors

  1. Verify Service and Endpoints

    • Command / Action:
      • Ensure traffic is correctly routed
      • kubectl get svc -n

      • kubectl get endpoints -n

    • Expected result:
      • Endpoints match running pods
    • additional info:
      • Stale or missing endpoints can cause intermittent failures

  1. Check networking and latency

    • Command / Action:
      • Validate network stability
      • ping

      • curl https://:/healthz

    • Expected result:
      • Low latency and consistent responses
    • additional info:
      • Network issues can cause request timeouts

  1. Restart or scale the aggregated API

    • Command / Action:
      • Restart or scale replicas to reduce load
      • kubectl rollout restart deployment -n

      • kubectl scale deployment –replicas= -n

    • Expected result:
      • Error rate drops and alert resolves
    • additional info:
      • Scaling improves availability for high-traffic APIs

Additional resources