KubePodNotEnoughHealthyPods

Description

This alert fires when a workload (such as a Deployment, StatefulSet, or ReplicaSet) does not have enough healthy pods compared to the expected or configured number.
It indicates that one or more pods are not Ready, unavailable, or failing, which may reduce service capacity or cause partial or full outages.

Possible Causes:

Pods failing readiness or liveness probes
Application crashes (CrashLoopBackOff)
Insufficient cluster resources (CPU, memory, disk)
Scheduling issues (taints, node selectors, affinity rules)
Image pull failures (ImagePullBackOff, ErrImagePull)
Node failures or nodes in NotReady state
Ongoing rollout or deployment update
Misconfigured health checks
Dependency failures (databases, APIs, external services)

Severity estimation

Medium to High severity, depending on workload criticality and impact.

Low if reduced health occurs briefly during a rollout
Medium if some pods are unhealthy but redundancy exists
High if user-facing services are affected
Critical if healthy pod count drops below minimum required capacity or reaches zero

Severity increases with:

Duration of unhealthy state
Number of affected pods
Criticality of the service

Troubleshooting steps

Identify affected workload
- Command / Action:
  - Check which workload is missing healthy pods
  - kubectl get deployment,statefulset,replicaset -n <namespace>
- Expected result:
  - Healthy workloads show matching desired and ready pod counts
- additional info:
  - Focus on workloads where READY < DESIRED

Inspect pods
- Command / Action:
  - List pods and check their status
  - kubectl get pods -n <namespace>
- Expected result:
  - Pods are Running and Ready
- additional info:
  - Investigate Pending, CrashLoopBackOff, or NotReady pods

Describe unhealthy pods
- Command / Action:
  - Inspect pod details and events
  - kubectl describe pod <pod-name> -n <namespace>
- Expected result:
  - Events show normal scheduling and startup
- additional info:
  - Look for probe failures, image issues, or resource errors

Check container logs
- Command / Action:
  - Review logs for failing containers
  - kubectl logs <pod-name> -n <namespace>
- Expected result:
  - Application runs without repeated errors
- additional info:
  - For restarts, check previous logs with --previous

Verify readiness and liveness probes
- Command / Action:
  - Review probe configuration in the workload spec
  - kubectl get <resource> <name> -n <namespace> -o yaml
- Expected result:
  - Probes reflect realistic startup and response times
- additional info:
  - Overly strict probes can cause healthy apps to appear unhealthy

Check node health
- Command / Action:
  - Ensure nodes are healthy and schedulable
  - kubectl get nodes
- Expected result:
  - Nodes are in Ready state
- additional info:
  - Node pressure or failures can affect pod health

Review recent changes
- Command / Action:
  - Check recent deployments or configuration changes
  - kubectl rollout history deployment <deployment-name> -n <namespace>
- Expected result:
  - Recent changes are expected and valid
- additional info:
  - Consider rollback if a recent change caused the issue

Scale or roll back if required
- Command / Action:
  - Scale workload or roll back to a stable version
  - kubectl scale deployment <deployment-name> –replicas=<n> -n <namespace>
  - kubectl rollout undo deployment <deployment-name> -n <namespace>
- Expected result:
  - Healthy pod count meets or exceeds required minimum
- additional info:
  - Always address root cause before scaling permanently

Additional resources

Kubernetes Pods documentation
Kubernetes Pod lifecycle and troubleshooting
Kubernetes Deployment documentation
Related alert: KubeDeploymentReplicasMismatch
Related alert: KubePodCrashLooping