KubePodCrashLooping

Description

This alert fires when a Kubernetes Pod is repeatedly crashing and restarting, entering a CrashLoopBackOff state.
It indicates that the container starts but fails shortly after, preventing the application from running normally and potentially impacting service availability.

Possible Causes:

Application runtime errors or unhandled exceptions
Incorrect command or entrypoint configuration
Missing or invalid environment variables or secrets
Dependency failures (databases, APIs, external services)
Failing liveness or startup probes
Insufficient resources (CPU, memory) causing OOMKills
Configuration or image changes introduced during a recent deployment
File system or permission issues

Severity estimation

Medium to High severity, depending on impact and scope.

Low if the pod is non-critical or has redundancy
Medium if some replicas are crash looping but service remains available
High if crash looping affects user-facing or critical services
Critical if all replicas of a workload are crash looping

Severity increases with:

Duration of the crash loop
Number of affected pods
Criticality of the application

Troubleshooting steps

Identify crash looping pods
- Command / Action:
  - List pods and check restart counts
  - kubectl get pods -n <namespace>
- Expected result:
  - Crash looping pods show CrashLoopBackOff and high restart counts
- additional info:
  - Focus on pods with frequent restarts

Describe the Pod
- Command / Action:
  - Inspect pod events and restart reasons
  - kubectl describe pod <pod-name> -n <namespace>
- Expected result:
  - Events indicate the reason for container restarts
- additional info:
  - Look for probe failures, OOMKilled, or config errors

Check previous container logs
- Command / Action:
  - Review logs from the last failed container instance
  - kubectl logs <pod-name> -n <namespace> –previous
- Expected result:
  - Logs reveal the error that caused the crash
- additional info:
  - Current logs may be empty if the container crashes quickly

Verify resource limits
- Command / Action:
  - Check CPU and memory limits
  - kubectl get pod <pod-name> -n <namespace> -o yaml
- Expected result:
  - Resource limits are sufficient for the workload
- additional info:
  - OOMKilled events indicate insufficient memory

Check probes configuration
- Command / Action:
  - Review liveness and startup probes
  - kubectl get <resource> <name> -n <namespace> -o yaml
- Expected result:
  - Probes allow enough startup and recovery time
- additional info:
  - Overly aggressive probes can cause crash loops

Verify configuration and secrets
- Command / Action:
  - Check ConfigMaps and Secrets used by the pod
  - kubectl describe pod <pod-name> -n <namespace>
- Expected result:
  - Required configuration is present and mounted correctly
- additional info:
  - Missing secrets often cause immediate container exits

Check recent changes
- Command / Action:
  - Review recent deployments or configuration updates
  - kubectl rollout history deployment <deployment-name> -n <namespace>
- Expected result:
  - Recent changes are expected and valid
- additional info:
  - Roll back if a recent change introduced the crash loop

Roll back or fix and redeploy
- Command / Action:
  - Roll back to last known good version or apply a fix
  - kubectl rollout undo deployment <deployment-name> -n <namespace>
- Expected result:
  - Pods stabilize and remain in Running state
- additional info:
  - Avoid repeated restarts without addressing the root cause

Additional resources

Kubernetes Pods documentation
Kubernetes Pod lifecycle and troubleshooting
Kubernetes debugging applications
Related alert: KubePodNotEnoughHealthyPods
Related alert: KubeDeploymentReplicasMismatch