Alert Runbooks

KubePodCrashLooping

KubePodCrashLooping

Description

This alert fires when a Kubernetes Pod is repeatedly crashing and restarting, entering a CrashLoopBackOff state.
It indicates that the container starts but fails shortly after, preventing the application from running normally and potentially impacting service availability.


Possible Causes:


Severity estimation

Medium to High severity, depending on impact and scope.

Severity increases with:


Troubleshooting steps

  1. Identify crash looping pods

    • Command / Action:
      • List pods and check restart counts
      • kubectl get pods -n <namespace>

    • Expected result:
      • Crash looping pods show CrashLoopBackOff and high restart counts
    • additional info:
      • Focus on pods with frequent restarts

  1. Describe the Pod

    • Command / Action:
      • Inspect pod events and restart reasons
      • kubectl describe pod <pod-name> -n <namespace>

    • Expected result:
      • Events indicate the reason for container restarts
    • additional info:
      • Look for probe failures, OOMKilled, or config errors

  1. Check previous container logs

    • Command / Action:
      • Review logs from the last failed container instance
      • kubectl logs <pod-name> -n <namespace> –previous

    • Expected result:
      • Logs reveal the error that caused the crash
    • additional info:
      • Current logs may be empty if the container crashes quickly

  1. Verify resource limits

    • Command / Action:
      • Check CPU and memory limits
      • kubectl get pod <pod-name> -n <namespace> -o yaml

    • Expected result:
      • Resource limits are sufficient for the workload
    • additional info:
      • OOMKilled events indicate insufficient memory

  1. Check probes configuration

    • Command / Action:
      • Review liveness and startup probes
      • kubectl get <resource> <name> -n <namespace> -o yaml

    • Expected result:
      • Probes allow enough startup and recovery time
    • additional info:
      • Overly aggressive probes can cause crash loops

  1. Verify configuration and secrets

    • Command / Action:
      • Check ConfigMaps and Secrets used by the pod
      • kubectl describe pod <pod-name> -n <namespace>

    • Expected result:
      • Required configuration is present and mounted correctly
    • additional info:
      • Missing secrets often cause immediate container exits

  1. Check recent changes

    • Command / Action:
      • Review recent deployments or configuration updates
      • kubectl rollout history deployment <deployment-name> -n <namespace>

    • Expected result:
      • Recent changes are expected and valid
    • additional info:
      • Roll back if a recent change introduced the crash loop

  1. Roll back or fix and redeploy

    • Command / Action:
      • Roll back to last known good version or apply a fix
      • kubectl rollout undo deployment <deployment-name> -n <namespace>

    • Expected result:
      • Pods stabilize and remain in Running state
    • additional info:
      • Avoid repeated restarts without addressing the root cause

Additional resources