Alert Runbooks

KubeJobFailed

KubeJobFailed

Description

This alert fires when a Kubernetes Job has entered a failed state.
It indicates that one or more Job pods have terminated unsuccessfully and the Job has exceeded its retry limit (backoffLimit), meaning the task did not complete as intended.

This may result in missed batch processing, failed data pipelines, or incomplete maintenance tasks.


Possible Causes:


Severity estimation

Medium to High severity, depending on the Job purpose.

Severity increases if:


Troubleshooting steps

  1. Check Job status

    • Command / Action:
      • Inspect Job completion and failure counters
      • kubectl get job <job-name> -n <namespace>

    • Expected result:
      • Job shows COMPLETIONS met and FAILED=0
      • COMPLETIONS=1/1, FAILED=0

    • additional info:
      • A non-zero FAILED count indicates the Job has failed

  1. Describe the Job

    • Command / Action:
      • Review Job events and pod history
      • kubectl describe job <job-name> -n <namespace>

    • Expected result:
      • Events show successful pod execution
    • additional info:
      • Look for messages about backoff limits or deadline exceeded

  1. Inspect Job pods

    • Command / Action:
      • List pods created by the Job
      • kubectl get pods -n <namespace> –selector=job-name=<job-name>

    • Expected result:
      • Pods complete successfully
    • additional info:
      • Failed pods may still exist for log inspection

  1. Describe failed pods

    • Command / Action:
      • Inspect pod status and events
      • kubectl describe pod <pod-name> -n <namespace>

    • Expected result:
      • Pod exits with code 0
    • additional info:
      • Non-zero exit codes indicate application or configuration errors

  1. Check container logs

    • Command / Action:
      • Review logs from the failed container
      • kubectl logs <pod-name> -n <namespace>

    • Expected result:
      • Logs show successful task execution
    • additional info:
      • Logs are often the primary source of failure root cause

  1. Verify Job configuration

    • Command / Action:
      • Review Job spec for retries and deadlines
      • kubectl get job <job-name> -n <namespace> -o yaml

    • Expected result:
      • backoffLimit and activeDeadlineSeconds are appropriate
    • additional info:
      • Too-low retry limits may cause premature Job failure

  1. Rerun or recreate the Job

    • Command / Action:
      • Delete and recreate the Job after fixing the issue
      • kubectl delete job <job-name> -n <namespace>

      • kubectl apply -f <job-manifest>.yaml

    • Expected result:
      • Job completes successfully
    • additional info:
      • Jobs cannot be restarted once failed; they must be recreated

Additional resources