KubeJobFailed

Description

This alert fires when a Kubernetes Job has entered a failed state.
It indicates that one or more Job pods have terminated unsuccessfully and the Job has exceeded its retry limit (backoffLimit), meaning the task did not complete as intended.

This may result in missed batch processing, failed data pipelines, or incomplete maintenance tasks.

Possible Causes:

Application errors causing the Job container to exit with a non-zero status
Incorrect command or arguments in the Job spec
Image pull failures (ImagePullBackOff, ErrImagePull)
Insufficient resources (CPU, memory, ephemeral storage)
Failing init containers
Dependency failures (databases, APIs, external services unavailable)
Node failures or pod eviction
Misconfigured backoffLimit or active deadline exceeded

Severity estimation

Medium to High severity, depending on the Job purpose.

Low if the Job is non-critical or can be retried manually
Medium if the Job supports internal processes or periodic tasks
High if the Job is part of critical workflows (backups, migrations, billing)
Critical if repeated Job failures block production operations or data integrity

Severity increases if:

The Job runs on a schedule (CronJob) and fails repeatedly
The Job failure impacts downstream systems
No alerting or retry mechanism exists beyond this alert

Troubleshooting steps

Check Job status
- Command / Action:
  - Inspect Job completion and failure counters
  - kubectl get job <job-name> -n <namespace>
- Expected result:
  - Job shows COMPLETIONS met and FAILED=0
  - COMPLETIONS=1/1, FAILED=0
- additional info:
  - A non-zero FAILED count indicates the Job has failed

Describe the Job
- Command / Action:
  - Review Job events and pod history
  - kubectl describe job <job-name> -n <namespace>
- Expected result:
  - Events show successful pod execution
- additional info:
  - Look for messages about backoff limits or deadline exceeded

Inspect Job pods
- Command / Action:
  - List pods created by the Job
  - kubectl get pods -n <namespace> –selector=job-name=<job-name>
- Expected result:
  - Pods complete successfully
- additional info:
  - Failed pods may still exist for log inspection

Describe failed pods
- Command / Action:
  - Inspect pod status and events
  - kubectl describe pod <pod-name> -n <namespace>
- Expected result:
  - Pod exits with code 0
- additional info:
  - Non-zero exit codes indicate application or configuration errors

Check container logs
- Command / Action:
  - Review logs from the failed container
  - kubectl logs <pod-name> -n <namespace>
- Expected result:
  - Logs show successful task execution
- additional info:
  - Logs are often the primary source of failure root cause

Verify Job configuration
- Command / Action:
  - Review Job spec for retries and deadlines
  - kubectl get job <job-name> -n <namespace> -o yaml
- Expected result:
  - backoffLimit and activeDeadlineSeconds are appropriate
- additional info:
  - Too-low retry limits may cause premature Job failure

Rerun or recreate the Job
- Command / Action:
  - Delete and recreate the Job after fixing the issue
  - kubectl delete job <job-name> -n <namespace>
  - kubectl apply -f <job-manifest>.yaml
- Expected result:
  - Job completes successfully
- additional info:
  - Jobs cannot be restarted once failed; they must be recreated

Additional resources

Kubernetes Job documentation
Kubernetes Pod lifecycle and troubleshooting
Related alert: KubeJobNotCompleted