KubeJobFailed
KubeJobFailed
Description
This alert fires when a Kubernetes Job has entered a failed state.
It indicates that one or more Job pods have terminated unsuccessfully and the Job has exceeded its retry limit (backoffLimit), meaning the task did not complete as intended.
This may result in missed batch processing, failed data pipelines, or incomplete maintenance tasks.
Possible Causes:
- Application errors causing the Job container to exit with a non-zero status
- Incorrect command or arguments in the Job spec
- Image pull failures (
ImagePullBackOff,ErrImagePull) - Insufficient resources (CPU, memory, ephemeral storage)
- Failing init containers
- Dependency failures (databases, APIs, external services unavailable)
- Node failures or pod eviction
- Misconfigured
backoffLimitor active deadline exceeded
Severity estimation
Medium to High severity, depending on the Job purpose.
- Low if the Job is non-critical or can be retried manually
- Medium if the Job supports internal processes or periodic tasks
- High if the Job is part of critical workflows (backups, migrations, billing)
- Critical if repeated Job failures block production operations or data integrity
Severity increases if:
- The Job runs on a schedule (CronJob) and fails repeatedly
- The Job failure impacts downstream systems
- No alerting or retry mechanism exists beyond this alert
Troubleshooting steps
-
Check Job status
- Command / Action:
- Inspect Job completion and failure counters
-
kubectl get job <job-name> -n <namespace>
- Expected result:
- Job shows
COMPLETIONSmet andFAILED=0 -
COMPLETIONS=1/1, FAILED=0
- Job shows
- additional info:
- A non-zero FAILED count indicates the Job has failed
- Command / Action:
-
Describe the Job
- Command / Action:
- Review Job events and pod history
-
kubectl describe job <job-name> -n <namespace>
- Expected result:
- Events show successful pod execution
- additional info:
- Look for messages about backoff limits or deadline exceeded
- Command / Action:
-
Inspect Job pods
- Command / Action:
- List pods created by the Job
-
kubectl get pods -n <namespace> –selector=job-name=<job-name>
- Expected result:
- Pods complete successfully
- additional info:
- Failed pods may still exist for log inspection
- Command / Action:
-
Describe failed pods
- Command / Action:
- Inspect pod status and events
-
kubectl describe pod <pod-name> -n <namespace>
- Expected result:
- Pod exits with code
0
- Pod exits with code
- additional info:
- Non-zero exit codes indicate application or configuration errors
- Command / Action:
-
Check container logs
- Command / Action:
- Review logs from the failed container
-
kubectl logs <pod-name> -n <namespace>
- Expected result:
- Logs show successful task execution
- additional info:
- Logs are often the primary source of failure root cause
- Command / Action:
-
Verify Job configuration
- Command / Action:
- Review Job spec for retries and deadlines
-
kubectl get job <job-name> -n <namespace> -o yaml
- Expected result:
backoffLimitandactiveDeadlineSecondsare appropriate
- additional info:
- Too-low retry limits may cause premature Job failure
- Command / Action:
-
Rerun or recreate the Job
- Command / Action:
- Delete and recreate the Job after fixing the issue
-
kubectl delete job <job-name> -n <namespace>
-
kubectl apply -f <job-manifest>.yaml
- Expected result:
- Job completes successfully
- additional info:
- Jobs cannot be restarted once failed; they must be recreated
- Command / Action:
Additional resources
- Kubernetes Job documentation
- Kubernetes Pod lifecycle and troubleshooting
- Related alert: KubeJobNotCompleted