KubeJobNotCompleted

Description

This alert fires when a Kubernetes Job has not completed within the expected time window.
It indicates that the Job is still running, retrying, or blocked and has not reached a successful completion state, which may delay batch processing, maintenance tasks, or dependent workflows.

Possible Causes:

Long-running or stuck Job execution
Job pods repeatedly restarting or retrying
Insufficient resources (CPU, memory, ephemeral storage)
Scheduling issues (taints, node selectors, affinity rules)
Failing init containers or sidecars
External dependency delays (databases, APIs, storage systems)
Misconfigured activeDeadlineSeconds
Node issues or pod eviction
Image pull delays

Severity estimation

Medium severity by default, increasing with duration and criticality.

Low if the Job is expected to run long and progress is observed
Medium if the Job is blocking internal or periodic processes
High if the Job is part of critical workflows (backups, migrations, billing)
Critical if the Job blocks production operations or downstream systems

Severity increases if:

The Job exceeds its normal runtime significantly
Multiple Jobs are affected
A scheduled (CronJob) execution overlaps with the next run

Troubleshooting steps

Check Job status
- Command / Action:
  - Inspect Job completion and active pod counts
  - kubectl get job <job-name> -n <namespace>
- Expected result:
  - Job shows COMPLETIONS met
  - COMPLETIONS=1/1
- additional info:
  - ACTIVE > 0 for a long time indicates the Job is not completing

Describe the Job
- Command / Action:
  - Review Job events and progress
  - kubectl describe job <job-name> -n <namespace>
- Expected result:
  - Events show normal pod execution
- additional info:
  - Look for repeated retries or deadline warnings

Inspect Job pods
- Command / Action:
  - List pods created by the Job
  - kubectl get pods -n <namespace> –selector=job-name=<job-name>
- Expected result:
  - Pods are Running or Completed
- additional info:
  - Pods stuck in Pending or restarting indicate issues

Describe running or stuck pods
- Command / Action:
  - Inspect pod details and events
  - kubectl describe pod <pod-name> -n <namespace>
- Expected result:
  - Pod is progressing without repeated failures
- additional info:
  - Check for resource limits, scheduling, or volume mount issues

Check container logs
- Command / Action:
  - Review logs from running or restarting containers
  - kubectl logs <pod-name> -n <namespace>
- Expected result:
  - Logs show forward progress
- additional info:
  - If logs are not advancing, the Job may be stuck

Verify Job deadlines and retries
- Command / Action:
  - Review Job configuration
  - kubectl get job <job-name> -n <namespace> -o yaml
- Expected result:
  - activeDeadlineSeconds and retry settings are appropriate
- additional info:
  - Missing or too-high deadlines can cause Jobs to run indefinitely

Terminate and recreate the Job if necessary
- Command / Action:
  - Stop the Job and recreate it after fixing the issue
  - kubectl delete job <job-name> -n <namespace>
  - kubectl apply -f <job-manifest>.yaml
- Expected result:
  - Job completes successfully
- additional info:
  - Ensure the underlying cause is resolved before rerunning

Additional resources

Kubernetes Job documentation
Kubernetes Pod lifecycle and troubleshooting
Related alert: KubeJobFailed