Alert Runbooks

KubeJobNotCompleted

KubeJobNotCompleted

Description

This alert fires when a Kubernetes Job has not completed within the expected time window.
It indicates that the Job is still running, retrying, or blocked and has not reached a successful completion state, which may delay batch processing, maintenance tasks, or dependent workflows.


Possible Causes:


Severity estimation

Medium severity by default, increasing with duration and criticality.

Severity increases if:


Troubleshooting steps

  1. Check Job status

    • Command / Action:
      • Inspect Job completion and active pod counts
      • kubectl get job <job-name> -n <namespace>

    • Expected result:
      • Job shows COMPLETIONS met
      • COMPLETIONS=1/1

    • additional info:
      • ACTIVE > 0 for a long time indicates the Job is not completing

  1. Describe the Job

    • Command / Action:
      • Review Job events and progress
      • kubectl describe job <job-name> -n <namespace>

    • Expected result:
      • Events show normal pod execution
    • additional info:
      • Look for repeated retries or deadline warnings

  1. Inspect Job pods

    • Command / Action:
      • List pods created by the Job
      • kubectl get pods -n <namespace> –selector=job-name=<job-name>

    • Expected result:
      • Pods are Running or Completed
    • additional info:
      • Pods stuck in Pending or restarting indicate issues

  1. Describe running or stuck pods

    • Command / Action:
      • Inspect pod details and events
      • kubectl describe pod <pod-name> -n <namespace>

    • Expected result:
      • Pod is progressing without repeated failures
    • additional info:
      • Check for resource limits, scheduling, or volume mount issues

  1. Check container logs

    • Command / Action:
      • Review logs from running or restarting containers
      • kubectl logs <pod-name> -n <namespace>

    • Expected result:
      • Logs show forward progress
    • additional info:
      • If logs are not advancing, the Job may be stuck

  1. Verify Job deadlines and retries

    • Command / Action:
      • Review Job configuration
      • kubectl get job <job-name> -n <namespace> -o yaml

    • Expected result:
      • activeDeadlineSeconds and retry settings are appropriate
    • additional info:
      • Missing or too-high deadlines can cause Jobs to run indefinitely

  1. Terminate and recreate the Job if necessary

    • Command / Action:
      • Stop the Job and recreate it after fixing the issue
      • kubectl delete job <job-name> -n <namespace>

      • kubectl apply -f <job-manifest>.yaml

    • Expected result:
      • Job completes successfully
    • additional info:
      • Ensure the underlying cause is resolved before rerunning

Additional resources