Alert Runbooks

KubeDaemonSetRolloutStuck

KubeDaemonSetRolloutStuck

Description

This alert fires when a Kubernetes DaemonSet rollout is not progressing and remains stuck in an updating state for longer than expected.
It indicates that the DaemonSet controller is unable to successfully create or update pods on all eligible nodes, leaving the cluster in a partially updated or degraded state.

DaemonSets are often used for critical node-level components (networking, logging, monitoring, security), so a stuck rollout can have widespread impact.


Possible Causes:


Severity estimation

Medium to High severity, depending on the DaemonSet function.

Severity increases with:


Troubleshooting steps

  1. Check DaemonSet status

    • Command / Action:
      • Inspect desired vs updated pod counts
      • kubectl get daemonset <daemonset-name> -n <namespace>

    • Expected result:
      • DESIRED, CURRENT, READY, and UPDATED match
    • additional info:
      • UPDATED < DESIRED indicates a stuck rollout

  1. Describe the DaemonSet

    • Command / Action:
      • Review events and update progress
      • kubectl describe daemonset <daemonset-name> -n <namespace>

    • Expected result:
      • Events show successful pod scheduling and updates
    • additional info:
      • Look for scheduling, image, or permission errors

  1. Identify missing or unhealthy pods

    • Command / Action:
      • List DaemonSet pods and node placement
      • kubectl get pods -n <namespace> -l <daemonset-label> -o wide

    • Expected result:
      • One Running and Ready pod per eligible node
    • additional info:
      • Compare against kubectl get nodes

  1. Describe problematic pods

    • Command / Action:
      • Inspect pod events and status
      • kubectl describe pod <pod-name> -n <namespace>

    • Expected result:
      • Pods start successfully and become Ready
    • additional info:
      • Common issues include scheduling failures or volume mount errors

  1. Check container logs

    • Command / Action:
      • Review logs for failing containers
      • kubectl logs <pod-name> -n <namespace>

    • Expected result:
      • Application starts without repeated errors
    • additional info:
      • For crash loops, check previous logs with --previous

  1. Verify node health

    • Command / Action:
      • Check node readiness and pressure conditions
      • kubectl get nodes

      • kubectl describe node <node-name>

    • Expected result:
      • Nodes are Ready with no DiskPressure or MemoryPressure
    • additional info:
      • Node issues often block DaemonSet scheduling

  1. Review update strategy

    • Command / Action:
      • Inspect update strategy configuration
      • kubectl get daemonset <daemonset-name> -n <namespace> -o yaml

    • Expected result:
      • maxUnavailable allows progress even with degraded nodes
    • additional info:
      • Overly strict strategies can stall rollouts

  1. Force recovery if required

    • Command / Action:
      • Delete stuck pods or roll back configuration
      • kubectl delete pod <pod-name> -n <namespace>

    • Expected result:
      • Pods are recreated and rollout resumes
    • additional info:
      • Ensure root cause is resolved before forcing restarts

Additional resources