KubeDaemonSetRolloutStuck

Description

This alert fires when a Kubernetes DaemonSet rollout is not progressing and remains stuck in an updating state for longer than expected.
It indicates that the DaemonSet controller is unable to successfully create or update pods on all eligible nodes, leaving the cluster in a partially updated or degraded state.

DaemonSets are often used for critical node-level components (networking, logging, monitoring, security), so a stuck rollout can have widespread impact.

Possible Causes:

One or more nodes are NotReady, cordoned, or unreachable
Insufficient node resources (CPU, memory, disk)
Pod scheduling blocked by taints, node selectors, or affinity rules
Image pull failures (ImagePullBackOff, ErrImagePull)
Pods failing to start or crashing (CrashLoopBackOff)
Failing init containers
Security context issues (PSA/PSP, SELinux, AppArmor)
DaemonSet update strategy too restrictive (e.g. maxUnavailable)
Kubelet or node-level failures

Severity estimation

Medium to High severity, depending on the DaemonSet function.

Low if the DaemonSet is non-critical and impact is limited
Medium if functionality is partially degraded
High if the DaemonSet provides critical services (CNI, logging, monitoring, security agents)
Critical if rollout failure impacts cluster networking or node stability

Severity increases with:

Number of affected nodes
Duration of the stuck rollout
Criticality of the DaemonSet workload

Troubleshooting steps

Check DaemonSet status
- Command / Action:
  - Inspect desired vs updated pod counts
  - kubectl get daemonset <daemonset-name> -n <namespace>
- Expected result:
  - DESIRED, CURRENT, READY, and UPDATED match
- additional info:
  - UPDATED < DESIRED indicates a stuck rollout

Describe the DaemonSet
- Command / Action:
  - Review events and update progress
  - kubectl describe daemonset <daemonset-name> -n <namespace>
- Expected result:
  - Events show successful pod scheduling and updates
- additional info:
  - Look for scheduling, image, or permission errors

Identify missing or unhealthy pods
- Command / Action:
  - List DaemonSet pods and node placement
  - kubectl get pods -n <namespace> -l <daemonset-label> -o wide
- Expected result:
  - One Running and Ready pod per eligible node
- additional info:
  - Compare against kubectl get nodes

Describe problematic pods
- Command / Action:
  - Inspect pod events and status
  - kubectl describe pod <pod-name> -n <namespace>
- Expected result:
  - Pods start successfully and become Ready
- additional info:
  - Common issues include scheduling failures or volume mount errors

Check container logs
- Command / Action:
  - Review logs for failing containers
  - kubectl logs <pod-name> -n <namespace>
- Expected result:
  - Application starts without repeated errors
- additional info:
  - For crash loops, check previous logs with --previous

Verify node health
- Command / Action:
  - Check node readiness and pressure conditions
  - kubectl get nodes
  - kubectl describe node <node-name>
- Expected result:
  - Nodes are Ready with no DiskPressure or MemoryPressure
- additional info:
  - Node issues often block DaemonSet scheduling

Review update strategy
- Command / Action:
  - Inspect update strategy configuration
  - kubectl get daemonset <daemonset-name> -n <namespace> -o yaml
- Expected result:
  - maxUnavailable allows progress even with degraded nodes
- additional info:
  - Overly strict strategies can stall rollouts

Force recovery if required
- Command / Action:
  - Delete stuck pods or roll back configuration
  - kubectl delete pod <pod-name> -n <namespace>
- Expected result:
  - Pods are recreated and rollout resumes
- additional info:
  - Ensure root cause is resolved before forcing restarts

Additional resources

Kubernetes DaemonSet documentation
Kubernetes scheduling and eviction
Kubernetes Pod lifecycle and troubleshooting
Related alert: KubeDaemonSetNotScheduled
Related alert: KubeNodeNotReady