Alert Runbooks

KubeNodeReadinessFlapping

KubeNodeReadinessFlapping

Description

A Kubernetes node is repeatedly transitioning between Ready and NotReady states (flapping).

Node readiness flapping indicates instability where a node alternates between Ready and NotReady conditions. This causes repeated pod evictions and rescheduling, leading to service disruption, increased cluster load, and unpredictable workload behavior. Flapping typically indicates an underlying intermittent issue rather than a permanent failure.


Possible Causes:


Severity estimation

High severity - Flapping indicates persistent instability affecting reliability.

Impact assessment:


Troubleshooting steps

  1. Identify flapping nodes and readiness history

    • Command / Action:
      • Check which nodes are flapping
      • kubectl get nodes -w # Watch for state changes

      • kubectl get nodes

      • Check recent node events: >kubectl get events -A –field-selector involvedObject.kind=Node –sort-by=’.lastTimestamp’ | tail -50
    • Expected result:
      • Identify nodes transitioning between Ready/NotReady
      • Review frequency and pattern of transitions
    • additional info:
      • Flapping is typically defined as multiple transitions within a short period (e.g., 2+ times in 15 minutes)
      • Look for patterns: periodic flapping, time-based, load-based

  1. Check node conditions during transitions

    • Command / Action:
      • Monitor node conditions in real-time
      • kubectl describe node <node-name> | grep -A 15 Conditions

      • kubectl get node <node-name> -o json | jq ‘.status.conditions’

      • Watch for condition changes: >watch -n 5 “kubectl describe node <node-name> | grep -A 15 Conditions”
    • Expected result:
      • Identify which conditions are flipping (Ready, MemoryPressure, DiskPressure, PIDPressure)
      • Understand transition reasons from condition messages
    • additional info:
      • Condition messages provide clues about why node flaps
      • Look for “NodeStatusUnknown” or network-related messages
      • Check LastHeartbeatTime and LastTransitionTime

  1. Review kubelet logs for patterns

    • Command / Action:
      • SSH to the node and examine kubelet logs
      • journalctl -u kubelet -n 500 –no-pager

      • Look for errors around transition times
      • journalctl -u kubelet –since “30 minutes ago” | grep -i “error|fail|timeout”

    • Expected result:
      • Identify recurring errors or patterns
      • Find root cause of intermittent failures
    • additional info:
      • Look for “Failed to update node status” messages
      • Check for API server connectivity errors
      • Look for certificate authentication issues
      • Watch for container runtime communication failures
      • Check for resource pressure warnings

  1. Check kubelet restart history

    • Command / Action:
      • Verify if kubelet is restarting frequently
      • systemctl status kubelet

      • journalctl -u kubelet | grep -i “started|stopped|restarted”

      • Check kubelet uptime
    • Expected result:
      • Kubelet should be stable without frequent restarts
      • No crash loops or repeated failures
    • additional info:
      • Frequent kubelet restarts cause node flapping
      • Check for OOM kills: journalctl -u kubelet | grep -i “out of memory”
      • Review kubelet configuration for misconfigurations

  1. Test network connectivity stability

    • Command / Action:
      • Test continuous connectivity to API server
      • SSH to the node
      • ping -c 100 <api-server-ip> # Check for packet loss

      • mtr -r -c 50 <api-server-ip> # Trace route with loss statistics

      • Test API endpoint: >while true; do curl -s -o /dev/null -w “%{http_code}\n” -k https://<api-server>:6443/healthz; sleep 2; done
    • Expected result:
      • Stable network connectivity with no packet loss
      • Consistent API server reachability
    • additional info:
      • Even brief network interruptions can cause flapping
      • Check for high latency or intermittent failures
      • Review network infrastructure logs
      • Check for bandwidth saturation

  1. Review node resource utilization patterns

    • Command / Action:
      • Check if resource spikes correlate with flapping
      • kubectl top node <node-name>

      • SSH to node and monitor resources
      • top -d 5

      • vmstat 5 10

      • iostat -x 5 10

    • Expected result:
      • Stable resource usage without extreme spikes
      • No resource exhaustion causing pressure
    • additional info:
      • CPU/memory spikes may cause temporary NotReady
      • Disk I/O saturation can cause kubelet timeouts
      • Check for periodic batch jobs causing load spikes
      • Review if node is overcommitted

  1. Check container runtime health

    • Command / Action:
      • Verify container runtime (containerd, docker, CRI-O) stability
      • systemctl status containerd # or docker, crio

      • journalctl -u containerd -n 200 | grep -i “error|timeout”

      • crictl ps # Test runtime responsiveness

    • Expected result:
      • Container runtime should be stable and responsive
      • No errors or restarts
    • additional info:
      • Runtime issues can cause kubelet to mark node NotReady
      • Check for runtime deadlocks or hangs
      • Look for socket connection errors
      • Verify runtime socket permissions

  1. Check kubelet configuration parameters

    • Command / Action:
      • Review kubelet timing parameters
      • ps aux | grep kubelet

      • cat /var/lib/kubelet/config.yaml

      • Look for: node-status-update-frequency, nodeStatusUpdateFrequency, node-lease-duration-seconds
    • Expected result:
      • Parameters should be appropriately configured
      • Default values: status-update-frequency=10s, lease-duration=40s
    • additional info:
      • Too frequent updates can cause API server load
      • Too infrequent updates make node appear NotReady
      • nodeStatusReportFrequency controls reporting to API server
      • Adjust based on network latency and reliability

  1. Review API server and controller manager logs

    • Command / Action:
      • Check control plane logs for node-related errors
      • kubectl logs -n kube-system kube-apiserver-<node> | grep <flapping-node-name>

      • kubectl logs -n kube-system kube-controller-manager-<node> | grep <flapping-node-name>

    • Expected result:
      • No errors related to the flapping node
      • API server should be processing node updates
    • additional info:
      • Look for rate limiting or throttling messages
      • Check for node lease update failures
      • API server issues can cause widespread flapping

  1. Check for DNS resolution issues

    • Command / Action:
      • Test DNS resolution stability
      • SSH to the node
      • nslookup kubernetes.default

      • dig <api-server-hostname>

      • Test repeated resolution: >for i in {1..20}; do nslookup kubernetes.default; sleep 1; done
    • Expected result:
      • DNS should resolve consistently
      • No intermittent resolution failures
    • additional info:
      • DNS failures can prevent API server connectivity
      • Check /etc/resolv.conf configuration
      • Verify DNS pods are healthy

  1. Check cloud provider and infrastructure logs

    • Command / Action:
      • Review cloud provider events (AWS, GCP, Azure, etc.)
      • Check for underlying infrastructure issues
      • Review cloud provider console for node events
      • Check for maintenance windows or throttling
    • Expected result:
      • No underlying infrastructure problems
      • No cloud provider maintenance or issues
    • additional info:
      • Cloud provider issues can cause node instability
      • Check for instance type throttling or credits exhaustion
      • Look for storage or network infrastructure problems
      • Review VPC/subnet configuration

  1. Analyze pod eviction and rescheduling patterns

    • Command / Action:
      • Check if pods are being repeatedly evicted
      • kubectl get events -A –field-selector reason=Evicted –sort-by=’.lastTimestamp’ | tail -30

      • kubectl get pods -A –field-selector spec.nodeName=<node-name>

      • Check pod restart counts
    • Expected result:
      • Identify pods affected by node flapping
      • Understand impact on workloads
    • additional info:
      • Repeated evictions indicate active flapping
      • High-priority pods may stay during brief NotReady periods
      • StatefulSets particularly sensitive to node flapping

  1. Adjust kubelet parameters if needed

    • Command / Action:
      • Tune kubelet timing parameters for stability
      • Edit /var/lib/kubelet/config.yaml or kubelet systemd unit
      • Increase nodeStatusUpdateFrequency: 20s (from default 10s)
      • Increase node-lease-duration-seconds: 60 (from default 40)
      • systemctl restart kubelet

    • Expected result:
      • More tolerant of brief interruptions
      • Reduced flapping sensitivity
    • additional info:
      • Only adjust if network latency is an issue
      • Longer durations delay failure detection
      • Test changes in non-production first
      • Balance between sensitivity and stability

  1. Cordon node and monitor stability

    • Command / Action:
      • Prevent new pods from scheduling to flapping node
      • kubectl cordon <node-name>

      • Monitor node without additional load
      • watch -n 10 “kubectl get node <node-name>”

    • Expected result:
      • Determine if reduced load stabilizes node
      • Isolate issue to specific workloads or general instability
    • additional info:
      • Helps identify if pod workload causes flapping
      • May need to drain node if issue persists
      • Uncordon after resolving: kubectl uncordon <node-name>

  1. Replace or rebuild node if hardware issue suspected

    • Command / Action:
      • If hardware or persistent issues identified, replace node
      • kubectl drain <node-name> –ignore-daemonsets –delete-emptydir-data

      • Decommission and replace node with new instance
      • kubectl delete node <node-name>

    • Expected result:
      • Workloads migrate to healthy nodes
      • New node joins cluster without flapping
    • additional info:
      • Last resort for persistent hardware issues
      • Ensure cluster has capacity for workload migration
      • For cloud instances, terminate and launch new instance
      • On-premises may require physical hardware replacement

  1. Verify resolution and monitor for recurrence

    • Command / Action:
      • Confirm node stability after remediation
      • kubectl get node <node-name> -w

      • Monitor for 24+ hours
      • kubectl get events -A –field-selector involvedObject.name=<node-name> –sort-by=’.lastTimestamp'

    • Expected result:
      • Node remains in Ready state consistently
      • No further flapping episodes
      • Pods running stably
    • additional info:
      • Document root cause and resolution
      • Update monitoring thresholds if needed
      • Consider capacity planning if resource-related
      • Review similar nodes for proactive remediation

Additional resources