KubeNodeReadinessFlapping

Description

A Kubernetes node is repeatedly transitioning between Ready and NotReady states (flapping).

Node readiness flapping indicates instability where a node alternates between Ready and NotReady conditions. This causes repeated pod evictions and rescheduling, leading to service disruption, increased cluster load, and unpredictable workload behavior. Flapping typically indicates an underlying intermittent issue rather than a permanent failure.

Possible Causes:

Intermittent network connectivity issues between node and control plane
Kubelet experiencing periodic crashes or restarts
Resource exhaustion causing temporary pressure (memory, disk, PIDs)
Flaky container runtime intermittently failing
Node experiencing intermittent hardware issues
API server connection timeouts or rate limiting
Clock synchronization problems causing certificate issues
Overly aggressive health check thresholds
Load spikes causing temporary resource saturation
Network partition or unstable network infrastructure
DNS resolution failures
Misconfigured kubelet parameters (node-status-update-frequency, node-lease-duration)

Severity estimation

High severity - Flapping indicates persistent instability affecting reliability.

High if one node is flapping repeatedly
Critical if multiple nodes are flapping simultaneously
Critical if flapping causes significant pod churn

Impact assessment:

Pods repeatedly evicted and rescheduled causing service disruption
Increased load on scheduler and control plane
Applications experience repeated restarts and connection failures
StatefulSets and databases may lose quorum or data consistency
Cluster instability and reduced capacity
Unpredictable workload placement and performance
Monitoring and logging gaps during NotReady periods
Increased network traffic from pod migrations

Troubleshooting steps

Identify flapping nodes and readiness history
- Command / Action:
  - Check which nodes are flapping
  - kubectl get nodes -w # Watch for state changes
  - kubectl get nodes
  - Check recent node events: >kubectl get events -A –field-selector involvedObject.kind=Node –sort-by=’.lastTimestamp’ | tail -50
- Expected result:
  - Identify nodes transitioning between Ready/NotReady
  - Review frequency and pattern of transitions
- additional info:
  - Flapping is typically defined as multiple transitions within a short period (e.g., 2+ times in 15 minutes)
  - Look for patterns: periodic flapping, time-based, load-based

Check node conditions during transitions
- Command / Action:
  - Monitor node conditions in real-time
  - kubectl describe node <node-name> | grep -A 15 Conditions
  - kubectl get node <node-name> -o json | jq ‘.status.conditions’
  - Watch for condition changes: >watch -n 5 “kubectl describe node <node-name> | grep -A 15 Conditions”
- Expected result:
  - Identify which conditions are flipping (Ready, MemoryPressure, DiskPressure, PIDPressure)
  - Understand transition reasons from condition messages
- additional info:
  - Condition messages provide clues about why node flaps
  - Look for “NodeStatusUnknown” or network-related messages
  - Check LastHeartbeatTime and LastTransitionTime

Review kubelet logs for patterns
- Command / Action:
  - SSH to the node and examine kubelet logs
  - journalctl -u kubelet -n 500 –no-pager
  - Look for errors around transition times
  - journalctl -u kubelet –since “30 minutes ago” | grep -i “error|fail|timeout”
- Expected result:
  - Identify recurring errors or patterns
  - Find root cause of intermittent failures
- additional info:
  - Look for “Failed to update node status” messages
  - Check for API server connectivity errors
  - Look for certificate authentication issues
  - Watch for container runtime communication failures
  - Check for resource pressure warnings

Check kubelet restart history
- Command / Action:
  - Verify if kubelet is restarting frequently
  - systemctl status kubelet
  - journalctl -u kubelet | grep -i “started|stopped|restarted”
  - Check kubelet uptime
- Expected result:
  - Kubelet should be stable without frequent restarts
  - No crash loops or repeated failures
- additional info:
  - Frequent kubelet restarts cause node flapping
  - Check for OOM kills: journalctl -u kubelet | grep -i “out of memory”
  - Review kubelet configuration for misconfigurations

Test network connectivity stability
- Command / Action:
  - Test continuous connectivity to API server
  - SSH to the node
  - ping -c 100 <api-server-ip> # Check for packet loss
  - mtr -r -c 50 <api-server-ip> # Trace route with loss statistics
  - Test API endpoint: >while true; do curl -s -o /dev/null -w “%{http_code}\n” -k https://<api-server>:6443/healthz; sleep 2; done
- Expected result:
  - Stable network connectivity with no packet loss
  - Consistent API server reachability
- additional info:
  - Even brief network interruptions can cause flapping
  - Check for high latency or intermittent failures
  - Review network infrastructure logs
  - Check for bandwidth saturation

Review node resource utilization patterns
- Command / Action:
  - Check if resource spikes correlate with flapping
  - kubectl top node <node-name>
  - SSH to node and monitor resources
  - top -d 5
  - vmstat 5 10
  - iostat -x 5 10
- Expected result:
  - Stable resource usage without extreme spikes
  - No resource exhaustion causing pressure
- additional info:
  - CPU/memory spikes may cause temporary NotReady
  - Disk I/O saturation can cause kubelet timeouts
  - Check for periodic batch jobs causing load spikes
  - Review if node is overcommitted

Check container runtime health
- Command / Action:
  - Verify container runtime (containerd, docker, CRI-O) stability
  - systemctl status containerd # or docker, crio
  - journalctl -u containerd -n 200 | grep -i “error|timeout”
  - crictl ps # Test runtime responsiveness
- Expected result:
  - Container runtime should be stable and responsive
  - No errors or restarts
- additional info:
  - Runtime issues can cause kubelet to mark node NotReady
  - Check for runtime deadlocks or hangs
  - Look for socket connection errors
  - Verify runtime socket permissions

Check kubelet configuration parameters
- Command / Action:
  - Review kubelet timing parameters
  - ps aux | grep kubelet
  - cat /var/lib/kubelet/config.yaml
  - Look for: node-status-update-frequency, nodeStatusUpdateFrequency, node-lease-duration-seconds
- Expected result:
  - Parameters should be appropriately configured
  - Default values: status-update-frequency=10s, lease-duration=40s
- additional info:
  - Too frequent updates can cause API server load
  - Too infrequent updates make node appear NotReady
  - nodeStatusReportFrequency controls reporting to API server
  - Adjust based on network latency and reliability

Review API server and controller manager logs
- Command / Action:
  - Check control plane logs for node-related errors
  - kubectl logs -n kube-system kube-apiserver-<node> | grep <flapping-node-name>
  - kubectl logs -n kube-system kube-controller-manager-<node> | grep <flapping-node-name>
- Expected result:
  - No errors related to the flapping node
  - API server should be processing node updates
- additional info:
  - Look for rate limiting or throttling messages
  - Check for node lease update failures
  - API server issues can cause widespread flapping

Check for DNS resolution issues
- Command / Action:
  - Test DNS resolution stability
  - SSH to the node
  - nslookup kubernetes.default
  - dig <api-server-hostname>
  - Test repeated resolution: >for i in {1..20}; do nslookup kubernetes.default; sleep 1; done
- Expected result:
  - DNS should resolve consistently
  - No intermittent resolution failures
- additional info:
  - DNS failures can prevent API server connectivity
  - Check /etc/resolv.conf configuration
  - Verify DNS pods are healthy

Check cloud provider and infrastructure logs
- Command / Action:
  - Review cloud provider events (AWS, GCP, Azure, etc.)
  - Check for underlying infrastructure issues
  - Review cloud provider console for node events
  - Check for maintenance windows or throttling
- Expected result:
  - No underlying infrastructure problems
  - No cloud provider maintenance or issues
- additional info:
  - Cloud provider issues can cause node instability
  - Check for instance type throttling or credits exhaustion
  - Look for storage or network infrastructure problems
  - Review VPC/subnet configuration

Analyze pod eviction and rescheduling patterns
- Command / Action:
  - Check if pods are being repeatedly evicted
  - kubectl get events -A –field-selector reason=Evicted –sort-by=’.lastTimestamp’ | tail -30
  - kubectl get pods -A –field-selector spec.nodeName=<node-name>
  - Check pod restart counts
- Expected result:
  - Identify pods affected by node flapping
  - Understand impact on workloads
- additional info:
  - Repeated evictions indicate active flapping
  - High-priority pods may stay during brief NotReady periods
  - StatefulSets particularly sensitive to node flapping

Adjust kubelet parameters if needed
- Command / Action:
  - Tune kubelet timing parameters for stability
  - Edit /var/lib/kubelet/config.yaml or kubelet systemd unit
  - Increase nodeStatusUpdateFrequency: 20s (from default 10s)
  - Increase node-lease-duration-seconds: 60 (from default 40)
  - systemctl restart kubelet
- Expected result:
  - More tolerant of brief interruptions
  - Reduced flapping sensitivity
- additional info:
  - Only adjust if network latency is an issue
  - Longer durations delay failure detection
  - Test changes in non-production first
  - Balance between sensitivity and stability

Cordon node and monitor stability
- Command / Action:
  - Prevent new pods from scheduling to flapping node
  - kubectl cordon <node-name>
  - Monitor node without additional load
  - watch -n 10 “kubectl get node <node-name>”
- Expected result:
  - Determine if reduced load stabilizes node
  - Isolate issue to specific workloads or general instability
- additional info:
  - Helps identify if pod workload causes flapping
  - May need to drain node if issue persists
  - Uncordon after resolving: kubectl uncordon <node-name>

Replace or rebuild node if hardware issue suspected
- Command / Action:
  - If hardware or persistent issues identified, replace node
  - kubectl drain <node-name> –ignore-daemonsets –delete-emptydir-data
  - Decommission and replace node with new instance
  - kubectl delete node <node-name>
- Expected result:
  - Workloads migrate to healthy nodes
  - New node joins cluster without flapping
- additional info:
  - Last resort for persistent hardware issues
  - Ensure cluster has capacity for workload migration
  - For cloud instances, terminate and launch new instance
  - On-premises may require physical hardware replacement

Verify resolution and monitor for recurrence
- Command / Action:
  - Confirm node stability after remediation
  - kubectl get node <node-name> -w
  - Monitor for 24+ hours
  - kubectl get events -A –field-selector involvedObject.name=<node-name> –sort-by=’.lastTimestamp'
- Expected result:
  - Node remains in Ready state consistently
  - No further flapping episodes
  - Pods running stably
- additional info:
  - Document root cause and resolution
  - Update monitoring thresholds if needed
  - Consider capacity planning if resource-related
  - Review similar nodes for proactive remediation

Additional resources

Kubernetes Node Status
Kubelet configuration
Node Lease
Troubleshooting clusters
Related alert: KubeNodeNotReady
Related alert: KubeNodeUnreachable
Related alert: KubeNodePressure