Alert Runbooks

KubeNodeUnreachable

KubeNodeUnreachable

Description

A Kubernetes node is unreachable from the control plane, with no status updates received.

When a node becomes unreachable, the control plane cannot communicate with it at all. This is more severe than NotReady as it indicates complete loss of communication. The node may be down, network connectivity is severed, or kubelet has completely stopped. Pods on unreachable nodes will be marked for eviction after a timeout period, but cannot be gracefully terminated if the node remains unreachable.


Possible Causes:


Severity estimation

Critical severity - Complete loss of communication with the node.

Impact assessment:


Troubleshooting steps

  1. Verify node is unreachable from control plane perspective

    • Command / Action:
      • Check node status and conditions
      • kubectl get nodes

      • kubectl describe node <node-name>

      • Check node conditions and last heartbeat
    • Expected result:
      • Node shows Unknown or NotReady status
      • LastHeartbeatTime is significantly in the past
    • additional info:
      • Unknown status typically indicates unreachable
      • Check how long since last successful heartbeat
      • Review node conditions for error messages

  1. Check if node is powered on and responsive

    • Command / Action:
      • For physical servers: Check physical/IPMI access
      • For VMs: Check hypervisor console
      • For cloud instances: Check cloud provider console
      • ping <node-ip>

      • Attempt SSH: >ssh <node-ip>
    • Expected result:
      • Node should respond to ping
      • SSH should be accessible
    • additional info:
      • No ping response may indicate node is down or network issue
      • If ping works but SSH fails, check SSH service
      • Check cloud provider console for instance state
      • Look for “stopped”, “terminated”, or “unhealthy” status

  1. Test network connectivity from control plane

    • Command / Action:
      • Test connectivity from control plane nodes
      • SSH to control plane node
      • ping <unreachable-node-ip>

      • telnet <unreachable-node-ip> 10250 # kubelet port

      • curl -k https://<unreachable-node-ip>:10250/healthz

    • Expected result:
      • Network connectivity should exist
      • Kubelet port (10250) should be reachable
    • additional info:
      • If ping fails, network issue exists
      • If ping works but kubelet port blocked, firewall issue
      • Test from multiple control plane nodes
      • Check for network partition

  1. Check pods on the unreachable node

    • Command / Action:
      • List pods scheduled on the unreachable node
      • kubectl get pods -A -o wide –field-selector spec.nodeName=<node-name>

      • Check pod status and ages
    • Expected result:
      • Identify critical workloads on the node
      • Pods may show Terminating or Unknown status
    • additional info:
      • Pods will eventually be evicted to other nodes
      • Check for StatefulSets or databases requiring attention
      • Terminating pods stuck if node unreachable
      • May need to force delete pods if node lost

  1. Review recent node and system events

    • Command / Action:
      • Check Kubernetes events for the node
      • kubectl get events -A –field-selector involvedObject.name=<node-name> –sort-by=’.lastTimestamp’ | tail -30

      • kubectl describe node <node-name> | grep -A 20 Events

    • Expected result:
      • Events may indicate reason for unreachability
      • Look for shutdown, crash, or network events
    • additional info:
      • Check for “NodeNotReady” or “NodeStatusUnknown” events
      • Look for resource pressure events preceding failure
      • Review timing of when node became unreachable

  1. Check cloud provider status (for cloud instances)

    • Command / Action:
      • Review cloud provider console
      • Check instance status and health checks
      • Look for system status checks failures
      • Review cloud provider events/logs
      • Check for scheduled maintenance
    • Expected result:
      • Instance should be running and healthy
      • No underlying infrastructure issues
    • additional info:
      • Instance may have been stopped or terminated
      • System status checks may indicate hardware failure
      • Look for “instance retirement” or maintenance notices
      • Check for resource limits or quota issues
      • Review VPC, security group, and subnet configurations

  1. Check if kubelet is running (if node accessible)

    • Command / Action:
      • If able to access node, check kubelet
      • SSH to node (if possible)
      • systemctl status kubelet

      • journalctl -u kubelet -n 100

      • ps aux | grep kubelet

    • Expected result:
      • Kubelet should be running
      • If not, this explains unreachability
    • additional info:
      • Kubelet stopped means control plane cannot reach node
      • Check for crash loops or start failures
      • Review kubelet logs for errors
      • Restart kubelet: systemctl restart kubelet

  1. Check network configuration and firewall

    • Command / Action:
      • Verify network interfaces are up
      • SSH to node (if accessible)
      • ip addr show

      • ip route show

      • iptables -L -n

      • Check firewall rules: >firewall-cmd –list-all or >ufw status
    • Expected result:
      • Network interfaces should be up
      • Routes to control plane should exist
      • Firewall should allow kubelet communication
    • additional info:
      • Network interface down is common cause
      • Check default gateway is reachable
      • Ensure port 10250 (kubelet) not blocked
      • Check for recent firewall rule changes

  1. Review system logs for failures

    • Command / Action:
      • Check system logs on the node
      • journalctl -p err -n 100

      • dmesg | tail -100

      • cat /var/log/messages | tail -100

      • Look for OOM kills, kernel panics, hardware errors
    • Expected result:
      • Identify system-level failures
      • No critical errors that would cause unreachability
    • additional info:
      • Look for “Out of memory” messages
      • Check for disk I/O errors
      • Look for network driver errors
      • Check for CPU/hardware errors

  1. Check DNS resolution on the node

    • Command / Action:
      • Test DNS from the node
      • SSH to node
      • nslookup kubernetes.default

      • cat /etc/resolv.conf

      • dig <api-server-hostname>

    • Expected result:
      • DNS should resolve correctly
      • API server should be resolvable
    • additional info:
      • DNS failures can prevent kubelet from reaching API server
      • Check if DNS servers are reachable
      • Verify /etc/resolv.conf is correctly configured

  1. Force delete stuck terminating pods (if node is lost)

    • Command / Action:
      • If node cannot be recovered, force delete pods
      • kubectl delete pod <pod-name> -n <namespace> –force –grace-period=0

      • Or delete all pods on node:
      • kubectl get pods -A –field-selector spec.nodeName=<node-name> -o json | jq -r ‘.items[] | “(.metadata.namespace) (.metadata.name)”’ | xargs -n 2 sh -c ‘kubectl delete pod $1 -n $0 –force –grace-period=0’

    • Expected result:
      • Pods are forcibly removed from the cluster
      • Allows rescheduling to healthy nodes
    • additional info:
      • Only force delete if node is confirmed lost
      • Risk of data loss for stateful workloads
      • Pods will be rescheduled by controllers
      • Monitor for successful rescheduling

  1. Restart the node (if accessible)

    • Command / Action:
      • Attempt to reboot the node
      • SSH to node: >sudo reboot
      • Or via cloud provider console/API
      • Or via out-of-band management (IPMI, etc.)
    • Expected result:
      • Node reboots successfully
      • Kubelet starts and rejoins cluster
      • Node returns to Ready state
    • additional info:
      • Drain node first if possible: kubectl drain <node-name>
      • Monitor boot process via console
      • Check that kubelet starts automatically
      • Wait for node to report Ready status

  1. Replace the node if hardware failure

    • Command / Action:
      • If hardware failure confirmed, replace node
      • Remove node from cluster: >kubectl delete node <node-name>
      • For cloud: Terminate instance and launch new one
      • For physical: Replace hardware and reinstall
      • Join new node to cluster
    • Expected result:
      • Old node removed from cluster
      • New node joins successfully
      • Workloads rescheduled
    • additional info:
      • Ensure cluster has capacity for workload migration
      • Backup any persistent data if possible
      • Update inventory/documentation
      • Monitor new node for stability

  1. Investigate network infrastructure

    • Command / Action:
      • Check network switches, routers, and firewalls
      • Review network topology
      • Check for network partitions
      • Review recent network changes
      • Check with network team for issues
    • Expected result:
      • Identify any network infrastructure problems
      • No network equipment failures
    • additional info:
      • Network partition can isolate multiple nodes
      • Switch failures can affect rack or zone
      • Check for fiber cuts or cable issues
      • Review BGP/routing if using advanced networking

  1. Verify node recovery and stability

    • Command / Action:
      • Confirm node is back online and stable
      • kubectl get node <node-name>

      • kubectl describe node <node-name>

      • Verify pods are scheduled and running
      • kubectl get pods -A –field-selector spec.nodeName=<node-name>

    • Expected result:
      • Node shows Ready status
      • All conditions healthy
      • Pods successfully running
    • additional info:
      • Monitor node for at least 24 hours
      • Watch for recurring unreachability
      • Document root cause and resolution
      • Consider if similar nodes at risk
      • Update runbooks or procedures if needed

  1. Post-incident review and prevention

    • Command / Action:
      • Document the incident and root cause
      • Review monitoring and alerting
      • Identify preventive measures
      • Update infrastructure as needed
    • Expected result:
      • Clear understanding of what caused unreachability
      • Actions taken to prevent recurrence
    • additional info:
      • Consider redundancy improvements
      • Review capacity planning
      • Improve monitoring and alerting
      • Train team on incident procedures
      • Update disaster recovery plans

Additional resources