Alert Runbooks

KubeNodeNotReady

KubeNodeNotReady

Description

A Kubernetes node has been in NotReady state for more than 15 minutes.

When a node is NotReady, the kubelet cannot communicate properly with the control plane, or the node is experiencing conditions that prevent it from accepting new pods. This significantly impacts cluster capacity and availability, as pods cannot be scheduled to NotReady nodes, and existing pods may be evicted if the condition persists.


Possible Causes:


Severity estimation

High to Critical severity, depending on cluster size and affected nodes.

Impact assessment:


Troubleshooting steps

  1. Identify NotReady nodes and check their status

    • Command / Action:
      • List all nodes and their conditions
      • kubectl get nodes

      • kubectl get nodes -o wide

      • kubectl describe node <node-name>

    • Expected result:
      • Nodes should be in Ready state
      • NODE-NAME STATUS ROLES AGE VERSION

      • node-1 Ready worker 10d v1.28.0

    • additional info:
      • NotReady nodes will show STATUS: NotReady
      • Check node conditions for specific issues (DiskPressure, MemoryPressure, PIDPressure)
      • Note the LastHeartbeatTime to see when kubelet last reported

  1. Check node conditions for specific problems

    • Command / Action:
      • Examine detailed node conditions
      • kubectl describe node <node-name> | grep -A 10 Conditions

      • kubectl get node <node-name> -o json | jq ‘.status.conditions’

    • Expected result:
      • All conditions should show False except Ready: True
      • Look for MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable
    • additional info:
      • Condition messages provide specific failure reasons
      • Common conditions: DiskPressure (disk full), MemoryPressure (low memory), PIDPressure (too many processes)
      • Ready condition False indicates kubelet is not healthy

  1. Check kubelet service status on the affected node

    • Command / Action:
      • SSH to the node and check kubelet service
      • systemctl status kubelet

      • systemctl is-active kubelet

    • Expected result:
      • Kubelet service should be active (running)
      • ● kubelet.service - kubelet: The Kubernetes Node Agent

      • Active: active (running)

    • additional info:
      • If kubelet is not running, the node will be NotReady
      • Check for recent restarts or crash loops
      • Restart if necessary: systemctl restart kubelet

  1. Review kubelet logs for errors

    • Command / Action:
      • Check kubelet logs on the affected node
      • journalctl -u kubelet -n 100 –no-pager

      • journalctl -u kubelet -f

    • Expected result:
      • No critical errors in kubelet logs
      • Kubelet should be reporting node status successfully
    • additional info:
      • Look for API server connectivity errors
      • Check for certificate authentication failures
      • Look for container runtime errors (containerd, docker, CRI-O)
      • Watch for “Failed to update node status” errors

  1. Check disk space on the node

    • Command / Action:
      • Verify disk usage on the node
      • SSH to the node
      • df -h

      • df -i (check inode usage)

    • Expected result:
      • Sufficient disk space available (< 85% usage recommended)
      • Sufficient inodes available
    • additional info:
      • DiskPressure occurs when disk is > 85% full (configurable)
      • Check both root filesystem and /var/lib/kubelet
      • Container logs and images consume significant space
      • Clean up unused images: docker/crictl image prune

  1. Check node resource usage

    • Command / Action:
      • Review CPU, memory, and process usage
      • top

      • free -h

      • kubectl top node <node-name>

    • Expected result:
      • Sufficient resources available for kubelet operation
      • No OOM conditions or excessive swapping
    • additional info:
      • High memory pressure can cause node to be marked NotReady
      • Check for memory leaks or runaway processes
      • Review pod resource requests vs actual node capacity

  1. Verify container runtime status

    • Command / Action:
      • Check container runtime service (containerd, docker, CRI-O)
      • systemctl status containerd # or docker, crio

      • crictl ps # verify runtime is responding

    • Expected result:
      • Container runtime should be active and responsive
      • Should be able to list containers
    • additional info:
      • Kubelet requires a working container runtime
      • If runtime is down or unresponsive, restart it
      • Check runtime logs: journalctl -u containerd
      • Verify runtime socket is accessible

  1. Test API server connectivity from the node

    • Command / Action:
      • Verify node can reach the API server
      • SSH to the node
      • curl -k https://<api-server>:6443/healthz

      • ping <api-server-ip>

      • Check DNS resolution: >nslookup kubernetes.default
    • Expected result:
      • API server is reachable from the node
      • DNS resolution works correctly
      • Network connectivity is stable
    • additional info:
      • Kubelet must communicate with API server to report status
      • Check firewall rules and security groups
      • Verify network plugins (CNI) are working
      • Test connectivity to cluster DNS

  1. Check kubelet certificates

    • Command / Action:
      • Verify kubelet certificates are valid
      • openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

      • Check kubelet logs for certificate errors
    • Expected result:
      • Certificates should be valid and not expired
      • No certificate authentication errors in logs
    • additional info:
      • Expired certificates prevent kubelet from authenticating
      • Enable certificate rotation if not configured
      • Check for pending CSRs: kubectl get csr

  1. Review network plugin status

    • Command / Action:
      • Check CNI plugin health on the node
      • kubectl get pods -n kube-system -o wide | grep <node-name>

      • Check CNI plugin logs (Calico, Cilium, Weave, etc.)
    • Expected result:
      • CNI pods on the node should be Running
      • No network plugin errors
    • additional info:
      • Network issues can cause NotReady state
      • Verify CNI configuration files in /etc/cni/net.d/
      • Check for network interface issues: ip addr show

  1. Check for kernel or system issues

    • Command / Action:
      • Review system logs for critical errors
      • dmesg | tail -100

      • journalctl -p err -n 50

      • Check for OOM kills: >dmesg | grep -i “out of memory”
    • Expected result:
      • No critical kernel errors or OOM kills
      • System should be stable
    • additional info:
      • Kernel panics or hardware failures can cause NotReady
      • Check for I/O errors or disk failures
      • Review cloud provider logs for infrastructure issues

  1. Check pods on the NotReady node

    • Command / Action:
      • List pods running on the affected node
      • kubectl get pods -A -o wide –field-selector spec.nodeName=<node-name>

      • Check if critical system pods are running
    • Expected result:
      • System pods (kube-proxy, CNI, monitoring) should be running
      • Pod status may show issues if node is NotReady
    • additional info:
      • Pods will be evicted after node is NotReady for ~5 minutes
      • Critical pods may have tolerations to stay longer
      • Check if pod issues are causing node problems

  1. Restart kubelet service

    • Command / Action:
      • Restart kubelet to attempt recovery
      • SSH to the node
      • systemctl restart kubelet

      • Monitor status: >journalctl -u kubelet -f
    • Expected result:
      • Kubelet restarts successfully
      • Node returns to Ready state within a few minutes
      • kubectl get node <node-name> shows Ready

    • additional info:
      • Wait 1-2 minutes after restart for node to report status
      • Check kubelet logs for any startup errors
      • If restart doesn’t help, deeper investigation needed

  1. Drain and reboot node if necessary

    • Command / Action:
      • Safely drain the node and reboot
      • kubectl drain <node-name> –ignore-daemonsets –delete-emptydir-data

      • SSH to node and reboot: >sudo reboot
      • After reboot, uncordon: >kubectl uncordon <node-name>
    • Expected result:
      • Node reboots cleanly
      • Returns to Ready state after boot
      • Pods can be scheduled again
    • additional info:
      • Only drain if node is truly unhealthy
      • Coordinate with teams running workloads
      • Ensure cluster has capacity to reschedule pods
      • Monitor for pods successfully rescheduling

  1. Verify node recovery and pod scheduling

    • Command / Action:
      • Confirm node is Ready and accepting pods
      • kubectl get node <node-name>

      • kubectl describe node <node-name> | grep -A 5 Conditions

      • Check that pods can schedule to the node
    • Expected result:
      • Node shows Ready status
      • All node conditions are healthy
      • Pods successfully schedule and run on the node
    • additional info:
      • Monitor node for stability over time
      • Check if issue is recurring (may indicate hardware problems)
      • Review capacity and ensure node is not overcommitted

Additional resources