Alert Runbooks

KubeNodePressure

KubeNodePressure

Description

A Kubernetes node is experiencing resource pressure (DiskPressure, MemoryPressure, or PIDPressure).

When a node experiences pressure conditions, the kubelet begins evicting pods to reclaim resources and prevent node failure. This alert indicates the node is approaching or has exceeded resource thresholds, which can lead to pod evictions, degraded performance, and potential node instability.

Types of pressure conditions:


Possible Causes:

DiskPressure:

MemoryPressure:

PIDPressure:


Severity estimation

Medium to High severity, depending on pressure type and duration.

Impact assessment:


Troubleshooting steps

  1. Identify nodes with pressure and pressure type

    • Command / Action:
      • Check node conditions to identify pressure type
      • kubectl get nodes -o wide

      • kubectl describe node <node-name> | grep -A 10 Conditions

      • kubectl get nodes -o json | jq ‘.items[] | {name: .metadata.name, conditions: .status.conditions}’

    • Expected result:
      • Identify which nodes have pressure conditions
      • Determine if DiskPressure, MemoryPressure, or PIDPressure
    • additional info:
      • Conditions show True when pressure exists
      • Check condition messages for threshold details
      • Look for MemoryPressure, DiskPressure, PIDPressure conditions

  1. Check for pod evictions on the affected node

    • Command / Action:
      • List recent evictions due to pressure
      • kubectl get pods -A -o wide –field-selector spec.nodeName=<node-name>

      • kubectl get events -A –field-selector involvedObject.kind=Pod,reason=Evicted –sort-by=’.lastTimestamp'

    • Expected result:
      • Identify if pods are being evicted
      • Understand eviction reasons and patterns
    • additional info:
      • Evictions indicate kubelet is reclaiming resources
      • BestEffort pods are evicted first, then Burstable
      • Guaranteed pods are evicted only in extreme cases

DiskPressure Troubleshooting

  1. Check disk space usage on the node

    • Command / Action:
      • SSH to the node and check disk usage
      • df -h

      • df -i # Check inode usage

      • du -sh /var/lib/* | sort -hr | head -10

    • Expected result:
      • Identify partitions with high usage
      • Root and /var/lib/kubelet should have sufficient space
    • additional info:
      • DiskPressure triggers at >85% disk usage (configurable)
      • Check both disk space and inodes
      • /var/lib/containerd or /var/lib/docker holds images
      • /var/log holds container logs

  1. Identify large files and directories

    • Command / Action:
      • Find largest directories consuming disk space
      • du -sh /var/lib/kubelet/pods/* | sort -hr | head -20

      • du -sh /var/log/pods/* | sort -hr | head -20

      • du -sh /var/lib/containerd/* | sort -hr | head -10

    • Expected result:
      • Identify pods or logs consuming excessive space
      • Find container images taking up space
    • additional info:
      • Pod logs and emptyDir volumes are common culprits
      • Check for terminated pods that haven’t been cleaned up
      • Look for large image layers

  1. Clean up unused container images

    • Command / Action:
      • Prune unused images to reclaim space
      • crictl images

      • crictl rmi –prune # Remove unused images

      • Or for Docker: >docker image prune -a
    • Expected result:
      • Reclaim disk space from unused images
      • Disk usage should decrease
    • additional info:
      • Kubelet has automatic image garbage collection
      • Check –image-gc-high-threshold (default 85%)
      • Manual cleanup may be needed if automatic GC is disabled

  1. Clean up container logs

    • Command / Action:
      • Rotate or remove large log files
      • find /var/log/pods -name “*.log” -size +100M

      • Check kubelet log rotation settings
      • grep -i log /var/lib/kubelet/config.yaml

    • Expected result:
      • Log files are rotated and pruned
      • Container log rotation configured
    • additional info:
      • Configure containerLogMaxSize and containerLogMaxFiles
      • Use log forwarding to external systems
      • Consider using log rotation at pod level

  1. Review pod emptyDir and hostPath usage

    • Command / Action:
      • Check pods using local disk volumes
      • kubectl get pods -A -o json | jq ‘.items[] | select(.spec.volumes[]?.emptyDir or .spec.volumes[]?.hostPath) | {name: .metadata.name, namespace: .metadata.namespace, volumes: .spec.volumes}’

      • Check emptyDir sizes on node
    • Expected result:
      • Identify pods with large emptyDir usage
      • Review if emptyDir sizeLimit is set
    • additional info:
      • emptyDir volumes share node disk space
      • Set sizeLimit on emptyDir to prevent unbounded growth
      • Consider using PVCs instead of emptyDir for large data

MemoryPressure Troubleshooting

  1. Check memory usage on the node

    • Command / Action:
      • Review memory consumption on the node
      • SSH to node
      • free -h

      • top -o %MEM

      • kubectl top node <node-name>

      • kubectl top pods -A –sort-by=memory –no-headers | head -20

    • Expected result:
      • Identify total and available memory
      • Find high memory consuming pods
    • additional info:
      • MemoryPressure triggers when available memory < threshold
      • Default threshold: <100MB or <10% available
      • Check for memory leaks or unexpected usage

  1. Identify pods without memory limits

    • Command / Action:
      • Find pods running without memory limits
      • kubectl get pods -A -o json | jq ‘.items[] | select(.spec.containers[].resources.limits.memory == null) | {name: .metadata.name, namespace: .metadata.namespace, node: .spec.nodeName}’

      • Filter by affected node
    • Expected result:
      • List of pods without memory limits
      • Pods on the affected node
    • additional info:
      • Pods without limits can consume unbounded memory
      • BestEffort pods (no requests/limits) are evicted first
      • Set memory limits on all pods to prevent issues

  1. Check for memory-hungry pods

    • Command / Action:
      • Identify pods consuming excessive memory
      • kubectl top pods -A –sort-by=memory –field-selector spec.nodeName=<node-name>

      • Describe high memory pods to check limits
      • kubectl describe pod <pod-name> -n <namespace>

    • Expected result:
      • Identify pods exceeding memory expectations
      • Check if memory usage is within limits
    • additional info:
      • Compare actual usage to memory requests/limits
      • Check application logs for memory leaks
      • Consider increasing node memory or pod limits

  1. Review node memory allocation

    • Command / Action:
      • Check total memory requests vs capacity
      • kubectl describe node <node-name> | grep -A 10 “Allocated resources”

      • Calculate memory overcommit ratio
    • Expected result:
      • Memory requests should not exceed allocatable
      • Identify if node is overcommitted
    • additional info:
      • Overcommitment can lead to memory pressure
      • Consider adding nodes or reducing pod density
      • Review pod memory requests for accuracy

PIDPressure Troubleshooting

  1. Check process count on the node

    • Command / Action:
      • Count running processes on the node
      • SSH to node
      • ps aux | wc -l

      • cat /proc/sys/kernel/pid_max # Check PID limit

      • kubectl describe node <node-name> | grep -i pid

    • Expected result:
      • Process count within PID limits
      • Identify if approaching PID limit
    • additional info:
      • PIDPressure triggers when PIDs approach max allowed
      • Default kubelet PID limit per pod varies
      • Check –pod-max-pids kubelet setting

  1. Identify processes consuming PIDs

    • Command / Action:
      • Find containers with high process counts
      • ps -eo pid,ppid,comm,args | grep -i “docker|containerd” | wc -l

      • Check per-container process counts
      • crictl ps -q | xargs -I {} crictl inspect {} | jq ‘.info.pid’

    • Expected result:
      • Identify containers spawning excessive processes
      • Find potential fork bombs or process leaks
    • additional info:
      • Look for unusual process multiplication
      • Check application logs for errors
      • May need to restart offending pods

  1. Review pod PID limits

    • Command / Action:
      • Check kubelet PID configuration
      • ps aux | grep kubelet | grep -i pid

      • cat /var/lib/kubelet/config.yaml | grep -i pid

    • Expected result:
      • PID limits configured appropriately
      • podPidsLimit should be set
    • additional info:
      • Default podPidsLimit varies by Kubernetes version
      • Consider increasing if legitimate workloads need more PIDs
      • Set limits to prevent process exhaustion

General Recovery Steps

  1. Cordon node to prevent new pods

    • Command / Action:
      • Temporarily prevent pod scheduling to node
      • kubectl cordon <node-name>

    • Expected result:
      • Node marked as SchedulingDisabled
      • No new pods scheduled while investigating
    • additional info:
      • Allows investigation without additional load
      • Uncordon after resolving pressure: kubectl uncordon <node-name>

  1. Scale down or migrate workloads

    • Command / Action:
      • Reduce load on affected node
      • kubectl scale deployment <deployment-name> –replicas=<lower-number> -n <namespace>

      • Or drain node: >kubectl drain <node-name> –ignore-daemonsets –delete-emptydir-data
    • Expected result:
      • Reduced resource pressure on node
      • Workloads moved to other nodes
    • additional info:
      • Only if cluster has capacity elsewhere
      • Monitor pressure conditions after scaling
      • May be temporary fix until proper sizing

  1. Restart kubelet if pressure persists

    • Command / Action:
      • Restart kubelet to reset conditions
      • SSH to node
      • systemctl restart kubelet

      • Monitor node conditions
    • Expected result:
      • Kubelet recalculates resource conditions
      • Pressure condition clears if resources now available
    • additional info:
      • Only restart if root cause addressed
      • Restarting doesn’t fix underlying resource issues
      • Monitor for pressure returning

  1. Verify pressure resolution

    • Command / Action:
      • Confirm pressure condition cleared
      • kubectl describe node <node-name> | grep -A 10 Conditions

      • kubectl get node <node-name> -o json | jq ‘.status.conditions’

    • Expected result:
      • Pressure conditions show False
      • Node returns to normal operation
      • No new evictions occurring
    • additional info:
      • Monitor node for recurring pressure
      • May indicate need for capacity planning
      • Consider alerting thresholds if false positives

Additional resources