KubeNodePressure

Description

A Kubernetes node is experiencing resource pressure (DiskPressure, MemoryPressure, or PIDPressure).

When a node experiences pressure conditions, the kubelet begins evicting pods to reclaim resources and prevent node failure. This alert indicates the node is approaching or has exceeded resource thresholds, which can lead to pod evictions, degraded performance, and potential node instability.

Types of pressure conditions:

DiskPressure: Available disk space or inodes are below threshold
MemoryPressure: Available memory is below threshold
PIDPressure: Too many processes running on the node

Possible Causes:

DiskPressure:

Container logs consuming excessive disk space
Large container images not being pruned
Pods using emptyDir or hostPath volumes
Application writing large files to disk
Insufficient disk provisioning for workload

MemoryPressure:

Pods without memory limits consuming excessive memory
Memory leaks in applications
Insufficient memory provisioning
Too many pods scheduled on the node
Lack of swap space configuration

PIDPressure:

Applications spawning too many processes
Fork bombs or process leaks
Insufficient PID limit configuration
Containers without proper process management

Severity estimation

Medium to High severity, depending on pressure type and duration.

Medium if pressure is intermittent and no evictions occurring
High if pressure is sustained or causing pod evictions
Critical if multiple nodes experiencing pressure simultaneously
Critical if pressure leads to node NotReady state

Impact assessment:

Kubelet begins evicting BestEffort and Burstable pods
New pods may fail to schedule to affected nodes
Node performance degrades significantly
Risk of node failure if pressure not resolved
Critical system pods may be evicted if pressure severe
Applications may experience OOM kills or disk write failures

Troubleshooting steps

Identify nodes with pressure and pressure type
- Command / Action:
  - Check node conditions to identify pressure type
  - kubectl get nodes -o wide
  - kubectl describe node <node-name> | grep -A 10 Conditions
  - kubectl get nodes -o json | jq ‘.items[] | {name: .metadata.name, conditions: .status.conditions}’
- Expected result:
  - Identify which nodes have pressure conditions
  - Determine if DiskPressure, MemoryPressure, or PIDPressure
- additional info:
  - Conditions show True when pressure exists
  - Check condition messages for threshold details
  - Look for MemoryPressure, DiskPressure, PIDPressure conditions

Check for pod evictions on the affected node
- Command / Action:
  - List recent evictions due to pressure
  - kubectl get pods -A -o wide –field-selector spec.nodeName=<node-name>
  - kubectl get events -A –field-selector involvedObject.kind=Pod,reason=Evicted –sort-by=’.lastTimestamp'
- Expected result:
  - Identify if pods are being evicted
  - Understand eviction reasons and patterns
- additional info:
  - Evictions indicate kubelet is reclaiming resources
  - BestEffort pods are evicted first, then Burstable
  - Guaranteed pods are evicted only in extreme cases

DiskPressure Troubleshooting

Check disk space usage on the node
- Command / Action:
  - SSH to the node and check disk usage
  - df -h
  - df -i # Check inode usage
  - du -sh /var/lib/* | sort -hr | head -10
- Expected result:
  - Identify partitions with high usage
  - Root and /var/lib/kubelet should have sufficient space
- additional info:
  - DiskPressure triggers at >85% disk usage (configurable)
  - Check both disk space and inodes
  - /var/lib/containerd or /var/lib/docker holds images
  - /var/log holds container logs

Identify large files and directories
- Command / Action:
  - Find largest directories consuming disk space
  - du -sh /var/lib/kubelet/pods/* | sort -hr | head -20
  - du -sh /var/log/pods/* | sort -hr | head -20
  - du -sh /var/lib/containerd/* | sort -hr | head -10
- Expected result:
  - Identify pods or logs consuming excessive space
  - Find container images taking up space
- additional info:
  - Pod logs and emptyDir volumes are common culprits
  - Check for terminated pods that haven’t been cleaned up
  - Look for large image layers

Clean up unused container images
- Command / Action:
  - Prune unused images to reclaim space
  - crictl images
  - crictl rmi –prune # Remove unused images
  - Or for Docker: >docker image prune -a
- Expected result:
  - Reclaim disk space from unused images
  - Disk usage should decrease
- additional info:
  - Kubelet has automatic image garbage collection
  - Check –image-gc-high-threshold (default 85%)
  - Manual cleanup may be needed if automatic GC is disabled

Clean up container logs
- Command / Action:
  - Rotate or remove large log files
  - find /var/log/pods -name “*.log” -size +100M
  - Check kubelet log rotation settings
  - grep -i log /var/lib/kubelet/config.yaml
- Expected result:
  - Log files are rotated and pruned
  - Container log rotation configured
- additional info:
  - Configure containerLogMaxSize and containerLogMaxFiles
  - Use log forwarding to external systems
  - Consider using log rotation at pod level

Review pod emptyDir and hostPath usage
- Command / Action:
  - Check pods using local disk volumes
  - kubectl get pods -A -o json | jq ‘.items[] | select(.spec.volumes[]?.emptyDir or .spec.volumes[]?.hostPath) | {name: .metadata.name, namespace: .metadata.namespace, volumes: .spec.volumes}’
  - Check emptyDir sizes on node
- Expected result:
  - Identify pods with large emptyDir usage
  - Review if emptyDir sizeLimit is set
- additional info:
  - emptyDir volumes share node disk space
  - Set sizeLimit on emptyDir to prevent unbounded growth
  - Consider using PVCs instead of emptyDir for large data

MemoryPressure Troubleshooting

Check memory usage on the node
- Command / Action:
  - Review memory consumption on the node
  - SSH to node
  - free -h
  - top -o %MEM
  - kubectl top node <node-name>
  - kubectl top pods -A –sort-by=memory –no-headers | head -20
- Expected result:
  - Identify total and available memory
  - Find high memory consuming pods
- additional info:
  - MemoryPressure triggers when available memory < threshold
  - Default threshold: <100MB or <10% available
  - Check for memory leaks or unexpected usage

Identify pods without memory limits
- Command / Action:
  - Find pods running without memory limits
  - kubectl get pods -A -o json | jq ‘.items[] | select(.spec.containers[].resources.limits.memory == null) | {name: .metadata.name, namespace: .metadata.namespace, node: .spec.nodeName}’
  - Filter by affected node
- Expected result:
  - List of pods without memory limits
  - Pods on the affected node
- additional info:
  - Pods without limits can consume unbounded memory
  - BestEffort pods (no requests/limits) are evicted first
  - Set memory limits on all pods to prevent issues

Check for memory-hungry pods
- Command / Action:
  - Identify pods consuming excessive memory
  - kubectl top pods -A –sort-by=memory –field-selector spec.nodeName=<node-name>
  - Describe high memory pods to check limits
  - kubectl describe pod <pod-name> -n <namespace>
- Expected result:
  - Identify pods exceeding memory expectations
  - Check if memory usage is within limits
- additional info:
  - Compare actual usage to memory requests/limits
  - Check application logs for memory leaks
  - Consider increasing node memory or pod limits

Review node memory allocation
- Command / Action:
  - Check total memory requests vs capacity
  - kubectl describe node <node-name> | grep -A 10 “Allocated resources”
  - Calculate memory overcommit ratio
- Expected result:
  - Memory requests should not exceed allocatable
  - Identify if node is overcommitted
- additional info:
  - Overcommitment can lead to memory pressure
  - Consider adding nodes or reducing pod density
  - Review pod memory requests for accuracy

PIDPressure Troubleshooting

Check process count on the node
- Command / Action:
  - Count running processes on the node
  - SSH to node
  - ps aux | wc -l
  - cat /proc/sys/kernel/pid_max # Check PID limit
  - kubectl describe node <node-name> | grep -i pid
- Expected result:
  - Process count within PID limits
  - Identify if approaching PID limit
- additional info:
  - PIDPressure triggers when PIDs approach max allowed
  - Default kubelet PID limit per pod varies
  - Check –pod-max-pids kubelet setting

Identify processes consuming PIDs
- Command / Action:
  - Find containers with high process counts
  - ps -eo pid,ppid,comm,args | grep -i “docker|containerd” | wc -l
  - Check per-container process counts
  - crictl ps -q | xargs -I {} crictl inspect {} | jq ‘.info.pid’
- Expected result:
  - Identify containers spawning excessive processes
  - Find potential fork bombs or process leaks
- additional info:
  - Look for unusual process multiplication
  - Check application logs for errors
  - May need to restart offending pods

Review pod PID limits
- Command / Action:
  - Check kubelet PID configuration
  - ps aux | grep kubelet | grep -i pid
  - cat /var/lib/kubelet/config.yaml | grep -i pid
- Expected result:
  - PID limits configured appropriately
  - podPidsLimit should be set
- additional info:
  - Default podPidsLimit varies by Kubernetes version
  - Consider increasing if legitimate workloads need more PIDs
  - Set limits to prevent process exhaustion

General Recovery Steps

Cordon node to prevent new pods
- Command / Action:
  - Temporarily prevent pod scheduling to node
  - kubectl cordon <node-name>
- Expected result:
  - Node marked as SchedulingDisabled
  - No new pods scheduled while investigating
- additional info:
  - Allows investigation without additional load
  - Uncordon after resolving pressure: kubectl uncordon <node-name>

Scale down or migrate workloads
- Command / Action:
  - Reduce load on affected node
  - kubectl scale deployment <deployment-name> –replicas=<lower-number> -n <namespace>
  - Or drain node: >kubectl drain <node-name> –ignore-daemonsets –delete-emptydir-data
- Expected result:
  - Reduced resource pressure on node
  - Workloads moved to other nodes
- additional info:
  - Only if cluster has capacity elsewhere
  - Monitor pressure conditions after scaling
  - May be temporary fix until proper sizing

Restart kubelet if pressure persists
- Command / Action:
  - Restart kubelet to reset conditions
  - SSH to node
  - systemctl restart kubelet
  - Monitor node conditions
- Expected result:
  - Kubelet recalculates resource conditions
  - Pressure condition clears if resources now available
- additional info:
  - Only restart if root cause addressed
  - Restarting doesn’t fix underlying resource issues
  - Monitor for pressure returning

Verify pressure resolution
- Command / Action:
  - Confirm pressure condition cleared
  - kubectl describe node <node-name> | grep -A 10 Conditions
  - kubectl get node <node-name> -o json | jq ‘.status.conditions’
- Expected result:
  - Pressure conditions show False
  - Node returns to normal operation
  - No new evictions occurring
- additional info:
  - Monitor node for recurring pressure
  - May indicate need for capacity planning
  - Consider alerting thresholds if false positives

Additional resources

Node pressure eviction
Kubelet configuration
Configure Out of Resource Handling
Resource Management for Pods
Garbage Collection
Related alert: KubeNodeNotReady
Related alert: KubeNodeReadinessFlapping
Related alert: KubePodNotReady