KubePersistentVolumeErrors

Description

This alert fires when one or more PersistentVolumes (PVs) are in a Failed or Pending phase, indicating that the volume cannot be provisioned, bound, or mounted correctly.

Pods depending on the affected PV will be unable to start or will crash, potentially causing data unavailability, service outages, or data loss if the volume backs a stateful workload.

Possible Causes:

StorageClass provisioner is unavailable or misconfigured
Underlying storage backend is unreachable (NFS server down, cloud disk API errors, etc.)
PersistentVolumeClaim (PVC) and PV have incompatible access modes or storage class
PV was manually deleted or reclaimed while a PVC was still bound to it
Insufficient capacity in the storage backend to provision the requested volume
Node where the volume is attached is unreachable or being drained
CSI driver pod is not running or has errors
Volume attachment stuck after a node failure or restart

Severity estimation

High severity — stateful workloads relying on the affected PV are likely failing:

Medium: Non-critical workload affected; service degraded but not fully down
High: Production stateful service (database, message queue, cache) cannot start or access data
Critical: Data-bearing volumes in a failed state with risk of data loss or corruption

Severity increases with:

Number of PVs in error state
Criticality of the workloads depending on the affected volumes
Duration of the error state
Whether the volume holds persistent data vs. ephemeral state

Troubleshooting steps

Identify the PVs in error state
- Command / Action:
  - List all PVs and filter for those in Failed or Pending phase
  - kubectl get pv –all-namespaces | grep -E ‘Failed|Pending’
- Expected result:
  - A list of affected PV names, their status, and associated claim
- additional info:
  - Note the CLAIM column to identify which namespace and PVC is affected

Describe the affected PV for error details
- Command / Action:
  - Inspect the PV object for events and status conditions
  - kubectl describe pv <pv-name>
- Expected result:
  - Events section shows the specific error message (e.g., provisioning failure, attachment error, backend unreachable)
- additional info:
  - The Reason and Message fields in events pinpoint the root cause

Check the associated PVC status
- Command / Action:
  - Inspect the PVC bound to the failing PV
  - kubectl describe pvc <pvc-name> -n <namespace>
- Expected result:
  - PVC status is Bound; if Pending or Lost, the binding itself has failed
- additional info:
  - A PVC in Lost state means the underlying PV was deleted or became unavailable

Check the CSI driver or provisioner pod
- Command / Action:
  - Verify the storage provisioner is running and healthy
  - kubectl get pods -n kube-system | grep -E ‘csi|provisioner|nfs’
  - kubectl logs <provisioner-pod> -n kube-system –tail=50
- Expected result:
  - Provisioner pod is Running with no error logs
- additional info:
  - A crashed or restarting provisioner pod will prevent new PV provisioning and may affect existing volumes

Check events in the affected namespace
- Command / Action:
  - Look for volume-related warning events in the namespace
  - kubectl get events -n <namespace> –sort-by=’.lastTimestamp’ | grep -iE ‘volume|pvc|pv|mount|attach’
- Expected result:
  - Events reveal whether the issue is at provisioning, binding, or mounting stage
- additional info:
  - Mount errors often indicate a node-level issue; provisioning errors point to the storage backend or StorageClass

Check pods that depend on the affected PVC
- Command / Action:
  - Find pods stuck in Pending or ContainerCreating due to volume errors
  - kubectl get pods -n <namespace> | grep -vE ‘Running|Completed’
  - kubectl describe pod <pod-name> -n <namespace>
- Expected result:
  - Pod events show Unable to attach or mount volumes or similar messages linking back to the PV error
- additional info:
  - If the pod is stuck on a specific node, the issue may be a node-level volume attachment problem

Verify the storage backend is reachable
- Command / Action:
  - Confirm the underlying storage system (NFS, EBS, GCE PD, Azure Disk, Ceph, etc.) is accessible
  - For cloud volumes, check the cloud provider console or CLI for disk/volume status
  - For NFS: verify the NFS server is reachable from cluster nodes
- Expected result:
  - The storage backend is online and responding normally
- additional info:
  - Backend outages require resolution at the infrastructure level before Kubernetes can recover the PV

Force-detach a stuck volume if the node is gone
- Command / Action:
  - If a volume is stuck attached to an unreachable node, force-detach via the cloud provider CLI or Kubernetes
  - kubectl get volumeattachment
  - kubectl delete volumeattachment <attachment-name>
- Expected result:
  - The volume detaches from the old node and can be reattached to the new one
- additional info:
  - Only do this if the original node is confirmed unreachable or terminated; force-detaching from a live node risks data corruption

Recreate the PV or PVC if in a permanently failed state
- Command / Action:
  - If the PV cannot recover, back up data (if accessible), delete the PV/PVC, and recreate
  - kubectl delete pvc <pvc-name> -n <namespace>
  - kubectl delete pv <pv-name>
- Expected result:
  - A new PV is provisioned and bound to a new PVC; workloads resume
- additional info:
  - Ensure ReclaimPolicy is set to Retain if you need to preserve the underlying data before deleting

Confirm recovery and workload restart
- Command / Action:
  - Verify PV is back to Bound state and dependent pods are running
  - kubectl get pv
  - kubectl get pods -n <namespace>
- Expected result:
  - PV phase is Bound; all dependent pods are in Running state
- additional info:
  - If pods do not restart automatically, delete them to trigger a fresh scheduling cycle: kubectl delete pod <pod-name> -n <namespace>

Additional resources

Kubernetes Persistent Volumes
Troubleshooting PVCs
CSI Driver troubleshooting
Kubernetes Storage Classes
Related alert: KubePersistentVolumeFillingUp