Alert Runbooks

KubePersistentVolumeErrors

KubePersistentVolumeErrors

Description

This alert fires when one or more PersistentVolumes (PVs) are in a Failed or Pending phase, indicating that the volume cannot be provisioned, bound, or mounted correctly.

Pods depending on the affected PV will be unable to start or will crash, potentially causing data unavailability, service outages, or data loss if the volume backs a stateful workload.


Possible Causes:


Severity estimation

High severity — stateful workloads relying on the affected PV are likely failing:

Severity increases with:


Troubleshooting steps

  1. Identify the PVs in error state

    • Command / Action:
      • List all PVs and filter for those in Failed or Pending phase
      • kubectl get pv –all-namespaces | grep -E ‘Failed|Pending’

    • Expected result:
      • A list of affected PV names, their status, and associated claim
    • additional info:
      • Note the CLAIM column to identify which namespace and PVC is affected

  1. Describe the affected PV for error details

    • Command / Action:
      • Inspect the PV object for events and status conditions
      • kubectl describe pv <pv-name>

    • Expected result:
      • Events section shows the specific error message (e.g., provisioning failure, attachment error, backend unreachable)
    • additional info:
      • The Reason and Message fields in events pinpoint the root cause

  1. Check the associated PVC status

    • Command / Action:
      • Inspect the PVC bound to the failing PV
      • kubectl describe pvc <pvc-name> -n <namespace>

    • Expected result:
      • PVC status is Bound; if Pending or Lost, the binding itself has failed
    • additional info:
      • A PVC in Lost state means the underlying PV was deleted or became unavailable

  1. Check the CSI driver or provisioner pod

    • Command / Action:
      • Verify the storage provisioner is running and healthy
      • kubectl get pods -n kube-system | grep -E ‘csi|provisioner|nfs’

      • kubectl logs <provisioner-pod> -n kube-system –tail=50

    • Expected result:
      • Provisioner pod is Running with no error logs
    • additional info:
      • A crashed or restarting provisioner pod will prevent new PV provisioning and may affect existing volumes

  1. Check events in the affected namespace

    • Command / Action:
      • Look for volume-related warning events in the namespace
      • kubectl get events -n <namespace> –sort-by=’.lastTimestamp’ | grep -iE ‘volume|pvc|pv|mount|attach’

    • Expected result:
      • Events reveal whether the issue is at provisioning, binding, or mounting stage
    • additional info:
      • Mount errors often indicate a node-level issue; provisioning errors point to the storage backend or StorageClass

  1. Check pods that depend on the affected PVC

    • Command / Action:
      • Find pods stuck in Pending or ContainerCreating due to volume errors
      • kubectl get pods -n <namespace> | grep -vE ‘Running|Completed’

      • kubectl describe pod <pod-name> -n <namespace>

    • Expected result:
      • Pod events show Unable to attach or mount volumes or similar messages linking back to the PV error
    • additional info:
      • If the pod is stuck on a specific node, the issue may be a node-level volume attachment problem

  1. Verify the storage backend is reachable

    • Command / Action:
      • Confirm the underlying storage system (NFS, EBS, GCE PD, Azure Disk, Ceph, etc.) is accessible
      • For cloud volumes, check the cloud provider console or CLI for disk/volume status
      • For NFS: verify the NFS server is reachable from cluster nodes
    • Expected result:
      • The storage backend is online and responding normally
    • additional info:
      • Backend outages require resolution at the infrastructure level before Kubernetes can recover the PV

  1. Force-detach a stuck volume if the node is gone

    • Command / Action:
      • If a volume is stuck attached to an unreachable node, force-detach via the cloud provider CLI or Kubernetes
      • kubectl get volumeattachment

      • kubectl delete volumeattachment <attachment-name>

    • Expected result:
      • The volume detaches from the old node and can be reattached to the new one
    • additional info:
      • Only do this if the original node is confirmed unreachable or terminated; force-detaching from a live node risks data corruption

  1. Recreate the PV or PVC if in a permanently failed state

    • Command / Action:
      • If the PV cannot recover, back up data (if accessible), delete the PV/PVC, and recreate
      • kubectl delete pvc <pvc-name> -n <namespace>

      • kubectl delete pv <pv-name>

    • Expected result:
      • A new PV is provisioned and bound to a new PVC; workloads resume
    • additional info:
      • Ensure ReclaimPolicy is set to Retain if you need to preserve the underlying data before deleting

  1. Confirm recovery and workload restart

    • Command / Action:
      • Verify PV is back to Bound state and dependent pods are running
      • kubectl get pv

      • kubectl get pods -n <namespace>

    • Expected result:
      • PV phase is Bound; all dependent pods are in Running state
    • additional info:
      • If pods do not restart automatically, delete them to trigger a fresh scheduling cycle: kubectl delete pod <pod-name> -n <namespace>

Additional resources