Alert Runbooks

KubePersistentVolumeFillingUp

KubePersistentVolumeFillingUp

Description

This alert fires when a PersistentVolume (PV) is running low on available disk space, typically above 85% utilization. If the volume fills up completely, the workload using it will likely crash or become read-only, potentially causing data loss, application errors, or service outages.


Possible Causes:


Severity estimation

Medium to High severity, depending on fill rate and remaining space:

Severity increases with:


Troubleshooting steps

  1. Identify the affected PVC and namespace

    • Command / Action:
      • Check alert labels for the PVC name and namespace, then confirm current usage
      • kubectl get pvc -n <namespace>

      • kubectl describe pvc <pvc-name> -n <namespace>

    • Expected result:
      • The affected PVC is identified and its bound PV is confirmed
    • additional info:
      • Note the StorageClass — it determines whether the volume can be expanded online

  1. Check actual disk usage inside the pod

    • Command / Action:
      • Exec into the pod using the volume and check disk usage
      • kubectl exec -it <pod-name> -n <namespace> – df -h

      • kubectl exec -it <pod-name> -n <namespace> – du -sh /<mount-path>/*

    • Expected result:
      • The mount path shows high utilization; the du output identifies which directories are largest
    • additional info:
      • Replace /&lt;mount-path&gt; with the actual volume mount path from the pod spec

  1. Identify what is consuming the most space

    • Command / Action:
      • Find the largest files and directories on the volume
      • kubectl exec -it <pod-name> -n <namespace> – du -sh /<mount-path>/* | sort -rh | head -20

    • Expected result:
      • The top space consumers are identified (logs, data files, temp files, dumps, etc.)
    • additional info:
      • Log files and database write-ahead logs are common culprits; identify the pattern before deleting anything

  1. Clean up unnecessary files to recover space immediately

    • Command / Action:
      • Remove stale logs, temporary files, or completed dump files that are safe to delete
      • kubectl exec -it <pod-name> -n <namespace> – find /<mount-path> -name “*.log” -mtime +7 -delete

    • Expected result:
      • Disk usage drops below the alert threshold; immediate pressure is relieved
    • additional info:
      • Only delete files you are certain are safe to remove; coordinate with the application team if unsure

  1. Check and configure log rotation or data retention

    • Command / Action:
      • Review the application’s log rotation and data retention settings to prevent recurrence
      • Check application config for log rotation (e.g., logrotate, application-level retention settings)
    • Expected result:
      • Retention policies are configured to prevent unbounded growth
    • additional info:
      • For databases, review WAL retention, vacuum settings (PostgreSQL), or purge policies

  1. Expand the PersistentVolume if the StorageClass supports it

    • Command / Action:
      • Edit the PVC to request more storage (requires allowVolumeExpansion: true in the StorageClass)
      • kubectl get storageclass <storageclass-name> -o yaml | grep allowVolumeExpansion

      • kubectl edit pvc <pvc-name> -n <namespace>

    • Expected result:
      • The PVC storage request is increased; the underlying volume expands (may require pod restart)
    • additional info:
      • After editing, monitor kubectl describe pvc <pvc-name> -n <namespace> for the resize condition
      • Some storage backends require the pod to be restarted for the filesystem resize to take effect inside the container

  1. Monitor the fill rate to predict when the volume will be full

    • Command / Action:
      • Query Prometheus for the volume fill rate
      • predict_linear(kubelet_volume_stats_available_bytes{persistentvolumeclaim="<pvc-name>"}[6h], 4 * 3600)

    • Expected result:
      • The predicted value is positive (volume won’t fill in the next 4 hours)
    • additional info:
      • A negative result means the volume is predicted to fill within 4 hours — treat as urgent

  1. Confirm recovery and monitor usage trend

    • Command / Action:
      • Verify disk usage is back below the threshold and stable
      • kubectl exec -it <pod-name> -n <namespace> – df -h

    • Expected result:
      • Volume usage is below 85% and the fill rate has stabilized
    • additional info:
      • Set up a recurring check or dashboard panel to track volume usage over time and catch growth early

Additional resources