Runbook: RDSDiskSpaceLimit Alert

Alert Details

  • Alert Name: RDSDiskSpaceLimit
  • Expression: rds_free_storage_bytes{dbidentifier!~".*copy.*", %(clusterSelector)s} /1024 /1024 /1024 < 15

Description

Alert is triggered when RDS instance is low on storage.

Possible Causes

  • accumulation of WAL files
  • accumulation of logs

Troubleshooting Steps

1. Determine whether it’s a long-term growth trend requiring storage increase or abnormal disk usage reflecting another problem:

  • Check database size growth over the last 24 hours to identify abnormal disk usage growth Look at Storage usage panel in RDS instance details Grafana dashboard

  • Check if there are long-running transactions (or queries). A long running transaction (or query) forces PostgreSQL to maintain visibility of all rows since the beginning of the transaction. So PostgreSQL accumulate WAL files until the transaction is close. SELECT pid, leader_pid, datname as database, usename AS user, application_name, EXTRACT(EPOCH FROM now() - xact_start) as transaction_duration, CASE WHEN state = 'active' THEN EXTRACT(EPOCH FROM now() - query_start) ELSE null END query_duration, state, wait_event_type, wait_event, CASE WHEN state = 'active' THEN query ELSE null END query, xact_start as transaction_start, CASE WHEN state = 'active' THEN query_start ELSE null END query_start, CASE WHEN state != 'active' THEN EXTRACT(EPOCH FROM state_change - query_start) ELSE null END last_query_duration, CASE WHEN state != 'active' THEN query ELSE null END last_query, pg_blocking_pids(pid) as blocked_by, client_addr, backend_start, ceil(EXTRACT(EPOCH FROM now() - backend_start)) as backend_duration FROM pg_stat_activity WHERE query_start is not null AND usename!='rdsrepladmin' AND query not like 'START_REPLICATION %' AND pid != pg_backend_pid();

Look at Postgresql live activity Grafana dashboard to find long-running query

  • Check the status of PostgreSQL replication slots. PostgreSQL keeps WAL files on its disk until the replication slot client acknowledges they consumed it.

  • Check log file sizes

2. Fix the system that blocks PostgreSQL to recycle its WAL files

  • If long-running transactions/queries: Cancel or kill the transactions
  • If non-running replication slot: Delete replication slot

3. Increase RDS disk space

  • Set AWS_PROFILE

export AWS_PROFILE=<AWS account>

  • Determine the minimum storage for the increase 💡 RDS requires a minimal storage increase of 10%

INSTANCE_IDENTIFIER=<replace with the RDS instance identifier> aws rds describe-db-instances --db-instance-identifier ${INSTANCE_IDENTIFIER} \ | jq -r '{"Current IOPS": .DBInstances[0].Iops, "Current Storage Limit": .DBInstances[0].AllocatedStorage, "New minimum storage size": ((.DBInstances[0].AllocatedStorage|tonumber)+(.DBInstances[0].AllocatedStorage|tonumber*0.1|floor))}'

  • Increase storage:

NEW_ALLOCATED_STORAGE=<replace with new allocated storage in GB> aws rds modify-db-instance --db-instance-identifier ${INSTANCE_IDENTIFIER} --allocated-storage ${NEW_ALLOCATED_STORAGE} --apply-immediately \ | jq .DBInstance.PendingModifiedValues

❗ If the RDS instance has replicas instances (replica or reporting), you must repeat the operation for all replicas to keep the same configuration between instances

  • Backport changes in Terraform