Runbook: HighDiskWriteRate

Alert Details

  • Alert Name: HighDiskWriteRate
  • Expression: irate(node_disk_written_bytes_total{nanocosmosGroup=~".+", instance=~".+", diskDeviceSelector=~".+", environment=~".+"}[2m]) / 1024 / 1024 >

Description

This alert triggers when the disk write rate exceeds a certain threshold. It monitors the rate of bytes written to disk over a 2-minute window and converts it to megabytes per second (MB/s).

Possible Causes

  • High write activity from applications or services
  • Log files being written excessively
  • Backup or data replication processes
  • Misconfigured applications causing excessive writes
  • Disk-intensive operations such as database transactions

Troubleshooting Steps

1. Check Current Disk Write Rate

Use the following command to check the current disk write rate on the affected instance:

iostat -dx 1 10

Expected Output

You should see an output similar to this, showing disk write rates:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00   10.00     0.00     1.00   200.00     0.50    5.00    0.00    5.00   0.50   5.00
...

2. Identify Processes with High Disk Write Activity

To identify processes causing high disk write activity, use:

sudo iotop -o

Expected Output

This command lists processes by disk I/O usage:

Total DISK READ: 0.00 B/s | Total DISK WRITE: 1.00 MB/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
1234 be/4 root        0.00 B/s    1.00 MB/s  0.00 %     0.00 % myapp
...

3. Check Log Files

If log files are causing high disk writes, consider cleaning them up or rotating them. For example, to clear a specific log file:

sudo truncate -s 0 /var/log/large_log_file.log

Expected Output

The log file size should be reduced to 0 bytes:

ls -lh /var/log/large_log_file.log

4. Adjust Application Configurations

Review and adjust configurations of applications causing high disk writes. For example, reduce logging levels or optimize write operations.

5. Check for Backup or Replication Processes

Ensure that backup or data replication processes are not running excessively. Adjust their schedules or configurations if necessary.

6. Monitor Disk Health

Check the health of the disk to ensure it is functioning properly. Use smartctl to check disk health:

sudo smartctl -a /dev/sda

Expected Output

You should see detailed health information about the disk:

SMART overall-health self-assessment test result: PASSED
...

Additional Steps

1. Monitor Disk Write Rates

Continuously monitor disk write rates to ensure they remain within acceptable limits. Use tools like prometheus and grafana to set up dashboards and alerts.

2. Optimize Disk Usage

Consider optimizing disk usage by using faster disks (e.g., SSDs), implementing caching mechanisms, or distributing disk I/O across multiple disks.

By following these steps, you should be able to troubleshoot and resolve the “HighDiskWriteRate” alert. If the issue persists, further investigation into the specific processes and system configuration may be necessary.