Runbook: HighDiskWriteRate
Alert Details
- Alert Name: HighDiskWriteRate
- Expression:
irate(node_disk_written_bytes_total{nanocosmosGroup=~".+", instance=~".+", diskDeviceSelector=~".+", environment=~".+"}[2m]) / 1024 / 1024 >
Description
This alert triggers when the disk write rate exceeds a certain threshold. It monitors the rate of bytes written to disk over a 2-minute window and converts it to megabytes per second (MB/s).
Possible Causes
- High write activity from applications or services
- Log files being written excessively
- Backup or data replication processes
- Misconfigured applications causing excessive writes
- Disk-intensive operations such as database transactions
Troubleshooting Steps
1. Check Current Disk Write Rate
Use the following command to check the current disk write rate on the affected instance:
iostat -dx 1 10
Expected Output
You should see an output similar to this, showing disk write rates:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 10.00 0.00 1.00 200.00 0.50 5.00 0.00 5.00 0.50 5.00
...
2. Identify Processes with High Disk Write Activity
To identify processes causing high disk write activity, use:
sudo iotop -o
Expected Output
This command lists processes by disk I/O usage:
Total DISK READ: 0.00 B/s | Total DISK WRITE: 1.00 MB/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
1234 be/4 root 0.00 B/s 1.00 MB/s 0.00 % 0.00 % myapp
...
3. Check Log Files
If log files are causing high disk writes, consider cleaning them up or rotating them. For example, to clear a specific log file:
sudo truncate -s 0 /var/log/large_log_file.log
Expected Output
The log file size should be reduced to 0 bytes:
ls -lh /var/log/large_log_file.log
4. Adjust Application Configurations
Review and adjust configurations of applications causing high disk writes. For example, reduce logging levels or optimize write operations.
5. Check for Backup or Replication Processes
Ensure that backup or data replication processes are not running excessively. Adjust their schedules or configurations if necessary.
6. Monitor Disk Health
Check the health of the disk to ensure it is functioning properly. Use smartctl to check disk health:
sudo smartctl -a /dev/sda
Expected Output
You should see detailed health information about the disk:
SMART overall-health self-assessment test result: PASSED
...
Additional Steps
1. Monitor Disk Write Rates
Continuously monitor disk write rates to ensure they remain within acceptable limits. Use tools like prometheus and grafana to set up dashboards and alerts.
2. Optimize Disk Usage
Consider optimizing disk usage by using faster disks (e.g., SSDs), implementing caching mechanisms, or distributing disk I/O across multiple disks.
By following these steps, you should be able to troubleshoot and resolve the “HighDiskWriteRate” alert. If the issue persists, further investigation into the specific processes and system configuration may be necessary.