Runbook: HighDiskReadRate
Alert Details
- Alert Name: HighDiskReadRate
- Expression:
irate(node_disk_read_bytes_total{nanocosmosGroup=~".+", instance=~".+", diskDeviceSelector=~".+", environment=~".+"}[2m]) / 1024 / 1024 >
Description
This alert triggers when the disk read rate exceeds a certain threshold. It monitors the rate of bytes read from disk over a 2-minute window and converts it to megabytes per second (MB/s).
Possible Causes
- High read activity from applications or services
- Backup or data replication processes
- Misconfigured applications causing excessive reads
- Disk-intensive operations such as database queries
- Caching mechanisms not working efficiently
Troubleshooting Steps
1. Check Current Disk Read Rate
Use the following command to check the current disk read rate on the affected instance:
iostat -dx 1 10
Expected Output
You should see an output similar to this, showing disk read rates:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 20.00 0.00 2.00 0.00 200.00 0.50 5.00 5.00 0.00 0.50 10.00
...
2. Identify Processes with High Disk Read Activity
To identify processes causing high disk read activity, use:
sudo iotop -o
Expected Output
This command lists processes by disk I/O usage:
Total DISK READ: 2.00 MB/s | Total DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
1234 be/4 root 2.00 MB/s 0.00 B/s 0.00 % 0.00 % myapp
...
3. Check Log Files
If log files are causing high disk reads, consider cleaning them up or rotating them. For example, to clear a specific log file:
sudo truncate -s 0 /var/log/large_log_file.log
Expected Output
The log file size should be reduced to 0 bytes:
ls -lh /var/log/large_log_file.log
4. Adjust Application Configurations
Review and adjust configurations of applications causing high disk reads. For example, optimize query performance or adjust caching mechanisms.
5. Check for Backup or Replication Processes
Ensure that backup or data replication processes are not running excessively. Adjust their schedules or configurations if necessary.
6. Monitor Disk Health
Check the health of the disk to ensure it is functioning properly. Use smartctl to check disk health:
sudo smartctl -a /dev/sda
Expected Output
You should see detailed health information about the disk:
SMART overall-health self-assessment test result: PASSED
...
Additional Steps
1. Monitor Disk Read Rates
Continuously monitor disk read rates to ensure they remain within acceptable limits. Use tools like prometheus and grafana to set up dashboards and alerts.
2. Optimize Disk Usage
Consider optimizing disk usage by using faster disks (e.g., SSDs), implementing caching mechanisms, or distributing disk I/O across multiple disks.
By following these steps, you should be able to troubleshoot and resolve the “HighDiskReadRate” alert. If the issue persists, further investigation into the specific processes and system configuration may be necessary.