Runbook: HighDiskReadRate

Alert Details

Alert Name: HighDiskReadRate
Expression: irate(node_disk_read_bytes_total{nanocosmosGroup=~".+", instance=~".+", diskDeviceSelector=~".+", environment=~".+"}[2m]) / 1024 / 1024 >

Description

This alert triggers when the disk read rate exceeds a certain threshold. It monitors the rate of bytes read from disk over a 2-minute window and converts it to megabytes per second (MB/s).

Possible Causes

High read activity from applications or services
Backup or data replication processes
Misconfigured applications causing excessive reads
Disk-intensive operations such as database queries
Caching mechanisms not working efficiently

Troubleshooting Steps

1. Check Current Disk Read Rate

Use the following command to check the current disk read rate on the affected instance:

iostat -dx 1 10

Expected Output

You should see an output similar to this, showing disk read rates:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   20.00    0.00     2.00     0.00   200.00     0.50    5.00    5.00    0.00   0.50  10.00
...

2. Identify Processes with High Disk Read Activity

To identify processes causing high disk read activity, use:

sudo iotop -o

Expected Output

This command lists processes by disk I/O usage:

Total DISK READ: 2.00 MB/s | Total DISK WRITE: 0.00 B/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
1234 be/4 root        2.00 MB/s    0.00 B/s  0.00 %     0.00 % myapp
...

3. Check Log Files

If log files are causing high disk reads, consider cleaning them up or rotating them. For example, to clear a specific log file:

sudo truncate -s 0 /var/log/large_log_file.log

Expected Output

The log file size should be reduced to 0 bytes:

ls -lh /var/log/large_log_file.log

4. Adjust Application Configurations

Review and adjust configurations of applications causing high disk reads. For example, optimize query performance or adjust caching mechanisms.

5. Check for Backup or Replication Processes

Ensure that backup or data replication processes are not running excessively. Adjust their schedules or configurations if necessary.

6. Monitor Disk Health

Check the health of the disk to ensure it is functioning properly. Use smartctl to check disk health:

sudo smartctl -a /dev/sda

Expected Output

You should see detailed health information about the disk:

SMART overall-health self-assessment test result: PASSED
...

Additional Steps

1. Monitor Disk Read Rates

Continuously monitor disk read rates to ensure they remain within acceptable limits. Use tools like prometheus and grafana to set up dashboards and alerts.

2. Optimize Disk Usage

Consider optimizing disk usage by using faster disks (e.g., SSDs), implementing caching mechanisms, or distributing disk I/O across multiple disks.

By following these steps, you should be able to troubleshoot and resolve the “HighDiskReadRate” alert. If the issue persists, further investigation into the specific processes and system configuration may be necessary.

Runbook: HighDiskReadRate#

Alert Details#

Description#

Possible Causes#

Troubleshooting Steps#

1. Check Current Disk Read Rate#

Expected Output#

2. Identify Processes with High Disk Read Activity#

Expected Output#

3. Check Log Files#

Expected Output#

4. Adjust Application Configurations#

5. Check for Backup or Replication Processes#

6. Monitor Disk Health#

Expected Output#

Additional Steps#

1. Monitor Disk Read Rates#

2. Optimize Disk Usage#

Runbook: HighDiskReadRate

Alert Details

Description

Possible Causes

Troubleshooting Steps

1. Check Current Disk Read Rate

Expected Output

2. Identify Processes with High Disk Read Activity

Expected Output

3. Check Log Files

Expected Output

4. Adjust Application Configurations

5. Check for Backup or Replication Processes

6. Monitor Disk Health

Expected Output

Additional Steps

1. Monitor Disk Read Rates

2. Optimize Disk Usage