Runbook: NodeRAIDDegraded

Alert Details

  • Alert Name: NodeRAIDDegraded
  • Expression: node_md_disks_required{nanocosmosGroup=~".+", instance=~".+", diskRaidDeviceSelector=~".+", environment=~".+"} - ignoring (state) (node_md_disks{nanocosGroup=~".+", instance=~".+", state="active", diskRaidDeviceSelector=~".+", environment=~".+"}) >

Description

This alert triggers when a RAID array is degraded. A degraded RAID array means that one or more disks in the array have failed or are not functioning correctly, reducing the redundancy and potentially impacting data integrity and performance.

Possible Causes

  • Disk failure or malfunction
  • Loose or disconnected cables
  • RAID controller issues
  • Power supply problems
  • Firmware or driver issues

Troubleshooting Steps

1. Check RAID Status

Use the following command to check the status of the RAID array:

cat /proc/mdstat

Expected Output

You should see an output similar to this, showing the status of the RAID arrays:

Personalities : [raid1] 
md0 : active raid1 sda1[0] sdb1[1]
      104320 blocks [2/1] [U_]

unused devices: <none>

2. Identify the Failed Disk

The output of /proc/mdstat will indicate which disk has failed. In the example above, sdb1 is the failed disk.

3. Check Disk Health

Use smartctl to check the health of the failed disk:

sudo smartctl -a /dev/sdb

Expected Output

You should see detailed health information about the disk:

SMART overall-health self-assessment test result: FAILED
...

4. Replace the Failed Disk

If the disk has failed, replace it with a new one. After replacing the disk, add it back to the RAID array:

sudo mdadm --manage /dev/md0 --add /dev/sdb1

Expected Output

Check the RAID status again to ensure the disk is being rebuilt:

cat /proc/mdstat

You should see the RAID array rebuilding:

md0 : active raid1 sda1[0] sdb1[2]
      104320 blocks [2/1] [U_]
      [>....................]  recovery =  0.1% (128/104320) finish=10.0min speed=100K/sec

5. Check RAID Controller and Cables

Ensure that the RAID controller and cables are functioning correctly. Check for loose or disconnected cables and reseat them if necessary.

6. Update Firmware and Drivers

Ensure that the firmware and drivers for the RAID controller and disks are up to date. Check the manufacturer’s website for updates and follow their instructions for updating.

Additional Steps

1. Monitor RAID Status

Continuously monitor the status of the RAID arrays to ensure they remain healthy. Use tools like prometheus and grafana to set up dashboards and alerts.

2. Perform Regular Backups

Ensure that regular backups are performed to protect against data loss in case of RAID failure. Use tools like rsync, tar, or backup software to automate the backup process.

By following these steps, you should be able to troubleshoot and resolve the “NodeRAIDDegraded” alert. If the issue persists, further investigation into the specific RAID configuration and hardware may be necessary.