Runbook for “HighMemoryUsage” Alert

1. Identify the Problem

When the “HighMemoryUsage” alert is triggered, it indicates that the available memory on one or more instances is critically low.

2. Check Current Memory Usage

Use the following command to check the current memory usage on the affected instance:

free -m

Expected Output

You should see an output similar to this:

              total        used        free      shared  buff/cache   available
Mem:           7977        1234        3456         123        2345        4567
Swap:          2047          12        2035

3. Identify Memory-Consuming Processes

To identify processes consuming the most memory, use:

ps aux --sort=-%mem | head -n 10

Expected Output

This command lists the top 10 processes by memory usage:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      1234  5.0 10.0 123456 789012 ?       S    10:00   0:30 /usr/bin/python3
...

4. Restart Memory-Consuming Services

If a specific service is consuming too much memory, consider restarting it. For example, if a web server is using excessive memory:

sudo systemctl restart apache2

Expected Output

Check the status to ensure the service restarted successfully:

sudo systemctl status apache2

You should see an output indicating that the service is active and running.

5. Clear Cache

Clearing the cache can free up memory. Use the following command:

sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches

Expected Output

Re-check the memory usage to see if the available memory has increased:

free -m

6. Add Swap Space

If the system frequently runs out of memory, consider adding swap space. Here’s how to add a 1GB swap file:

sudo fallocate -l 1G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Expected Output

Verify the swap space:

sudo swapon --show

You should see the new swap file listed.

7. Update Prometheus Configuration

If the memory usage threshold needs adjustment, update the Prometheus alert expression and reload the configuration:

Edit the Prometheus configuration file (usually prometheus.yml):

- alert: HighMemoryUsage
  expr: 100 * (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) <= 10
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High memory usage on {{ $labels.instance }}"
    description: "Memory usage is above 90% on {{ $labels.instance }}."

Reload Prometheus configuration:

curl -X POST http://<prometheus_host>:<prometheus_port>/-/reload

Expected Output

Prometheus should reload the configuration without errors, and the alert should be updated accordingly.

Conclusion

By following these steps, you should be able to troubleshoot and resolve the “HighMemoryUsage” alert. If the issue persists, further investigation into the specific processes and system configuration may be necessary.