Runbook for “HighMemoryUsage” Alert
1. Identify the Problem
When the “HighMemoryUsage” alert is triggered, it indicates that the available memory on one or more instances is critically low.
2. Check Current Memory Usage
Use the following command to check the current memory usage on the affected instance:
free -m
Expected Output
You should see an output similar to this:
total used free shared buff/cache available
Mem: 7977 1234 3456 123 2345 4567
Swap: 2047 12 2035
3. Identify Memory-Consuming Processes
To identify processes consuming the most memory, use:
ps aux --sort=-%mem | head -n 10
Expected Output
This command lists the top 10 processes by memory usage:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1234 5.0 10.0 123456 789012 ? S 10:00 0:30 /usr/bin/python3
...
4. Restart Memory-Consuming Services
If a specific service is consuming too much memory, consider restarting it. For example, if a web server is using excessive memory:
sudo systemctl restart apache2
Expected Output
Check the status to ensure the service restarted successfully:
sudo systemctl status apache2
You should see an output indicating that the service is active and running.
5. Clear Cache
Clearing the cache can free up memory. Use the following command:
sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
Expected Output
Re-check the memory usage to see if the available memory has increased:
free -m
6. Add Swap Space
If the system frequently runs out of memory, consider adding swap space. Here’s how to add a 1GB swap file:
sudo fallocate -l 1G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Expected Output
Verify the swap space:
sudo swapon --show
You should see the new swap file listed.
7. Update Prometheus Configuration
If the memory usage threshold needs adjustment, update the Prometheus alert expression and reload the configuration:
Edit the Prometheus configuration file (usually prometheus.yml):
- alert: HighMemoryUsage
expr: 100 * (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) <= 10
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% on {{ $labels.instance }}."
Reload Prometheus configuration:
curl -X POST http://<prometheus_host>:<prometheus_port>/-/reload
Expected Output
Prometheus should reload the configuration without errors, and the alert should be updated accordingly.
Conclusion
By following these steps, you should be able to troubleshoot and resolve the “HighMemoryUsage” alert. If the issue persists, further investigation into the specific processes and system configuration may be necessary.