Runbook: NodeFileDescriptorLimit
Alert Details
- Alert Name: NodeFileDescriptorLimit
- Expression:
node_filefd_allocated{nanocosmosGroup=~".+", instance=~".+", environment=~".+"} * 100 / node_filefd_maximum{nanocosGroup=~".+", instance=~".+", environment=~".+"} >
Description
This alert triggers when the percentage of allocated file descriptors exceeds a certain threshold. File descriptors are a finite resource used by the operating system to manage open files, sockets, and other resources. Running out of file descriptors can cause applications to fail when they attempt to open new files or network connections.
Possible Causes
- Applications opening too many files or sockets
- File descriptor leaks in applications
- Insufficient file descriptor limits set on the system
- High number of concurrent connections or open files
Troubleshooting Steps
1. Check Current File Descriptor Usage
Use the following command to check the current file descriptor usage on the affected instance:
lsof | wc -l
Expected Output
You should see the total number of open file descriptors:
12345
2. Identify Processes with High File Descriptor Usage
To identify processes using a large number of file descriptors, use:
lsof | awk '{print $2}' | sort | uniq -c | sort -nr | head -n 10
Expected Output
This command lists the top 10 processes by file descriptor usage:
5000 1234
3000 5678
...
To get more details about a specific process, use:
lsof -p <PID>
Replace <PID> with the process ID.
3. Check File Descriptor Limits
Check the current file descriptor limits for the system:
ulimit -n
Expected Output
You should see the current limit for open file descriptors:
1024
4. Increase File Descriptor Limits
If the current limit is insufficient, increase it. Edit the /etc/security/limits.conf file to set higher limits:
sudo nano /etc/security/limits.conf
Add the following lines to set the limits for all users:
* soft nofile 65536
* hard nofile 65536
Expected Output
Save the file and apply the changes by logging out and back in, or by restarting the system.
5. Restart Affected Services
If specific services are causing high file descriptor usage, consider restarting them to release unused file descriptors. For example, to restart a web server:
sudo systemctl restart apache2
Expected Output
Check the status to ensure the service restarted successfully:
sudo systemctl status apache2
You should see an output indicating that the service is active and running.
6. Check for File Descriptor Leaks
Investigate applications for potential file descriptor leaks. This may involve reviewing application logs, code, or using debugging tools to identify and fix leaks.
Additional Steps
1. Monitor File Descriptor Usage
Continuously monitor file descriptor usage to ensure it remains within acceptable limits. Use tools like prometheus and grafana to set up dashboards and alerts.
2. Optimize Application Configurations
Review and optimize application configurations to reduce file descriptor usage. This may involve adjusting connection pooling settings, reducing the number of open files, or optimizing resource management.
By following these steps, you should be able to troubleshoot and resolve the “NodeFileDescriptorLimit” alert. If the issue persists, further investigation into the specific processes and system configuration may be necessary.