Runbook: NodeFileDescriptorLimit

Alert Details

  • Alert Name: NodeFileDescriptorLimit
  • Expression: node_filefd_allocated{nanocosmosGroup=~".+", instance=~".+", environment=~".+"} * 100 / node_filefd_maximum{nanocosGroup=~".+", instance=~".+", environment=~".+"} >

Description

This alert triggers when the percentage of allocated file descriptors exceeds a certain threshold. File descriptors are a finite resource used by the operating system to manage open files, sockets, and other resources. Running out of file descriptors can cause applications to fail when they attempt to open new files or network connections.

Possible Causes

  • Applications opening too many files or sockets
  • File descriptor leaks in applications
  • Insufficient file descriptor limits set on the system
  • High number of concurrent connections or open files

Troubleshooting Steps

1. Check Current File Descriptor Usage

Use the following command to check the current file descriptor usage on the affected instance:

lsof | wc -l

Expected Output

You should see the total number of open file descriptors:

12345

2. Identify Processes with High File Descriptor Usage

To identify processes using a large number of file descriptors, use:

lsof | awk '{print $2}' | sort | uniq -c | sort -nr | head -n 10

Expected Output

This command lists the top 10 processes by file descriptor usage:

5000 1234
3000 5678
...

To get more details about a specific process, use:

lsof -p <PID>

Replace <PID> with the process ID.

3. Check File Descriptor Limits

Check the current file descriptor limits for the system:

ulimit -n

Expected Output

You should see the current limit for open file descriptors:

1024

4. Increase File Descriptor Limits

If the current limit is insufficient, increase it. Edit the /etc/security/limits.conf file to set higher limits:

sudo nano /etc/security/limits.conf

Add the following lines to set the limits for all users:

* soft nofile 65536
* hard nofile 65536

Expected Output

Save the file and apply the changes by logging out and back in, or by restarting the system.

5. Restart Affected Services

If specific services are causing high file descriptor usage, consider restarting them to release unused file descriptors. For example, to restart a web server:

sudo systemctl restart apache2

Expected Output

Check the status to ensure the service restarted successfully:

sudo systemctl status apache2

You should see an output indicating that the service is active and running.

6. Check for File Descriptor Leaks

Investigate applications for potential file descriptor leaks. This may involve reviewing application logs, code, or using debugging tools to identify and fix leaks.

Additional Steps

1. Monitor File Descriptor Usage

Continuously monitor file descriptor usage to ensure it remains within acceptable limits. Use tools like prometheus and grafana to set up dashboards and alerts.

2. Optimize Application Configurations

Review and optimize application configurations to reduce file descriptor usage. This may involve adjusting connection pooling settings, reducing the number of open files, or optimizing resource management.

By following these steps, you should be able to troubleshoot and resolve the “NodeFileDescriptorLimit” alert. If the issue persists, further investigation into the specific processes and system configuration may be necessary.