Runbook: NodeNetworkReceiveErrs
Alert Details
- Alert Name: NodeNetworkReceiveErrs
- Expression:
rate(node_network_receive_errs_total{nanocosmosGroup=~".+", instance=~".+", environment=~".+"}[2m]) / rate(node_network_receive_packets_total{nanocosGroup=~".+", instance=~".+", environment=~".+"}[2m]) >
Description
This alert triggers when the rate of network receive errors exceeds a certain threshold compared to the rate of total network receive packets over a 2-minute window. It indicates potential issues with the network interface on one or more instances.
Possible Causes
- Network interface hardware issues
- Network configuration errors
- High network traffic causing packet loss
- Faulty network cables or connections
- Driver or firmware issues
Troubleshooting Steps
1. Check Current Network Errors
Use the following command to check the current network errors on the affected instance:
ifconfig -a
Expected Output
You should see an output similar to this, with error counts for each network interface:
eth0 Link encap:Ethernet HWaddr 00:0c:29:68:8c:7b
inet addr:192.168.1.100 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:123456 errors:10 dropped:0 overruns:0 frame:0
TX packets:123456 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:12345678 (12.3 MB) TX bytes:12345678 (12.3 MB)
2. Identify the Affected Interface
Determine which network interface is experiencing high receive errors. In the example above, eth0 has 10 receive errors.
3. Check Network Interface Statistics
To get more detailed statistics for the affected interface, use:
ethtool -S eth0
Replace eth0 with the actual interface name.
Expected Output
This command provides detailed statistics for the network interface:
NIC statistics:
rx_packets: 123456
tx_packets: 123456
rx_errors: 10
tx_errors: 0
...
4. Restart Network Interface
If the errors are persistent, consider restarting the network interface:
sudo ifdown eth0 && sudo ifup eth0
Expected Output
Check the interface status again to ensure it is up and running:
ifconfig eth0
You should see the interface listed as UP.
5. Check for Network Configuration Issues
Ensure that the network configuration is correct. Check the network configuration file (usually /etc/network/interfaces or /etc/netplan/*.yaml):
cat /etc/network/interfaces
Expected Output
Review the configuration to ensure there are no errors or misconfigurations.
6. Check for Hardware Issues
Network receive errors can sometimes be caused by hardware issues. Check the physical network connections and cables. If possible, replace the network cable and test again.
Additional Steps
1. Update Prometheus Configuration
If the network error threshold needs adjustment, update the Prometheus alert expression and reload the configuration:
Edit the Prometheus configuration file (usually prometheus.yml):
- alert: NodeNetworkReceiveErrs
expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High network receive errors on {{ $labels.instance }}"
description: "Network receive errors are above 1% on {{ $labels.instance }}."
Reload Prometheus configuration:
curl -X POST http://<prometheus_host>:<prometheus_port>/-/reload
Expected Output
Prometheus should reload the configuration without errors, and the alert should be updated accordingly.
2. Monitor Network Performance
Continuously monitor the network performance to ensure that the issue does not recur. Use tools like iftop or nload to monitor real-time network traffic:
sudo apt-get install iftop
sudo iftop -i eth0
Expected Output
You should see real-time network traffic statistics for the specified interface.
By following these steps, you should be able to troubleshoot and resolve the “NodeNetworkReceiveErrs” alert. If the issue persists, further investigation into the specific network interface and its configuration may be necessary.