Runbook for “NodeNetworkTransmitErrs” Alert
1. Identify the Problem
When the “NodeNetworkTransmitErrs” alert is triggered, it indicates that the rate of network transmit errors on one or more instances is critically high.
2. Check Current Network Errors
Use the following command to check the current network errors on the affected instance:
ifconfig -a
Expected Output
You should see an output similar to this, with error counts for each network interface:
eth0 Link encap:Ethernet HWaddr 00:0c:29:68:8c:7b
inet addr:192.168.1.100 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:123456 errors:0 dropped:0 overruns:0 frame:0
TX packets:123456 errors:10 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:12345678 (12.3 MB) TX bytes:12345678 (12.3 MB)
3. Identify the Affected Interface
Determine which network interface is experiencing high transmit errors. In the example above, eth0 has 10 transmit errors.
4. Check Network Interface Statistics
To get more detailed statistics for the affected interface, use:
ethtool -S eth0
Replace eth0 with the actual interface name.
Expected Output
This command provides detailed statistics for the network interface:
NIC statistics:
rx_packets: 123456
tx_packets: 123456
rx_errors: 0
tx_errors: 10
...
5. Restart Network Interface
If the errors are persistent, consider restarting the network interface:
sudo ifdown eth0 && sudo ifup eth0
Expected Output
Check the interface status again to ensure it is up and running:
ifconfig eth0
You should see the interface listed as UP.
6. Check for Network Configuration Issues
Ensure that the network configuration is correct. Check the network configuration file (usually /etc/network/interfaces or /etc/netplan/*.yaml):
cat /etc/network/interfaces
Expected Output
Review the configuration to ensure there are no errors or misconfigurations.
7. Update Prometheus Configuration
If the network error threshold needs adjustment, update the Prometheus alert expression and reload the configuration:
Edit the Prometheus configuration file (usually prometheus.yml):
- alert: NodeNetworkTransmitErrs
expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High network transmit errors on {{ $labels.instance }}"
description: "Network transmit errors are above 1% on {{ $labels.instance }}."
Reload Prometheus configuration:
curl -X POST http://<prometheus_host>:<prometheus_port>/-/reload
Expected Output
Prometheus should reload the configuration without errors, and the alert should be updated accordingly.
Conclusion
By following these steps, you should be able to troubleshoot and resolve the “NodeNetworkTransmitErrs” alert. If the issue persists, further investigation into the specific network interface and its configuration may be necessary.