Runbook for “NodeNetworkTransmitErrs” Alert

1. Identify the Problem

When the “NodeNetworkTransmitErrs” alert is triggered, it indicates that the rate of network transmit errors on one or more instances is critically high.

2. Check Current Network Errors

Use the following command to check the current network errors on the affected instance:

ifconfig -a

Expected Output

You should see an output similar to this, with error counts for each network interface:

eth0      Link encap:Ethernet  HWaddr 00:0c:29:68:8c:7b  
          inet addr:192.168.1.100  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:123456 errors:0 dropped:0 overruns:0 frame:0
          TX packets:123456 errors:10 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:12345678 (12.3 MB)  TX bytes:12345678 (12.3 MB)

3. Identify the Affected Interface

Determine which network interface is experiencing high transmit errors. In the example above, eth0 has 10 transmit errors.

4. Check Network Interface Statistics

To get more detailed statistics for the affected interface, use:

ethtool -S eth0

Replace eth0 with the actual interface name.

Expected Output

This command provides detailed statistics for the network interface:

NIC statistics:
     rx_packets: 123456
     tx_packets: 123456
     rx_errors: 0
     tx_errors: 10
     ...

5. Restart Network Interface

If the errors are persistent, consider restarting the network interface:

sudo ifdown eth0 && sudo ifup eth0

Expected Output

Check the interface status again to ensure it is up and running:

ifconfig eth0

You should see the interface listed as UP.

6. Check for Network Configuration Issues

Ensure that the network configuration is correct. Check the network configuration file (usually /etc/network/interfaces or /etc/netplan/*.yaml):

cat /etc/network/interfaces

Expected Output

Review the configuration to ensure there are no errors or misconfigurations.

7. Update Prometheus Configuration

If the network error threshold needs adjustment, update the Prometheus alert expression and reload the configuration:

Edit the Prometheus configuration file (usually prometheus.yml):

- alert: NodeNetworkTransmitErrs
  expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High network transmit errors on {{ $labels.instance }}"
    description: "Network transmit errors are above 1% on {{ $labels.instance }}."

Reload Prometheus configuration:

curl -X POST http://<prometheus_host>:<prometheus_port>/-/reload

Expected Output

Prometheus should reload the configuration without errors, and the alert should be updated accordingly.

Conclusion

By following these steps, you should be able to troubleshoot and resolve the “NodeNetworkTransmitErrs” alert. If the issue persists, further investigation into the specific network interface and its configuration may be necessary.