Runbook: NodeNetworkReceiveErrs

Alert Details

Alert Name: NodeNetworkReceiveErrs
Expression: rate(node_network_receive_errs_total{nanocosmosGroup=~".+", instance=~".+", environment=~".+"}[2m]) / rate(node_network_receive_packets_total{nanocosGroup=~".+", instance=~".+", environment=~".+"}[2m]) >

Description

This alert triggers when the rate of network receive errors exceeds a certain threshold compared to the rate of total network receive packets over a 2-minute window. It indicates potential issues with the network interface on one or more instances.

Possible Causes

Network interface hardware issues
Network configuration errors
High network traffic causing packet loss
Faulty network cables or connections
Driver or firmware issues

Troubleshooting Steps

1. Check Current Network Errors

Use the following command to check the current network errors on the affected instance:

ifconfig -a

Expected Output

You should see an output similar to this, with error counts for each network interface:

eth0      Link encap:Ethernet  HWaddr 00:0c:29:68:8c:7b  
          inet addr:192.168.1.100  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:123456 errors:10 dropped:0 overruns:0 frame:0
          TX packets:123456 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:12345678 (12.3 MB)  TX bytes:12345678 (12.3 MB)

2. Identify the Affected Interface

Determine which network interface is experiencing high receive errors. In the example above, eth0 has 10 receive errors.

3. Check Network Interface Statistics

To get more detailed statistics for the affected interface, use:

ethtool -S eth0

Replace eth0 with the actual interface name.

Expected Output

This command provides detailed statistics for the network interface:

NIC statistics:
     rx_packets: 123456
     tx_packets: 123456
     rx_errors: 10
     tx_errors: 0
     ...

4. Restart Network Interface

If the errors are persistent, consider restarting the network interface:

sudo ifdown eth0 && sudo ifup eth0

Expected Output

Check the interface status again to ensure it is up and running:

ifconfig eth0

You should see the interface listed as UP.

5. Check for Network Configuration Issues

Ensure that the network configuration is correct. Check the network configuration file (usually /etc/network/interfaces or /etc/netplan/*.yaml):

cat /etc/network/interfaces

Expected Output

Review the configuration to ensure there are no errors or misconfigurations.

6. Check for Hardware Issues

Network receive errors can sometimes be caused by hardware issues. Check the physical network connections and cables. If possible, replace the network cable and test again.

Additional Steps

1. Update Prometheus Configuration

If the network error threshold needs adjustment, update the Prometheus alert expression and reload the configuration:

Edit the Prometheus configuration file (usually prometheus.yml):

- alert: NodeNetworkReceiveErrs
  expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High network receive errors on {{ $labels.instance }}"
    description: "Network receive errors are above 1% on {{ $labels.instance }}."

Reload Prometheus configuration:

curl -X POST http://<prometheus_host>:<prometheus_port>/-/reload

Expected Output

Prometheus should reload the configuration without errors, and the alert should be updated accordingly.

2. Monitor Network Performance

Continuously monitor the network performance to ensure that the issue does not recur. Use tools like iftop or nload to monitor real-time network traffic:

sudo apt-get install iftop
sudo iftop -i eth0

Expected Output

You should see real-time network traffic statistics for the specified interface.

By following these steps, you should be able to troubleshoot and resolve the “NodeNetworkReceiveErrs” alert. If the issue persists, further investigation into the specific network interface and its configuration may be necessary.

Runbook: NodeNetworkReceiveErrs#

Alert Details#

Description#

Possible Causes#

Troubleshooting Steps#

1. Check Current Network Errors#

Expected Output#

2. Identify the Affected Interface#

3. Check Network Interface Statistics#

Expected Output#

4. Restart Network Interface#

Expected Output#

5. Check for Network Configuration Issues#

Expected Output#

6. Check for Hardware Issues#

Additional Steps#

1. Update Prometheus Configuration#

Expected Output#

2. Monitor Network Performance#

Expected Output#

Runbook: NodeNetworkReceiveErrs

Alert Details

Description

Possible Causes

Troubleshooting Steps

1. Check Current Network Errors

Expected Output

2. Identify the Affected Interface

3. Check Network Interface Statistics

Expected Output

4. Restart Network Interface

Expected Output

5. Check for Network Configuration Issues

Expected Output

6. Check for Hardware Issues

Additional Steps

1. Update Prometheus Configuration

Expected Output

2. Monitor Network Performance

Expected Output