Runbook: BlackboxProbeSlow Alert

Alert Details

  • Alert Name: BlackboxProbeSlow
  • Expression: avg_over_time(probe_duration_seconds{job=~".+", nanocosmosGroup=~".+", environment=~".+"}[1m]) >

Description

This alert is triggered when the average probe duration (probe_duration_seconds) over a 1-minute window exceeds a specified threshold for any job within a specific group (nanocosmosGroup) and environment (environment). This indicates that the blackbox probe is taking longer than expected to complete.

Possible Causes

  1. Network latency or congestion.
  2. High load on the target service.
  3. Suboptimal probe configuration.
  4. Resource constraints on the probing or target system.
  5. DNS resolution delays.

Troubleshooting Steps

1. Check Network Latency

Measure the network latency to the target.

# Example: Measure network latency to a target
ping <target_hostname_or_ip>

Expected Output:

PING <target_hostname_or_ip> (<ip_address>) 56(84) bytes of data.
64 bytes from <target_hostname_or_ip>: icmp_seq=1 ttl=64 time=0.123 ms
...

2. Verify Target Service Load

Check the load on the target service to ensure it is not overloaded.

# Example: Check the load on the target service
ssh <target_hostname_or_ip> 'uptime'

Expected Output:

 13:45:01 up 10 days,  3:22,  1 user,  load average: 0.15, 0.10, 0.05

3. Check Probe Configuration

Ensure that the probe is optimally configured.

# Example: Check probe configuration
cat /etc/prometheus/prometheus.yml | grep -A 10 'scrape_configs:'

Expected Output:

scrape_configs:
  - job_name: 'probe'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - <target_hostname_or_ip>
...

4. Review Logs

Check the logs of the target service and the probe for any errors or warnings.

# Example: Review logs of the target service
journalctl -u <service_name> --since "1 hour ago"

Expected Output:

Nov 13 12:00:00 <hostname> <service_name>[1234]: Starting <service_name>...
Nov 13 12:00:01 <hostname> <service_name>[1234]: <Log message>
...
# Example: Review logs of the probe
cat /var/log/prometheus/probe.log | tail -n 50

Expected Output:

<timestamp> <log_level> <log_message>
...

5. DNS Resolution Check

Ensure that DNS resolution for the target is working correctly and not causing delays.

# Example: Check DNS resolution
nslookup <target_hostname>

Expected Output:

Server:         <dns_server>
Address:        <dns_server_ip>

Name:   <target_hostname>
Address: <target_ip>

Additional Steps

If the issue persists, consider:

  • Restarting the affected services or hosts.
  • Checking for resource constraints on the probing or target system.
  • Contacting the network or system administrator for further investigation.