Runbook: BlackboxProbeUnsuccessful Alert

Alert Details

  • Alert Name: BlackboxProbeUnsuccessful
  • Expression: probe_success{job=~".+", nanocosmosGroup=~".+", environment=~".+"} == 0

Description

This alert is triggered when the probe_success metric for any job within a specific group (nanocosmosGroup) and environment (environment) is equal to zero. This indicates that the blackbox probe for the specified job has failed.

Possible Causes

  1. Network issues affecting the reachability of the target.
  2. The target service is down or unresponsive.
  3. Misconfiguration of the probe or monitoring tools.
  4. DNS resolution issues.
  5. Firewall or security group rules blocking the probe.

Troubleshooting Steps

1. Check Network Connectivity

Verify the network connections to the target.

# Example: Check network connectivity to a target
ping <target_hostname_or_ip>

Expected Output:

PING <target_hostname_or_ip> (<ip_address>) 56(84) bytes of data.
64 bytes from <target_hostname_or_ip>: icmp_seq=1 ttl=64 time=0.123 ms
...

2. Verify Target Service Status

Ensure that the target service is running and reachable.

# Example: Check the status of the target service
ssh <target_hostname_or_ip> 'systemctl status <service_name>'

Expected Output:

● <service_name>.service - <Service Description>
   Loaded: loaded (/etc/systemd/system/<service_name>.service; enabled; vendor preset: enabled)
   Active: active (running) since <date>; <time> ago
...

3. Check Probe Configuration

Ensure that the probe is correctly configured and running.

# Example: Check probe configuration
cat /etc/prometheus/prometheus.yml | grep -A 10 'scrape_configs:'

Expected Output:

scrape_configs:
  - job_name: 'probe'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - <target_hostname_or_ip>
...

4. Review Logs

Check the logs of the target service and the probe for errors.

# Example: Review logs of the target service
journalctl -u <service_name> --since "1 hour ago"

Expected Output:

Nov 13 12:00:00 <hostname> <service_name>[1234]: Starting <service_name>...
Nov 13 12:00:01 <hostname> <service_name>[1234]: <Log message>
...
# Example: Review logs of the probe
cat /var/log/prometheus/probe.log | tail -n 50

Expected Output:

<timestamp> <log_level> <log_message>
...

5. DNS Resolution Check

Ensure that the DNS resolution for the target is working correctly.

# Example: Check DNS resolution
nslookup <target_hostname>

Expected Output:

Server:         <dns_server>
Address:        <dns_server_ip>

Name:   <target_hostname>
Address: <target_ip>

Additional Steps

If the issue persists, consider:

  • Restarting the affected services or hosts.
  • Checking firewall or security group rules.
  • Contacting the network or system administrator.