Runbook: EndpointDown Alert

Alert Details

  • Alert Name: EndpointDown
  • Expression: sum without (cluster, job) (probe_success{job=~"blackbox.*", nanocosmosGroup=~".+", environment=~".+", component=~".+", instanceNumber=""}) == 0

Description

This alert just recognizes instances/endpoints which dont have a instance number (instanceNumber=""), like (Bintu API, Token, Dashboard). This alert is triggered when the sum of successful probes (probe_success) for a specific instance with group (nanocosmosGroup), environment (environment) and component (commponent) is equal to zero. This indicates that all hosts in this group and environment are unreachable.

Possible Causes

  1. Network issues affecting the reachability of the hosts.
  2. All hosts in the group are down or powered off.
  3. Misconfiguration of probes or monitoring tools.
  4. Power supply issues or hardware failures.

Troubleshooting Steps

1. If the Alert is triggered by bintu service (Bintu API, Token, Dashboard) refer to this Runbook instead.

2. Check Network Connectivity

Verify the network connections to the affected hosts.

# Example: Check network connectivity to a host
ping <hostname_or_ip>

Expected Output:

PING <hostname_or_ip> (<ip_address>) 56(84) bytes of data.
64 bytes from <hostname_or_ip>: icmp_seq=1 ttl=64 time=0.123 ms
...

3. Verify Host Status

Ensure that the hosts are running and reachable.

# Example: Check the status of a host
ssh <hostname_or_ip> 'systemctl status'

Expected Output:

● <service_name>.service - <Service Description>
   Loaded: loaded (/etc/systemd/system/<service_name>.service; enabled; vendor preset: enabled)
   Active: active (running) since <date>; <time> ago
...

4. Check Probe Configuration

Ensure that the probes are correctly configured and running.

# Example: Check probe configuration
cat /etc/prometheus/prometheus.yml | grep -A 10 'scrape_configs:'

Expected Output:

scrape_configs:
  - job_name: 'probe'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - <hostname_or_ip>
...

5. Review Logs

Check the logs of the affected hosts and probes for errors.

# Example: Review logs
journalctl -u <service_name> --since "1 hour ago"

Expected Output:

Nov 13 12:00:00 <hostname> <service_name>[1234]: Starting <service_name>...
Nov 13 12:00:01 <hostname> <service_name>[1234]: <Log message>
...

Additional Steps

If the issue persists, consider:

  • Restarting the affected services or hosts.
  • Checking the hardware for failures.
  • Contacting the network or system administrator.