Runbook: HostDown Alert

Alert Details

  • Alert Name: HostDown
  • Expression: sum without (cluster, job) (probe_success{nanocosmosGroup=~".+", environment=~".+"}) == 0

Description

This alert is triggered when the sum of successful probes (probe_success) for all hosts in a specific group (nanocosmosGroup) and environment (environment) is equal to zero. This indicates that all hosts in this group and environment are unreachable.

Possible Causes

  1. Network issues affecting the reachability of the hosts.
  2. All hosts in the group are down or powered off.
  3. Misconfiguration of probes or monitoring tools.
  4. Power supply issues or hardware failures.

Troubleshooting Steps

1. Check Network Connectivity

Verify the network connections to the affected hosts.

# Example: Check network connectivity to a host
ping <hostname_or_ip>

Expected Output:

PING <hostname_or_ip> (<ip_address>) 56(84) bytes of data.
64 bytes from <hostname_or_ip>: icmp_seq=1 ttl=64 time=0.123 ms
...

2. Verify Host Status

Ensure that the hosts are running and reachable.

# Example: Check the status of a host
ssh <hostname_or_ip> 'systemctl status'

Expected Output:

● <service_name>.service - <Service Description>
   Loaded: loaded (/etc/systemd/system/<service_name>.service; enabled; vendor preset: enabled)
   Active: active (running) since <date>; <time> ago
...

3. Check Probe Configuration

Ensure that the probes are correctly configured and running.

# Example: Check probe configuration
cat /etc/prometheus/prometheus.yml | grep -A 10 'scrape_configs:'

Expected Output:

scrape_configs:
  - job_name: 'probe'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - <hostname_or_ip>
...

4. Review Logs

Check the logs of the affected hosts and probes for errors.

# Example: Review logs
journalctl -u <service_name> --since "1 hour ago"

Expected Output:

Nov 13 12:00:00 <hostname> <service_name>[1234]: Starting <service_name>...
Nov 13 12:00:01 <hostname> <service_name>[1234]: <Log message>
...

Additional Steps

If the issue persists, consider:

  • Restarting the affected services or hosts.
  • Checking the hardware for failures.
  • Contacting the network or system administrator.