Runbook: InstanceDown Alert

Alert Details

Alert Name: InstanceDown
Expression: sum without (cluster, job) (probe_success{job=~"blackbox.*", nanocosmosGroup=~".+", environment=~".+", component=~".+", instanceNumber!=""}) ==

Description

This alert just recognizes instances/endpoints which have a instance number (instanceNumber!=""), like (t2devteam-edge-eu-hc-nbg1-01, t3dev-edge-ma-lnd-mia-01). This alert is triggered when the sum of successful probes (probe_success) for all hosts in a specific group (nanocosmosGroup) and environment (environment) is equal to zero. This indicates that all hosts in this group and environment are unreachable.

Possible Causes

Network issues affecting the reachability of the hosts.
Misconfiguration of probes or monitoring tools.
Power supply issues or hardware failures.

Troubleshooting Steps

1. Check Network Connectivity

Verify the network connections to the affected hosts.

# Example: Check network connectivity to a host
ping <hostname_or_ip>

Expected Output:

PING <hostname_or_ip> (<ip_address>) 56(84) bytes of data.
64 bytes from <hostname_or_ip>: icmp_seq=1 ttl=64 time=0.123 ms
...

2. Verify Host Status

Ensure that the hosts are running and reachable.

# Example: Check the status of a host
ssh <hostname_or_ip> 'systemctl status'

Expected Output:

● <service_name>.service - <Service Description>
   Loaded: loaded (/etc/systemd/system/<service_name>.service; enabled; vendor preset: enabled)
   Active: active (running) since <date>; <time> ago
...

3. Check Probe Configuration

Ensure that the probes are correctly configured and running.

# Example: Check probe configuration
cat /etc/prometheus/prometheus.yml | grep -A 10 'scrape_configs:'

Expected Output:

scrape_configs:
  - job_name: 'probe'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - <hostname_or_ip>
...

4. Review Logs

Check the logs of the affected hosts and probes for errors.

# Example: Review logs
journalctl -u <service_name> --since "1 hour ago"

Expected Output:

Nov 13 12:00:00 <hostname> <service_name>[1234]: Starting <service_name>...
Nov 13 12:00:01 <hostname> <service_name>[1234]: <Log message>
...

Additional Steps

If the issue persists, consider:

Restarting the affected services or hosts.
Checking the hardware for failures.
Contacting the network or system administrator.

Runbook: InstanceDown Alert#

Alert Details#

Description#

Possible Causes#

Troubleshooting Steps#

1. Check Network Connectivity#

2. Verify Host Status#

3. Check Probe Configuration#

4. Review Logs#

Additional Steps#

Runbook: InstanceDown Alert

Alert Details

Description

Possible Causes

Troubleshooting Steps

1. Check Network Connectivity

2. Verify Host Status

3. Check Probe Configuration

4. Review Logs

Additional Steps