Runbook: InstanceDown Alert
Alert Details
- Alert Name: InstanceDown
- Expression:
sum without (cluster, job) (probe_success{job=~"blackbox.*", nanocosmosGroup=~".+", environment=~".+", component=~".+", instanceNumber!=""}) ==
Description
This alert just recognizes instances/endpoints which have a instance number (instanceNumber!=""), like (t2devteam-edge-eu-hc-nbg1-01, t3dev-edge-ma-lnd-mia-01).
This alert is triggered when the sum of successful probes (probe_success) for all hosts in a specific group (nanocosmosGroup) and environment (environment) is equal to zero. This indicates that all hosts in this group and environment are unreachable.
Possible Causes
- Network issues affecting the reachability of the hosts.
- Misconfiguration of probes or monitoring tools.
- Power supply issues or hardware failures.
Troubleshooting Steps
1. Check Network Connectivity
Verify the network connections to the affected hosts.
# Example: Check network connectivity to a host
ping <hostname_or_ip>
Expected Output:
PING <hostname_or_ip> (<ip_address>) 56(84) bytes of data.
64 bytes from <hostname_or_ip>: icmp_seq=1 ttl=64 time=0.123 ms
...
2. Verify Host Status
Ensure that the hosts are running and reachable.
# Example: Check the status of a host
ssh <hostname_or_ip> 'systemctl status'
Expected Output:
● <service_name>.service - <Service Description>
Loaded: loaded (/etc/systemd/system/<service_name>.service; enabled; vendor preset: enabled)
Active: active (running) since <date>; <time> ago
...
3. Check Probe Configuration
Ensure that the probes are correctly configured and running.
# Example: Check probe configuration
cat /etc/prometheus/prometheus.yml | grep -A 10 'scrape_configs:'
Expected Output:
scrape_configs:
- job_name: 'probe'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- <hostname_or_ip>
...
4. Review Logs
Check the logs of the affected hosts and probes for errors.
# Example: Review logs
journalctl -u <service_name> --since "1 hour ago"
Expected Output:
Nov 13 12:00:00 <hostname> <service_name>[1234]: Starting <service_name>...
Nov 13 12:00:01 <hostname> <service_name>[1234]: <Log message>
...
Additional Steps
If the issue persists, consider:
- Restarting the affected services or hosts.
- Checking the hardware for failures.
- Contacting the network or system administrator.