InstanceDown
Runbook: InstanceDown Alert
Alert Details
- Alert Name: InstanceDown
- Expression:
sum without (cluster, job) (probe_success{job=~"blackbox.*", nanocosmosGroup=~".+", environment=~".+", component=~".+", instanceNumber!=""}) ==
Description
This alert just recognizes instances/endpoints which have a instance number (instanceNumber!=""), like (t2devteam-edge-eu-hc-nbg1-01, t3dev-edge-ma-lnd-mia-01).
This alert is triggered when the sum of successful probes (probe_success) for all hosts in a specific group (nanocosmosGroup) and environment (environment) is equal to zero. This indicates that all hosts in this group and environment are unreachable.
Possible Causes
- Network issues affecting the reachability of the hosts.
- Misconfiguration of probes or monitoring tools.
- Power supply issues or hardware failures.
Troubleshooting Steps
1. Check Network Connectivity
Verify the network connections to the affected hosts.
|
|
Expected Output:
|
|
2. Verify Host Status
Ensure that the hosts are running and reachable.
|
|
Expected Output:
|
|
3. Check Probe Configuration
Ensure that the probes are correctly configured and running.
|
|
Expected Output:
|
|
4. Review Logs
Check the logs of the affected hosts and probes for errors.
|
|
Expected Output:
|
|
Additional Steps
If the issue persists, consider:
- Restarting the affected services or hosts.
- Checking the hardware for failures.
- Contacting the network or system administrator.