Runbook: BlackboxProbeUnsuccessful Alert
Alert Details
- Alert Name: BlackboxProbeUnsuccessful
- Expression:
probe_success{job=~".+", nanocosmosGroup=~".+", environment=~".+"} == 0
Description
This alert is triggered when the probe_success metric for any job within a specific group (nanocosmosGroup) and environment (environment) is equal to zero. This indicates that the blackbox probe for the specified job has failed.
Possible Causes
- Network issues affecting the reachability of the target.
- The target service is down or unresponsive.
- Misconfiguration of the probe or monitoring tools.
- DNS resolution issues.
- Firewall or security group rules blocking the probe.
Troubleshooting Steps
1. If the Alert is triggered by bintu service (Bintu API, Token, Dashboard) refer to this Runbook instead.
2. Check Network Connectivity
Verify the network connections to the target.
# Example: Check network connectivity to a target
ping <target_hostname_or_ip>
Expected Output:
PING <target_hostname_or_ip> (<ip_address>) 56(84) bytes of data.
64 bytes from <target_hostname_or_ip>: icmp_seq=1 ttl=64 time=0.123 ms
...
3. Verify Target Service Status
Ensure that the target service is running and reachable.
# Example: Check the status of the target service
ssh <target_hostname_or_ip> 'systemctl status <service_name>'
Expected Output:
● <service_name>.service - <Service Description>
Loaded: loaded (/etc/systemd/system/<service_name>.service; enabled; vendor preset: enabled)
Active: active (running) since <date>; <time> ago
...
4. Check Probe Configuration
Ensure that the probe is correctly configured and running.
# Example: Check probe configuration
cat /etc/prometheus/prometheus.yml | grep -A 10 'scrape_configs:'
Expected Output:
scrape_configs:
- job_name: 'probe'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- <target_hostname_or_ip>
...
5. Review Logs
Check the logs of the target service and the probe for errors.
# Example: Review logs of the target service
journalctl -u <service_name> --since "1 hour ago"
Expected Output:
Nov 13 12:00:00 <hostname> <service_name>[1234]: Starting <service_name>...
Nov 13 12:00:01 <hostname> <service_name>[1234]: <Log message>
...
# Example: Review logs of the probe
cat /var/log/prometheus/probe.log | tail -n 50
Expected Output:
<timestamp> <log_level> <log_message>
...
6. DNS Resolution Check
Ensure that the DNS resolution for the target is working correctly.
# Example: Check DNS resolution
nslookup <target_hostname>
Expected Output:
Server: <dns_server>
Address: <dns_server_ip>
Name: <target_hostname>
Address: <target_ip>
Additional Steps
If the issue persists, consider:
- Restarting the affected services or hosts.
- Checking firewall or security group rules.
- Contacting the network or system administrator.