Runbook: BlackboxProbeSlow Alert
Alert Details
- Alert Name: BlackboxProbeSlow
- Expression:
avg_over_time(probe_duration_seconds{job=~".+", nanocosmosGroup=~".+", environment=~".+"}[1m]) >
Description
This alert is triggered when the average probe duration (probe_duration_seconds) over a 1-minute window exceeds a specified threshold for any job within a specific group (nanocosmosGroup) and environment (environment). This indicates that the blackbox probe is taking longer than expected to complete.
Possible Causes
- Network latency or congestion.
- High load on the target service.
- Suboptimal probe configuration.
- Resource constraints on the probing or target system.
- DNS resolution delays.
Troubleshooting Steps
1. Check Network Latency
Measure the network latency to the target.
# Example: Measure network latency to a target
ping <target_hostname_or_ip>
Expected Output:
PING <target_hostname_or_ip> (<ip_address>) 56(84) bytes of data.
64 bytes from <target_hostname_or_ip>: icmp_seq=1 ttl=64 time=0.123 ms
...
2. Verify Target Service Load
Check the load on the target service to ensure it is not overloaded.
# Example: Check the load on the target service
ssh <target_hostname_or_ip> 'uptime'
Expected Output:
13:45:01 up 10 days, 3:22, 1 user, load average: 0.15, 0.10, 0.05
3. Check Probe Configuration
Ensure that the probe is optimally configured.
# Example: Check probe configuration
cat /etc/prometheus/prometheus.yml | grep -A 10 'scrape_configs:'
Expected Output:
scrape_configs:
- job_name: 'probe'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- <target_hostname_or_ip>
...
4. Review Logs
Check the logs of the target service and the probe for any errors or warnings.
# Example: Review logs of the target service
journalctl -u <service_name> --since "1 hour ago"
Expected Output:
Nov 13 12:00:00 <hostname> <service_name>[1234]: Starting <service_name>...
Nov 13 12:00:01 <hostname> <service_name>[1234]: <Log message>
...
# Example: Review logs of the probe
cat /var/log/prometheus/probe.log | tail -n 50
Expected Output:
<timestamp> <log_level> <log_message>
...
5. DNS Resolution Check
Ensure that DNS resolution for the target is working correctly and not causing delays.
# Example: Check DNS resolution
nslookup <target_hostname>
Expected Output:
Server: <dns_server>
Address: <dns_server_ip>
Name: <target_hostname>
Address: <target_ip>
Additional Steps
If the issue persists, consider:
- Restarting the affected services or hosts.
- Checking for resource constraints on the probing or target system.
- Contacting the network or system administrator for further investigation.