Runbook: HttpProbeFail Alert

Alert Details

  • Alert Name: HttpProbeFail
  • Expression: probe_http_status_code{job=~".+", nanocosmosGroup=~".+", environment=~".+"} <= 199 or probe_http_status_code{job=~".+", nanocosmosGroup=~".+", environment=~".+"} >= 400

Description

This alert is triggered when the HTTP status code returned by a probe is less than or equal to 199 or greater than or equal to 400 for any job within a specific group (nanocosmosGroup) and environment (environment). This indicates that the HTTP probe has failed, either due to client errors (4xx) or server errors (5xx).

Possible Causes

  1. The target service is down or unresponsive.
  2. Misconfiguration of the probe or target service.
  3. Network issues affecting the reachability of the target.
  4. DNS resolution issues.
  5. Firewall or security group rules blocking the probe.

Troubleshooting Steps

1. If the Alert is triggered by bintu service (Bintu API, Token, Dashboard) refer to this Runbook instead.

2. Check HTTP Status Code

Verify the HTTP status code returned by the target service.

# Example: Check HTTP status code using curl
curl -I <target_url>

Expected Output:

HTTP/1.1 200 OK
Date: Wed, 13 Nov 2024 13:45:00 GMT
...

3. Verify Target Service Status

Ensure that the target service is running and reachable.

# Example: Check the status of the target service
ssh <target_hostname_or_ip> 'systemctl status <service_name>'

Expected Output:

● <service_name>.service - <Service Description>
   Loaded: loaded (/etc/systemd/system/<service_name>.service; enabled; vendor preset: enabled)
   Active: active (running) since <date>; <time> ago
...

4. Check Probe Configuration

Ensure that the probe is correctly configured and running.

# Example: Check probe configuration
cat /etc/prometheus/prometheus.yml | grep -A 10 'scrape_configs:'

Expected Output:

scrape_configs:
  - job_name: 'probe'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - <target_hostname_or_ip>
...

5. Review Logs

Check the logs of the target service and the probe for any errors or warnings.

# Example: Review logs of the target service
journalctl -u <service_name> --since "1 hour ago"

Expected Output:

Nov 13 12:00:00 <hostname> <service_name>[1234]: Starting <service_name>...
Nov 13 12:00:01 <hostname> <service_name>[1234]: <Log message>
...
# Example: Review logs of the probe
cat /var/log/prometheus/probe.log | tail -n 50

Expected Output:

<timestamp> <log_level> <log_message>
...

6. DNS Resolution Check

Ensure that DNS resolution for the target is working correctly and not causing delays.

# Example: Check DNS resolution
nslookup <target_hostname>

Expected Output:

Server:         <dns_server>
Address:        <dns_server_ip>

Name:   <target_hostname>
Address: <target_ip>

Additional Steps

If the issue persists, consider:

  • Restarting the affected services or hosts.
  • Checking firewall or security group rules.
  • Contacting the network or system administrator for further investigation.