Runbook: SslProbeFail Alert

Alert Details

  • Alert Name: SslProbeFail
  • Expression: sum without (cluster) (probe_http_ssl{job=~".+", nanocosmosGroup=~".+", environment=~".+"}) == 0

Description

This alert is triggered when the sum of successful SSL probes (probe_http_ssl) for all jobs within a specific group (nanocosmosGroup) and environment (environment) is equal to zero. This indicates that all SSL probes in this group and environment have failed.

Possible Causes

  1. SSL certificate issues (expired, misconfigured, or invalid).
  2. Network issues affecting the reachability of the target.
  3. Misconfiguration of the probe or monitoring tools.
  4. DNS resolution issues.
  5. Firewall or security group rules blocking the probe.

Troubleshooting Steps

1. Check SSL Certificate

Verify the SSL certificate of the target service.

# Example: Check SSL certificate using OpenSSL
echo | openssl s_client -connect <target_hostname_or_ip>:443 2>/dev/null | openssl x509 -noout -dates

Expected Output:

notBefore=Nov 13 00:00:00 2023 GMT
notAfter=Nov 13 00:00:00 2024 GMT

2. Check Network Connectivity

Verify the network connections to the target.

# Example: Check network connectivity to a target
ping <target_hostname_or_ip>

Expected Output:

PING <target_hostname_or_ip> (<ip_address>) 56(84) bytes of data.
64 bytes from <target_hostname_or_ip>: icmp_seq=1 ttl=64 time=0.123 ms
...

3. Verify Target Service Status

Ensure that the target service is running and reachable.

# Example: Check the status of the target service
ssh <target_hostname_or_ip> 'systemctl status <service_name>'

Expected Output:

● <service_name>.service - <Service Description>
   Loaded: loaded (/etc/systemd/system/<service_name>.service; enabled; vendor preset: enabled)
   Active: active (running) since <date>; <time> ago
...

4. Check Probe Configuration

Ensure that the probe is correctly configured and running.

# Example: Check probe configuration
cat /etc/prometheus/prometheus.yml | grep -A 10 'scrape_configs:'

Expected Output:

scrape_configs:
  - job_name: 'probe'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - <target_hostname_or_ip>
...

5. Review Logs

Check the logs of the target service and the probe for any errors or warnings.

# Example: Review logs of the target service
journalctl -u <service_name> --since "1 hour ago"

Expected Output:

Nov 13 12:00:00 <hostname> <service_name>[1234]: Starting <service_name>...
Nov 13 12:00:01 <hostname> <service_name>[1234]: <Log message>
...
# Example: Review logs of the probe
cat /var/log/prometheus/probe.log | tail -n 50

Expected Output:

<timestamp> <log_level> <log_message>
...

6. DNS Resolution Check

Ensure that DNS resolution for the target is working correctly and not causing delays.

# Example: Check DNS resolution
nslookup <target_hostname>

Expected Output:

Server:         <dns_server>
Address:        <dns_server_ip>

Name:   <target_hostname>
Address: <target_ip>

Additional Steps

If the issue persists, consider:

  • Restarting the affected services or hosts.
  • Checking firewall or security group rules.
  • Contacting the network or system administrator for further investigation.