Runbook: ExporterDown

Alert Details

  • Alert Name: ExporterDown
  • Expression: sum by (hostname) (up{job="health-check-proxy", nanocosmosGroup=~".+", instance=~".+", environment=~".+"}) == 0

Description

This alert triggers when the exporter for the health-check-proxy job is down across all instances in the specified nanocosmosGroup and environment.

Possible Causes

  • The exporter service is not running.
  • Network issues preventing the exporter from reporting.
  • Configuration errors in the exporter setup.
  • Issues with the underlying infrastructure (e.g., server down).

Troubleshooting Steps

  1. Check Exporter Service Status

    • Command: systemctl status hcp
    • Expected Output: The status of the exporter service. Look for “active (running)”.
    • Example:
      $ systemctl status hcp
      ● hcp.service - health check proxy
      Loaded: loaded (/etc/systemd/system/hcp.service; enabled; vendor preset: e>
      Active: active (running) since Sun 2025-07-27 08:51:32 UTC; 1 weeks 2 days>
      Main PID: 2225299 (hcp)
       Tasks: 8 (limit: 9417)
      Memory: 150.5M
      CGroup: /system.slice/hcp.service
              └─2225299 /opt/nanostream/hcp/hcp
      
  2. Restart Exporter Service

    • Command: sudo systemctl restart hcp
    • Expected Output: The service restarts without errors.
    • Example:
      $ sudo systemctl restart hcp
      
  3. Check Network Connectivity

    • Command: ping -c 4 exporter-hostname
    • Expected Output: Successful ping responses.
    • Example:
      $ ping -c 4 exporter-hostname
      PING exporter-hostname (192.168.1.1) 56(84) bytes of data.
      64 bytes from exporter-hostname: icmp_seq=1 ttl=64 time=0.123 ms
      64 bytes from exporter-hostname: icmp_seq=2 ttl=64 time=0.124 ms
      64 bytes from exporter-hostname: icmp_seq=3 ttl=64 time=0.125 ms
      64 bytes from exporter-hostname: icmp_seq=4 ttl=64 time=0.126 ms
      
  4. Verify Exporter Configuration

    • Command: cat /etc/nanostream/hcp/config.yml
    • Expected Output: Configuration file contents. Ensure all settings are correct.
    • Example:
      $ cat /etc/exporter/config.yml
       Server:
         port: 9590
       Elastic:
         url: https://api.elasticsearch.fsn.hz.k8s.nanostream.cloud
         user: $secret 
         password: $secret
         index: streamcloud_accounting
         timePeriode: 600
         services: h5live;nginx-rtmp
       LogFile: /var/log/hcp/hcp.log
       MaintenanceFile: /var/www/maintenance
       TestStream:
         url: https://bintu-play.nanocosmos.de/h5live/http/stream.mp4?url=rtmp://bintu-play.nanocosmos.de:1935/play&stream=CD6oL-2kE1g
         duration: 5
         interval: 15
       PromHealth:
         url: https://mimir.nanocosmos.cloud
         urlPath: /prometheus
         query: (group by (instance, environment, component, geoCluster, nanocosmosGroup, datacenterRegion) (ALERTS{instance="%s",alertstate="firing", health="unhealthy", nanocosmosGroup="streamcloud"}) ) or ( group by (instance, environment, component, geoCluster, nanocosmosGroup, datacenterRegion) ( up{nanocosmosGroup="streamcloud",instance="%s"}) ) * 0
         user: $secret
         pass: $secret
         interval: 60
       NetworkSpeedFile: /etc/nanostream/hcp/networkspeed
       Debug: 
         on: false
      

Additional Steps

  • Check Logs for Errors

    • Command: journalctl -u hcp --since "1 hour ago"
    • Expected Output: Recent logs for the exporter service. Look for any error messages.
  • Check Underlying Infrastructure

    • Ensure the server hosting the exporter is up and running.
    • Verify there are no ongoing maintenance activities or outages.