Alert Runbooks

ExporterDown

Runbook: ExporterDown

Description

This alert triggers when the exporter for the health-check-proxy job is down across all instances in the specified nanocosmosGroup and environment.


Possible Causes:


Severity estimation

This alert is critical and needs to be resolved. This exporter alert exposees the metric which shows if a server is in maintenance.


Troubleshooting steps

  1. Check Exporter Service Status

    • Command / Action:
      • access terminal session on server and check systemd service status
      • systemctl status hcp
    • Expected Result:
      • service is running
      1
      2
      3
      4
      5
      6
      7
      8
      9
      
      $ systemctl status hcp
      ● hcp.service - health check proxy
      Loaded: loaded (/etc/systemd/system/hcp.service; enabled; vendor preset: e>
      Active: active (running) since Sun 2025-07-27 08:51:32 UTC; 1 weeks 2 days>
      Main PID: 2225299 (hcp)
       Tasks: 8 (limit: 9417)
      Memory: 150.5M
      CGroup: /system.slice/hcp.service
              └─2225299 /opt/nanostream/hcp/hcp

  1. Restart Exporter Service

    • Command / Action:
      • if hcp is not running, try to restart the service
      • sudo systemctl restart hcp

  1. Check Logs for Errors

    • Command / Action:
      • Look for any error messages which describe
      • journalctl -u hcp --since "1 hour ago"
    • Expected Result:

  1. Check Network Connectivity

    • Command / Action:
      • ping -c 4 $<hostname|IP>
    • Example:
      1
      2
      3
      4
      5
      6
      7
      8
      9
      
      $ ping -c 4 t3b-vtrans-sa-gc-gru-02.vtrans-b.nanocosmos.de
      PING t3b-vtrans-sa-gc-gru-02.vtrans-b.nanocosmos.de (213.156.149.183) 56(84) bytes of data.
      64 bytes from 213.156.149.183: icmp_seq=1 ttl=48 time=216 ms
      64 bytes from 213.156.149.183: icmp_seq=2 ttl=48 time=215 ms
      64 bytes from 213.156.149.183: icmp_seq=3 ttl=48 time=218 ms
      64 bytes from 213.156.149.183: icmp_seq=4 ttl=48 time=215 ms
      t3b-vtrans-sa-gc-gru-02.vtrans-b.nanocosmos.de ping statistics
      4 packets transmitted, 4 received, 0% packet loss, time 3004ms
      rtt min/avg/max/mdev = 214.672/215.831/217.789/1.177 ms

  1. Verify Exporter Configuration

    • Command / Action:
      • cat /etc/nanostream/hcp/config.yml
    • Expected Output:
      • Configuration file contents. Ensure all settings are correct.
    • Example:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      
      $ cat /etc/exporter/config.yml
       Server:
         port: 9590
       Elastic:
         url: https://api.elasticsearch.fsn.hz.k8s.nanostream.cloud
         user: $secret 
         password: $secret
         index: streamcloud_accounting
         timePeriode: 600
         services: h5live;nginx-rtmp
       LogFile: /var/log/hcp/hcp.log
       MaintenanceFile: /var/www/maintenance
       TestStream:
         url: https://bintu-play.nanocosmos.de/h5live/http/stream.mp4?url=rtmp://bintu-play.nanocosmos.de:1935/play&stream=CD6oL-2kE1g
         duration: 5
         interval: 15
       PromHealth:
         url: https://mimir.nanocosmos.cloud
         urlPath: /prometheus
         query: (group by (instance, environment, component, geoCluster, nanocosmosGroup, datacenterRegion) (ALERTS{instance="%s",alertstate="firing", health="unhealthy", nanocosmosGroup="streamcloud"}) ) or ( group by (instance, environment, component, geoCluster, nanocosmosGroup, datacenterRegion) ( up{nanocosmosGroup="streamcloud",instance="%s"}) ) * 0
         user: $secret
         pass: $secret
         interval: 60
       NetworkSpeedFile: /etc/nanostream/hcp/networkspeed
       Debug: 
         on: false

Additional resources

Dashbords:

Git repos: