Runbook: ExporterDown

Alert Details

Alert Name: ExporterDown
Expression: sum by (hostname) (up{job="health-check-proxy", nanocosmosGroup=~".+", instance=~".+", environment=~".+"}) == 0

Description

This alert triggers when the exporter for the health-check-proxy job is down across all instances in the specified nanocosmosGroup and environment.

Possible Causes

The exporter service is not running.
Network issues preventing the exporter from reporting.
Configuration errors in the exporter setup.
Issues with the underlying infrastructure (e.g., server down).

Troubleshooting Steps

Check Exporter Service Status

Command: systemctl status hcp
Expected Output: The status of the exporter service. Look for “active (running)”.

Example:

$ systemctl status hcp
● hcp.service - health check proxy
Loaded: loaded (/etc/systemd/system/hcp.service; enabled; vendor preset: e>
Active: active (running) since Sun 2025-07-27 08:51:32 UTC; 1 weeks 2 days>
Main PID: 2225299 (hcp)
 Tasks: 8 (limit: 9417)
Memory: 150.5M
CGroup: /system.slice/hcp.service
        └─2225299 /opt/nanostream/hcp/hcp

Restart Exporter Service
- Command: sudo systemctl restart hcp
- Expected Output: The service restarts without errors.
- Example:
```
$ sudo systemctl restart hcp
```

Check Network Connectivity

Command: ping -c 4 exporter-hostname
Expected Output: Successful ping responses.

Example:

$ ping -c 4 exporter-hostname
PING exporter-hostname (192.168.1.1) 56(84) bytes of data.
64 bytes from exporter-hostname: icmp_seq=1 ttl=64 time=0.123 ms
64 bytes from exporter-hostname: icmp_seq=2 ttl=64 time=0.124 ms
64 bytes from exporter-hostname: icmp_seq=3 ttl=64 time=0.125 ms
64 bytes from exporter-hostname: icmp_seq=4 ttl=64 time=0.126 ms

Verify Exporter Configuration

Command: cat /etc/nanostream/hcp/config.yml
Expected Output: Configuration file contents. Ensure all settings are correct.

Example:

$ cat /etc/exporter/config.yml
 Server:
   port: 9590
 Elastic:
   url: https://api.elasticsearch.fsn.hz.k8s.nanostream.cloud
   user: $secret 
   password: $secret
   index: streamcloud_accounting
   timePeriode: 600
   services: h5live;nginx-rtmp
 LogFile: /var/log/hcp/hcp.log
 MaintenanceFile: /var/www/maintenance
 TestStream:
   url: https://bintu-play.nanocosmos.de/h5live/http/stream.mp4?url=rtmp://bintu-play.nanocosmos.de:1935/play&stream=CD6oL-2kE1g
   duration: 5
   interval: 15
 PromHealth:
   url: https://mimir.nanocosmos.cloud
   urlPath: /prometheus
   query: (group by (instance, environment, component, geoCluster, nanocosmosGroup, datacenterRegion) (ALERTS{instance="%s",alertstate="firing", health="unhealthy", nanocosmosGroup="streamcloud"}) ) or ( group by (instance, environment, component, geoCluster, nanocosmosGroup, datacenterRegion) ( up{nanocosmosGroup="streamcloud",instance="%s"}) ) * 0
   user: $secret
   pass: $secret
   interval: 60
 NetworkSpeedFile: /etc/nanostream/hcp/networkspeed
 Debug: 
   on: false

Additional Steps

Check Logs for Errors
- Command: journalctl -u hcp --since "1 hour ago"
- Expected Output: Recent logs for the exporter service. Look for any error messages.
Check Underlying Infrastructure
- Ensure the server hosting the exporter is up and running.
- Verify there are no ongoing maintenance activities or outages.

Runbook: ExporterDown#

Alert Details#

Description#

Possible Causes#

Troubleshooting Steps#

Additional Steps#

Runbook: ExporterDown

Alert Details

Description

Possible Causes

Troubleshooting Steps

Additional Steps