ExporterDown
VtransExporterDown
Description
This alert triggers when the exporter service for the vtrans-metrics job is not able to collect metrics from the host.
Possible Causes:
- The exporter service is not running.
- Network issues preventing the exporter from reporting.
- Configuration errors in the exporter setup.
- Issues with the underlying infrastructure (e.g., server down).>
Severity estimation
If vtrans load in the region of the affected server is not at full capacity, i.e. if there are still servers which are not overloaded,this alert is not critical. This Dashboard can be utilized to get an vtrans load overview for all regions and make sure the dashboard filter are set to the desired values.
Troubleshooting steps
-
Log into server
- create the fqdn of the server with the following scheme Streamcloud server naming and log in to the server
-
Check exporter service status
- the service runs via the
serviceuser, switch user withsudo suand thensu service - service runs with
pm2process manager - Command / Action:
-
pm2 list
-
- Expected output:
- The service status is online.
- Example:
$ pm2 list In memory PM2 version: 5.3.0 Local PM2 version: 6.0.6 ┌────┬────────────────────┬──────────┬──────┬───────────┬──────────┬──────────┐ │ id │ name │ mode │ ↺ │ status │ cpu │ memory │ ├────┼────────────────────┼──────────┼──────┼───────────┼──────────┼──────────┤ │ 0 │ push-worker │ fork │ 15 │ online │ 0% │ 316.8mb │ └────┴────────────────────┴──────────┴──────┴───────────┴──────────┴──────────┘
- the service runs via the
-
Restart exporter service
- Command / Action:
-
pm2 restart <name/id>
-
- Expected result:
- service restarts without errors.
- Command / Action:
-
Check service logs
- Command / Action:
- check logs for errors, also compare to other logs of healthy instances to see how the logs should look like
-
pm2 logs pm2 logs –lines 200 pm2 monit
- Command / Action:
-
Verify HTTP endpoint
- Command / Action:
-
https://<fqdn>/vtrans2stats
-
- Expected result:
- Exporter exposes HTTP endpoint and provides metric data
- Example:
# HELP ffmpeg_idle A Vtrans/2 srver is IDLE if it is not running any push, pull or passthrough process # TYPE ffmpeg_idle gauge ffmpeg_idle 1 # HELP ffmpeg_processes_active The number of active processes at the moment # TYPE ffmpeg_processes_active gauge # HELP ffmpeg_processes_total The total number of processes, including (if any) the ones that were respawned because of errors # TYPE ffmpeg_processes_total counter ... ... ... # HELP vtrans_overloaded Whether the host is overloaded at the moment, meaning that it is either out of slots or with a high CPU load # TYPE vtrans_overloaded gauge vtrans_overloaded 0 # HELP vtrans_capacity The current capacity of this host in number of concurrent processes # TYPE vtrans_capacity gauge vtrans_capacity 8 # HELP vtrans_maxcapacity The maximum capacity of this host in number of concurrent processes, calculated from the Number of Processes Per Core minus a 10% headroom # TYPE vtrans_maxcapacity gauge vtrans_maxcapacity 8.64 # HELP vtrans_version The deployed versions of the engine and application # TYPE vtrans_version gauge
- Command / Action:
Additional resources
PM2 documentation Streamcloud server naming scheme todo: streamcloud balancing runbook todo : streamcloud load estimation dashboarad