Runbook: ExporterDown
Alert Details
- Alert Name: ExporterDown
- Expression:
sum by (hostname) (up{job="rtmpstat", nanocosmosGroup=~".+", instance=~".+", environment=~".+"}) == 0
Description
This alert triggers when the exporter for the rtmpstat job is down across all instances in the specified nanocosmosGroup and environment.
Possible Causes
- The exporter service is not running.
- Network issues preventing the exporter from reporting.
- Configuration errors in the exporter setup.
- Issues with the underlying infrastructure (e.g., server down).
Troubleshooting Steps
-
Check Exporter Service Status
- Command:
systemctl status exporter-service - Expected Output: The status of the exporter service. Look for “active (running)”.
- Example:
$ systemctl status exporter-service ● exporter-service.service - Exporter Service Loaded: loaded (/etc/systemd/system/exporter-service.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2024-11-13 14:00:00 UTC; 19min ago
- Command:
-
Restart Exporter Service
- Command:
sudo systemctl restart exporter-service - Expected Output: The service restarts without errors.
- Example:
$ sudo systemctl restart exporter-service
- Command:
-
Check Network Connectivity
- Command:
ping -c 4 exporter-hostname - Expected Output: Successful ping responses.
- Example:
$ ping -c 4 exporter-hostname PING exporter-hostname (192.168.1.1) 56(84) bytes of data. 64 bytes from exporter-hostname: icmp_seq=1 ttl=64 time=0.123 ms 64 bytes from exporter-hostname: icmp_seq=2 ttl=64 time=0.124 ms 64 bytes from exporter-hostname: icmp_seq=3 ttl=64 time=0.125 ms 64 bytes from exporter-hostname: icmp_seq=4 ttl=64 time=0.126 ms
- Command:
-
Verify Exporter Configuration
- Command:
cat /etc/exporter/config.yml - Expected Output: Configuration file contents. Ensure all settings are correct.
- Example:
$ cat /etc/exporter/config.yml job_name: 'rtmpstat' static_configs: - targets: ['localhost:9100']
- Command:
Additional Steps
-
Check Logs for Errors
- Command:
journalctl -u exporter-service --since "1 hour ago" - Expected Output: Recent logs for the exporter service. Look for any error messages.
- Example:
$ journalctl -u exporter-service --since "1 hour ago" -- Logs begin at Wed 2024-11-13 13:00:00 UTC, end at Wed 2024-11-13 14:00:00 UTC. -- Nov 13 13:45:00 hostname exporter-service[1234]: Exporter started Nov 13 13:50:00 hostname exporter-service[1234]: Error: Connection refused
- Command:
-
Check Underlying Infrastructure
- Ensure the server hosting the exporter is up and running.
- Verify there are no ongoing maintenance activities or outages.