Runbook: ExporterDown

Alert Details

  • Alert Name: ExporterDown
  • Expression: sum by (hostname) (up{job=~".+", nanocosmosGroup=~".+", instance=~".+", environment=~".+"}) == 0

Description

This alert triggers when one or more exporters are down. Exporters are responsible for collecting and exposing metrics to Prometheus. If no instances of a job are up, the alert will be triggered.

Possible Causes

  • Exporter service is not running
  • Network issues preventing Prometheus from reaching the exporter
  • Configuration errors in the exporter or Prometheus
  • Resource constraints on the host running the exporter
  • Firewall or security group rules blocking access

Troubleshooting Steps

1. Check the Status of Exporters

Use the following command to check the status of the exporters:

curl -s http://<exporter_host>:<exporter_port>/metrics | grep up

Expected Output

If the exporter is up, you should see something like:

up{instance="<exporter_host>:<exporter_port>",job="<job_name>"} 1

If the exporter is down, the output will be:

up{instance="<exporter_host>:<exporter_port>",job="<job_name>"} 0

2. Restart the Exporter Service

If the exporter is down, restart the service. For example, if you are using a Node Exporter, you can restart it with:

sudo systemctl restart node_exporter

Expected Output

Check the status again to ensure the exporter is up:

sudo systemctl status node_exporter

You should see an output indicating that the service is active and running.

3. Check Logs for Errors

If the exporter does not start, check the logs for any errors. For example, for Node Exporter:

sudo journalctl -u node_exporter

Expected Output

Look for any error messages that might indicate why the exporter is failing to start. Common issues include configuration errors, missing dependencies, or port conflicts.

4. Verify Network Connectivity

Ensure that the network connectivity between Prometheus and the exporter is intact. You can use tools like ping or telnet:

ping <exporter_host>
telnet <exporter_host> <exporter_port>

Expected Output

  • ping should show successful replies.
  • telnet should establish a connection to the specified port.

5. Update Prometheus Configuration

If the exporter has been moved to a different host or port, update the Prometheus configuration and reload it:

Edit the Prometheus configuration file (usually prometheus.yml):

- job_name: '<job_name>'
  static_configs:
  - targets: ['<new_exporter_host>:<new_exporter_port>']

Reload Prometheus configuration:

curl -X POST http://<prometheus_host>:<prometheus_port>/-/reload

Expected Output

Prometheus should reload the configuration without errors, and the exporter should be scraped successfully.

Additional Steps

1. Monitor Exporter Performance

Continuously monitor the performance and availability of the exporter to ensure it remains operational. Use tools like prometheus and grafana to set up dashboards and alerts.

2. Scale Exporter Instances

If the exporter frequently goes down due to high load, consider scaling the number of exporter instances to distribute the load.

By following these steps, you should be able to troubleshoot and resolve the “ExporterDown” alert. If the issue persists, further investigation into the specific exporter and its environment may be necessary.