TargetDown

Runbook: TargetDown Alert

Description

The alert means that one or more prometheus scrape targets are down. It fires when at least 10% of scrape targets in a Service are unreachable. <- text field to unfold maybe

Severity estimation

Low if the target is non-critical or has redundant monitoring
Medium if monitoring visibility is partially lost
High if a critical service, node, or exporter is unreachable
Critical if multiple targets are down or a core component is affected

Severity increases with:

Importance of the affected target
Number of targets impacted
Duration of the downtime
Whether the alert represents a monitoring-only issue or a real outage

Expression

(count by (job, namespace, service) (up{namespace!="",cluster=~".*"} == 0) / count by (job, namespace, service) (up{cluster=~".*"})) * 100 > 10 monitor the latest result in grafana

dashboards

Troubleshooting steps

Identify the affected target
- Command / Action:
  - Check which target is down in Prometheus
  - Prometheus UI → Status → Targets
- Expected result:
  - Target status is UP
- additional info:
  - Note job name, instance, and error message

Check target reachability
- Command / Action:
  - Test network connectivity from Prometheus
  - curl http://:/
- Expected result:
  - Metrics endpoint responds with text output
- additional info:
  - Connection refused or timeout indicates network or service issue

Verify target process or pod
- Command / Action:
  - Check pod or service status (Kubernetes)
  - kubectl get pod -n
  - kubectl get svc -n
- Expected result:
  - Pod is Running and service endpoints exist
- additional info:
  - No endpoints means traffic cannot reach the target

Check exporter status
- Command / Action:
  - Verify exporter process is running
  - kubectl logs -n
- Expected result:
  - Exporter starts successfully and listens on expected port
- additional info:
  - Crashes or misconfiguration are common causes

Inspect scrape configuration
- Command / Action:
  - Validate scrape config parameters
  - kubectl get configmap -n
- Expected result:
  - Correct target labels, ports, and paths
- additional info:
  - Wrong ports or paths cause immediate scrape failures

Check DNS and networking
- Command / Action:
  - Validate DNS resolution and network paths
  - nslookup
  - ping
- Expected result:
  - DNS resolves and network is reachable
- additional info:
  - Network policies or firewalls may block access

Check node health
- Command / Action:
  - Verify node or VM status
  - kubectl get nodes
  - kubectl describe node
- Expected result:
  - Node is Ready with no pressure conditions
- additional info:
  - Node outages often cause multiple targets to go down

Confirm recovery
- Command / Action:
  - Recheck target status in Prometheus
  - Prometheus UI → Status → Targets
- Expected result:
  - Target state changes from DOWN to UP
- additional info:
  - Alert should auto-resolve once scraping resumes

Additional resources

Prometheus targets and scraping
Related alert: KubeNodeNotReady