TargetDown
Runbook: TargetDown Alert
Description
The alert means that one or more prometheus scrape targets are down. It fires when at least 10% of scrape targets in a Service are unreachable. <- text field to unfold maybe
Severity estimation
- Low if the target is non-critical or has redundant monitoring
- Medium if monitoring visibility is partially lost
- High if a critical service, node, or exporter is unreachable
- Critical if multiple targets are down or a core component is affected
Severity increases with:
- Importance of the affected target
- Number of targets impacted
- Duration of the downtime
- Whether the alert represents a monitoring-only issue or a real outage
Expression
(count by (job, namespace, service) (up{namespace!="",cluster=~".*"} == 0) / count by (job, namespace, service) (up{cluster=~".*"})) * 100 > 10
monitor the latest result in grafana
dashboards
Troubleshooting steps
-
Identify the affected target
- Command / Action:
- Check which target is down in Prometheus
-
Prometheus UI → Status → Targets
- Expected result:
- Target status is
UP
- Target status is
- additional info:
- Note job name, instance, and error message
- Command / Action:
-
Check target reachability
- Command / Action:
- Test network connectivity from Prometheus
-
curl http://:/
- Expected result:
- Metrics endpoint responds with text output
- additional info:
- Connection refused or timeout indicates network or service issue
- Command / Action:
-
Verify target process or pod
- Command / Action:
- Check pod or service status (Kubernetes)
-
kubectl get pod -n
-
kubectl get svc -n
- Expected result:
- Pod is
Runningand service endpoints exist
- Pod is
- additional info:
- No endpoints means traffic cannot reach the target
- Command / Action:
-
Check exporter status
- Command / Action:
- Verify exporter process is running
-
kubectl logs -n
- Expected result:
- Exporter starts successfully and listens on expected port
- additional info:
- Crashes or misconfiguration are common causes
- Command / Action:
-
Inspect scrape configuration
- Command / Action:
- Validate scrape config parameters
-
kubectl get configmap -n
- Expected result:
- Correct target labels, ports, and paths
- additional info:
- Wrong ports or paths cause immediate scrape failures
- Command / Action:
-
Check DNS and networking
- Command / Action:
- Validate DNS resolution and network paths
-
nslookup
-
ping
- Expected result:
- DNS resolves and network is reachable
- additional info:
- Network policies or firewalls may block access
- Command / Action:
-
Check node health
- Command / Action:
- Verify node or VM status
-
kubectl get nodes
-
kubectl describe node
- Expected result:
- Node is
Readywith no pressure conditions
- Node is
- additional info:
- Node outages often cause multiple targets to go down
- Command / Action:
-
Confirm recovery
- Command / Action:
- Recheck target status in Prometheus
-
Prometheus UI → Status → Targets
- Expected result:
- Target state changes from
DOWNtoUP
- Target state changes from
- additional info:
- Alert should auto-resolve once scraping resumes
- Command / Action:
Additional resources
- Prometheus targets and scraping
- Related alert: KubeNodeNotReady