Alert Runbooks

TargetDown

Runbook: TargetDown Alert

Description

The alert means that one or more prometheus scrape targets are down. It fires when at least 10% of scrape targets in a Service are unreachable. <- text field to unfold maybe

Severity estimation

Severity increases with:

Expression

(count by (job, namespace, service) (up{namespace!="",cluster=~".*"} == 0) / count by (job, namespace, service) (up{cluster=~".*"})) * 100 > 10 monitor the latest result in grafana

dashboards

Troubleshooting steps

  1. Identify the affected target

    • Command / Action:
      • Check which target is down in Prometheus
      • Prometheus UI → Status → Targets

    • Expected result:
      • Target status is UP
    • additional info:
      • Note job name, instance, and error message

  1. Check target reachability

    • Command / Action:
      • Test network connectivity from Prometheus
      • curl http://:/

    • Expected result:
      • Metrics endpoint responds with text output
    • additional info:
      • Connection refused or timeout indicates network or service issue

  1. Verify target process or pod

    • Command / Action:
      • Check pod or service status (Kubernetes)
      • kubectl get pod -n

      • kubectl get svc -n

    • Expected result:
      • Pod is Running and service endpoints exist
    • additional info:
      • No endpoints means traffic cannot reach the target

  1. Check exporter status

    • Command / Action:
      • Verify exporter process is running
      • kubectl logs -n

    • Expected result:
      • Exporter starts successfully and listens on expected port
    • additional info:
      • Crashes or misconfiguration are common causes

  1. Inspect scrape configuration

    • Command / Action:
      • Validate scrape config parameters
      • kubectl get configmap -n

    • Expected result:
      • Correct target labels, ports, and paths
    • additional info:
      • Wrong ports or paths cause immediate scrape failures

  1. Check DNS and networking

    • Command / Action:
      • Validate DNS resolution and network paths
      • nslookup

      • ping

    • Expected result:
      • DNS resolves and network is reachable
    • additional info:
      • Network policies or firewalls may block access

  1. Check node health

    • Command / Action:
      • Verify node or VM status
      • kubectl get nodes

      • kubectl describe node

    • Expected result:
      • Node is Ready with no pressure conditions
    • additional info:
      • Node outages often cause multiple targets to go down

  1. Confirm recovery

    • Command / Action:
      • Recheck target status in Prometheus
      • Prometheus UI → Status → Targets

    • Expected result:
      • Target state changes from DOWN to UP
    • additional info:
      • Alert should auto-resolve once scraping resumes

Additional resources