Alert Runbooks

KubeSchedulerAbsent

KubeSchedulerAbsent

Description

Prometheus target discovery has not found the Kubernetes Scheduler in the past 15 minutes.

The Kube Scheduler is a critical control plane component responsible for assigning newly created pods to nodes based on resource requirements, constraints, affinity/anti-affinity rules, and other scheduling policies. When absent from monitoring, no new metrics are being collected from this component.


Possible Causes:


Severity estimation

Critical severity - The Scheduler is a core control plane component.

While existing pods continue to run normally, without a functioning Scheduler:

Immediate action is required if the Scheduler is actually down (not just a monitoring issue).


Troubleshooting steps

  1. Verify Prometheus target status

    • Command / Action:
      • Check if Scheduler appears in Prometheus targets
      • Access Prometheus UI at /targets
      • Look for kube-scheduler targets
    • Expected result:
      • Target should be present and status should be “UP”
    • additional info:
      • If target is missing from discovery, this is a service discovery issue
      • If target shows as “DOWN”, the endpoint is unreachable
      • Only the leader scheduler exposes full metrics

  1. Check Scheduler pod status (for kubeadm clusters)

    • Command / Action:
      • Verify Scheduler is running as a static pod
      • kubectl get pods -n kube-system | grep scheduler

      • kubectl get pods -n kube-system -l component=kube-scheduler

    • Expected result:
      • Scheduler pod should be in Running state
      • kube-scheduler-<node> 1/1 Running

    • additional info:
      • For kubeadm clusters, Scheduler runs as a static pod on control plane nodes
      • In HA setups, multiple scheduler pods exist but only one is active (leader)
      • Pod should be in Running state with 1/1 ready containers

  1. Check Scheduler process (for systemd-managed clusters)

    • Command / Action:
      • Verify Scheduler service is active
      • systemctl status kube-scheduler

      • ps aux | grep kube-scheduler

    • Expected result:
      • Service should be active and running
      • Process should be present
    • additional info:
      • For non-kubeadm clusters, Scheduler may run as a systemd service
      • Check service status on control plane nodes

  1. Inspect Scheduler logs

    • Command / Action:
      • Review logs for errors or crash information
      • For static pod: >kubectl logs -n kube-system kube-scheduler-<node>
      • For systemd: >journalctl -u kube-scheduler -n 100
    • Expected result:
      • No critical errors or crash logs
      • Scheduler should be processing pod scheduling requests
    • additional info:
      • Look for authentication errors, API server connectivity issues, or crashes
      • Check for leader election messages (only leader actively schedules)
      • Look for “Successfully acquired lease” messages

  1. Check for pending pods (indicates scheduler issues)

    • Command / Action:
      • List pods stuck in Pending state
      • kubectl get pods -A –field-selector=status.phase=Pending

      • kubectl describe pod <pending-pod> -n <namespace>

    • Expected result:
      • If scheduler is working, pending pods should have scheduling events
      • No pods should be stuck in Pending due to scheduler failure
    • additional info:
      • Pods in Pending without any scheduler events indicate scheduler is not working
      • Events should show “Successfully assigned” or scheduling failure reasons
      • Pending due to insufficient resources is normal, not a scheduler issue

  1. Verify metrics endpoint accessibility

    • Command / Action:
      • Test if metrics endpoint is reachable
      • kubectl get pods -n kube-system -l component=kube-scheduler -o wide

      • curl -k https://<scheduler-ip>:10259/metrics

    • Expected result:
      • Metrics endpoint should return Prometheus-formatted metrics
      • HTTP 200 response with metric data
    • additional info:
      • Default metrics port is 10259 (secure) or 10251 (insecure, deprecated)
      • May require certificates or tokens for authentication
      • Check –bind-address and –secure-port flags

  1. Check Scheduler configuration

    • Command / Action:
      • Review Scheduler startup configuration
      • For static pod: >kubectl get pod -n kube-system kube-scheduler-<node> -o yaml
      • Check manifest: >cat /etc/kubernetes/manifests/kube-scheduler.yaml
    • Expected result:
      • Metrics endpoint should be enabled and properly configured
      • –bind-address should not be 127.0.0.1 (unless using host network)
    • additional info:
      • Ensure –authorization-mode and –authentication-kubeconfig are properly set
      • Verify network mode allows external metric scraping
      • Check –leader-elect flag is set to true for HA setups

  1. Verify leader election status (for HA clusters)

    • Command / Action:
      • Check which scheduler is the leader
      • kubectl get lease -n kube-system kube-scheduler -o yaml

      • kubectl get endpoints -n kube-system kube-scheduler -o yaml

    • Expected result:
      • One scheduler should hold the leader lease
      • Leader’s holderIdentity should be visible
    • additional info:
      • Only the leader actively schedules pods
      • Non-leader schedulers are in standby mode
      • Leader election uses leases in kube-system namespace

  1. Verify ServiceMonitor configuration (for Prometheus Operator)

    • Command / Action:
      • Check ServiceMonitor for Scheduler
      • kubectl get servicemonitor -n kube-system

      • kubectl get servicemonitor -n monitoring -l app.kubernetes.io/name=kube-scheduler -o yaml

    • Expected result:
      • ServiceMonitor exists and targets correct endpoints
      • Selector matches Scheduler service/endpoints
    • additional info:
      • ServiceMonitor configuration tells Prometheus how to discover Scheduler
      • Check endpoint, port, and authentication settings

  1. Check service and endpoints

    • Command / Action:
      • Verify service exists and has endpoints
      • kubectl get svc -n kube-system kube-scheduler

      • kubectl get endpoints -n kube-system kube-scheduler

    • Expected result:
      • Service should exist and have valid endpoints
      • Endpoints should point to running Scheduler instances
    • additional info:
      • Missing endpoints indicate the service can’t find Scheduler pods
      • May need to create service manually for static pods

  1. Restart Scheduler (if not running)

    • Command / Action:
      • Restart Scheduler component
      • For static pod: >kubectl delete pod -n kube-system kube-scheduler-<node>
      • For systemd: >systemctl restart kube-scheduler
    • Expected result:
      • Scheduler starts successfully
      • Pending pods begin to be scheduled
      • Prometheus begins scraping metrics again
    • additional info:
      • Static pods automatically restart when deleted
      • Verify pod comes back in Running state
      • Monitor for successful metric collection in Prometheus
      • Check that pending pods are now being scheduled

  1. Verify network policies and firewall rules

    • Command / Action:
      • Check for network policies blocking metric scraping
      • kubectl get networkpolicies -n kube-system

      • Check firewall rules on control plane nodes
    • Expected result:
      • No policies blocking traffic from Prometheus to Scheduler
      • Port 10259 (or 10251) should be accessible
    • additional info:
      • Network policies may prevent Prometheus from reaching the metrics endpoint
      • Verify Prometheus can reach the Scheduler pod/host network

Additional resources