Runbook: ManyUnhealthy Alert

Alert Details

  • Alert Name: ManyUnhealthy
  • Expression:
    round(
      (count by (geoCluster, component) (
        group by (instance, geoCluster, component) (
          ALERTS{health="unhealthy", alertstate="firing", nanocosmosGroup=~".+", instance=~".+", environment=~".+"}
        )
        unless on(instance, geoCluster, component)
        group by (instance, geoCluster, component) (
          maintenance{nanocosmosGroup=~".+", instance=~".+"} == 1
        )
      ) / count by (geoCluster, component) (
        group by (instance, geoCluster, component) (
          up{nanocosmosGroup=~".+", instance=~".+"}
        )
        unless on(instance, geoCluster, component)
        group by (instance, geoCluster, component) (
          maintenance{nanocosGroup=~".+", instance=~".+"} == 1
        )
      ) * 100) or (
        count by (component, geoCluster) (
          group by (instance, component, geoCluster) (
            up{nanocosmosGroup=~".+", instance=~".+", environment=~".+"}
          )
        ) * 0
      )
    ) >= <threshold>
    

Description

This alert is triggered when a significant percentage of components within a geoCluster are unhealthy. The expression calculates the percentage of unhealthy components, excluding those under maintenance, and triggers if this percentage exceeds a specified threshold.

Possible Causes

  1. Widespread network issues affecting multiple components.
  2. A common dependency or service failure impacting multiple components.
  3. Misconfiguration or deployment issues.
  4. Resource constraints or hardware failures.
  5. Ongoing maintenance activities not properly accounted for.

Troubleshooting Steps

1. Identify Unhealthy Components

List all components that are currently unhealthy.

# Example: Query Prometheus for unhealthy components
curl -G 'http://<prometheus_server>/api/v1/query' --data-urlencode 'query=ALERTS{health="unhealthy", alertstate="firing", nanocosmosGroup=~".+", instance=~".+", environment=~".+"}'

Expected Output:

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "alertname": "ComponentDown",
          "instance": "instance1",
          "job": "component",
          "nanocosmosGroup": "group1",
          "environment": "prod"
        },
        "value": [<timestamp>, "1"]
      },
      ...
    ]
  }
}

2. Check Network Connectivity

Verify network connectivity to the affected components.

# Example: Check network connectivity to a component
ping <component_hostname_or_ip>

Expected Output:

PING <component_hostname_or_ip> (<ip_address>) 56(84) bytes of data.
64 bytes from <component_hostname_or_ip>: icmp_seq=1 ttl=64 time=0.123 ms
...

3. Verify Component Status

Ensure that the affected components are running and reachable.

# Example: Check the status of a component
ssh <component_hostname_or_ip> 'systemctl status <component_service>'

Expected Output:

● <component_service>.service - <Service Description>
   Loaded: loaded (/etc/systemd/system/<component_service>.service; enabled; vendor preset: enabled)
   Active: active (running) since <date>; <time> ago
...

4. Review Logs

Check the logs of the affected components for any errors or warnings.

# Example: Review logs of a component
ssh <component_hostname_or_ip> 'sudo journalctl -u <component_service> --since "1 hour ago"'

Expected Output:

Nov 13 12:00:00 <hostname> <component_service>[1234]: Starting <component_service>.
Nov 13 12:00:01 <hostname> <component_service>[1234]: <Log message>
...

5. Check Resource Utilization

Ensure that the affected components have sufficient resources (CPU, memory, disk).

# Example: Check resource utilization
ssh <component_hostname_or_ip> 'top -b -n 1 | head -n 10'

Expected Output:

top - 13:00:00 up 10 days,  3:22,  1 user,  load average: 0.15, 0.10, 0.05
Tasks: 123 total,   1 running, 122 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.0 us,  0.5 sy,  0.0 ni, 98.0 id,  0.5 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  2048000 total,  1024000 free,   512000 used,   512000 buff/cache
KiB Swap:  1024000 total,  1024000 free,        0 used.  1536000 avail Mem
...

Additional Steps

If the issue persists, consider:

  • Restarting the affected components.
  • Checking for common dependencies or services that might be failing.
  • Reviewing recent changes or deployments that might have caused the issue.
  • Contacting the network or system administrator for further investigation.