Runbook: ManyUnhealthy Alert
Alert Details
- Alert Name: ManyUnhealthy
- Expression:
round( (count by (geoCluster, component) ( group by (instance, geoCluster, component) ( ALERTS{health="unhealthy", alertstate="firing", nanocosmosGroup=~".+", instance=~".+", environment=~".+"} ) unless on(instance, geoCluster, component) group by (instance, geoCluster, component) ( maintenance{nanocosmosGroup=~".+", instance=~".+"} == 1 ) ) / count by (geoCluster, component) ( group by (instance, geoCluster, component) ( up{nanocosmosGroup=~".+", instance=~".+"} ) unless on(instance, geoCluster, component) group by (instance, geoCluster, component) ( maintenance{nanocosGroup=~".+", instance=~".+"} == 1 ) ) * 100) or ( count by (component, geoCluster) ( group by (instance, component, geoCluster) ( up{nanocosmosGroup=~".+", instance=~".+", environment=~".+"} ) ) * 0 ) ) >= <threshold>
Description
This alert is triggered when a significant percentage of components within a geoCluster are unhealthy. The expression calculates the percentage of unhealthy components, excluding those under maintenance, and triggers if this percentage exceeds a specified threshold.
Possible Causes
- Widespread network issues affecting multiple components.
- A common dependency or service failure impacting multiple components.
- Misconfiguration or deployment issues.
- Resource constraints or hardware failures.
- Ongoing maintenance activities not properly accounted for.
Troubleshooting Steps
1. Identify Unhealthy Components
List all components that are currently unhealthy.
# Example: Query Prometheus for unhealthy components
curl -G 'http://<prometheus_server>/api/v1/query' --data-urlencode 'query=ALERTS{health="unhealthy", alertstate="firing", nanocosmosGroup=~".+", instance=~".+", environment=~".+"}'
Expected Output:
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"alertname": "ComponentDown",
"instance": "instance1",
"job": "component",
"nanocosmosGroup": "group1",
"environment": "prod"
},
"value": [<timestamp>, "1"]
},
...
]
}
}
2. Check Network Connectivity
Verify network connectivity to the affected components.
# Example: Check network connectivity to a component
ping <component_hostname_or_ip>
Expected Output:
PING <component_hostname_or_ip> (<ip_address>) 56(84) bytes of data.
64 bytes from <component_hostname_or_ip>: icmp_seq=1 ttl=64 time=0.123 ms
...
3. Verify Component Status
Ensure that the affected components are running and reachable.
# Example: Check the status of a component
ssh <component_hostname_or_ip> 'systemctl status <component_service>'
Expected Output:
● <component_service>.service - <Service Description>
Loaded: loaded (/etc/systemd/system/<component_service>.service; enabled; vendor preset: enabled)
Active: active (running) since <date>; <time> ago
...
4. Review Logs
Check the logs of the affected components for any errors or warnings.
# Example: Review logs of a component
ssh <component_hostname_or_ip> 'sudo journalctl -u <component_service> --since "1 hour ago"'
Expected Output:
Nov 13 12:00:00 <hostname> <component_service>[1234]: Starting <component_service>.
Nov 13 12:00:01 <hostname> <component_service>[1234]: <Log message>
...
5. Check Resource Utilization
Ensure that the affected components have sufficient resources (CPU, memory, disk).
# Example: Check resource utilization
ssh <component_hostname_or_ip> 'top -b -n 1 | head -n 10'
Expected Output:
top - 13:00:00 up 10 days, 3:22, 1 user, load average: 0.15, 0.10, 0.05
Tasks: 123 total, 1 running, 122 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.0 us, 0.5 sy, 0.0 ni, 98.0 id, 0.5 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 2048000 total, 1024000 free, 512000 used, 512000 buff/cache
KiB Swap: 1024000 total, 1024000 free, 0 used. 1536000 avail Mem
...
Additional Steps
If the issue persists, consider:
- Restarting the affected components.
- Checking for common dependencies or services that might be failing.
- Reviewing recent changes or deployments that might have caused the issue.
- Contacting the network or system administrator for further investigation.