ManyUnhealthy
ManyUnhealthy
Description
This alert is triggered when a significant percentage of components within a geoCluster are unhealthy. The expression calculates the percentage of unhealthy components, excluding those under maintenance, and triggers if this percentage exceeds a specified threshold.
Possible Causes:
- infrastructure load
- widespread network issues affecting multiple components
- common dependency or service failure impacting multiple components
- misconfiguration or deployment issues
- ongoing maintenance activities not properly accounted for
Severity estimation
If there are still healthy server in the environment and also geoRegion available this alert is not severe. If a majority of servers is unavailable actions should be taken to bring back server into balancing.
Troubleshooting steps
-
Get overview of server health
- go to the grafana load overview dashboard by following the following links and check the
healthy/unhealthy/maintenancedistribution across the production environment
- go to the grafana load overview dashboard by following the following links and check the
-
Check unhealthy alerts
- at the bottom of the dashboard from step 1. you can check the currently firing critical alerts
- try to resolve critical alerts with the corresponding runbook to bring the servers back into balancing
- at the bottom of the dashboard from step 1. you can check the currently firing critical alerts
-
Chek for server in maintenance
- in the deshboard of step 1. we can also see how many servers are in maintenance mode
- check the cloud status channel in mattermost for maintenance or deployment announcements
- check redmine for tickets regarding actions on the corresponding server
- deactivate maintenance mode:
- the presence of the file
/var/www/maintenanceis responsible for the maintenance mode - the file is made publicly available via HTTPs with the url
https://$serverUrl/maintenance - to disable maintenance mode on a server use following command
-
rm /var/www/maintenance
- the presence of the file
- in the deshboard of step 1. we can also see how many servers are in maintenance mode
-
Contact lvl 3
- message lvl 3 member via mattermost
- call lvl 3 via the number listed in pagerduty
- select user and navigate to contact information tab
- additional info:
- alerts can also be escalated to lvl 3 via the escalate button in pagerduty
Additional resources
Runbooks: Dashboards: Docs: