Alert Runbooks

ManyUnhealthy

ManyUnhealthy

Description

This alert is triggered when a significant percentage of components within a geoCluster are unhealthy. The expression calculates the percentage of unhealthy components, excluding those under maintenance, and triggers if this percentage exceeds a specified threshold.


Possible Causes:


Severity estimation

If there are still healthy server in the environment and also geoRegion available this alert is not severe. If a majority of servers is unavailable actions should be taken to bring back server into balancing.


Troubleshooting steps

  1. Get overview of server health

  2. Check unhealthy alerts

    • at the bottom of the dashboard from step 1. you can check the currently firing critical alerts
      • try to resolve critical alerts with the corresponding runbook to bring the servers back into balancing
  3. Chek for server in maintenance

    • in the deshboard of step 1. we can also see how many servers are in maintenance mode
      • check the cloud status channel in mattermost for maintenance or deployment announcements
      • check redmine for tickets regarding actions on the corresponding server
    • deactivate maintenance mode:
      • the presence of the file /var/www/maintenance is responsible for the maintenance mode
      • the file is made publicly available via HTTPs with the url https://$serverUrl/maintenance
      • to disable maintenance mode on a server use following command
      • rm /var/www/maintenance
        
  4. Contact lvl 3

    • message lvl 3 member via mattermost
    • call lvl 3 via the number listed in pagerduty
      • select user and navigate to contact information tab
    • additional info:
      • alerts can also be escalated to lvl 3 via the escalate button in pagerduty

Additional resources

Runbooks: Dashboards: Docs: