Runbook: Bintu Services Kubernetes (K8s) Troubleshooting

Alert Details

  • Alert Name: HostDown Alerts: applications bintu-api dev
  • Expression: N/A

Description

This runbook provides instructions for troubleshooting common issues that may occur with Bintu Services running in K8s. It includes instructions to check K8s cluster health, restarting the deployment, and manually restarting cluster nodes through the hosting provider’s interface.

As an example, Bintu API service is taken to explicitly demonstrate the resolution steps. In case there is a problem with related services (cloud-token-service, dashboards, guardian, etc.), you can use the same manual interchangeably. Just use the name of the service in question.

Possible Causes

  • Resource shortages or network issues
  • Problems retrieving the container image
  • Application crashes repeatedly

Troubleshooting Steps

  1. Problem Analysis:

    • One Ring to rule them all - Grafana:
      • You should always start with checking the metrics of K8s cluster.
      • Everything you need can be seen here: Grafana Dashboard. (To authenticate to grafana use Azure AD)
      • By default, you will have prod clusters selected and bintu namespace as a filter (change if required).
      • First you will see is the CPU and RAM gauges. The most important ones are the 2 in the middle (CPU Utilisation (from requests) / Memory Utilisation (from requests)).
      • If the values there are high (more than 60%) several problems might exist:
        • several nodes are not available at once () # TODO: => proceed defining the steps to identify hosts issues
    • Verify that bintu-api pods exist and are running stable:
      • Go to Rancher: LINK.
      • Log in using Azure SSO.
      • Navigate to the cluster where Bintu API is deployed. Cluster names are the following:
        • Production Environment:
          • app-prod-euc-a
          • app-prod-euc-b
          • app-prod-euc-c (usually not in prod, as used for canary deployments only)
          • app-prod-ass (bintu API is not deployed here)
          • app-prod-use (bintu API is not deployed here)
        • Staging Environment:
          • app-staging-euc
        • Development Environment:
          • app-dev-euc
      • Click on Workloads in the left-hand menu.
      • Navigate to Deployments.
      • Click on bintu-bintu-production deployment. If you do not see it in the list, check that you have any of the following namespaces selected (dropdown menu at the top of a screen to the right): All namespaces, Only User Namespaces, bintu.
      • All pods’ state should be Running (in green color). For prod clusters you would need to check all clusters to verify completely.
    • Redeployment of application:
      • In case you see a pod in the state other than Running.
      • Navigate back to Deployments tab.
      • Select bintu-bintu-production deployment (click on the checkbox, not the name).
      • Click Redeploy at the top of the page having bintu-bintu-production deployment selected.
    • Verify cluster nodes are running stable:
      • Click on Cluster in the left-hand menu.
      • Navigate to Nodes.
      • All nodes’ state should be Active (in green color).
      • If not, copy any of the node’s IPs and see Manually Restarting Cluster Nodes step.
  2. Manually Restarting Cluster Nodes:

    • Go to AWS console: https://aws.amazon.com/

    • Sign in into prod AWS account (ID: 300926307005)

    • In the top right corner select region Frankfurt eu-central-1

    • Go to EC2 service and select list of the instances in the left-hand menu.

    • Enter the IP you copied before in the search bar and hit enter.

    • Select the instance (click on checkbox).

    • Click on Instance State in the right upper corner.

    • Select Reboot instance.

    • The node will be restarted and soon shall appear as Active in Rancher UI (it can take several minutes).

    • Checking API Reachability:

      • Open a terminal on an Ubuntu system.
      • Use the following command to check the API reachability:
        curl -I https://bintu.nanocosmos.de
        
      • Check the output. A successful connection should return an HTTP status line like HTTP/1.1 200 OK. If the API is unreachable, you may receive an error message like curl: (7) Failed to connect to bintu.nanocosmos.de port 443: Connection refused.
      • To check API reachability from different geographic locations, you can use SSH tunnels to servers in Singapore, the USA, and the EU:
        • Singapore:
          ssh user@singapore-server "curl -I https://bintu.nanocosmos.de"
          
        • USA:
          ssh user@usa-server "curl -I https://bintu.nanocosmos.de"
          
        • EU:
          ssh user@eu-server "curl -I https://bintu.nanocosmos.de"
          
    • Kibana:

      • Log in to Kibana and navigate to the logs.
      • Check the logs for error messages or unusual activities that might indicate a problem.
      • Example:
        # Access Kibana URL and search for specific logs
        https://kibana.elasticsearch.fsn.hz.k8s.nanostream.cloud/app/discover
        
    • Grafana:

  3. Example Issues That Can Occur in Kubernetes:

  • Pods Stuck in Pending Status:

    • Cause: Resource shortages or network issues.
    • Solution: Check resource availability and network configuration. Use kubectl describe pod <pod-name> -n <namespace> for detailed information.
  • ImagePullBackOff or ErrImagePull Errors:

    • Cause: Problems retrieving the container image.
    • Solution: Check the image URL and authentication details. Use kubectl describe pod <pod-name> -n <namespace> for more details.
    • Example:
    kubectl describe pod <pod-name> -n <namespace>
    
  • CrashLoopBackOff Errors:

    • Cause: Application crashes repeatedly.
    • Solution: Check the container logs with kubectl logs <pod-name> -n <namespace>. Ensure all environment variables and configuration files are correct.
    • Example:
    kubectl logs <pod-name> -n <namespace>
    
  1. Escalation to Level 3 Support:

    • If the problem still persists after performing the above steps, escalate the issue to Level 3 Support.
    • Contact the current support team member and request assistance from Level 3 Support.
    • Provide detailed information about the issue, including logs, steps taken, and any error messages encountered.
  2. Final Analysis:

    • Solution Verification:

    • Post-Incident Analysis:

      • Refer to point 4 “Post-Incident” of the Incident Response Guideline Runbook to ensure all necessary steps for post-incident review and documentation are completed.