Runbook: Bintu Services Kubernetes (K8s) Troubleshooting
Alert Details
- Alert Name:
HostDown Alerts: applications bintu-api dev - Expression: N/A
Description
This runbook provides instructions for troubleshooting common issues that may occur with Bintu Services running in K8s. It includes instructions to check K8s cluster health, restarting the deployment, and manually restarting cluster nodes through the hosting provider’s interface.
As an example, Bintu API service is taken to explicitly demonstrate the resolution steps. In case there is a problem with related services (cloud-token-service, dashboards, guardian, etc.), you can use the same manual interchangeably. Just use the name of the service in question.
Possible Causes
- Resource shortages or network issues
- Problems retrieving the container image
- Application crashes repeatedly
Troubleshooting Steps
-
Problem Analysis:
- One Ring to rule them all - Grafana:
- You should always start with checking the metrics of K8s cluster.
- Everything you need can be seen here: Grafana Dashboard. (To authenticate to grafana use Azure AD)
- By default, you will have prod clusters selected and
bintunamespace as a filter (change if required). - First you will see is the CPU and RAM gauges. The most important ones are the 2 in the middle (
CPU Utilisation (from requests)/Memory Utilisation (from requests)). - If the values there are high (more than 60%) several problems might exist:
- several nodes are not available at once () # TODO: => proceed defining the steps to identify hosts issues
- Verify that bintu-api pods exist and are running stable:
- Go to Rancher: LINK.
- Log in using Azure SSO.
- Navigate to the cluster where Bintu API is deployed. Cluster names are the following:
- Production Environment:
app-prod-euc-aapp-prod-euc-bapp-prod-euc-c(usually not in prod, as used for canary deployments only)app-prod-ass(bintu API is not deployed here)app-prod-use(bintu API is not deployed here)
- Staging Environment:
app-staging-euc
- Development Environment:
app-dev-euc
- Production Environment:
- Click on Workloads in the left-hand menu.
- Navigate to Deployments.
- Click on
bintu-bintu-productiondeployment. If you do not see it in the list, check that you have any of the following namespaces selected (dropdown menu at the top of a screen to the right):All namespaces,Only User Namespaces,bintu. - All pods’ state should be
Running(in green color). For prod clusters you would need to check all clusters to verify completely.
- Redeployment of application:
- In case you see a pod in the state other than
Running. - Navigate back to Deployments tab.
- Select
bintu-bintu-productiondeployment (click on the checkbox, not the name). - Click Redeploy at the top of the page having
bintu-bintu-productiondeployment selected.
- In case you see a pod in the state other than
- Verify cluster nodes are running stable:
- Click on Cluster in the left-hand menu.
- Navigate to Nodes.
- All nodes’ state should be
Active(in green color). - If not, copy any of the node’s IPs and see
Manually Restarting Cluster Nodesstep.
- One Ring to rule them all - Grafana:
-
Manually Restarting Cluster Nodes:
-
Go to AWS console: https://aws.amazon.com/
-
Sign in into prod AWS account (ID: 300926307005)
-
In the top right corner select region
Frankfurt eu-central-1 -
Go to EC2 service and select list of the instances in the left-hand menu.
-
Enter the IP you copied before in the search bar and hit enter.
-
Select the instance (click on checkbox).
-
Click on
Instance Statein the right upper corner. -
Select
Reboot instance. -
The node will be restarted and soon shall appear as
Activein Rancher UI (it can take several minutes). -
Checking API Reachability:
- Open a terminal on an Ubuntu system.
- Use the following command to check the API reachability:
curl -I https://bintu.nanocosmos.de - Check the output. A successful connection should return an HTTP status line like
HTTP/1.1 200 OK. If the API is unreachable, you may receive an error message likecurl: (7) Failed to connect to bintu.nanocosmos.de port 443: Connection refused. - To check API reachability from different geographic locations, you can use SSH tunnels to servers in Singapore, the USA, and the EU:
- Singapore:
ssh user@singapore-server "curl -I https://bintu.nanocosmos.de" - USA:
ssh user@usa-server "curl -I https://bintu.nanocosmos.de" - EU:
ssh user@eu-server "curl -I https://bintu.nanocosmos.de"
- Singapore:
-
Kibana:
- Log in to Kibana and navigate to the logs.
- Check the logs for error messages or unusual activities that might indicate a problem.
- Example:
# Access Kibana URL and search for specific logs https://kibana.elasticsearch.fsn.hz.k8s.nanostream.cloud/app/discover
-
Grafana:
- Log in to Grafana and check the dashboards for anomalies in the metrics.
- Use the logs visualization to get detailed information about system events.
-
-
Example Issues That Can Occur in Kubernetes:
-
Pods Stuck in Pending Status:
- Cause: Resource shortages or network issues.
- Solution: Check resource availability and network configuration. Use
kubectl describe pod <pod-name> -n <namespace>for detailed information.
-
ImagePullBackOff or ErrImagePull Errors:
- Cause: Problems retrieving the container image.
- Solution: Check the image URL and authentication details. Use
kubectl describe pod <pod-name> -n <namespace>for more details. - Example:
kubectl describe pod <pod-name> -n <namespace> -
CrashLoopBackOff Errors:
- Cause: Application crashes repeatedly.
- Solution: Check the container logs with
kubectl logs <pod-name> -n <namespace>. Ensure all environment variables and configuration files are correct. - Example:
kubectl logs <pod-name> -n <namespace>
-
Escalation to Level 3 Support:
- If the problem still persists after performing the above steps, escalate the issue to Level 3 Support.
- Contact the current support team member and request assistance from Level 3 Support.
- Provide detailed information about the issue, including logs, steps taken, and any error messages encountered.
-
Final Analysis:
-
Solution Verification:
- Ensure the problem is fully resolved and no further errors occur.
- Check the logs and metrics in Kibana and Grafana to ensure no new anomalies appear.
- Example:
- Access Kibana URL and check logs:
- Access Grafana URL and check dashboards:
-
Post-Incident Analysis:
- Refer to point 4 “Post-Incident” of the Incident Response Guideline Runbook to ensure all necessary steps for post-incident review and documentation are completed.
-