KubeVersionMismatch

Description

This alert fires when Kubernetes cluster components are running different semantic versions — typically control plane nodes (API server, controller-manager, scheduler) and worker nodes are on mismatched versions.

This most commonly occurs during a rolling cluster upgrade and is expected to be transient. However, if the mismatch persists after an upgrade completes, it indicates that some nodes or components were not successfully updated, which can cause API incompatibilities, unexpected behavior, or feature degradation.

Possible Causes:

Cluster upgrade in progress (expected, transient state)
Upgrade partially failed — some nodes did not complete the update
Node stuck in a cordon/drain loop preventing the upgrade from proceeding
Manual node addition with a different Kubernetes version
RKE2/K3s or other distribution upgrade applied to control plane but not all worker nodes
Node unreachable during upgrade, causing it to be skipped

Severity estimation

Low to Medium severity depending on the duration and degree of mismatch:

Low: Mismatch is recent and the cluster is actively being upgraded; expected transient state
Medium: Upgrade completed but mismatch persists; some nodes were not updated
High: Major version mismatch between control plane and workers; API incompatibilities may affect workloads
Critical: Components more than one minor version apart, violating Kubernetes version skew policy; cluster stability at risk

Kubernetes version skew policy allows worker nodes to be up to 2 minor versions behind the control plane. Exceeding this is unsupported.

Troubleshooting steps

Check current versions across all nodes
Command / Action:

List all nodes and their Kubernetes versions

kubectl get nodes
Expected result:
All nodes show the same version in the VERSION column
NAME                         STATUS   ROLES                       AGE    VERSION
elk-cp-eu-hc-fsn1-01         Ready    control-plane,etcd,master   476d   v1.33.1+rke2r1
elk-alloy-eu-hc-fsn1-01      Ready    worker                      476d   v1.33.1+rke2r1
elk-warm-eu-hz-hel1-07       Ready    worker                      376d   v1.33.1+rke2r1
additional info:

Identify which nodes are on the old version — these are the ones that need to be upgraded

Check if an upgrade is currently in progress
- Command / Action:
  - Verify whether a cluster upgrade is actively running in Rancher or the cluster management tool
  - Check Rancher UI: Cluster → Overview → Kubernetes Version
- Expected result:
  - If an upgrade is in progress, the mismatch is expected and will resolve automatically
- additional info:
  - Do not intervene if an upgrade is actively running; monitor until it completes

Identify nodes that failed to upgrade
- Command / Action:
  - Compare node versions to find those still on the old version
  - kubectl get nodes -o custom-columns=“NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion,STATUS:.status.conditions[-1].type”
- Expected result:
  - Nodes on the old version are clearly listed and distinguishable from updated ones
- additional info:
  - Also check if any nodes are in NotReady or SchedulingDisabled state, which may have caused the upgrade to stall

Check for nodes stuck in cordon or drain
- Command / Action:
  - Look for nodes that are cordoned and may be blocking the rolling upgrade
  - kubectl get nodes | grep SchedulingDisabled
  - kubectl describe node <node-name> | grep -A5 Taints
- Expected result:
  - No nodes are unexpectedly cordoned after the upgrade completes
- additional info:
  - If a node is stuck cordoned after the upgrade, uncordon it: kubectl uncordon <node-name>

Trigger upgrade of remaining nodes via Rancher
- Command / Action:
  - In Rancher, navigate to the cluster and trigger the upgrade for nodes still on the old version
  - Rancher UI: Cluster → Nodes → select outdated node → Edit → update Kubernetes version
- Expected result:
  - The node is upgraded to match the control plane version; the alert clears
- additional info:
  - For RKE2 clusters, node upgrades are managed through the system-upgrade-controller or directly via Rancher

Verify all components are on the same version after upgrade
- Command / Action:
  - Confirm all nodes now report the same version
  - kubectl get nodes
- Expected result:
  - All nodes show the same version; alert clears within the next evaluation cycle
- additional info:
  - Also verify control plane components: kubectl get pods -n kube-system -o wide to confirm apiserver, scheduler, and controller-manager versions

KubeVersionMismatch

KubeVersionMismatch

Description

Possible Causes:

Severity estimation

Troubleshooting steps

Additional resources