Skip to content

KubeNodeReadinessFlapping

Meaning

The readiness status of node has changed few times in the last 15 minutes.

Impact

The performance of the cluster deployments is affected, depending on the overall workload and the type of the node.

Diagnosis

The notification details should list the node that's not reachable. For Example:

 - alertname = KubeNodeUnreachable
...
 - node = node1.example.com
...

Login to the cluster. Check the status of that node:

$ kubectl get node $NODE -o yaml

The output should describe why the node is not reachable.

Common failure scenarios:

  • disruptive software upgrades
  • network patitioning due to hardware failures
  • firewall rules
  • virtual machines suspended due to storage area network problems
  • system crashes / freezes due to software or hardware malfunctions

Mitigation

In case of maintenance ensure to cordon and drain node.

In other cases ensure storage and networking redundancy if applicable.

See KubeNode See node problem detector See Watchdog timer