[BUG] The taint node.openyurt.io/unschedulable can NOT be removed when the failure node recovers during yurt-controller-manager reboot. #1233

AndyEWang · 2023-02-13T10:01:46Z

What happened:
There is only one single yurt-controller-manager Pod in my k8s. When an edge node failed to access apiserver but can communicate with other nodes, the taint of node.openyurt.io/unschedulable is added successfully.
Somehow yurt-controller-manager failed to renew lease and reboot. Also it took ~12 minutes to do nothing. During this time, the failure node recovers but the taint can never be removed after yurt-controller-manager reboot.

I0213 06:25:52.270685       1 poolcoordinator_controller.go:122] taint edgeworker01: key node.openyurt.io/unschedulable already exists, nothing to do
E0213 06:26:02.361335       1 poolcoordinator_controller.go:164] Operation cannot be fulfilled on nodes "edgeworker01": the object has been modified; please apply your changes to the latest version and try again
E0213 06:37:51.646804       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/yurt-controller-manager?timeout=10s": context deadline exceeded
I0213 06:37:51.647067       1 leaderelection.go:283] failed to renew lease kube-system/yurt-controller-manager: timed out waiting for the condition
F0213 06:37:51.647111       1 controllermanager.go:248] leaderelection lost

What you expected to happen:
The taint "node.openyurt.io/unschedulable" should be removed when failure node recovers no matter whether yurt-controller-manager reboots or not.

How to reproduce it (as minimally and precisely as possible):
Rare case.

Anything else we need to know?:
From pkg/controller/poolcoordinator/delegatelease/poolcoordinator_controller.go, the Counter never gets chance to increment after reboot and failure node recovery.

Environment:

OpenYurt version: v1.2
Kubernetes version (use kubectl version): 1.22

others

/kind bug

The text was updated successfully, but these errors were encountered:

rambohe-ch · 2023-02-14T01:59:21Z

@AndyEWang Thanks for raising issue. It seems that we need to improve delegatelease controller, would you like to take this work?

AndyEWang · 2023-02-14T06:14:09Z

Yes, I'd like to.

rambohe-ch · 2023-02-14T09:10:38Z

/assign @AndyEWang

….io/unschedulable (openyurtio#1233)

….io/unschedulable (#1233) (#1337)

….io/unschedulable (openyurtio#1233) (openyurtio#1337)

AndyEWang added the kind/bug kind/bug label Feb 13, 2023

openyurt-bot assigned AndyEWang Feb 14, 2023

AndyEWang added a commit to AndyEWang/openyurt that referenced this issue Mar 29, 2023

fix: yurt-controller-manager reboot cannot remove taint node.openyurt…

162bb1c

….io/unschedulable (openyurtio#1233)

AndyEWang mentioned this issue Mar 29, 2023

fix: yurt-controller-manager reboot cannot remove taint node.openyurt.io/unschedulable (#1233) #1337

Merged

openyurt-bot closed this as completed in #1337 Mar 29, 2023

openyurt-bot pushed a commit that referenced this issue Mar 29, 2023

fix: yurt-controller-manager reboot cannot remove taint node.openyurt…

0896832

….io/unschedulable (#1233) (#1337)

JameKeal pushed a commit to JameKeal/openyurt that referenced this issue Apr 4, 2023

fix: yurt-controller-manager reboot cannot remove taint node.openyurt…

41f7140

….io/unschedulable (openyurtio#1233) (openyurtio#1337)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] The taint node.openyurt.io/unschedulable can NOT be removed when the failure node recovers during yurt-controller-manager reboot. #1233

[BUG] The taint node.openyurt.io/unschedulable can NOT be removed when the failure node recovers during yurt-controller-manager reboot. #1233

AndyEWang commented Feb 13, 2023

rambohe-ch commented Feb 14, 2023

AndyEWang commented Feb 14, 2023

rambohe-ch commented Feb 14, 2023

[BUG] The taint node.openyurt.io/unschedulable can NOT be removed when the failure node recovers during yurt-controller-manager reboot. #1233

[BUG] The taint node.openyurt.io/unschedulable can NOT be removed when the failure node recovers during yurt-controller-manager reboot. #1233

Comments

AndyEWang commented Feb 13, 2023

rambohe-ch commented Feb 14, 2023

AndyEWang commented Feb 14, 2023

rambohe-ch commented Feb 14, 2023