Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] The taint node.openyurt.io/unschedulable can NOT be removed when the failure node recovers during yurt-controller-manager reboot. #1233

Closed
AndyEWang opened this issue Feb 13, 2023 · 3 comments · Fixed by #1337
Assignees
Labels
kind/bug kind/bug

Comments

@AndyEWang
Copy link
Contributor

What happened:
There is only one single yurt-controller-manager Pod in my k8s. When an edge node failed to access apiserver but can communicate with other nodes, the taint of node.openyurt.io/unschedulable is added successfully.
Somehow yurt-controller-manager failed to renew lease and reboot. Also it took ~12 minutes to do nothing. During this time, the failure node recovers but the taint can never be removed after yurt-controller-manager reboot.

I0213 06:25:52.270685       1 poolcoordinator_controller.go:122] taint edgeworker01: key node.openyurt.io/unschedulable already exists, nothing to do
E0213 06:26:02.361335       1 poolcoordinator_controller.go:164] Operation cannot be fulfilled on nodes "edgeworker01": the object has been modified; please apply your changes to the latest version and try again
E0213 06:37:51.646804       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/yurt-controller-manager?timeout=10s": context deadline exceeded
I0213 06:37:51.647067       1 leaderelection.go:283] failed to renew lease kube-system/yurt-controller-manager: timed out waiting for the condition
F0213 06:37:51.647111       1 controllermanager.go:248] leaderelection lost

What you expected to happen:
The taint "node.openyurt.io/unschedulable" should be removed when failure node recovers no matter whether yurt-controller-manager reboots or not.

How to reproduce it (as minimally and precisely as possible):
Rare case.

Anything else we need to know?:
From pkg/controller/poolcoordinator/delegatelease/poolcoordinator_controller.go, the Counter never gets chance to increment after reboot and failure node recovery.

Environment:

  • OpenYurt version: v1.2
  • Kubernetes version (use kubectl version): 1.22

others

/kind bug

@AndyEWang AndyEWang added the kind/bug kind/bug label Feb 13, 2023
@rambohe-ch
Copy link
Member

@AndyEWang Thanks for raising issue. It seems that we need to improve delegatelease controller, would you like to take this work?

@AndyEWang
Copy link
Contributor Author

Yes, I'd like to.

@rambohe-ch
Copy link
Member

/assign @AndyEWang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug kind/bug
Projects
None yet
2 participants