Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge case where Node Deletion is missed if machine 'node' label is not present #875

Closed
elankath opened this issue Nov 30, 2023 · 3 comments · Fixed by #887
Closed

Edge case where Node Deletion is missed if machine 'node' label is not present #875

elankath opened this issue Nov 30, 2023 · 3 comments · Fixed by #887
Assignees
Labels
area/robustness Robustness, reliability, resilience related kind/bug Bug priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)

Comments

@elankath
Copy link
Contributor

elankath commented Nov 30, 2023

How to categorize this issue?

/area robustness
/kind bug
/priority 2

What happened:

When a Node is never associated with its Machine. Ie the machine object never has the machine.Labels[v1alpha1.NodeLabelKey] set after the machine creation, then during the deletion flow, our Node object is not deleted. (Label up-dation can be missed if the machine object update transiently fails)

Then after some time, the dangling Node object gets the NotManagedByMCM annotation.

What you expected to happen:
Node object should always be deleted prior to the instance VM Termination and Machine object deletion, even if the association was missed during instance creation.

How to reproduce it (as minimally and precisely as possible):

  1. Launch a Machine and then remove its node label.
  2. Then delete the machine, triggering the delete flow.
  3. After machine object is deleted, the corresponding Node is still present.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): any
  • Cloud provider or hardware configuration: any
  • Others:
@elankath elankath added the kind/bug Bug label Nov 30, 2023
@elankath elankath self-assigned this Nov 30, 2023
@gardener-robot gardener-robot added area/robustness Robustness, reliability, resilience related priority/2 Priority (lower number equals higher priority) labels Nov 30, 2023
@gardener-robot
Copy link

@elankath You have mentioned internal references in the public. Please check.

1 similar comment
@gardener-robot
Copy link

@elankath You have mentioned internal references in the public. Please check.

@elankath
Copy link
Contributor Author

Teested fix on GCP (node label is same as machine name for this provider). Removed the node label and initiated machine deletion. Node label is now set again prior to drain and deletion. Node deletion successfully occurs even when node label is missing.

I1226 10:08:31.120914   86315 machine.go:128] reconcileClusterMachine: Start for "shoot--i034796--g1-w1-z1-788d9-hlgnx" with phase:"Terminating", description:"Set machine status to termination. Now, getting VM Status"
I1226 10:08:33.742750   86315 machine_util.go:1685] Updating "node" label on machine "shoot--i034796--g1-w1-z1-788d9-hlgnx" to "shoot--i034796--g1-w1-z1-788d9-hlgnx"
I1226 10:08:33.926901   86315 machine_util.go:1696] Updated "node" label on machine "shoot--i034796--g1-w1-z1-788d9-hlgnx" to "shoot--i034796--g1-w1-z1-788d9-hlgnx
I1226 10:08:41.699711   86315 machine_util.go:1104] Normal delete/drain has been triggerred for machine "shoot--i034796--g1-w1-z1-788d9-hlgnx"
...
I1226 10:11:05.334091   86315 machine_controller.go:131] VM "gce:///OMITTED/shoot--i034796--g1-w1-z1-788d9-hlgnx" for Machine "shoot--i034796--g1-w1-z1-788d9-hlgnx" was terminated succesfully
I1226 10:11:10.681394   86315 machine_util.go:1357] Deleting node "shoot--i034796--g1-w1-z1-788d9-hlgnx" associated with machine "shoot--i034796--g1-w1-z1-788d9-hlgnx"
I1226 10:11:16.055535   86315 machine.go:648] Machine "shoot--i034796--g1-w1-z1-788d9-hlgnx" with providerID "gce:///OMITTED/shoot--i034796--g1-w1-z1-788d9-hlgnx" and nodeName "shoot--i034796--g1-w1-z1-788d9-hlgnx" deleted successfully

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Dec 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/robustness Robustness, reliability, resilience related kind/bug Bug priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants