Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add nodeVolumeDetachTimeout property to Machine #6413

Conversation

furkatgofurov7
Copy link
Member

@furkatgofurov7 furkatgofurov7 commented Apr 13, 2022

What this PR does / why we need it:

  • introduces a new timeout called nodeVolumeDetachTimeout to the machine and other CRDs, which can be used to wait for volume detachment to happen within the timeout limit
  • introduces a new annotation called ExcludeWaitForNodeVolumeDetachAnnotation to be able to skip shouldWaitForNodeVolumes check entirely if needed
  • moves out shouldWaitForNodeVolumes from node drain logic (isNodeDrainAllowed) and adds new check only for volume related checks called isNodeVolumeDetachingAllowed to to keep a cleaner separation for API
  • adds unit tests
  • documents introduced new timeout nodeVolumeDetachTimeout and annotation ExcludeWaitForNodeVolumeDetachAnnotation as a breaking changes in the provider's migration v1.2 to v1.3 document
  • Renames VolumeDetachSucceededCondition to VolumeDetachFinishedCondition so it is clearer for the user, because we have to mark VolumeDetachSucceededCondition as True when volumeDetachTimeoutExceeded condition met even though in fact volumes are not detached. For that reason, I renamed it to VolumeDetachFinishedCondition (open to suggestions on this) to make it a bit generic and less confusing for the user.
  • Adds VolumeDetachTimedOutCondition to indicate volume detaching timeout exceeded
  • Adds WaitingForVolumeDetachTimeoutReason to indicate the reason for detach timeout

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #6285

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 13, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @furkatgofurov7. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 13, 2022
@fabriziopandini
Copy link
Member

/hold
while discussing requirements and solution on the issue

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 13, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2022
@furkatgofurov7
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 13, 2022
@furkatgofurov7 furkatgofurov7 force-pushed the feature/add-volume-detach-timeout-property branch from 6a80f0b to bf36f49 Compare July 18, 2022 18:41
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 28, 2022
Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: for node draining, we have the machine.cluster.x-k8s.io/exclude-node-draining annotation that allows us to skip the process entirely. Should we have something similar for volumeDetach?

internal/controllers/machine/machine_controller.go Outdated Show resolved Hide resolved
@fabriziopandini
Copy link
Member

/hold cancel
@furkatgofurov7 sorry for getting late at this PR, it would be great if we can get this moving

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 4, 2022
@fabriziopandini
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 4, 2022
@furkatgofurov7
Copy link
Member Author

Q: for node draining, we have the machine.cluster.x-k8s.io/exclude-node-draining annotation that allows us to skip the process entirely. Should we have something similar for volumeDetach?

@fabriziopandini great idea, I was not aware of that annotation, since it could simplify the logic a lot and not affect the current behavior. I will make changes accordingly hopefully soon (sorry for the late reply btw, I am on vacation now)

@furkatgofurov7 furkatgofurov7 force-pushed the feature/add-volume-detach-timeout-property branch from bf36f49 to 1a64aeb Compare August 15, 2022 12:10
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 15, 2022
@furkatgofurov7 furkatgofurov7 force-pushed the feature/add-volume-detach-timeout-property branch from 1a0f416 to fd85fbe Compare August 15, 2022 14:01
@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented Aug 15, 2022

Q: for node draining, we have the machine.cluster.x-k8s.io/exclude-node-draining annotation that allows us to skip the process entirely. Should we have something similar for volumeDetach?

@fabriziopandini I have updated the PR accordingly to add exclude volumeDetach annotation, PTAL. I am also investigating verify main test failure

@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented Aug 16, 2022

I can not reproduce this test failure locally (make verify-gen is not throwing any error? tried both 1.17 and v1.18 Go versions )

/test pull-cluster-api-verify-main

@adilGhaffarDev adilGhaffarDev force-pushed the feature/add-volume-detach-timeout-property branch from fd85fbe to 02d905c Compare August 16, 2022 12:08
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 16, 2022
@furkatgofurov7 furkatgofurov7 force-pushed the feature/add-volume-detach-timeout-property branch from 566ba4e to 60b3ef4 Compare September 16, 2022 19:53
@furkatgofurov7
Copy link
Member Author

Created #7233 for tracking related work on Cluster.Topology

Thank you, most probably you will be working on it, but in case not, I am also happy to work on this issue. This looks very much like 83036e1 and 67080e1

@ykakarap
Copy link
Contributor

I can take care of doing the work in Cluster.Topology.
I am assigned to the issue. I will get started on that once this PR is merged.

@fabriziopandini
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 19, 2022
@furkatgofurov7
Copy link
Member Author

/cc @sbueringer @enxebre hi folks, if you could take another look and help to merge this, thanks!

@sbueringer
Copy link
Member

/lgtm

Giving Alberto some time to take a look as well

@furkatgofurov7 furkatgofurov7 force-pushed the feature/add-volume-detach-timeout-property branch from a543dd6 to 3514f1b Compare September 19, 2022 10:27
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 19, 2022
@furkatgofurov7
Copy link
Member Author

/lgtm

Giving Alberto some time to take a look as well

Sure, anyways I had to rebase it after bac8e27 and it got removed again 😀

@sbueringer
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 19, 2022
@fabriziopandini
Copy link
Member

Oh, I just noticed that commits are not squashed 😅

@furkatgofurov7 furkatgofurov7 force-pushed the feature/add-volume-detach-timeout-property branch from 3514f1b to a27f71c Compare September 19, 2022 14:19
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 19, 2022
@furkatgofurov7
Copy link
Member Author

Oh, I just noticed that commits are not squashed sweat_smile

@fabriziopandini done 😃

@enxebre
Copy link
Member

enxebre commented Sep 20, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 20, 2022
@sbueringer
Copy link
Member

/approve

@furkatgofurov7 Thx for all your work and patience on this!

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 20, 2022
@k8s-ci-robot k8s-ci-robot merged commit 1a56bc5 into kubernetes-sigs:main Sep 20, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.3 milestone Sep 20, 2022
@furkatgofurov7 furkatgofurov7 deleted the feature/add-volume-detach-timeout-property branch October 17, 2022 14:18
// After node draining is completed, and if isNodeVolumeDetachingAllowed returns True, make sure all
// volumes are detached before proceeding to delete the Node.
if r.isNodeVolumeDetachingAllowed(m) {
log.Info("Waiting for node volumes to be detached", "Node", klog.KRef("", m.Status.NodeRef.Name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log message duplicates the one shown in line 374. Would be good to remove one of them. If we keep this one, we should add a second log entry when volumes are detached. Otherwise it makes debugging harder - caused a lot of confusion for us just now.
I can move this to an issue if necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good if you can either move it to an issue or open a PR if you want to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CAPI waiting forever for the volume to be detached
10 participants