Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exponential backoff retries to CreateSnapshot #153

Closed

Conversation

ggriffiths
Copy link
Member

@ggriffiths ggriffiths commented Aug 3, 2019

Signed-off-by: Grant Griffiths [email protected]

What type of PR is this?
/kind bug

What this PR does / why we need it:

  • Adds exponential retries to the external snapshotter

Which issue(s) this PR fixes:
Fixes #134

Special notes for your reviewer:
This PR is modeled after @jsafrane's retry work on the external-provisioner:

Ready for review now.

Does this PR introduce a user-facing change?:

external-snapshotter will now do exponential backoff retries after a failure instead of timing out.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 3, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ggriffiths
To complete the pull request process, please assign saad-ali
You can assign the PR to them by writing /assign @saad-ali in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Hi @ggriffiths. Thanks for your PR.

I'm waiting for a kubernetes-csi or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 3, 2019
@ggriffiths
Copy link
Member Author

@jsafrane Can you take a look to see if I'm on the right track here? I added a few notes in the PR description for tasks I still need to do.

@msau42
Copy link
Collaborator

msau42 commented Aug 3, 2019

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 3, 2019
pkg/snapshotter/snapshotter.go Outdated Show resolved Hide resolved
pkg/snapshotter/snapshotter.go Outdated Show resolved Hide resolved
pkg/snapshotter/snapshotter.go Show resolved Hide resolved
pkg/controller/csi_handler.go Outdated Show resolved Hide resolved
pkg/controller/csi_handler.go Outdated Show resolved Hide resolved
pkg/snapshotter/snapshotter.go Outdated Show resolved Hide resolved
@ggriffiths ggriffiths force-pushed the snapshotter-retries branch 3 times, most recently from e432cae to c86ff40 Compare August 9, 2019 23:25
@jsafrane
Copy link
Contributor

Is it ready for review? I see failing tests. And please don't add cmd/csi-snapshotter/csi-snapshotter binary to git :-).

@ggriffiths
Copy link
Member Author

Is it ready for review? I see failing tests. And please don't add cmd/csi-snapshotter/csi-snapshotter binary to git :-).

Yeah it's not quite ready yet. Still need to fix tests. Will mark it ready for review when it is :-)

And thanks I'll remove it - added it accidentally

@ggriffiths ggriffiths force-pushed the snapshotter-retries branch 2 times, most recently from 5023849 to e4949cb Compare August 12, 2019 18:53
@ggriffiths ggriffiths force-pushed the snapshotter-retries branch 3 times, most recently from 1a085e1 to 9bf26a9 Compare August 12, 2019 20:55
@ggriffiths ggriffiths changed the title WIP: Add exponential backoff retries to CreateSnapshot Add exponential backoff retries to CreateSnapshot Aug 14, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 14, 2019
@ggriffiths ggriffiths force-pushed the snapshotter-retries branch from 9bf26a9 to 9a7c8ce Compare August 20, 2019 21:04
@ggriffiths ggriffiths force-pushed the snapshotter-retries branch from 82dfc90 to 7bc228d Compare August 21, 2019 20:05
@ggriffiths
Copy link
Member Author

ggriffiths commented Aug 21, 2019

One thing to note is that I haven't added a snapshotInformer/snapshotIndexer similar to the provisioner's claimInformer/claimIndexer:

I didn't think I would need to add this, but now I realize I'm not really accessing snapshotsInProgress as a result (see https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/master/controller/controller.go#L963 for how this is used in the provisioner. Currently I'm only adding/removing it based on the SnapshottingState. @msau42

I can add the snapshotInformer/indexer, but am open to other ideas before continuing on that work.

@msau42
Copy link
Collaborator

msau42 commented Aug 23, 2019

I don't think you need to follow external-provisioner exactly. I think the main thing that's missing is in snapshotWorker(): if you can't find the object from the informer (because it's been deleted), then try finding it from snapshotsInProgress.

Also cc @wongma7 since he is familiar with the provisioner design.

@ggriffiths ggriffiths force-pushed the snapshotter-retries branch from 7bc228d to 9595e5a Compare August 23, 2019 05:53
@ggriffiths
Copy link
Member Author

Sounds good, I've updated the PR to check snapshotsInProgress if we don't find one from the informer.

ctrl.updateSnapshot(newSnapshot)
}

ctrl.snapshotQueue.Forget(keyObj)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to forget and delete in this case. We should probably follow the whole L230-235 section. Can we inject the snapshotsInProgressCheck before L230 so they both go through the same codepath?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, updated to remove the forget/delete and reworked the informer/snapshotsInProgress checks.

@ggriffiths ggriffiths force-pushed the snapshotter-retries branch from 9595e5a to 192fbca Compare August 23, 2019 18:28
@ggriffiths ggriffiths force-pushed the snapshotter-retries branch 3 times, most recently from 308ca7d to 8766952 Compare September 5, 2019 20:01
@ggriffiths
Copy link
Member Author

ggriffiths commented Nov 14, 2019

@xing-yang should I re-do this PR for a future snapshotter release? Now that the split controller work is done.

@xing-yang
Copy link
Collaborator

should I re-do this PR for a future snapshotter release? Now that the split controller work is done.

Yes, please do this after the 1.17 release. I'm actually still working on improving the controller logic.

@ggriffiths
Copy link
Member Author

should I re-do this PR for a future snapshotter release? Now that the split controller work is done.

Yes, please do this after the 1.17 release. I'm actually still working on improving the controller logic.

Sounds good, will do 👍
/assign ggriffiths

@k8s-ci-robot
Copy link
Contributor

@ggriffiths: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-csi-external-snapshotter-1-17-on-kubernetes-1-17 fd0e055 link /test pull-kubernetes-csi-external-snapshotter-1-17-on-kubernetes-1-17

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot
Copy link
Contributor

@ggriffiths: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 10, 2019
@humblec
Copy link
Contributor

humblec commented Feb 12, 2020

@ggriffiths are you working on this rebase ? would like to see this with 1.18 :)

@xing-yang
Copy link
Collaborator

Hi @humblec, I think this PR is not needed any more. In the Beta version of external-snapshotter, we made some changes. I tested myself and don't see this issue any more. That's why I asked the person who opened the issue to re-test.

@humblec
Copy link
Contributor

humblec commented Feb 12, 2020

Hi @humblec, I think this PR is not needed any more. In the Beta version of external-snapshotter, we made some changes. I tested myself and don't see this issue any more. That's why I asked the person who opened the issue to re-test.

Thanks @xing-yang for clarifying it.. Then lets double confirm and close this . 👍

@ggriffiths
Copy link
Member Author

Closing this it was put on hold for the beta API changes last release. If this is still needed after testing #134 with the beta API, I'll submit a separate PR.

@ggriffiths ggriffiths closed this Feb 13, 2020
xing-yang added a commit to xing-yang/external-snapshotter that referenced this pull request Jul 20, 2021
c0a4fb1 Merge pull request kubernetes-csi#164 from anubha-v-ardhan/patch-1
9c6a6c0 Master to main cleanup
682c686 Merge pull request kubernetes-csi#162 from pohly/pod-name-via-shell-command
36a29f5 Merge pull request kubernetes-csi#163 from pohly/remove-bazel
68e43ca prow.sh: remove Bazel build support
c5f59c5 prow.sh: allow shell commands in CSI_PROW_SANITY_POD
71c810a Merge pull request kubernetes-csi#161 from pohly/mock-test-fixes
9e438f8 prow.sh: fix mock testing
d7146c7 Merge pull request kubernetes-csi#160 from pohly/kind-update
4b6aa60 prow.sh: update to KinD v0.11.0
7cdc76f Merge pull request kubernetes-csi#159 from pohly/fix-deployment-selection
ef8bd33 prow.sh: more flexible CSI_PROW_DEPLOYMENT, part II
204bc89 Merge pull request kubernetes-csi#158 from pohly/fix-deployment-selection
61538bb prow.sh: more flexible CSI_PROW_DEPLOYMENT
2b0e6db Merge pull request kubernetes-csi#157 from humblec/csi-release
a2fcd6d Adding myself to csi reviewers group
f325590 Merge pull request kubernetes-csi#149 from pohly/cluster-logs
4b03b30 Merge pull request kubernetes-csi#155 from pohly/owners
a6453c8 owners: introduce aliases
ad83def Merge pull request kubernetes-csi#153 from pohly/fix-image-builds
5561780 build.make: fix image publishng
29bd39b Merge pull request kubernetes-csi#152 from pohly/bump-csi-test
bc42793 prow.sh: use csi-test v4.2.0
b546baa Merge pull request kubernetes-csi#150 from mauriciopoppe/windows-multiarch-args
bfbb6f3 add parameter base_image and addon_image to BUILD_PARAMETERS
2d61d3b Merge pull request kubernetes-csi#151 from humblec/cm
48e71f0 Replace `which` command ( non standard)  with `command -v` builtin
feb20e2 prow.sh: collect cluster logs
7b96bea Merge pull request kubernetes-csi#148 from dobsonj/add-checkpathcmd-to-prow
2d2e03b prow.sh: enable -csi.checkpathcmd option in csi-sanity
09d4151 Merge pull request kubernetes-csi#147 from pohly/mock-testing
74cfbc9 prow.sh: support mock tests
4a3f110 prow.sh: remove obsolete test suppression
6616a6b Merge pull request kubernetes-csi#146 from pohly/kubernetes-1.21
510fb0f prow.sh: support Kubernetes 1.21
c63c61b prow.sh: add CSI_PROW_DEPLOYMENT_SUFFIX
51ac11c Merge pull request kubernetes-csi#144 from pohly/pull-jobs
dd54c92 pull-test.sh: test importing csi-release-tools into other repo
7d2643a Merge pull request kubernetes-csi#143 from pohly/path-setup
6880b0c prow.sh: avoid creating paths unless really running tests

git-subtree-dir: release-tools
git-subtree-split: c0a4fb1
xing-yang pushed a commit to xing-yang/external-snapshotter that referenced this pull request Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problems dealing with snapshot create requests timing out
6 participants