Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AnnVolumeSnapshotBeingCreated #261

Merged
merged 10 commits into from
Apr 3, 2020

Conversation

xing-yang
Copy link
Collaborator

@xing-yang xing-yang commented Feb 27, 2020

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR adds a VolumeSnapshotBeingCreated annotation on VolumeSnapshotContent
when the CreateSnapshot request is sent to the CSI driver.
The annotation will be removed when driver responds with success or failure.
If timeout happens, it will retry and annotation remains.

If the VolumeSnapshotContent has the VolumeSnapshotBeingCreated annotation
set, the csi-snapshotter sidecar will not delete the snapshot resource on the storage system.

Which issue(s) this PR fixes:
Fixes #134

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fixes a create snapshot timeout issue.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 27, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 27, 2020
@xing-yang
Copy link
Collaborator Author

/assign @yuxiangqian

Copy link
Contributor

@yuxiangqian yuxiangqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments, WIP

pkg/sidecar-controller/snapshot_controller.go Outdated Show resolved Hide resolved
pkg/sidecar-controller/snapshot_controller.go Outdated Show resolved Hide resolved
pkg/utils/util.go Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 5, 2020
@xing-yang xing-yang changed the title WIP: Add AnnVolumeSnapshotBeingCreated Add AnnVolumeSnapshotBeingCreated Mar 11, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 11, 2020
@xing-yang
Copy link
Collaborator Author

/hold

Wait for @ShyamsundarR to test this PR.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 11, 2020
@ShyamsundarR
Copy link

Wait for @ShyamsundarR to test this PR.

Test results updated in the issue.

@xing-yang
Copy link
Collaborator Author

@ShyamsundarR addressed your comments.

@k8s-ci-robot k8s-ci-robot removed the release-note-none Denotes a PR that doesn't merit a release note. label Mar 30, 2020
Copy link
Contributor

@yuxiangqian yuxiangqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP


// scheduleOperation starts given asynchronous operation on given snapshot. It
// makes sure there is no running operation with the same operationName
func (ctrl *csiSnapshotSideCarController) scheduleOperation(operationName string, operation func() error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned with this refactoring for potential racing conditions.
the original logic uses a goroutine map to manage whether there is an existing goroutine for a specific operation on an object. This refactoring seems move that protection away?
Though the idempotence nature of CSI interfaces could still relief the problem, but it might still cause API updating failures.

Copy link
Collaborator Author

@xing-yang xing-yang Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without removing the scheduleOperation, I'll have to add a createSnapshotWrapper to be called by both createSnapshotOperation and checkandUpdateContentStatusOperation.

Another problem with the scheduleOperation is that it won't return any error code from a failed create snapshot operation. Without the error code, we couldn't add a failed item back to the worker queue. Retry will still happen with the resync period, but there was a review comment during the beta review to do the requeue with proper throttling/backoff (#214). Removing the scheduleOperation and return error code is the first step towards fixing the requeue logic. The next step is to revive this PR: #230

Copy link
Collaborator Author

@xing-yang xing-yang Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduleOperation code is based on implementation of the in-tree PV/PVC controller. I see that the in-tree PV/PVC controller also does not do requeue.

The other in-tree controllers such as attach/detach controller, expand controller, and pv/pvc protection controller don't have scheduleOperation and all have the reuque logic.
The out-of-tree provisioner (https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/master/controller/controller.go) which is used together with external-provisioner also does not use scheduleOperation and it does have requeue logic. It handles create/delete/update as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Xref: kubernetes/kubernetes#85661

Imo, relying on a resync to requeue is not a good design pattern to continue. It then requires you have a frequent resync period to increase performance, which generates more unnecessary churn especially as the number of objects scale.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK. That confirms that we should get rid of scheduleOperation so that we can properly do requeue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also snapshot controller and csi-snapshotter sidecar are separate controllers. Normally when you use kubebuilder, it would generate a controller that goes with the CRDs but that would be for the snapshot controller. For the sidecar controller, we still need to write a separate one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, and wasn't that the plan to separate sidecar/snapshot controller into two repos anyways?
again, I am not suggesting kubebuilder is the way to go :), its just an alternative could be considered.
my main concern here is really the potential racing I listed initially. can requeue happen inside of createSnapshotOperation? that would solve the problem of returning code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, and wasn't that the plan to separate sidecar/snapshot controller into two repos anyways?

What I meant to say is that we may not be able to use kubebuilder for sidecar anyway. The scheduleOperation code is only in the sidecar. The common controller does not use scheduleOperation (needs to clean up unused code).

my main concern here is really the potential racing I listed initially. can requeue happen inside of createSnapshotOperation? that would solve the problem of returning code?

We could look into that but that could make the code look very ugly. Normally requeue happens in snapshot_controller_base.go. If we do requeue inside that operation, we would need to do requeue in snapshot_controller.go.

How about this? I'll remove the code that removes scheduleOperation from this PR and do that together with this other requeue PR later. But that means I'll have to add a createSnapshotWrapper so that createSnapshotOperation won't be called by checkandUpdateContentStatusOperation.

On another note, we probably don't need checkandUpdateContentStatusOperation any way. Even in the in-tree PV/PVC controller, I only see operation being used for create/delete, but not on update.

Copy link
Contributor

@yuxiangqian yuxiangqian Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, and wasn't that the plan to separate sidecar/snapshot controller into two repos anyways?

What I meant to say is that we may not be able to use kubebuilder for sidecar anyway. The scheduleOperation code is only in the sidecar. The common controller does not use scheduleOperation (needs to clean up unused code).

It's possible to use kubebuilder for sidecar as long as the CRD/API is in a separated repo.

my main concern here is really the potential racing I listed initially. can requeue happen inside of createSnapshotOperation? that would solve the problem of returning code?

We could look into that but that could make the code look very ugly. Normally requeue happens in snapshot_controller_base.go. If we do requeue inside that operation, we would need to do requeue in snapshot_controller.go.

you are right, that's not good. lets not moving towards that.

How about this? I'll remove the code that removes scheduleOperation from this PR and do that together with this other requeue PR later. But that means I'll have to add a createSnapshotWrapper so that createSnapshotOperation won't be called by checkandUpdateContentStatusOperation.

SGTM, this PR is already complex, and I am a bit hesitate to include too many changes right before release cut which is an optimization not critical bug fix. WDYT?

On another note, we probably don't need checkandUpdateContentStatusOperation any way. Even in the in-tree PV/PVC controller, I only see operation being used for create/delete, but not on update.

I complete agree. I think the current controller implementation could be improved by given more time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll break this PR into several PRs.

  • This PR will only include the fix for create snapshot timeout. This means I'll need to add that createSnapshotWrapper back:). We want this fix for this release.
  • I'll move the check for PVC is being used by the current snapshot to a separate PR as it is fixing the problem of removing the PVC finalizer. We want this fix for this release.
  • The changes related to scheduleOperation will be moved to a future PR as it is an optimization.

}

// Get snapshot source which is a PVC
pvc, err := ctrl.getClaimFromVolumeSnapshot(snapshot)
if err != nil {
klog.Infof("cannot get claim from snapshot [%s]: [%v] Claim may be deleted already. No need to remove finalizer on the claim.", snapshot.Name, err)
return nil
return false, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its a temp API error to get PVC from API server, this could leak a potential finalizer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We could add a retry? I don't know what is a better way to fix this.

pkg/common-controller/snapshot_controller.go Outdated Show resolved Hide resolved
pkg/common-controller/snapshot_controller.go Outdated Show resolved Hide resolved
pkg/common-controller/snapshot_controller.go Outdated Show resolved Hide resolved
pkg/sidecar-controller/snapshot_controller.go Outdated Show resolved Hide resolved
pkg/sidecar-controller/snapshot_controller.go Show resolved Hide resolved
@xing-yang xing-yang force-pushed the create_timeout branch 2 times, most recently from 844cfdc to 5491852 Compare April 1, 2020 03:34
@xing-yang
Copy link
Collaborator Author

Removed the change to remove scheduleOperation, but still need to add createSnapshotWrapper change back.

@xing-yang
Copy link
Collaborator Author

Added createSnapshotWrapper. This only contains the fix for create snapshot timeout now.

@yuxiangqian It's ready for review again.

pkg/sidecar-controller/snapshot_controller.go Outdated Show resolved Hide resolved
pkg/sidecar-controller/snapshot_controller.go Show resolved Hide resolved
return false
}
switch st.Code() {
case codes.Canceled, // gRPC: Client Application cancelled the request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the documented RPC error code here
It looks only three types have been documented by CSI spec. ResourceExhausted, Aborted -> may retry later, Already_Existed -> should fail.
are we sure the rest are supported? If yes, should the CSI spec be fixed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSI spec only documents error codes directly returned by the CSI driver. Error codes such as DeadlineExceeded comes from the gRPC layer when timeout happens, not directly returned by the CSI driver. That's why not all gRPC error codes are documented in CSI spec.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are concerns regarding the gRPC error codes being re-used in the CSI spec. See issue here: container-storage-interface/spec#419. That is a bigger design issue. We can modify these error codes when they are changed in the CSI spec.

pkg/sidecar-controller/snapshot_controller.go Outdated Show resolved Hide resolved
contentClone := content.DeepCopy()
metav1.SetMetaDataAnnotation(&contentClone.ObjectMeta, utils.AnnVolumeSnapshotBeingCreated, "yes")

updatedContent, err := ctrl.clientset.SnapshotV1beta1().VolumeSnapshotContents().Update(contentClone)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, I saw this section seems to be a repeating pattern to update a content in API server and local cache, might worth to extract this piece out to a separate function. WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to make this change, but realized that I also have to log different messages at each place it is called for easy debugging. It does not save too much. If you insist, I can make the change. Otherwise, I'd prefer to keep it this way.

pkg/sidecar-controller/snapshot_controller.go Outdated Show resolved Hide resolved
pkg/sidecar-controller/snapshot_controller.go Outdated Show resolved Hide resolved
Copy link
Contributor

@yuxiangqian yuxiangqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two nits and LGTM

pkg/sidecar-controller/snapshot_controller.go Outdated Show resolved Hide resolved
}
// This is a wrapper function for the snapshot creation process.
func (ctrl *csiSnapshotSideCarController) createSnapshotWrapper(content *crdv1.VolumeSnapshotContent) (*crdv1.VolumeSnapshotContent, error) {
klog.Infof("createSnapshotWrapper: Creating snapshot for content %s through the plugin ...", content.Name)

class, snapshotterCredentials, err := ctrl.getCSISnapshotInput(content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added error status check in createSnapshotWrapper. Can you check?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "https://github.com/kubernetes-csi/external-snapshotter/pull/261/files#r397544373" does not seem to be the spot that you are referring to. My understanding is that you want me to add error status check in createSnapshotWrapper. If it is not the case, please clarify.

@xing-yang
Copy link
Collaborator Author

@yuxiangqian addressed your comments. PTAL.

@yuxiangqian
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 3, 2020
@k8s-ci-robot k8s-ci-robot merged commit cdbae99 into kubernetes-csi:master Apr 3, 2020
huww98 added a commit to huww98/external-snapshotter that referenced this pull request Nov 22, 2024
e31de525b Merge pull request kubernetes-csi#261 from huww98/golang
fd153a9e2 Bump golang to 1.23.1
a8b3d0504 pull-test.sh: fix "git subtree pull" errors
227577e00 Merge pull request kubernetes-csi#258 from gnufied/enable-race-detection
e1ceee287 Always enable race detection while running tests

git-subtree-dir: release-tools
git-subtree-split: e31de525b39d99e762c2b0fc825576ec623fa1a6
AndrewSirenko pushed a commit to AndrewSirenko/external-snapshotter that referenced this pull request Dec 10, 2024
e31de525b Merge pull request kubernetes-csi#261 from huww98/golang
fd153a9e2 Bump golang to 1.23.1
a8b3d0504 pull-test.sh: fix "git subtree pull" errors
227577e00 Merge pull request kubernetes-csi#258 from gnufied/enable-race-detection
e1ceee287 Always enable race detection while running tests

git-subtree-dir: release-tools
git-subtree-split: e31de525b39d99e762c2b0fc825576ec623fa1a6
AndrewSirenko added a commit to AndrewSirenko/external-snapshotter that referenced this pull request Dec 10, 2024
98f23071 Merge pull request kubernetes-csi#260 from TerryHowe/update-csi-driver-version
e9d8712d Merge pull request kubernetes-csi#259 from stmcginnis/deprecated-kind-kube-root
faf79ff6 Remove --kube-root deprecated kind argument
734c2b95 Merge pull request kubernetes-csi#265 from Rakshith-R/consider-main-branch
f95c855b Merge pull request kubernetes-csi#262 from huww98/golang-toolchain
3c8d966f Treat main branch as equivalent to master branch
e31de525 Merge pull request kubernetes-csi#261 from huww98/golang
fd153a9e Bump golang to 1.23.1
a8b3d050 pull-test.sh: fix "git subtree pull" errors
6b05f0fc use new GOTOOLCHAIN env to manage go version
18b6ac6d chore: update CSI driver version to 1.15
227577e0 Merge pull request kubernetes-csi#258 from gnufied/enable-race-detection
e1ceee28 Always enable race detection while running tests

git-subtree-dir: release-tools
git-subtree-split: 98f23071d946dd3de3188a7e1f84679067003162
AndrewSirenko added a commit to AndrewSirenko/external-snapshotter that referenced this pull request Dec 10, 2024
406a79ac Merge pull request kubernetes-csi#267 from huww98/gomodcache
9cec273d Set GOMODCACHE to avoid re-download toolchain
98f23071 Merge pull request kubernetes-csi#260 from TerryHowe/update-csi-driver-version
e9d8712d Merge pull request kubernetes-csi#259 from stmcginnis/deprecated-kind-kube-root
faf79ff6 Remove --kube-root deprecated kind argument
734c2b95 Merge pull request kubernetes-csi#265 from Rakshith-R/consider-main-branch
f95c855b Merge pull request kubernetes-csi#262 from huww98/golang-toolchain
3c8d966f Treat main branch as equivalent to master branch
e31de525 Merge pull request kubernetes-csi#261 from huww98/golang
fd153a9e Bump golang to 1.23.1
a8b3d050 pull-test.sh: fix "git subtree pull" errors
6b05f0fc use new GOTOOLCHAIN env to manage go version
18b6ac6d chore: update CSI driver version to 1.15
227577e0 Merge pull request kubernetes-csi#258 from gnufied/enable-race-detection
e1ceee28 Always enable race detection while running tests

git-subtree-dir: release-tools
git-subtree-split: 406a79acf021b5564108afebeea7d0ed44648d3f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problems dealing with snapshot create requests timing out
5 participants