Add AnnVolumeSnapshotBeingCreated #261

xing-yang · 2020-02-27T21:22:42Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR adds a VolumeSnapshotBeingCreated annotation on VolumeSnapshotContent
when the CreateSnapshot request is sent to the CSI driver.
The annotation will be removed when driver responds with success or failure.
If timeout happens, it will retry and annotation remains.

If the VolumeSnapshotContent has the VolumeSnapshotBeingCreated annotation
set, the csi-snapshotter sidecar will not delete the snapshot resource on the storage system.

Which issue(s) this PR fixes:
Fixes #134

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fixes a create snapshot timeout issue.

k8s-ci-robot · 2020-02-27T21:22:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [xing-yang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

xing-yang · 2020-02-27T21:22:59Z

/assign @yuxiangqian

yuxiangqian

some comments, WIP

pkg/sidecar-controller/snapshot_controller.go

pkg/utils/util.go

pkg/sidecar-controller/snapshot_controller.go

xing-yang · 2020-03-11T03:19:06Z

/hold

Wait for @ShyamsundarR to test this PR.

pkg/sidecar-controller/snapshot_controller.go

ShyamsundarR · 2020-03-11T21:37:21Z

Wait for @ShyamsundarR to test this PR.

Test results updated in the issue.

xing-yang · 2020-03-16T19:45:21Z

@ShyamsundarR addressed your comments.

yuxiangqian

WIP

yuxiangqian · 2020-03-31T05:34:20Z

pkg/sidecar-controller/snapshot_controller.go

-
-// scheduleOperation starts given asynchronous operation on given snapshot. It
-// makes sure there is no running operation with the same operationName
-func (ctrl *csiSnapshotSideCarController) scheduleOperation(operationName string, operation func() error) {


I am concerned with this refactoring for potential racing conditions.
the original logic uses a goroutine map to manage whether there is an existing goroutine for a specific operation on an object. This refactoring seems move that protection away?
Though the idempotence nature of CSI interfaces could still relief the problem, but it might still cause API updating failures.

Without removing the scheduleOperation, I'll have to add a createSnapshotWrapper to be called by both createSnapshotOperation and checkandUpdateContentStatusOperation.

Another problem with the scheduleOperation is that it won't return any error code from a failed create snapshot operation. Without the error code, we couldn't add a failed item back to the worker queue. Retry will still happen with the resync period, but there was a review comment during the beta review to do the requeue with proper throttling/backoff (#214). Removing the scheduleOperation and return error code is the first step towards fixing the requeue logic. The next step is to revive this PR: #230

The scheduleOperation code is based on implementation of the in-tree PV/PVC controller. I see that the in-tree PV/PVC controller also does not do requeue.

The other in-tree controllers such as attach/detach controller, expand controller, and pv/pvc protection controller don't have scheduleOperation and all have the reuque logic.
The out-of-tree provisioner (https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/master/controller/controller.go) which is used together with external-provisioner also does not use scheduleOperation and it does have requeue logic. It handles create/delete/update as well.

Xref: kubernetes/kubernetes#85661

Imo, relying on a resync to requeue is not a good design pattern to continue. It then requires you have a frequent resync period to increase performance, which generates more unnecessary churn especially as the number of objects scale.

ACK. That confirms that we should get rid of scheduleOperation so that we can properly do requeue.

Also snapshot controller and csi-snapshotter sidecar are separate controllers. Normally when you use kubebuilder, it would generate a controller that goes with the CRDs but that would be for the snapshot controller. For the sidecar controller, we still need to write a separate one.

true, and wasn't that the plan to separate sidecar/snapshot controller into two repos anyways?
again, I am not suggesting kubebuilder is the way to go :), its just an alternative could be considered.
my main concern here is really the potential racing I listed initially. can requeue happen inside of createSnapshotOperation? that would solve the problem of returning code?

true, and wasn't that the plan to separate sidecar/snapshot controller into two repos anyways?

What I meant to say is that we may not be able to use kubebuilder for sidecar anyway. The scheduleOperation code is only in the sidecar. The common controller does not use scheduleOperation (needs to clean up unused code).

my main concern here is really the potential racing I listed initially. can requeue happen inside of createSnapshotOperation? that would solve the problem of returning code?

We could look into that but that could make the code look very ugly. Normally requeue happens in snapshot_controller_base.go. If we do requeue inside that operation, we would need to do requeue in snapshot_controller.go.

How about this? I'll remove the code that removes scheduleOperation from this PR and do that together with this other requeue PR later. But that means I'll have to add a createSnapshotWrapper so that createSnapshotOperation won't be called by checkandUpdateContentStatusOperation.

On another note, we probably don't need checkandUpdateContentStatusOperation any way. Even in the in-tree PV/PVC controller, I only see operation being used for create/delete, but not on update.

true, and wasn't that the plan to separate sidecar/snapshot controller into two repos anyways?

What I meant to say is that we may not be able to use kubebuilder for sidecar anyway. The scheduleOperation code is only in the sidecar. The common controller does not use scheduleOperation (needs to clean up unused code).

It's possible to use kubebuilder for sidecar as long as the CRD/API is in a separated repo.

my main concern here is really the potential racing I listed initially. can requeue happen inside of createSnapshotOperation? that would solve the problem of returning code?

We could look into that but that could make the code look very ugly. Normally requeue happens in snapshot_controller_base.go. If we do requeue inside that operation, we would need to do requeue in snapshot_controller.go.

you are right, that's not good. lets not moving towards that.

How about this? I'll remove the code that removes scheduleOperation from this PR and do that together with this other requeue PR later. But that means I'll have to add a createSnapshotWrapper so that createSnapshotOperation won't be called by checkandUpdateContentStatusOperation.

SGTM, this PR is already complex, and I am a bit hesitate to include too many changes right before release cut which is an optimization not critical bug fix. WDYT?

On another note, we probably don't need checkandUpdateContentStatusOperation any way. Even in the in-tree PV/PVC controller, I only see operation being used for create/delete, but not on update.

I complete agree. I think the current controller implementation could be improved by given more time.

I'll break this PR into several PRs.

This PR will only include the fix for create snapshot timeout. This means I'll need to add that createSnapshotWrapper back:). We want this fix for this release.

I'll move the check for PVC is being used by the current snapshot to a separate PR as it is fixing the problem of removing the PVC finalizer. We want this fix for this release.

The changes related to scheduleOperation will be moved to a future PR as it is an optimization.

yuxiangqian · 2020-03-31T06:01:03Z

pkg/common-controller/snapshot_controller.go

 	}

 	// Get snapshot source which is a PVC
 	pvc, err := ctrl.getClaimFromVolumeSnapshot(snapshot)
 	if err != nil {
 		klog.Infof("cannot get claim from snapshot [%s]: [%v] Claim may be deleted already. No need to remove finalizer on the claim.", snapshot.Name, err)
-		return nil
+		return false, nil


If its a temp API error to get PVC from API server, this could leak a potential finalizer?

Yes. We could add a retry? I don't know what is a better way to fix this.

pkg/common-controller/snapshot_controller.go

pkg/sidecar-controller/snapshot_controller.go

xing-yang · 2020-04-01T03:44:20Z

Removed the change to remove scheduleOperation, but still need to add createSnapshotWrapper change back.

xing-yang · 2020-04-01T04:17:31Z

Added createSnapshotWrapper. This only contains the fix for create snapshot timeout now.

@yuxiangqian It's ready for review again.

pkg/sidecar-controller/snapshot_controller.go

yuxiangqian · 2020-04-01T20:00:19Z

pkg/sidecar-controller/snapshot_controller.go

+		return false
+	}
+	switch st.Code() {
+	case codes.Canceled, // gRPC: Client Application cancelled the request


I checked the documented RPC error code here
It looks only three types have been documented by CSI spec. ResourceExhausted, Aborted -> may retry later, Already_Existed -> should fail.
are we sure the rest are supported? If yes, should the CSI spec be fixed?

CSI spec only documents error codes directly returned by the CSI driver. Error codes such as DeadlineExceeded comes from the gRPC layer when timeout happens, not directly returned by the CSI driver. That's why not all gRPC error codes are documented in CSI spec.

There are concerns regarding the gRPC error codes being re-used in the CSI spec. See issue here: container-storage-interface/spec#419. That is a bigger design issue. We can modify these error codes when they are changed in the CSI spec.

pkg/sidecar-controller/snapshot_controller.go

yuxiangqian · 2020-04-01T20:11:26Z

pkg/sidecar-controller/snapshot_controller.go

+		contentClone := content.DeepCopy()
+		metav1.SetMetaDataAnnotation(&contentClone.ObjectMeta, utils.AnnVolumeSnapshotBeingCreated, "yes")
+
+		updatedContent, err := ctrl.clientset.SnapshotV1beta1().VolumeSnapshotContents().Update(contentClone)


nit, I saw this section seems to be a repeating pattern to update a content in API server and local cache, might worth to extract this piece out to a separate function. WDYT?

I started to make this change, but realized that I also have to log different messages at each place it is called for easy debugging. It does not save too much. If you insist, I can make the change. Otherwise, I'd prefer to keep it this way.

pkg/sidecar-controller/snapshot_controller.go

…eration

yuxiangqian

two nits and LGTM

pkg/sidecar-controller/snapshot_controller.go

yuxiangqian · 2020-04-02T18:10:26Z

pkg/sidecar-controller/snapshot_controller.go

-	}
+// This is a wrapper function for the snapshot creation process.
+func (ctrl *csiSnapshotSideCarController) createSnapshotWrapper(content *crdv1.VolumeSnapshotContent) (*crdv1.VolumeSnapshotContent, error) {
+	klog.Infof("createSnapshotWrapper: Creating snapshot for content %s through the plugin ...", content.Name)

 	class, snapshotterCredentials, err := ctrl.getCSISnapshotInput(content)


https://github.com/kubernetes-csi/external-snapshotter/pull/261/files#r397544373
this has not been updated.

Added error status check in createSnapshotWrapper. Can you check?

This "https://github.com/kubernetes-csi/external-snapshotter/pull/261/files#r397544373" does not seem to be the spot that you are referring to. My understanding is that you want me to add error status check in createSnapshotWrapper. If it is not the case, please clarify.

xing-yang · 2020-04-03T03:07:33Z

@yuxiangqian addressed your comments. PTAL.

yuxiangqian · 2020-04-03T18:48:29Z

/lgtm

e31de525b Merge pull request kubernetes-csi#261 from huww98/golang fd153a9e2 Bump golang to 1.23.1 a8b3d0504 pull-test.sh: fix "git subtree pull" errors 227577e00 Merge pull request kubernetes-csi#258 from gnufied/enable-race-detection e1ceee287 Always enable race detection while running tests git-subtree-dir: release-tools git-subtree-split: e31de525b39d99e762c2b0fc825576ec623fa1a6

98f23071 Merge pull request kubernetes-csi#260 from TerryHowe/update-csi-driver-version e9d8712d Merge pull request kubernetes-csi#259 from stmcginnis/deprecated-kind-kube-root faf79ff6 Remove --kube-root deprecated kind argument 734c2b95 Merge pull request kubernetes-csi#265 from Rakshith-R/consider-main-branch f95c855b Merge pull request kubernetes-csi#262 from huww98/golang-toolchain 3c8d966f Treat main branch as equivalent to master branch e31de525 Merge pull request kubernetes-csi#261 from huww98/golang fd153a9e Bump golang to 1.23.1 a8b3d050 pull-test.sh: fix "git subtree pull" errors 6b05f0fc use new GOTOOLCHAIN env to manage go version 18b6ac6d chore: update CSI driver version to 1.15 227577e0 Merge pull request kubernetes-csi#258 from gnufied/enable-race-detection e1ceee28 Always enable race detection while running tests git-subtree-dir: release-tools git-subtree-split: 98f23071d946dd3de3188a7e1f84679067003162

406a79ac Merge pull request kubernetes-csi#267 from huww98/gomodcache 9cec273d Set GOMODCACHE to avoid re-download toolchain 98f23071 Merge pull request kubernetes-csi#260 from TerryHowe/update-csi-driver-version e9d8712d Merge pull request kubernetes-csi#259 from stmcginnis/deprecated-kind-kube-root faf79ff6 Remove --kube-root deprecated kind argument 734c2b95 Merge pull request kubernetes-csi#265 from Rakshith-R/consider-main-branch f95c855b Merge pull request kubernetes-csi#262 from huww98/golang-toolchain 3c8d966f Treat main branch as equivalent to master branch e31de525 Merge pull request kubernetes-csi#261 from huww98/golang fd153a9e Bump golang to 1.23.1 a8b3d050 pull-test.sh: fix "git subtree pull" errors 6b05f0fc use new GOTOOLCHAIN env to manage go version 18b6ac6d chore: update CSI driver version to 1.15 227577e0 Merge pull request kubernetes-csi#258 from gnufied/enable-race-detection e1ceee28 Always enable race detection while running tests git-subtree-dir: release-tools git-subtree-split: 406a79acf021b5564108afebeea7d0ed44648d3f

k8s-ci-robot requested review from lpabon and saad-ali February 27, 2020 21:22

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 27, 2020

k8s-ci-robot assigned yuxiangqian Feb 27, 2020

xing-yang force-pushed the create_timeout branch from 8069960 to 0a6ad41 Compare February 28, 2020 04:36

xing-yang mentioned this pull request Feb 28, 2020

Problems dealing with snapshot create requests timing out #134

Closed

yuxiangqian reviewed Mar 3, 2020

View reviewed changes

pkg/sidecar-controller/snapshot_controller.go Outdated Show resolved Hide resolved

pkg/sidecar-controller/snapshot_controller.go Outdated Show resolved Hide resolved

pkg/utils/util.go Show resolved Hide resolved

msau42 reviewed Mar 4, 2020

View reviewed changes

pkg/sidecar-controller/snapshot_controller.go Outdated Show resolved Hide resolved

xing-yang mentioned this pull request Mar 5, 2020

Add event for snapshotting in progress #268

Closed

xing-yang force-pushed the create_timeout branch from 0a6ad41 to bbc5dc0 Compare March 5, 2020 21:19

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 5, 2020

xing-yang force-pushed the create_timeout branch from bbc5dc0 to 0927a9f Compare March 5, 2020 21:41

yuxiangqian reviewed Mar 7, 2020

View reviewed changes

xing-yang changed the title ~~WIP: Add AnnVolumeSnapshotBeingCreated~~ Add AnnVolumeSnapshotBeingCreated Mar 11, 2020

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 11, 2020

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 11, 2020

ShyamsundarR reviewed Mar 11, 2020

View reviewed changes

pkg/sidecar-controller/snapshot_controller.go Outdated Show resolved Hide resolved

ShyamsundarR reviewed Mar 11, 2020

View reviewed changes

pkg/sidecar-controller/snapshot_controller.go Outdated Show resolved Hide resolved

xing-yang force-pushed the create_timeout branch from 8faf1ef to 55a538d Compare March 16, 2020 01:50

k8s-ci-robot removed the release-note-none Denotes a PR that doesn't merit a release note. label Mar 30, 2020

xing-yang mentioned this pull request Mar 30, 2020

Changelog for 2.1 release #281

Merged

yuxiangqian reviewed Mar 31, 2020

View reviewed changes

xing-yang force-pushed the create_timeout branch 2 times, most recently from 844cfdc to 5491852 Compare April 1, 2020 03:34

yuxiangqian reviewed Apr 1, 2020

View reviewed changes

pkg/sidecar-controller/snapshot_controller.go Show resolved Hide resolved

xing-yang added 8 commits April 2, 2020 14:23

Add AnnVolumeSnapshotBeingCreated

313507f

Remove VolumeSnapshotBeingCreated annotation after driver response

4c38cc8

Remove the unnecessary driver name check

aa2bc2f

Check error code DeadlineExceeded

7930dd2

Address review comments

6ca7507

Check final error

0189c77

checkandUpdateContentStatusOperation should not call createSnapshotOp…

f32313b

…eration

Address review comments

9f382ac

xing-yang force-pushed the create_timeout branch from d05bd27 to 9f382ac Compare April 2, 2020 14:25

yuxiangqian reviewed Apr 2, 2020

View reviewed changes

Add error status check in createSnapshotWrapper

2295e6a

xing-yang force-pushed the create_timeout branch from 2671b18 to 2295e6a Compare April 3, 2020 01:53

Remove error status check

f688d7b

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 3, 2020

k8s-ci-robot merged commit cdbae99 into kubernetes-csi:master Apr 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AnnVolumeSnapshotBeingCreated #261

Add AnnVolumeSnapshotBeingCreated #261

xing-yang commented Feb 27, 2020 •

edited

Loading

k8s-ci-robot commented Feb 27, 2020

xing-yang commented Feb 27, 2020

yuxiangqian left a comment

xing-yang commented Mar 11, 2020

ShyamsundarR commented Mar 11, 2020

xing-yang commented Mar 16, 2020

yuxiangqian left a comment

yuxiangqian Mar 31, 2020

xing-yang Mar 31, 2020 •

edited

Loading

xing-yang Mar 31, 2020 •

edited

Loading

msau42 Mar 31, 2020

xing-yang Mar 31, 2020

xing-yang Mar 31, 2020

yuxiangqian Mar 31, 2020

xing-yang Mar 31, 2020

yuxiangqian Mar 31, 2020 •

edited

Loading

xing-yang Apr 1, 2020

yuxiangqian Mar 31, 2020

xing-yang Apr 1, 2020

xing-yang commented Apr 1, 2020

xing-yang commented Apr 1, 2020

yuxiangqian Apr 1, 2020

xing-yang Apr 2, 2020

xing-yang Apr 2, 2020

yuxiangqian Apr 1, 2020

xing-yang Apr 2, 2020

yuxiangqian left a comment

yuxiangqian Apr 2, 2020

xing-yang Apr 3, 2020

xing-yang Apr 3, 2020

xing-yang commented Apr 3, 2020

yuxiangqian commented Apr 3, 2020

Add AnnVolumeSnapshotBeingCreated #261

Add AnnVolumeSnapshotBeingCreated #261

Conversation

xing-yang commented Feb 27, 2020 • edited Loading

k8s-ci-robot commented Feb 27, 2020

xing-yang commented Feb 27, 2020

yuxiangqian left a comment

Choose a reason for hiding this comment

xing-yang commented Mar 11, 2020

ShyamsundarR commented Mar 11, 2020

xing-yang commented Mar 16, 2020

yuxiangqian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xing-yang Mar 31, 2020 • edited Loading

Choose a reason for hiding this comment

xing-yang Mar 31, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuxiangqian Mar 31, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xing-yang commented Apr 1, 2020

xing-yang commented Apr 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuxiangqian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xing-yang commented Apr 3, 2020

yuxiangqian commented Apr 3, 2020

xing-yang commented Feb 27, 2020 •

edited

Loading

xing-yang Mar 31, 2020 •

edited

Loading

xing-yang Mar 31, 2020 •

edited

Loading

yuxiangqian Mar 31, 2020 •

edited

Loading