-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(provisioning) enable pod resheduling cause of node insufficient capacity #23
Conversation
029596b
to
6e6368c
Compare
"strconv" | ||
|
||
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status is not updated here, it will block the controller forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is intentional. As of now, we are setting up the failed state only in case of Insufficient capacity where we don't want to retry the volume provisioning. In other cases, controller loop will keep retrying the volume provisioning.
pkg/driver/params.go
Outdated
|
||
// parse bool params | ||
boolParams := map[string]*bool{ | ||
"wait": ¶ms.WaitForProvision, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this parameter? we will always wait for the provisioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, removed.
@@ -89,6 +112,7 @@ func CreateVolume(vol *apis.LVMVolume) error { | |||
out, err := cmd.CombinedOutput() | |||
|
|||
if err != nil { | |||
err = newExecError(out, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use newExecError in the previous return at line 103?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not required. As the name suggests, newExecError just wraps only the error returned by exec
pkg after running the cmd. Wrapping other errors will not provide any extra information since output will be empty.
pkg/driver/controller.go
Outdated
if volErr := vol.Status.Error; volErr != nil { | ||
errMsg = volErr.Message | ||
if volErr.Code == lvmapi.InsufficientCapacity { | ||
reschedule = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make is reschedule everytime there is failure on the node, does not matter for what reason. If it fails on a perticular node, we would want to try the volume creation on some other node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may not be a good idea to reschedule volume in case of any (possibly intermittent) error. It'll result in sparse & unpredictable scheduling of storage resources. IMO, if there is any error related to a particular node, it's better to inspect & fix that error(by cordoning the node) rather than masking it with rescheduling.
6e6368c
to
30b5feb
Compare
@iyashu there is conflict in the file pkg/driver/controller.go. Cab you rebase and push. |
capacity It also involves some refactoring of csi param parsing and volume provisioing workflow improving timeout, error handling & idempotency. Signed-off-by: Yashpal Choudhary <[email protected]>
30b5feb
to
3fab614
Compare
@pawanpraka1 done, ptal. |
Codecov Report
@@ Coverage Diff @@
## master #23 +/- ##
=========================================
- Coverage 1.20% 1.16% -0.04%
=========================================
Files 11 12 +1
Lines 830 856 +26
=========================================
Hits 10 10
- Misses 820 846 +26
Continue to review full report at Codecov.
|
Signed-off-by: Yashpal Choudhary <[email protected]>
@@ -123,14 +123,28 @@ spec: | |||
description: VolStatus string that specifies the current state of the volume | |||
provisioning request. | |||
properties: | |||
error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this now? we can directly rely on state being failed to take the decesion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, functionally from rescheduling point of view it can be removed now. I had just kept it for the purpose of better inspection & debugging as a user/maintainer. As we are propagating error message to csi external provisioner (along with ResourceExhausted), the same error message is being added as events under pvc resource. e.g
➜ lvm-localpv git:(storage-capacity) ✗ k describe pvc claim-a-test-a-4
Name: claim-a-test-a-4
Namespace: default
StorageClass: openebs-ext4pv
Status: Pending
Volume:
Labels: app=test-a
Annotations: volume.beta.kubernetes.io/storage-provisioner: local.csi.openebs.io
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Used By: test-a-4
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal WaitForFirstConsumer 22s persistentvolume-controller waiting for first consumer to be created before binding
Normal Provisioning 5s (x4 over 22s) local.csi.openebs.io_openebs-lvm-controller-0_6f7c487b-6397-4d19-a5d4-b4d84819f831 External provisioner is provisioning volume for claim "default/claim-a-test-a-4"
Normal ExternalProvisioning 5s (x5 over 22s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "local.csi.openebs.io" or manually created by system administrator
Warning ProvisioningFailed 3s (x4 over 19s) local.csi.openebs.io_openebs-lvm-controller-0_6f7c487b-6397-4d19-a5d4-b4d84819f831 failed to provision volume with StorageClass "openebs-ext4pv": rpc error: code = ResourceExhausted desc = Volume group "lvmvg" has insufficient free space (3071 extents): 5120 required.
- exit status 5
Normal WaitForPodScheduled 3s (x5 over 19s) persistentvolume-controller waiting for pod test-a-4 to be scheduled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. Can you describe how you have tested it? Can you have a setup and request for a size which only one node can serve and see if it is getting tried/provisioner on this node eventually?
} | ||
return status.Errorf(codes.Internal, | ||
"lvm: destroy wait failed, not able to get the volume %s %s", volname, err.Error()) | ||
if reschedule { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can move this block inside if volErr := vol.Status.Error; volErr != nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can since reschedule var is only set to true at one place . Just keeping it here for readability purpose where we first take decision on whether we want to reschedule or not and then at last we run some clean up actions based on if rescheduling is required. I can move it if you think otherwise.
Exactly see the pull request description above (If the changes in this PR are manually verified, list down the scenarios covered). In my case, I've a 4 nodes cluster (besides master nodes) each having storage capacity of 32 GiB. I've created 4 pods each requesting storage capacity of size 20 GiB. Now, I can see from lvm node plugin logs that kube-scheduler first tries to schedule all 4 pods on first 3 nodes, but then eventually it schedule the 4th pod on that remaining node (having sufficient capacity since other 3 nodes have only 12 GiB left). Another test that I've run is when there is no nodes available to satisfy the pvc claim. In that case, kube-scheduler keeps on retrying the pvc across nodes until any nodes gets capacity enough to fit the claim size. |
sounds good @iyashu .Thanks for this PR. |
Why is this PR required? What issue does it fix?:
As of now, if the volume provisioning fails due to insufficient capacity on nodes, the pod will remain in pending or container creating state forever (unless someone recreates it). See #18 for more details. Apart from this, the pull request also gracefully handles the rpc timeouts & internal errors ensuring idempotency in provisioning workflow.
What this PR does?:
We introduce additional field named as
VolumeError
inLVMVolume
custom resource to propagate the error from node agent to controller. Additionally, node agent will mark the lvm volume being failed if no sufficient space is available in node vg. Controller will inspect the errors populated by node agent and according return eitherResourceExhausted
error code, so that external provisioner can reschedule the pod. In other cases, controller returns statusAborted
to treat the volume provisioning in progress.Note: Pull request currently doesn't handles rescheduling in case of implicit binding. It should be trivial & I'll create a separate pr for the same.
Does this PR require any upgrade changes?:
No
If the changes in this PR are manually verified, list down the scenarios covered::
Consider a cluster where only one node has sufficient capacity to fulfill the pvc capacity request. Now, if you try to create a pod (with late binding), kube-scheduler will keep retrying scheduling if node doesn't have enough capacity available in volume group until it finds the right node.
Any additional information for your reviewer? :
Mention if this PR is part of any design or a continuation of previous PRs
Checklist:
<type>(<scope>): <subject>