Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(provisioning) enable pod resheduling cause of node insufficient capacity #23

Merged
merged 2 commits into from
Mar 3, 2021

Conversation

iyashu
Copy link
Contributor

@iyashu iyashu commented Feb 17, 2021

Why is this PR required? What issue does it fix?:
As of now, if the volume provisioning fails due to insufficient capacity on nodes, the pod will remain in pending or container creating state forever (unless someone recreates it). See #18 for more details. Apart from this, the pull request also gracefully handles the rpc timeouts & internal errors ensuring idempotency in provisioning workflow.

What this PR does?:
We introduce additional field named as VolumeError in LVMVolume custom resource to propagate the error from node agent to controller. Additionally, node agent will mark the lvm volume being failed if no sufficient space is available in node vg. Controller will inspect the errors populated by node agent and according return either ResourceExhausted error code, so that external provisioner can reschedule the pod. In other cases, controller returns status Aborted to treat the volume provisioning in progress.

Note: Pull request currently doesn't handles rescheduling in case of implicit binding. It should be trivial & I'll create a separate pr for the same.

Does this PR require any upgrade changes?:
No

If the changes in this PR are manually verified, list down the scenarios covered::
Consider a cluster where only one node has sufficient capacity to fulfill the pvc capacity request. Now, if you try to create a pod (with late binding), kube-scheduler will keep retrying scheduling if node doesn't have enough capacity available in volume group until it finds the right node.

Any additional information for your reviewer? :
Mention if this PR is part of any design or a continuation of previous PRs

Checklist:

@iyashu iyashu force-pushed the lvm-error-handling branch from 029596b to 6e6368c Compare February 17, 2021 18:07
@pawanpraka1 pawanpraka1 added this to the v0.3 milestone Feb 18, 2021
pkg/mgmt/volume/volume.go Outdated Show resolved Hide resolved
"strconv"

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we are using gofmt for maintaining a same code format. goimports (being superset of gofmt) separates standard lib imports with third party ones by a newline. More details are here.

}
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status is not updated here, it will block the controller forever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is intentional. As of now, we are setting up the failed state only in case of Insufficient capacity where we don't want to retry the volume provisioning. In other cases, controller loop will keep retrying the volume provisioning.


// parse bool params
boolParams := map[string]*bool{
"wait": &params.WaitForProvision,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this parameter? we will always wait for the provisioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, removed.

@@ -89,6 +112,7 @@ func CreateVolume(vol *apis.LVMVolume) error {
out, err := cmd.CombinedOutput()

if err != nil {
err = newExecError(out, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use newExecError in the previous return at line 103?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not required. As the name suggests, newExecError just wraps only the error returned by exec pkg after running the cmd. Wrapping other errors will not provide any extra information since output will be empty.

if volErr := vol.Status.Error; volErr != nil {
errMsg = volErr.Message
if volErr.Code == lvmapi.InsufficientCapacity {
reschedule = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make is reschedule everytime there is failure on the node, does not matter for what reason. If it fails on a perticular node, we would want to try the volume creation on some other node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may not be a good idea to reschedule volume in case of any (possibly intermittent) error. It'll result in sparse & unpredictable scheduling of storage resources. IMO, if there is any error related to a particular node, it's better to inspect & fix that error(by cordoning the node) rather than masking it with rescheduling.

@iyashu iyashu force-pushed the lvm-error-handling branch from 6e6368c to 30b5feb Compare February 28, 2021 15:27
@pawanpraka1
Copy link
Contributor

@iyashu there is conflict in the file pkg/driver/controller.go. Cab you rebase and push.

capacity

It also involves some refactoring of csi param parsing and volume
provisioing workflow improving timeout, error handling & idempotency.

Signed-off-by: Yashpal Choudhary <[email protected]>
@iyashu iyashu force-pushed the lvm-error-handling branch from 30b5feb to 3fab614 Compare March 1, 2021 13:57
@iyashu
Copy link
Contributor Author

iyashu commented Mar 1, 2021

@pawanpraka1 done, ptal.

@codecov-io
Copy link

Codecov Report

Merging #23 (3fab614) into master (7effd08) will decrease coverage by 0.03%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #23      +/-   ##
=========================================
- Coverage    1.20%   1.16%   -0.04%     
=========================================
  Files          11      12       +1     
  Lines         830     856      +26     
=========================================
  Hits           10      10              
- Misses        820     846      +26     
Impacted Files Coverage Δ
pkg/driver/controller.go 0.76% <0.00%> (-0.02%) ⬇️
pkg/driver/params.go 0.00% <0.00%> (ø)
pkg/driver/schd_helper.go 0.00% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7effd08...3fab614. Read the comment docs.

@iyashu iyashu requested a review from pawanpraka1 March 2, 2021 11:51
@@ -123,14 +123,28 @@ spec:
description: VolStatus string that specifies the current state of the volume
provisioning request.
properties:
error:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this now? we can directly rely on state being failed to take the decesion?

Copy link
Contributor Author

@iyashu iyashu Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, functionally from rescheduling point of view it can be removed now. I had just kept it for the purpose of better inspection & debugging as a user/maintainer. As we are propagating error message to csi external provisioner (along with ResourceExhausted), the same error message is being added as events under pvc resource. e.g

➜  lvm-localpv git:(storage-capacity) ✗ k describe pvc claim-a-test-a-4
Name:          claim-a-test-a-4
Namespace:     default
StorageClass:  openebs-ext4pv
Status:        Pending
Volume:
Labels:        app=test-a
Annotations:   volume.beta.kubernetes.io/storage-provisioner: local.csi.openebs.io
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       test-a-4
Events:
  Type     Reason                Age               From                                                                                Message
  ----     ------                ----              ----                                                                                -------
  Normal   WaitForFirstConsumer  22s               persistentvolume-controller                                                         waiting for first consumer to be created before binding
  Normal   Provisioning          5s (x4 over 22s)  local.csi.openebs.io_openebs-lvm-controller-0_6f7c487b-6397-4d19-a5d4-b4d84819f831  External provisioner is provisioning volume for claim "default/claim-a-test-a-4"
  Normal   ExternalProvisioning  5s (x5 over 22s)  persistentvolume-controller                                                         waiting for a volume to be created, either by external provisioner "local.csi.openebs.io" or manually created by system administrator
  Warning  ProvisioningFailed    3s (x4 over 19s)  local.csi.openebs.io_openebs-lvm-controller-0_6f7c487b-6397-4d19-a5d4-b4d84819f831  failed to provision volume with StorageClass "openebs-ext4pv": rpc error: code = ResourceExhausted desc =   Volume group "lvmvg" has insufficient free space (3071 extents): 5120 required.
 - exit status 5
  Normal  WaitForPodScheduled  3s (x5 over 19s)  persistentvolume-controller  waiting for pod test-a-4 to be scheduled

Copy link
Contributor

@pawanpraka1 pawanpraka1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. Can you describe how you have tested it? Can you have a setup and request for a size which only one node can serve and see if it is getting tried/provisioner on this node eventually?

}
return status.Errorf(codes.Internal,
"lvm: destroy wait failed, not able to get the volume %s %s", volname, err.Error())
if reschedule {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can move this block inside if volErr := vol.Status.Error; volErr != nil .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can since reschedule var is only set to true at one place . Just keeping it here for readability purpose where we first take decision on whether we want to reschedule or not and then at last we run some clean up actions based on if rescheduling is required. I can move it if you think otherwise.

@iyashu
Copy link
Contributor Author

iyashu commented Mar 3, 2021

looks good. Can you describe how you have tested it? Can you have a setup and request for a size which only one node can serve and see if it is getting tried/provisioner on this node eventually?

Exactly see the pull request description above (If the changes in this PR are manually verified, list down the scenarios covered). In my case, I've a 4 nodes cluster (besides master nodes) each having storage capacity of 32 GiB. I've created 4 pods each requesting storage capacity of size 20 GiB. Now, I can see from lvm node plugin logs that kube-scheduler first tries to schedule all 4 pods on first 3 nodes, but then eventually it schedule the 4th pod on that remaining node (having sufficient capacity since other 3 nodes have only 12 GiB left).

Another test that I've run is when there is no nodes available to satisfy the pvc claim. In that case, kube-scheduler keeps on retrying the pvc across nodes until any nodes gets capacity enough to fit the claim size.

@pawanpraka1
Copy link
Contributor

pawanpraka1 commented Mar 3, 2021

sounds good @iyashu .Thanks for this PR.

@pawanpraka1 pawanpraka1 merged commit e965f93 into openebs:master Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to reschedule the volume provisioning in case selected node doesn't have enough capacity
3 participants