Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: resolve invalid claim states #11890

Merged
merged 2 commits into from
Jan 27, 2022
Merged

CSI: resolve invalid claim states #11890

merged 2 commits into from
Jan 27, 2022

Conversation

tgross
Copy link
Member

@tgross tgross commented Jan 20, 2022

Partial fix for #10927 #10052 #8734

It's currently possible (because of either bugs which we still need to resolve, or simply from node failures) for CSI volumes to be claimed by allocations that don't exist. This changeset is intended to assert a reasonable state at the state store level by registering these nil allocations as "past claims" on any read. This will cause any pass through the periodic GC or volume watcher to trigger the unpublishing workflow for those claims.


Note this won't entirely fix the issue:

@tgross tgross self-assigned this Jan 20, 2022
@vercel vercel bot temporarily deployed to Preview – nomad January 20, 2022 18:42 Inactive
@tgross tgross force-pushed the csi-known-bad-state branch from e436fe5 to ea44d74 Compare January 20, 2022 20:15
@vercel vercel bot temporarily deployed to Preview – nomad January 20, 2022 20:15 Inactive
@tgross tgross force-pushed the csi-known-bad-state branch from ea44d74 to 6353ea7 Compare January 24, 2022 18:26
@vercel vercel bot temporarily deployed to Preview – nomad January 24, 2022 18:26 Inactive
@tgross tgross added this to the 1.2.5 milestone Jan 24, 2022
@vercel vercel bot temporarily deployed to Preview – nomad January 24, 2022 20:06 Inactive
@tgross tgross force-pushed the csi-known-bad-state branch from e5057ab to fb83d13 Compare January 24, 2022 20:09
@vercel vercel bot temporarily deployed to Preview – nomad January 24, 2022 20:09 Inactive
@tgross tgross force-pushed the csi-known-bad-state branch from fb83d13 to 098cb22 Compare January 24, 2022 20:11
@vercel vercel bot temporarily deployed to Preview – nomad January 24, 2022 20:11 Inactive
@tgross tgross force-pushed the csi-known-bad-state branch from 098cb22 to 4c66b90 Compare January 24, 2022 20:38
@vercel vercel bot temporarily deployed to Preview – nomad January 24, 2022 20:38 Inactive
@tgross tgross force-pushed the csi-known-bad-state branch from 4c66b90 to dcb2cc1 Compare January 24, 2022 20:49
@vercel vercel bot temporarily deployed to Preview – nomad January 24, 2022 20:49 Inactive
@tgross tgross marked this pull request as ready for review January 24, 2022 20:49
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tgross tgross merged commit d0624fc into main Jan 27, 2022
@tgross tgross deleted the csi-known-bad-state branch January 27, 2022 14:30
@tgross
Copy link
Member Author

tgross commented Jan 27, 2022

I've checked that this patch should be mergeable to 1.1.x and 1.0.x so we should be able to simply cherry-pick this commit for backport.

tgross added a commit that referenced this pull request Jan 28, 2022
* csi: resolve invalid claim states on read

It's currently possible for CSI volumes to be claimed by allocations
that no longer exist. This changeset asserts a reasonable state at
the state store level by registering these nil allocations as "past
claims" on any read. This will cause any pass through the periodic GC
or volumewatcher to trigger the unpublishing workflow for those claims.

* csi: make feasibility check errors more understandable

When the feasibility checker finds we have no free write claims, it
checks to see if any of those claims are for the job we're currently
scheduling (so that earlier versions of a job can't block claims for
new versions) and reports a conflict if the volume can't be scheduled
so that the user can fix their claims. But when the checker hits a
claim that has a GCd allocation, the state is recoverable by the
server once claim reaping completes and no user intervention is
required; the blocked eval should complete. Differentiate the
scheduler error produced by these two conditions.
tgross added a commit that referenced this pull request Jan 28, 2022
* csi: resolve invalid claim states on read

It's currently possible for CSI volumes to be claimed by allocations
that no longer exist. This changeset asserts a reasonable state at
the state store level by registering these nil allocations as "past
claims" on any read. This will cause any pass through the periodic GC
or volumewatcher to trigger the unpublishing workflow for those claims.

* csi: make feasibility check errors more understandable

When the feasibility checker finds we have no free write claims, it
checks to see if any of those claims are for the job we're currently
scheduling (so that earlier versions of a job can't block claims for
new versions) and reports a conflict if the volume can't be scheduled
so that the user can fix their claims. But when the checker hits a
claim that has a GCd allocation, the state is recoverable by the
server once claim reaping completes and no user intervention is
required; the blocked eval should complete. Differentiate the
scheduler error produced by these two conditions.
tgross added a commit that referenced this pull request Jan 28, 2022
* csi: resolve invalid claim states on read

It's currently possible for CSI volumes to be claimed by allocations
that no longer exist. This changeset asserts a reasonable state at
the state store level by registering these nil allocations as "past
claims" on any read. This will cause any pass through the periodic GC
or volumewatcher to trigger the unpublishing workflow for those claims.

* csi: make feasibility check errors more understandable

When the feasibility checker finds we have no free write claims, it
checks to see if any of those claims are for the job we're currently
scheduling (so that earlier versions of a job can't block claims for
new versions) and reports a conflict if the volume can't be scheduled
so that the user can fix their claims. But when the checker hits a
claim that has a GCd allocation, the state is recoverable by the
server once claim reaping completes and no user intervention is
required; the blocked eval should complete. Differentiate the
scheduler error produced by these two conditions.
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants