-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI: resolve invalid claim states #11890
Conversation
e436fe5
to
ea44d74
Compare
ea44d74
to
6353ea7
Compare
e5057ab
to
fb83d13
Compare
fb83d13
to
098cb22
Compare
098cb22
to
4c66b90
Compare
4c66b90
to
dcb2cc1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I've checked that this patch should be mergeable to 1.1.x and 1.0.x so we should be able to simply cherry-pick this commit for backport. |
* csi: resolve invalid claim states on read It's currently possible for CSI volumes to be claimed by allocations that no longer exist. This changeset asserts a reasonable state at the state store level by registering these nil allocations as "past claims" on any read. This will cause any pass through the periodic GC or volumewatcher to trigger the unpublishing workflow for those claims. * csi: make feasibility check errors more understandable When the feasibility checker finds we have no free write claims, it checks to see if any of those claims are for the job we're currently scheduling (so that earlier versions of a job can't block claims for new versions) and reports a conflict if the volume can't be scheduled so that the user can fix their claims. But when the checker hits a claim that has a GCd allocation, the state is recoverable by the server once claim reaping completes and no user intervention is required; the blocked eval should complete. Differentiate the scheduler error produced by these two conditions.
* csi: resolve invalid claim states on read It's currently possible for CSI volumes to be claimed by allocations that no longer exist. This changeset asserts a reasonable state at the state store level by registering these nil allocations as "past claims" on any read. This will cause any pass through the periodic GC or volumewatcher to trigger the unpublishing workflow for those claims. * csi: make feasibility check errors more understandable When the feasibility checker finds we have no free write claims, it checks to see if any of those claims are for the job we're currently scheduling (so that earlier versions of a job can't block claims for new versions) and reports a conflict if the volume can't be scheduled so that the user can fix their claims. But when the checker hits a claim that has a GCd allocation, the state is recoverable by the server once claim reaping completes and no user intervention is required; the blocked eval should complete. Differentiate the scheduler error produced by these two conditions.
* csi: resolve invalid claim states on read It's currently possible for CSI volumes to be claimed by allocations that no longer exist. This changeset asserts a reasonable state at the state store level by registering these nil allocations as "past claims" on any read. This will cause any pass through the periodic GC or volumewatcher to trigger the unpublishing workflow for those claims. * csi: make feasibility check errors more understandable When the feasibility checker finds we have no free write claims, it checks to see if any of those claims are for the job we're currently scheduling (so that earlier versions of a job can't block claims for new versions) and reports a conflict if the volume can't be scheduled so that the user can fix their claims. But when the checker hits a claim that has a GCd allocation, the state is recoverable by the server once claim reaping completes and no user intervention is required; the blocked eval should complete. Differentiate the scheduler error produced by these two conditions.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Partial fix for #10927 #10052 #8734
It's currently possible (because of either bugs which we still need to resolve, or simply from node failures) for CSI volumes to be claimed by allocations that don't exist. This changeset is intended to assert a reasonable state at the state store level by registering these nil allocations as "past claims" on any read. This will cause any pass through the periodic GC or volume watcher to trigger the unpublishing workflow for those claims.
Note this won't entirely fix the issue:
nomad/core_sched.go
about moving theTerminalStatus
check into the denormalize volume logic.