Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: ensure that PastClaims are populated with correct mode #11932

Merged
merged 1 commit into from
Jan 27, 2022

Conversation

tgross
Copy link
Member

@tgross tgross commented Jan 25, 2022

Fixes #10927 (comment).

In the client's (*csiHook) Postrun() method, we make an unpublish RPC that includes a claim in the CSIVolumeClaimStateUnpublishing state and using the mode from the client. But then in the (*CSIVolume) Unpublish RPC handler, we query the volume from the state store (because we only get an ID from the client). And when we make the client RPC for the node unpublish step, we use the current volume's view of the mode. If the volume's mode has been changed before the old allocations can have their claims released, then we end up making a CSI RPC that will never succeed.

Why does this code path get the mode from the volume and not the claim? Because the claim written by the GC job in (*CoreScheduler) csiVolumeClaimGC doesn't have a mode. Instead it just writes a claim in the unpublishing state to ensure the volumewatcher detects a "past claim" change and reaps all the claims on the volumes.

Fix this by ensuring that the CSIVolumeDenormalize creates past claims for all nil allocations with a correct access mode set. This ends up being a fairly large refactoring of the method because otherwise fixing the mode ends up repeating the past claim construction in 4 places and balloons the size of the method.

@tgross tgross added this to the 1.2.5 milestone Jan 25, 2022
@tgross tgross changed the title csi: ensure that PastClaims are populated with correct mode CSI: ensure that PastClaims are populated with correct mode Jan 25, 2022
@tgross tgross requested review from jrasell, lgfa29 and shoenig January 26, 2022 20:58
@tgross
Copy link
Member Author

tgross commented Jan 26, 2022

I've tagged reviewers on this but it currently includes all the code in #11890 so y'all may want to wait till that's merged to review this one.

@tgross tgross marked this pull request as ready for review January 26, 2022 20:59
@tgross tgross self-assigned this Jan 26, 2022
In the client's `(*csiHook) Postrun()` method, we make an unpublish
RPC that includes a claim in the `CSIVolumeClaimStateUnpublishing`
state and using the mode from the client. But then in the
`(*CSIVolume) Unpublish` RPC handler, we query the volume from the
state store (because we only get an ID from the client). And when we
make the client RPC for the node unpublish step, we use the _current
volume's_ view of the mode. If the volume's mode has been changed
before the old allocations can have their claims released, then we end
up making a CSI RPC that will never succeed.

Why does this code path get the mode from the volume and not the
claim? Because the claim written by the GC job in `(*CoreScheduler)
csiVolumeClaimGC` doesn't have a mode. Instead it just writes a claim
in the unpublishing state to ensure the volumewatcher detects a "past
claim" change and reaps all the claims on the volumes.

Fix this by ensuring that the `CSIVolumeDenormalize` creates past
claims for all nil allocations with a correct access mode set.
Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tgross tgross merged commit b588a7b into main Jan 27, 2022
@tgross tgross deleted the csi-postrun-mode branch January 27, 2022 15:05
tgross added a commit that referenced this pull request Jan 28, 2022
In the client's `(*csiHook) Postrun()` method, we make an unpublish
RPC that includes a claim in the `CSIVolumeClaimStateUnpublishing`
state and using the mode from the client. But then in the
`(*CSIVolume) Unpublish` RPC handler, we query the volume from the
state store (because we only get an ID from the client). And when we
make the client RPC for the node unpublish step, we use the _current
volume's_ view of the mode. If the volume's mode has been changed
before the old allocations can have their claims released, then we end
up making a CSI RPC that will never succeed.

Why does this code path get the mode from the volume and not the
claim? Because the claim written by the GC job in `(*CoreScheduler)
csiVolumeClaimGC` doesn't have a mode. Instead it just writes a claim
in the unpublishing state to ensure the volumewatcher detects a "past
claim" change and reaps all the claims on the volumes.

Fix this by ensuring that the `CSIVolumeDenormalize` creates past
claims for all nil allocations with a correct access mode set.
tgross added a commit that referenced this pull request Jan 28, 2022
In the client's `(*csiHook) Postrun()` method, we make an unpublish
RPC that includes a claim in the `CSIVolumeClaimStateUnpublishing`
state and using the mode from the client. But then in the
`(*CSIVolume) Unpublish` RPC handler, we query the volume from the
state store (because we only get an ID from the client). And when we
make the client RPC for the node unpublish step, we use the _current
volume's_ view of the mode. If the volume's mode has been changed
before the old allocations can have their claims released, then we end
up making a CSI RPC that will never succeed.

Why does this code path get the mode from the volume and not the
claim? Because the claim written by the GC job in `(*CoreScheduler)
csiVolumeClaimGC` doesn't have a mode. Instead it just writes a claim
in the unpublishing state to ensure the volumewatcher detects a "past
claim" change and reaps all the claims on the volumes.

Fix this by ensuring that the `CSIVolumeDenormalize` creates past
claims for all nil allocations with a correct access mode set.
@tgross
Copy link
Member Author

tgross commented Jan 28, 2022

Removed the backport/1.0 tag on this one because these modes don't exist in the CSIVolumeClaim struct in the 1.0.x branch.

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Constraint "CSI volume has exhausted its available writer claims": 1 nodes excluded by filter
3 participants