Backport of CSI: failed allocation should not block its own controller unpublish into release/1.2.x #14507

hc-github-team-nomad-core · 2022-09-08T17:30:42Z

Backport

This PR is auto-generated from #14484 to be assessed for backporting due to the inclusion of the label backport/1.2.x.

WARNING automatic cherry-pick of commits failed. Commits will require human attention.

The below text is copied from the body of the original PR.

A Nomad user reported problems with CSI volumes associated with failed allocations, where the Nomad server did not send a controller unpublish RPC.

The controller unpublish is skipped if other non-terminal allocations on the same node claim the volume. The check has a bug where the allocation belonging to the claim being freed was included in the check incorrectly. During a normal allocation stop for job stop or a new version of the job, the allocation is terminal so that's ok. But allocations that fail are not yet marked terminal at the point in time when the client sends the unpublish RPC to the server.

For CSI plugins that support controller attach/detach, this means that the controller will not be able to detach the volume from the allocation's host and the replacement claim will fail until a GC is run. This changeset fixes the conditional so that the claim's own allocation is not included, and makes the logic easier to read. Include a test case covering this path.

This PR includes two other tiny bug fixes that were going to be a pain if I had to backport 3 different PRs. They're in their own commits:

Fix missing copies in the volume unpublish workflow. Entities we get from the state store should always be copied before altering. Ensure that we copy the volume in the top-level unpublish workflow before handing off to the steps.
The list stub object for volumes in nomad/structs did not match the stub object in api. The api package also did not include the current readers/writers fields that are expected by the UI. True up the two objects and add the previously undocumented fields to the docs.

…14484) A Nomad user reported problems with CSI volumes associated with failed allocations, where the Nomad server did not send a controller unpublish RPC. The controller unpublish is skipped if other non-terminal allocations on the same node claim the volume. The check has a bug where the allocation belonging to the claim being freed was included in the check incorrectly. During a normal allocation stop for job stop or a new version of the job, the allocation is terminal. But allocations that fail are not yet marked terminal at the point in time when the client sends the unpublish RPC to the server. For CSI plugins that support controller attach/detach, this means that the controller will not be able to detach the volume from the allocation's host and the replacement claim will fail until a GC is run. This changeset fixes the conditional so that the claim's own allocation is not included, and makes the logic easier to read. Include a test case covering this path. Also includes two minor extra bugfixes: * Entities we get from the state store should always be copied before altering. Ensure that we copy the volume in the top-level unpublish workflow before handing off to the steps. * The list stub object for volumes in `nomad/structs` did not match the stub object in `api`. The `api` package also did not include the current readers/writers fields that are expected by the UI. True up the two objects and add the previously undocumented fields to the docs.

github-actions · 2023-01-07T02:14:33Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hc-github-team-nomad-core requested a review from tgross September 8, 2022 17:30

hc-github-team-nomad-core force-pushed the backport/b-csi-controller-unpublish/usually-enormous-ox branch from 5c24fda to 301739b Compare September 8, 2022 17:30

tgross force-pushed the backport/b-csi-controller-unpublish/usually-enormous-ox branch from 301739b to 677a0c0 Compare September 8, 2022 18:43

vercel bot deployed to Preview – nomad September 8, 2022 18:46 View deployment

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui September 8, 2022 18:46 Failure

tgross force-pushed the backport/b-csi-controller-unpublish/usually-enormous-ox branch from 677a0c0 to fab7e9a Compare September 8, 2022 18:54

vercel bot deployed to Preview – nomad September 8, 2022 18:57 View deployment

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui September 8, 2022 18:57 Failure

tgross merged commit 3912d93 into release/1.2.x Sep 8, 2022

tgross deleted the backport/b-csi-controller-unpublish/usually-enormous-ox branch September 8, 2022 19:43

github-actions bot locked as resolved and limited conversation to collaborators Jan 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport of CSI: failed allocation should not block its own controller unpublish into release/1.2.x #14507

Backport of CSI: failed allocation should not block its own controller unpublish into release/1.2.x #14507

hc-github-team-nomad-core commented Sep 8, 2022

github-actions bot commented Jan 7, 2023

Backport of CSI: failed allocation should not block its own controller unpublish into release/1.2.x #14507

Backport of CSI: failed allocation should not block its own controller unpublish into release/1.2.x #14507

Conversation

hc-github-team-nomad-core commented Sep 8, 2022

Backport

github-actions bot commented Jan 7, 2023