Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI plugin is not deregistered when a new version of the plugin job changes the CSI plugin id #20225

Closed
Jamesits opened this issue Mar 26, 2024 · 4 comments · Fixed by #20555
Closed
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Milestone

Comments

@Jamesits
Copy link

Nomad version

1.7.6

Operating system and Environment details

Ubuntu 22.04

Issue

When a CSI job has a new version with a different CSI plugin id, Nomad is unable to unregister the storage plugin with the old id cleanly.

Reproduction steps

  1. Submit a working CSI storage plugin job
  2. Submit a new version of the job with job.group[].task[].csi_plugin.id changed directly

Expected Result

Web UI -> Storage -> Plugins shows only the new id

Actual Result

Web UI -> Storage -> Plugins have both the old id and new id showing up, and you have no way to remove the old one

I've tried changing the CSI plugin id back, stopping all the plugin jobs, purging the related jobs completely. It worked for one cluster, but does not on another cluster of the same configuration (created with near identical Terraform plans). This might be caused by an unknown issue.

Screenshot from 2024-03-26 17-41-04

Job file (if appropriate)

N/A

Nomad Server logs (if appropriate)

N/A

Nomad Client logs (if appropriate)

N/A

Related #7306

@jrasell
Copy link
Member

jrasell commented Apr 2, 2024

Hi @Jamesits and thanks for raising this issue. It looks like we are not correctly handling the ID update, to perform destructive actions on the original CSI ID. I'll add this to our roadmap.

@jrasell jrasell added theme/storage stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Apr 2, 2024
@tgross tgross self-assigned this May 9, 2024
@tgross
Copy link
Member

tgross commented May 9, 2024

I've had a look at this and some of the behavior we're seeing here is intentional. We don't clean up the plugin until its jobs are completely deleted, because we need to account for the delay between allocations coming up and down and the plugin being fingerprinted by the node. That being said, what should be working is garbage collection. But in this case, we detect the plugin as still having the job even though it has no allocations for the job and there's a more current version of the job.

I'm working on a patch that will update the logic for GC to account for that.

What you've done here as a workaround should work:

I've tried changing the CSI plugin id back, stopping all the plugin jobs, purging the related jobs completely

So if you didn't see that work on one cluster, there may be an unrelated issue where the plugin thinks its still in use. I'd need to see the plugin status / logs for more details.

@tgross
Copy link
Member

tgross commented May 10, 2024

PR is up here: #20555

tgross added a commit that referenced this issue May 10, 2024
When a job that implements a plugin is updated to have a new plugin ID, the old
version of the plugin is never deleted. We want to delay deleting plugins until
garbage collection to avoid race conditions between a plugin being registered
and its allocations being marked healthy.

Add logic to the state store's `DeleteCSIPlugin` method (used only by GC) to
check whether any of the jobs associated with the plugin have no allocations and
either have been purged or have been updated to no longer implement that plugin
ID.

This changeset also updates the CSI plugin lifecycle tests in the state store to
use `shoenig/test` over `testify`, and removes a spurious error log that was
happening on every periodic plugin GC attempt.

Fixes: #20225
tgross added a commit that referenced this issue May 16, 2024
When a job that implements a plugin is updated to have a new plugin ID, the old
version of the plugin is never deleted. We want to delay deleting plugins until
garbage collection to avoid race conditions between a plugin being registered
and its allocations being marked healthy.

Add logic to the state store's `DeleteCSIPlugin` method (used only by GC) to
check whether any of the jobs associated with the plugin have no allocations and
either have been purged or have been updated to no longer implement that plugin
ID.

This changeset also updates the CSI plugin lifecycle tests in the state store to
use `shoenig/test` over `testify`, and removes a spurious error log that was
happening on every periodic plugin GC attempt.

Fixes: #20225
@tgross tgross added this to the 1.8.0 milestone May 16, 2024
@tgross
Copy link
Member

tgross commented May 16, 2024

#20555 has been merged and will ship in Nomad 1.8.0 (with backports to supported versions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Projects
Development

Successfully merging a pull request may close this issue.

3 participants