-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI plugin is not deregistered when a new version of the plugin job changes the CSI plugin id #20225
Comments
Hi @Jamesits and thanks for raising this issue. It looks like we are not correctly handling the ID update, to perform destructive actions on the original CSI ID. I'll add this to our roadmap. |
I've had a look at this and some of the behavior we're seeing here is intentional. We don't clean up the plugin until its jobs are completely deleted, because we need to account for the delay between allocations coming up and down and the plugin being fingerprinted by the node. That being said, what should be working is garbage collection. But in this case, we detect the plugin as still having the job even though it has no allocations for the job and there's a more current version of the job. I'm working on a patch that will update the logic for GC to account for that. What you've done here as a workaround should work:
So if you didn't see that work on one cluster, there may be an unrelated issue where the plugin thinks its still in use. I'd need to see the plugin status / logs for more details. |
PR is up here: #20555 |
When a job that implements a plugin is updated to have a new plugin ID, the old version of the plugin is never deleted. We want to delay deleting plugins until garbage collection to avoid race conditions between a plugin being registered and its allocations being marked healthy. Add logic to the state store's `DeleteCSIPlugin` method (used only by GC) to check whether any of the jobs associated with the plugin have no allocations and either have been purged or have been updated to no longer implement that plugin ID. This changeset also updates the CSI plugin lifecycle tests in the state store to use `shoenig/test` over `testify`, and removes a spurious error log that was happening on every periodic plugin GC attempt. Fixes: #20225
When a job that implements a plugin is updated to have a new plugin ID, the old version of the plugin is never deleted. We want to delay deleting plugins until garbage collection to avoid race conditions between a plugin being registered and its allocations being marked healthy. Add logic to the state store's `DeleteCSIPlugin` method (used only by GC) to check whether any of the jobs associated with the plugin have no allocations and either have been purged or have been updated to no longer implement that plugin ID. This changeset also updates the CSI plugin lifecycle tests in the state store to use `shoenig/test` over `testify`, and removes a spurious error log that was happening on every periodic plugin GC attempt. Fixes: #20225
#20555 has been merged and will ship in Nomad 1.8.0 (with backports to supported versions) |
Nomad version
1.7.6
Operating system and Environment details
Ubuntu 22.04
Issue
When a CSI job has a new version with a different CSI plugin id, Nomad is unable to unregister the storage plugin with the old id cleanly.
Reproduction steps
job.group[].task[].csi_plugin.id
changed directlyExpected Result
Web UI -> Storage -> Plugins shows only the new id
Actual Result
Web UI -> Storage -> Plugins have both the old id and new id showing up, and you have no way to remove the old one
I've tried changing the CSI plugin id back, stopping all the plugin jobs, purging the related jobs completely. It worked for one cluster, but does not on another cluster of the same configuration (created with near identical Terraform plans). This might be caused by an unknown issue.
Job file (if appropriate)
N/A
Nomad Server logs (if appropriate)
N/A
Nomad Client logs (if appropriate)
N/A
Related #7306
The text was updated successfully, but these errors were encountered: