-
Notifications
You must be signed in to change notification settings - Fork 606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cinder-csi-plugin:bug] Volume Snapshot deletion is not detecting errors #2294
Comments
@josedev-union see a similar issue in Lirt/velero-plugin-for-openstack#78 |
@kayrus we don't have a plan to use velero. |
@josedev-union I'm not asking you to use velero. I'm just sharing an issue that has the same consequences.
The volume can be cloned with the |
I think the key ask here is to wait for snapshot really deleted just like
wait for snapshot ready , because the API is 202 response (https://docs.openstack.org/api-ref/block-storage/v3/index.html?expanded=delete-a-snapshot-detail#delete-a-snapshot) , so I think we should may consider wait for snapshot to really delete (with a timeout of course) |
@jichenjc if the snapshot is in error state, API won't allow to delete it. The logic must be a bit clever and complicated, e.g. reset snapshot status, then try to delete it. I faced the same issue in velero plugin, testing this fix in my env. If the tests are good, we can onboard this logic into CSI. |
that's good suggestion ,will read your provided PR :) thanks for the great info~ |
I agree on that it will be complicated a bit. But i think we need to keep in mind that reset status requires admin permission. We run cinder-csi-plugin using tenant scope service account with member role. (Downstream clusters will not have such a wide permission normally) |
@josedev-union we have two kinds of permissions in our openstack env: admin and cloud admin. admin permissions allow to reset the status, cloud admin can force delete the resource. if you also have such roles, then the status reset can be triggered. nevertheless this reset option should be toggleable in CSI config. |
/assign |
https://docs.openstack.org/keystone/latest/admin/service-api-protection.html#admin so in your context (right side is defintion in above) ?
|
I think in this case we MUST NOT wait in the controller's reconcile loop for the resource to be deleted. Depending on the order of deletion this may take an arbitrarily long time. It may never deleted if the user doesn't also delete the dependent resources. We must exit the reconcile loop and allow the manager to call us again with exponential backoff. |
yeah, that coulld be also another solution. |
I just noticed that manila CSI doesn't have a finalizer, and during my tests I had a manila pvc which was successfully deleted from k8s api, but still stuck in openstack. |
agree, this might be another solution , seems this is better option but the goal is same, check the deletion instead of take 202 as complete .. |
Hi, Is anyone still working on this issue. @kayrus If you haven't started this work, could you leave it to me? I think that we should introduce finalizer to resolve the issue. When user create a volume with snapshot, |
Or alternatively each individual volume created from the snapshot would add its own finalizer. That would have the advantage of being its own reference count, and also giving the user a clue as to specifically which volumes are blocking the deletion of the snapshot. I can't think of a case where I've seen this pattern used before, though. We should take a moment to consider if there's a reason for that. |
Hi @mdbooth We all agree that if a volume created from a snapshot, the snapshot shouldn't be deleted before the volume, right? But, I haven't understood the sentence:
Your mean that add For more informations about |
I was referring to this finalizer ☝️ |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/reopen |
@stephenfin: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/remove-lifecycle rotten |
Is this a BUG REPORT or FEATURE REQUEST?:
What happened:
I am using Delete policy for both PV class and VS(VolumeSnapshot) class.
We use TrilioVault for Kubernetes for K8s backup tool and run TVK preflight plugin as a part of preflight check of the cluster provisioning.
This TVK preflight check plugin creates a PVC and a VS from this source PVC, then create a restore PVC from the snapshot.
Once all checks are finished, it deletes all test resources.
In openstack , if a volume was created from a volume snapshot, this snapshot cannot be deleted before its child volume is deleted.
So the first attempt to delete VS is failed because it takes a bit time to delete the restored volume. Ofc, the source volume deletion is also failed because it can be deleted once the VS and restored volume are deleted.
But openstack snapshot api is async and it means it responds 202 (return value 0) to DELETE snapshot request every time even if the deletion failed. i.e. this
cs.Cloud.DeleteSnapshot
func will never fail in our scenario and k8s VS and VSC(VolumeSnapshotContent) objects are deleted without any issue and it will not be requeued even though openstack resources are there.cloud-provider-openstack/pkg/csi/cinder/controllerserver.go
Line 415 in 9ed6d96
It results garbage resources(volumes of the source PVC, and volumeSnapshots of the VS) on Openstack cloud.
What you expected to happen:
I think this could be considered as Openstack API issue more or less. But i think we can make a workaround at this plugin level.
So csi plugin will check the volume snapshot status after requesting the deletion. If the volume snapshot is removed, then it will return OK. If the volume snapshot status is changed to ERROR or its status remains Available (we can set timeout for this check), it will return Error so it can be requeued.
How to reproduce it:
It can be reproduced by using these example resources https://github.com/kubernetes/cloud-provider-openstack/tree/master/examples/cinder-csi-plugin/snapshot
Anything else we need to know?:
Only volume snapshot deletion is the problem.
Environment:
openstack-cloud-controller-manager(or other related binary) version:
We just use cinder-csi-plugin. The version list is
csiplugin: registry.k8s.io/provider-os/cinder-csi-plugin:v1.25.6
snapshotter: k8s.gcr.io/sig-storage/csi-snapshotter:v6.2.1
provisioner: k8s.gcr.io/sig-storage/csi-provisioner:v3.1.1
atttacher: k8s.gcr.io/sig-storage/csi-attacher:v3.5.1
OpenStack version: ocata, wallaby (i think all version, even antelope will has the same issue)
Others:
K8s: 1.25.9
The text was updated successfully, but these errors were encountered: