-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create and delete a large number of snapshots leaving stale snapshots in the backend #446
Comments
@ShyamsundarR @humblec PTAL |
There are few things happening here, primarily because the requests are timing out. The snapshotter sidecar does NOT retry taking snapshots, instead it tries once and if the call times out, the The lock fix patch (#443) alleviates the problem, as it speeds up the operations (including snapshots) but there are still corner cases that may occur leaking snapshots. BTW, Similarly there can be corner cases in PVC->PV creates and deletes, that can leak images as kubernetes never recorded a success from the plugin. I have seen this happen when working on the performance improvements with the locks. It needs some more analysis, but there are cases when on large RPC response times we may leak an image. There is the Looking at the logs you provided, there are some discrepancies in call numbers (i.e Create/Delete/RPC success etc.) which explains the leaks, but to me it looks like you attempted to invoke delete before the snapshots were created as well. In my test, I attempted creating 25 snapshots using the attached script, which actually waits for the BUT, even with the above and without the lock fix patch, only 9 snapshots were marked ready and had Also, I tried an experiment adding a sleep to the We need to understand what to do next, and also possibly raise this with CSI/kube folks to understand expectations here and how to handle not losing state. |
We possibly need the snapshotter side car to implement a more robust timeout and retry mechanism like the provisioner does: https://github.com/kubernetes-csi/external-provisioner#csi-error-and-timeout-handling |
On further thought, snapshot may not be able to retry endlessly (owing to concepts like freezing/thawing workloads using the volume, prior and post the snapshot, and IO to the volume cannot be held back for long duration's). It may hence require some other form of fix in the kubernetes snapshotter. Will start a discussion there. |
@ShyamsundarR can you point me to the snapshot discussion if it already started? |
Describe the bug
create and delete a large number of snapshots leaving stale snapshots in the backend
Environment details
rbd plugin logs
rbd.log
snapshotter logs
snap.log
Steps to reproduce
Steps to reproduce the behavior:
Actual results
created 50 snapshots and deleted 50 snapshots, there are around 14 stale snapshots in the backend
Expected behavior
once we delete Kube snapshots there should not be any stale snapshots in the backend
The text was updated successfully, but these errors were encountered: