Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't delete PVC, finalizer pvc-as-source-protection does not finish #2670

Open
kaitimmer opened this issue Nov 25, 2024 · 5 comments
Open

Comments

@kaitimmer
Copy link

What happened:
When deleting a PVC, the deletion process is "stuck".

The finalizer: snapshot.storage.kubernetes.io/pvc-as-source-protection does not finish.

If I patch the PVC in `Terminating" state and remove the finalizer, everything works as expected.

I've seen this behavior randomly in multiple clusters. But in the current one, it has been persisting for a couple of weeks already.

What you expected to happen:
Finalizer finishes and I can delete the PVC without the need to patch it first.

How to reproduce it:

k delete PVC pvc-something-0

It does not matter which StorageClass or SKU is behind the PVC. If it is not working in a cluster, it is not working for all PVCs.

Anything else we need to know?:

When this error exists, I cannot get a VolumeSnapshot into "ReadyToUse." It looks like everything that interacts with Snapshots is broken in this cluster.

Environment:

  • CSI Driver version: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.30.4
  • Kubernetes version (use kubectl version): - Client Version: v1.31.2 Kustomize Version: v5.4.2 Server Version: v1.30.3
@andyzhangx
Copy link
Member

that snapshot creation is stuck there, I could help troubleshooting if you could provide the aks cluster fqdn.

@kaitimmer
Copy link
Author

@andyzhangx thank you for offering help. We figured out that this was caused by having a lot of VolumeSnapshots and VolumeSnapshotcontents in this cluster (some cleanup did not work as expected). Once we cleaned everything up, it started working again.

However, seeing this got me thinking:

How many VolumeSnapshots and VolumesnapshotContents can the csi-driver safely handle before we reach this problem again? Do you have any numbers there?

@monotek
Copy link
Member

monotek commented Nov 28, 2024

The downside is now, with removing the finalizer, we have to delete the actual azure snapshots manually from the azure portal, because the csi driver did not do it.

"az delete snapshot" seems to be rather slow for this (even with using --now-wait=true), needing about 5 seconds for every snapshot delete command.

I'll check again if we can delete all snapshots at once but my first try failed as we have ~20000 snapshots to delete and bash said "to many arguments" :D

@andyzhangx
Copy link
Member

@andyzhangx thank you for offering help. We figured out that this was caused by having a lot of VolumeSnapshots and VolumeSnapshotcontents in this cluster (some cleanup did not work as expected). Once we cleaned everything up, it started working again.

However, seeing this got me thinking:

How many VolumeSnapshots and VolumesnapshotContents can the csi-driver safely handle before we reach this problem again? Do you have any numbers there?

@kaitimmer as long as the snapshot container is working fine, that's ok. Recently we found that the memory limit of snapshot container is too small when there are lots of snapshots, finally the snapshot container is OOM. So I think the question is about snapshot num and memory limit of snapshot container, how fast can the csi driver handle the snapshot to avoid snapshot content accumulated, just let me know when your cluster is stuck on creating snapshot, I could increase the memory limit immediately. Later on, we will increase the memory limit since Azure service is in CCOA now.

@kaitimmer
Copy link
Author

Hi @andyzhangx,

One of our clusters is again in the state where the finalizer does not work. I will send you the ID and URI via EMail.

Since we cleaned up all the VolumeSnapshots, the amount is not the problem. I assume that we are again in a state where the problem started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants