The datavolume pointing to an existing virtual machine was deleted by the garbage collector #3134

anyushun · 2024-03-19T06:58:00Z

What happened:
we have a VirtualMachine and DataVolume, vm yaml as below:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  annotations:
    kubevirt.io/latest-observed-api-version: v1
    kubevirt.io/storage-observed-api-version: v1alpha3
  creationTimestamp: "2023-01-29T06:44:25Z"
  name: evm-cfb1bef9upshljjib6ng
  namespace: evm-1460
  selfLink: /apis/kubevirt.io/v1/namespaces/evm-1460/virtualmachines/evm-cfb1bef9upshljjib6ng
  uid: 9500ddab-5ef2-406f-9414-28efb506d5c6
spec:
  dataVolumeTemplates:
  - metadata:
      annotations:
      creationTimestamp: null
      name: dv-cfb1bfn9upsqt09rpct0
    spec:
      pvc:
        accessModes:
        - ReadWriteMany
        resources:
          requests:
            storage: 100Gi
        storageClassName: csi-rbd-sc
        volumeMode: Block
      source:
        http:
          url: http://image-cache/image.qcow2
  runStrategy: RerunOnFailure
  template:
    metadata:
      creationTimestamp: null
      labels:
        kubevirt.io/vm: evm-cfb1bef9upshljjib6ng
    spec:
      volumes:
      - dataVolume:
          name: dv-cfb1bfn9upsqt09rpct0
        name: cd-cfb1bfn9upsqt09rpctg

dv as:

apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  creationTimestamp: "2024-02-20T18:38:01Z"
  labels:
    kubevirt.io/created-by: 9500ddab-5ef2-406f-9414-28efb506d5c6
  name: dv-cfb1bfn9upsqt09rpct0
  namespace: evm-1460
  ownerReferences:
  - apiVersion: kubevirt.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: VirtualMachine
    name: evm-cfb1bef9upshljjib6ng
    uid: 9500ddab-5ef2-406f-9414-28efb506d5c6
  selfLink: /apis/cdi.kubevirt.io/v1beta1/namespaces/evm-1460/datavolumes/dv-cfb1bfn9upsqt09rpct0
  uid: 93c10278-6dbb-4970-bd1d-ae2301970d9f
spec:
  pvc:
    accessModes:
    - ReadWriteMany
    resources:
      requests:
        storage: 100Gi
    storageClassName: csi-rbd-sc
    volumeMode: Block
  source:
    http:
      url: http://image-cache/image.qcow2
status:
  phase: Succeeded
  progress: 100.0%

Everything is normal until 2024-02-21 02:38:00，the vm: evm-cfb1bef9upshljjib6ng failed, the datavolume pointing to vm was deleted by garbage collector, kube-controller-manager log:

kube-controller-manager.log.INFO.20240217-023659.3106557:I0221 02:37:59.055712 3106557 garbagecollector.go:409] processing item [cdi.kubevirt.io/v1beta1/DataVolume, namespace: evm-1460, name: dv-cfb1bfn9upsqt09rpct0, uid: 36bba869-279a-4eee-9675-3102e25a5a25]
kube-controller-manager.log.INFO.20240217-023659.3106557:I0221 02:38:00.055702 3106557 garbagecollector.go:522] delete object [cdi.kubevirt.io/v1beta1/DataVolume, namespace: evm-1460, name: dv-cfb1bfn9upsqt09rpct0, uid: 36bba869-279a-4eee-9675-3102e25a5a25] with propagation policy Background
kube-controller-manager.log.INFO.20240217-023659.3106557:I0221 02:39:01.356342 3106557 garbagecollector.go:409] processing item [v1/PersistentVolumeClaim, namespace: evm-1460, name: dv-cfb1bfn9upsqt09rpct0, uid: b59960ea-43d1-4376-ab0b-bf1812a156a9]
kube-controller-manager.log.INFO.20240217-023659.3106557:I0221 02:39:02.354953 3106557 garbagecollector.go:522] delete object [v1/PersistentVolumeClaim, namespace: evm-1460, name: dv-cfb1bfn9upsqt09rpct0, uid: b59960ea-43d1-4376-ab0b-bf1812a156a9] with propagation policy Background

I tried to find the cause of the problem from several directions:

Virtual machine migration causes datavolume to be deleted?

2024-02-20 16:24:44 migrate vm failed due to lack of resources, 2024-02-20 18:20:07 migrate successed. The dv was deleted on 2024-02-21 02:38:00, and virtual machine migration does not delete dv. So this reason can be ruled out

The virtual machine was deleted, until now, the virtual machine is still running normally.
kube-controller-manager deleted dv by mistake, see k8s issues 98471, and 92743. both ruled out, dv and vm in same namespace, and kubectl-check-ownerreferences report "No invalid ownerReferences found"

What you expected to happen:
When the virtual machine exists, the datavolume pointed to by the ownerreference of the virtual machine must not be deleted.

How to reproduce it (as minimally and precisely as possible):
Reproducing does not seem to be easy. At least we have not found a stable way.

Additional context:
Before dv was deleted, kube-controller-manager was elected.

I0104 02:34:19.786027 edge-231-2.cn_b36c7470-70c8-4e02-b84c-dccc18784b88 became leader
I0106 02:40:02.562808 edge-231-3.cn_4a54b5ad-1f34-4625-ba6b-d8c86f9c7e94 became leader
I0108 02:35:19.678290 edge-231-5.cn_51b780a3-4cfd-4334-9850-77aeefbc9636 became leader
I0123 02:37:34.077279 edge-231-3.cn_65bd9e0c-7064-4105-bd8d-f9061a44c3de became leader
I0125 02:42:54.600166 edge-231-5.cn_2e3d2088-b009-43ab-a19f-4edf98cc1b5a became leader
I0210 21:12:30.235067 edge-231-1.cn_de6aca2a-691c-4f5f-8890-4a828721a9e2 became leader
I0210 21:20:12.480774 edge-231-3.cn_9897f5d7-4b40-43ca-b957-f849de225921 became leader
I0217 02:37:10.526154 edge-231-1.cn_311d6bc7-2591-44e7-aa10-7ffa8fcabe40 became leader
I0220 02:37:30.213190 edge-231-5.cn_b343ad18-f7f8-496a-acf6-bfc8c0cff83a became leader
I0221 02:37:18.519236 edge-231-3.cn_f5959e0a-3fd1-48b8-b3d8-d727e117ebcc became leader

Environment:

CDI version (use kubectl get deployments cdi-deployment -o yaml): v1.41.0
Kubernetes version (use kubectl version): v1.18.19
DV specification: N/A
Cloud provider or hardware configuration: N/A
OS (e.g. from /etc/os-release): CentOS Linux release 7.7.1908 (Core)
Kernel (e.g. uname -a): 4.18.0-3.3.el7.x86_64
Install tools: N/A
Others: N/A

The text was updated successfully, but these errors were encountered:

awels · 2024-03-21T19:10:17Z

Since it is the garbagecollector deleting the DV, it will also delete the PVC since the DV owns it. That makes sense. But for the garbage collector to delete the DV, it must think the owner (the VM) must be deleted as well. Can you check if the deletionTimestamp on the VM resource is set? That is the only reason I can think of that could cause the deletion of the DV by the GC

anyushun · 2024-03-22T02:37:11Z

Very happy to receive your reply. The deletionTimestamp of VM: evm-cfb1bef9upshljjib6ng is not set

awels · 2024-03-22T12:15:27Z

Okay, well that exhausted my possible ideas. Note that the versions of kubernetes and CDI you are running are very old and we really only support n - 2 so that is 1.58 to 1.56, sometimes for critical issues we will go back a little further.

It seems to me you hit a bug in kubernetes here, as I am not aware of anything on our end that would delete datavolumes in the background. What version of KubeVirt are you running with?

anyushun · 2024-03-25T08:26:32Z

CDI version (use kubectl get deployments cdi-deployment -o yaml): v1.41.0
Kubernetes version (use kubectl version): v1.18.19
Kubevirt version: v0.51.0

akalenyu · 2024-03-25T13:33:19Z

One thing you could try is to set up audit log to see the exact sequence of events:
https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/

And let me echo again the worry about the old k8s/CDI version, as @awels mentioned,
this version is not supported anymore.

anyushun · 2024-05-22T02:03:24Z

Thank you for the enthusiastic discussion @awels @akalenyu, our team has identified the root cause of the problem:

We use cluster level CR MigrationPolicy to control the live migration process and set OwnerReference to point to the namespace level VM (kubevirt. io/v1alpha3/VirtualMachine), triggered a bug in 98471

We have changed the usage of MigrationPolicy, removed ownerreference, and no longer have it collected by the garbage collector of the kube-controller-manager. Instead, we have developed a self-developed controller responsible for managing the lifecycle of MigrationPolicy

anyushun added the kind/bug label Mar 19, 2024

anyushun closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The datavolume pointing to an existing virtual machine was deleted by the garbage collector #3134

The datavolume pointing to an existing virtual machine was deleted by the garbage collector #3134

anyushun commented Mar 19, 2024

awels commented Mar 21, 2024

anyushun commented Mar 22, 2024

awels commented Mar 22, 2024

anyushun commented Mar 25, 2024

akalenyu commented Mar 25, 2024

anyushun commented May 22, 2024

The datavolume pointing to an existing virtual machine was deleted by the garbage collector #3134

The datavolume pointing to an existing virtual machine was deleted by the garbage collector #3134

Comments

anyushun commented Mar 19, 2024

awels commented Mar 21, 2024

anyushun commented Mar 22, 2024

awels commented Mar 22, 2024

anyushun commented Mar 25, 2024

akalenyu commented Mar 25, 2024

anyushun commented May 22, 2024