You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We find that zookeeper-operator could accidentally delete the PVC of the zookeeper pod when zookeeper-operator experiences a restart and talks to one stale apiserver in an HA k8s cluster. After several rounds of inspection, we find the root cause is the staleness of the events fed from the apiserver to the controller.
More concretely, the controller performs reconcile() according to the events watched from apiserver, and the watch is served by the apiserver’s local cache without any strong consistency guarantee. If that apiserver's watch cache is stale (due to some network partition issue), the controller will perform reconcile() according to the stale events from the apiserver, which can lead to failures.
Particularly, if the controller receives a stale update event indicating "ZookeeperCluster has non-zero deletion timestamp" from the stale apiserver, it will believe a deletion is happening and decides to delete the PVC of that ZookeeperCluster. Note that such a stale event actually comes from the deletion of a previous ZookeeperCluster that has a different UID than the currently running one. However, the controller lists all related PVC only using the name and namespace:
If the currently running ZookeeperCluster uses the same name as the previously deleted one (e.g., we delete and recreate a ZookeeperCluster), the controller will mistakenly think the listed PVCs belong to the "deleted ZookeeperCluster", and invokes deletePVC to delete the PVC.
We list concrete reproduction steps as below:
Run the controller in a HA k8s cluster and create a ZookeeperCluster zkc. The controller is talking to apiserver1 (which is not stale).
Delete zkc. Apiserver1 will send the update events with a non-zero deletion timestamp to the controller and the controller will delete the related PVC of zkc during reconcile. Meanwhile, apiserver2 is partitioned so its watch cache stops at the moment that zkc is tagged with a deletion timestamp.
Create the ZookeeperCluster with the same name again. Now the ZookeeperCluster and its PVC gets back. However, apiserver2 still holds the stale view that zkc has a non-zero deletion timestamp and is about to be deleted.
The controller crashes due to some node failure and restarts (or its follower becomes leader). This time the controller talks to the stale apiserver2. The restarted controller receives the stale update events from apiserver2 that zkc has a deletion timestamp. Since the controller only uses the name(space) to list the PVC, all the PVCs belonging to the newly created ZookeeperCluster will be listed and deleted.
Importance
blocker: This bug makes the controller perform unexpected deletion when reading stale data from the apiserver, and can further lead to data loss or availability issues.
Location
zookeepercluster_controller.go
Suggestions for an improvement
We are willing to issue a PR to help fix this.
The bug can be fixed by tagging each PVC with the UID of the ZookeeperCluster in MakeStatefulSet, and list PVC using UID. Each ZookeeperCluster instance always has a different UID even with the same name, so PVCs belonging to the current ZookeeperCluster will not be deleted by events of the old ZookeeperCluster. We can issue a PR to add UID as a label for each PVC.
The text was updated successfully, but these errors were encountered:
Description
We find that zookeeper-operator could accidentally delete the PVC of the zookeeper pod when zookeeper-operator experiences a restart and talks to one stale apiserver in an HA k8s cluster. After several rounds of inspection, we find the root cause is the staleness of the events fed from the apiserver to the controller.
More concretely, the controller performs reconcile() according to the events watched from apiserver, and the watch is served by the apiserver’s local cache without any strong consistency guarantee. If that apiserver's watch cache is stale (due to some network partition issue), the controller will perform reconcile() according to the stale events from the apiserver, which can lead to failures.
Particularly, if the controller receives a stale update event indicating "ZookeeperCluster has non-zero deletion timestamp" from the stale apiserver, it will believe a deletion is happening and decides to delete the PVC of that ZookeeperCluster. Note that such a stale event actually comes from the deletion of a previous ZookeeperCluster that has a different UID than the currently running one. However, the controller lists all related PVC only using the name and namespace:
If the currently running ZookeeperCluster uses the same name as the previously deleted one (e.g., we delete and recreate a ZookeeperCluster), the controller will mistakenly think the listed PVCs belong to the "deleted ZookeeperCluster", and invokes
deletePVC
to delete the PVC.We list concrete reproduction steps as below:
Importance
blocker: This bug makes the controller perform unexpected deletion when reading stale data from the apiserver, and can further lead to data loss or availability issues.
Location
zookeepercluster_controller.go
Suggestions for an improvement
We are willing to issue a PR to help fix this.
The bug can be fixed by tagging each PVC with the UID of the ZookeeperCluster in
MakeStatefulSet
, and list PVC using UID. Each ZookeeperCluster instance always has a different UID even with the same name, so PVCs belonging to the current ZookeeperCluster will not be deleted by events of the old ZookeeperCluster. We can issue a PR to add UID as a label for each PVC.The text was updated successfully, but these errors were encountered: