Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] PVC can be mistakenly deleted when reading from stale apiserver #312

Closed
srteam2020 opened this issue Mar 15, 2021 · 2 comments · Fixed by #313
Closed

[BUG] PVC can be mistakenly deleted when reading from stale apiserver #312

srteam2020 opened this issue Mar 15, 2021 · 2 comments · Fixed by #313

Comments

@srteam2020
Copy link
Contributor

srteam2020 commented Mar 15, 2021

Description

We find that zookeeper-operator could accidentally delete the PVC of the zookeeper pod when zookeeper-operator experiences a restart and talks to one stale apiserver in an HA k8s cluster. After several rounds of inspection, we find the root cause is the staleness of the events fed from the apiserver to the controller.

More concretely, the controller performs reconcile() according to the events watched from apiserver, and the watch is served by the apiserver’s local cache without any strong consistency guarantee. If that apiserver's watch cache is stale (due to some network partition issue), the controller will perform reconcile() according to the stale events from the apiserver, which can lead to failures.

Particularly, if the controller receives a stale update event indicating "ZookeeperCluster has non-zero deletion timestamp" from the stale apiserver, it will believe a deletion is happening and decides to delete the PVC of that ZookeeperCluster. Note that such a stale event actually comes from the deletion of a previous ZookeeperCluster that has a different UID than the currently running one. However, the controller lists all related PVC only using the name and namespace:

func (r *ReconcileZookeeperCluster) getPVCList(instance *zookeeperv1beta1.ZookeeperCluster) (pvList corev1.PersistentVolumeClaimList, err error) {
	selector, err := metav1.LabelSelectorAsSelector(&metav1.LabelSelector{
		MatchLabels: map[string]string{"app": instance.GetName()},
	})
	pvclistOps := &client.ListOptions{
		Namespace:     instance.Namespace,
		LabelSelector: selector,
	}
	pvcList := &corev1.PersistentVolumeClaimList{}
	err = r.client.List(context.TODO(), pvcList, pvclistOps)
	return *pvcList, err
}

If the currently running ZookeeperCluster uses the same name as the previously deleted one (e.g., we delete and recreate a ZookeeperCluster), the controller will mistakenly think the listed PVCs belong to the "deleted ZookeeperCluster", and invokes deletePVC to delete the PVC.

We list concrete reproduction steps as below:

  1. Run the controller in a HA k8s cluster and create a ZookeeperCluster zkc. The controller is talking to apiserver1 (which is not stale).
  2. Delete zkc. Apiserver1 will send the update events with a non-zero deletion timestamp to the controller and the controller will delete the related PVC of zkc during reconcile. Meanwhile, apiserver2 is partitioned so its watch cache stops at the moment that zkc is tagged with a deletion timestamp.
  3. Create the ZookeeperCluster with the same name again. Now the ZookeeperCluster and its PVC gets back. However, apiserver2 still holds the stale view that zkc has a non-zero deletion timestamp and is about to be deleted.
  4. The controller crashes due to some node failure and restarts (or its follower becomes leader). This time the controller talks to the stale apiserver2. The restarted controller receives the stale update events from apiserver2 that zkc has a deletion timestamp. Since the controller only uses the name(space) to list the PVC, all the PVCs belonging to the newly created ZookeeperCluster will be listed and deleted.

Importance

blocker: This bug makes the controller perform unexpected deletion when reading stale data from the apiserver, and can further lead to data loss or availability issues.

Location

zookeepercluster_controller.go

Suggestions for an improvement

We are willing to issue a PR to help fix this.
The bug can be fixed by tagging each PVC with the UID of the ZookeeperCluster in MakeStatefulSet, and list PVC using UID. Each ZookeeperCluster instance always has a different UID even with the same name, so PVCs belonging to the current ZookeeperCluster will not be deleted by events of the old ZookeeperCluster. We can issue a PR to add UID as a label for each PVC.

@anishakj
Copy link
Contributor

@srteam2020 Thanks for finding the issue. Please go ahead and raise PR as you know how to solve this issue.

@srteam2020
Copy link
Contributor Author

@anishakj Yes, thank you for confirming the issue. We will raise a PR for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants