Zookeeper operator is unable to recover the broken zk statefulset due to an issue in statefulSet controller #547

hoyhbx · 2023-04-07T21:16:56Z

Description

Hi, we found that the operator is unable to recover a broken statefulSet, after a misoperation.
For example, if we set the image of the zookeeper cluster to a wrong image, the statefulSet will be updated by the zookeeper cluster and the rolling update will cause one pod to keep crashing due to ImagePull error.
Then we realized this error, and performed a manual roll back to fix the image. But we found that the pod still keeps crashing, although the statefulSet is updated.

We think the root cause is because the operator uses OrderReady as the podManagementPolicy, and there is a known problem in statefulSet: kubernetes/kubernetes#67250.
which prevents statefulSet to roll back even the template is updated. And zookeeper-operator is affected.

The workaround is to manually delete the crashed pod so that statefulSet controller can proceed. As far as we know, there is a KEP open to fix this issue: kubernetes/enhancements#3562, but it is still at a very early stage. The best thing for the operator to do here is probably to delete the pod if it can recognize the pod is being stuck. If the KEP gets actually implemented and merged, this problem will be much easier to deal with.

Importance

(Indicate the importance of this issue to you (blocker, must-have, should-have, nice-to-have))
must-have

Location

kubernetes/kubernetes#67250
kubernetes/enhancements#3562

Suggestions for an improvement

Force restart the pod if the operator can recognize the pod is at unhealthy state, so that the statefulSet pods can be updated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zookeeper operator is unable to recover the broken zk statefulset due to an issue in statefulSet controller #547

Zookeeper operator is unable to recover the broken zk statefulset due to an issue in statefulSet controller #547

hoyhbx commented Apr 7, 2023

Zookeeper operator is unable to recover the broken zk statefulset due to an issue in statefulSet controller #547

Zookeeper operator is unable to recover the broken zk statefulset due to an issue in statefulSet controller #547

Comments

hoyhbx commented Apr 7, 2023

Description

Importance

Location

Suggestions for an improvement