Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zookeeper operator is unable to recover the broken zk statefulset due to an issue in statefulSet controller #547

Open
hoyhbx opened this issue Apr 7, 2023 · 0 comments

Comments

@hoyhbx
Copy link
Contributor

hoyhbx commented Apr 7, 2023

Description

Hi, we found that the operator is unable to recover a broken statefulSet, after a misoperation.
For example, if we set the image of the zookeeper cluster to a wrong image, the statefulSet will be updated by the zookeeper cluster and the rolling update will cause one pod to keep crashing due to ImagePull error.
Then we realized this error, and performed a manual roll back to fix the image. But we found that the pod still keeps crashing, although the statefulSet is updated.

We think the root cause is because the operator uses OrderReady as the podManagementPolicy, and there is a known problem in statefulSet: kubernetes/kubernetes#67250.
which prevents statefulSet to roll back even the template is updated. And zookeeper-operator is affected.

The workaround is to manually delete the crashed pod so that statefulSet controller can proceed. As far as we know, there is a KEP open to fix this issue: kubernetes/enhancements#3562, but it is still at a very early stage. The best thing for the operator to do here is probably to delete the pod if it can recognize the pod is being stuck. If the KEP gets actually implemented and merged, this problem will be much easier to deal with.

Importance

(Indicate the importance of this issue to you (blocker, must-have, should-have, nice-to-have))
must-have

Location

kubernetes/kubernetes#67250
kubernetes/enhancements#3562

Suggestions for an improvement

Force restart the pod if the operator can recognize the pod is at unhealthy state, so that the statefulSet pods can be updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant