-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K8SSAND-1423 ⁃ cass-operator becomes partially inoperable if replaceNodes has a wrong pod name #315
Comments
I also just encounter this issue, the workaround is to edit the status and remove the invalid data using this plugin |
@kien-truong What caused you to require to do manual nodeReplace? |
I actually don't remember, it was quite a while ago. I think we wanted to move one Cassandra pod to a new node. We followed some steps mentioned this issue: #78 . It took some trials-and-errors so we ended up with an invalid IP in |
Thanks @kien-truong for providing the workaround. @burmanm I think the operator should do a sanity check before moving the |
Hey team! Please add your planning poker estimate with ZenHub @burmanm @jsanda |
* Verify the ReplaceNode pod name exists in the Datacenter before replacing it, fixes #315
What happened?
cass-operator loses part of its functionality after the user mistakenly supplied a wrong pod name under the
spec.replaceNodes
field. For example, the operator will fail to decommission nodes nor do a rolling restart. This is because in the reconciliation loop, the operator re-queues the request if thestatus.NodeReplacements
is nonempty before reconciling many other functionalities. Since the pod name under thestatus.NodeReplacements
does not exist, the operator is never able to clear thestatus.NodeReplacements
. This causes the operator becomes partially nonfunctional.Did you expect to see something different?
The operator should be robust and still able to reconcile other functionalities even when the user submitted a wrong pod name for
spec.replaceNodes
. Or the operator should do a sanity check onspec.replaceNodes
to prevent from getting stuck.How to reproduce it (as minimally and precisely as possible):
kubectl apply -f sample.yaml
:spec.replaceNodes
,kubectl apply -f sample.yaml
kubectl apply -f sample.yaml
:Environment
docker.io/k8ssandra/cass-operator:v1.9.0
kind
Anything else we need to know?:
The bug is caused because the operator keeps re-queueing the request while the
status.NodeReplacements
cannot get cleared due to wrong pod name. We suggest to sanitize thespec.replaceNodes
field to make the operator robust.┆Issue is synchronized with this Jira Bug by Unito
┆affectedVersions: cass-operator-1.9.0
┆friendlyId: K8SSAND-1423
┆priority: Medium
The text was updated successfully, but these errors were encountered: