Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8SSAND-1423 ⁃ cass-operator becomes partially inoperable if replaceNodes has a wrong pod name #315

Closed
sync-by-unito bot opened this issue Apr 4, 2022 · 5 comments · Fixed by #326
Assignees

Comments

@sync-by-unito
Copy link

sync-by-unito bot commented Apr 4, 2022

What happened?
cass-operator loses part of its functionality after the user mistakenly supplied a wrong pod name under the spec.replaceNodes field. For example, the operator will fail to decommission nodes nor do a rolling restart. This is because in the reconciliation loop, the operator re-queues the request if the status.NodeReplacements is nonempty before reconciling many other functionalities. Since the pod name under the status.NodeReplacements does not exist, the operator is never able to clear the status.NodeReplacements. This causes the operator becomes partially nonfunctional.

Did you expect to see something different?
The operator should be robust and still able to reconcile other functionalities even when the user submitted a wrong pod name for spec.replaceNodes. Or the operator should do a sanity check on spec.replaceNodes to prevent from getting stuck.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy the cass-operator
  2. Deploy the CassandraDatacenter using this yaml, kubectl apply -f sample.yaml:
apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  name: cassandra-datacenter
spec:
  clusterName: cluster1
  serverType: cassandra
  serverVersion: 3.11.7
  size: 1
  storageConfig:
    cassandraDataVolumeClaimSpec:
      accessModes:


* ReadWriteOnce
resources:
requests:
storage: 3Gi
storageClassName: server-storage
  1. Provide an invalid pod name for spec.replaceNodes, kubectl apply -f sample.yaml
apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
name: cassandra-datacenter
spec:
clusterName: cluster1
serverType: cassandra
serverVersion: 3.11.7
size: 1
storageConfig:
cassandraDataVolumeClaimSpec:
accessModes:
* ReadWriteOnce
resources:
requests:
storage: 3Gi
storageClassName: server-storage
replaceNodes:
* rtiisajufx
  1. Try request rolling restart, kubectl apply -f sample.yaml:
apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
name: cassandra-datacenter
spec:
clusterName: cluster1
serverType: cassandra
serverVersion: 3.11.7
size: 1
storageConfig:
cassandraDataVolumeClaimSpec:
accessModes:
* ReadWriteOnce
resources:
requests:
storage: 3Gi
storageClassName: server-storage
replaceNodes:
* rtiisajufx
rollingRestartRequested: true
  1. Observe that the cluster does not restart

Environment

  • Cass Operator version:

docker.io/k8ssandra/cass-operator:v1.9.0

  • Kubernetes version information:
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:41:01Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-21T23:01:33Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster kind:

kind

  • Manifests:
See above yaml manifests in the reproduce steps

Anything else we need to know?:
The bug is caused because the operator keeps re-queueing the request while the status.NodeReplacements cannot get cleared due to wrong pod name. We suggest to sanitize the spec.replaceNodes field to make the operator robust.

┆Issue is synchronized with this Jira Bug by Unito
┆affectedVersions: cass-operator-1.9.0
┆friendlyId: K8SSAND-1423
┆priority: Medium

@kien-truong
Copy link

I also just encounter this issue, the workaround is to edit the status and remove the invalid data using this plugin

https://github.com/ulucinar/kubectl-edit-status

@burmanm
Copy link
Contributor

burmanm commented Apr 6, 2022

@kien-truong What caused you to require to do manual nodeReplace?

@kien-truong
Copy link

I actually don't remember, it was quite a while ago. I think we wanted to move one Cassandra pod to a new node.

We followed some steps mentioned this issue: #78 . It took some trials-and-errors so we ended up with an invalid IP in status.NodeReplacements

@hoyhbx
Copy link

hoyhbx commented Apr 6, 2022

Thanks @kien-truong for providing the workaround.

@burmanm I think the operator should do a sanity check before moving the spec.replaceNodes to the status, especially given that it's not easy for users to modify the fields in the status. I think a simple client.Get should suffice the check.

@bradfordcp
Copy link
Member

Hey team! Please add your planning poker estimate with ZenHub @burmanm @jsanda

@burmanm burmanm self-assigned this Apr 26, 2022
burmanm added a commit to burmanm/cass-operator that referenced this issue Apr 26, 2022
burmanm added a commit to burmanm/cass-operator that referenced this issue Apr 26, 2022
burmanm added a commit that referenced this issue May 12, 2022
* Verify the ReplaceNode pod name exists in the Datacenter before replacing it, fixes #315
burmanm added a commit that referenced this issue May 12, 2022
* Verify the ReplaceNode pod name exists in the Datacenter before replacing it, fixes #315

(cherry picked from commit 57e178a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants