K8SSAND-1423 ⁃ cass-operator becomes partially inoperable if replaceNodes has a wrong pod name #315

sync-by-unito · 2022-04-04T19:13:15Z

What happened?
cass-operator loses part of its functionality after the user mistakenly supplied a wrong pod name under the spec.replaceNodes field. For example, the operator will fail to decommission nodes nor do a rolling restart. This is because in the reconciliation loop, the operator re-queues the request if the status.NodeReplacements is nonempty before reconciling many other functionalities. Since the pod name under the status.NodeReplacements does not exist, the operator is never able to clear the status.NodeReplacements. This causes the operator becomes partially nonfunctional.

Did you expect to see something different?
The operator should be robust and still able to reconcile other functionalities even when the user submitted a wrong pod name for spec.replaceNodes. Or the operator should do a sanity check on spec.replaceNodes to prevent from getting stuck.

How to reproduce it (as minimally and precisely as possible):

Deploy the cass-operator
Deploy the CassandraDatacenter using this yaml, kubectl apply -f sample.yaml:

apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  name: cassandra-datacenter
spec:
  clusterName: cluster1
  serverType: cassandra
  serverVersion: 3.11.7
  size: 1
  storageConfig:
    cassandraDataVolumeClaimSpec:
      accessModes:


* ReadWriteOnce
resources:
requests:
storage: 3Gi
storageClassName: server-storage

Provide an invalid pod name for spec.replaceNodes, kubectl apply -f sample.yaml

apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
name: cassandra-datacenter
spec:
clusterName: cluster1
serverType: cassandra
serverVersion: 3.11.7
size: 1
storageConfig:
cassandraDataVolumeClaimSpec:
accessModes:
* ReadWriteOnce
resources:
requests:
storage: 3Gi
storageClassName: server-storage
replaceNodes:
* rtiisajufx

Try request rolling restart, kubectl apply -f sample.yaml:

apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
name: cassandra-datacenter
spec:
clusterName: cluster1
serverType: cassandra
serverVersion: 3.11.7
size: 1
storageConfig:
cassandraDataVolumeClaimSpec:
accessModes:
* ReadWriteOnce
resources:
requests:
storage: 3Gi
storageClassName: server-storage
replaceNodes:
* rtiisajufx
rollingRestartRequested: true

Observe that the cluster does not restart

Environment

Cass Operator version:

docker.io/k8ssandra/cass-operator:v1.9.0

Kubernetes version information:

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:41:01Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-21T23:01:33Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes cluster kind:

kind

Manifests:

See above yaml manifests in the reproduce steps

Anything else we need to know?:
The bug is caused because the operator keeps re-queueing the request while the status.NodeReplacements cannot get cleared due to wrong pod name. We suggest to sanitize the spec.replaceNodes field to make the operator robust.

┆Issue is synchronized with this Jira Bug by Unito
┆affectedVersions: cass-operator-1.9.0
┆friendlyId: K8SSAND-1423
┆priority: Medium

The text was updated successfully, but these errors were encountered:

kien-truong · 2022-04-06T11:16:50Z

I also just encounter this issue, the workaround is to edit the status and remove the invalid data using this plugin

https://github.com/ulucinar/kubectl-edit-status

burmanm · 2022-04-06T11:47:19Z

@kien-truong What caused you to require to do manual nodeReplace?

kien-truong · 2022-04-06T16:41:10Z

I actually don't remember, it was quite a while ago. I think we wanted to move one Cassandra pod to a new node.

We followed some steps mentioned this issue: #78 . It took some trials-and-errors so we ended up with an invalid IP in status.NodeReplacements

hoyhbx · 2022-04-06T20:05:01Z

Thanks @kien-truong for providing the workaround.

@burmanm I think the operator should do a sanity check before moving the spec.replaceNodes to the status, especially given that it's not easy for users to modify the fields in the status. I think a simple client.Get should suffice the check.

bradfordcp · 2022-04-19T19:54:47Z

Hey team! Please add your planning poker estimate with ZenHub @burmanm @jsanda

…cing it, fixes k8ssandra#315

* Verify the ReplaceNode pod name exists in the Datacenter before replacing it, fixes #315

* Verify the ReplaceNode pod name exists in the Datacenter before replacing it, fixes #315 (cherry picked from commit 57e178a)

tylergu mentioned this issue Apr 4, 2022

cass-operator becomes partially inoperable if replaceNodes has a wrong pod name xlab-uiuc/acto#62

Closed

adejanovski added the zh:Assess/Investigate label Apr 12, 2022

adejanovski added zh:Ready and removed zh:Assess/Investigate zh:Ready labels Apr 20, 2022

burmanm self-assigned this Apr 26, 2022

burmanm added a commit to burmanm/cass-operator that referenced this issue Apr 26, 2022

Verify the ReplaceNode pod name exists in the Datacenter before repla…

64cef53

…cing it, fixes k8ssandra#315

burmanm added a commit to burmanm/cass-operator that referenced this issue Apr 26, 2022

Verify the ReplaceNode pod name exists in the Datacenter before repla…

64ff5a5

…cing it, fixes k8ssandra#315

burmanm mentioned this issue Apr 26, 2022

Verify the ReplaceNodes pod names #326

Merged

5 tasks

adejanovski added zh:Ready-For-Review and removed zh:Ready labels Apr 26, 2022

adejanovski added zh:Review zh:Ready-For-Review and removed zh:Ready-For-Review zh:Review labels May 11, 2022

burmanm closed this as completed in #326 May 12, 2022

burmanm added a commit that referenced this issue May 12, 2022

Verify the ReplaceNodes pod names (#326)

57e178a

* Verify the ReplaceNode pod name exists in the Datacenter before replacing it, fixes #315

burmanm added a commit that referenced this issue May 12, 2022

Verify the ReplaceNodes pod names (#326)

a617ccd

* Verify the ReplaceNode pod name exists in the Datacenter before replacing it, fixes #315 (cherry picked from commit 57e178a)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8SSAND-1423 ⁃ cass-operator becomes partially inoperable if replaceNodes has a wrong pod name #315

K8SSAND-1423 ⁃ cass-operator becomes partially inoperable if replaceNodes has a wrong pod name #315

sync-by-unito bot commented Apr 4, 2022 •

edited

Loading

kien-truong commented Apr 6, 2022

burmanm commented Apr 6, 2022

kien-truong commented Apr 6, 2022

hoyhbx commented Apr 6, 2022

bradfordcp commented Apr 19, 2022

K8SSAND-1423 ⁃ cass-operator becomes partially inoperable if replaceNodes has a wrong pod name #315

K8SSAND-1423 ⁃ cass-operator becomes partially inoperable if replaceNodes has a wrong pod name #315

Comments

sync-by-unito bot commented Apr 4, 2022 • edited Loading

kien-truong commented Apr 6, 2022

burmanm commented Apr 6, 2022

kien-truong commented Apr 6, 2022

hoyhbx commented Apr 6, 2022

bradfordcp commented Apr 19, 2022

sync-by-unito bot commented Apr 4, 2022 •

edited

Loading