K8SSAND-619 ⁃ No way to gracefully decommission an entire datacenter #125

arianvp · 2021-06-23T08:14:57Z

What happened?

I was in the process of migrating an existing cassandra cluster to k8ssandra using the instructions described here: https://docs.k8ssandra.io/tasks/migrate/

After succesfully adding the k8ssandra datacenter to my existing cluster, I wanted to abort the procedure and decomission the k8ssandra cluster again. I tried scaling down the cassandradatacenter to size 0 however cass-operator threw an error saying that the size needs to be at least 1.

I then decided to delete the cassandradatacenter hoping that would gracefully decomission the datacenter. However instead it just deleted all the pods ands pvc at once and left all the nodes in a DN state. Luckily I hadn't streamed any data yet between the datacenters and the k8sssandra datacenter was not in use yet, but it was a bit of an awkward bad state to end up in. I had to then manually nodetool assasinate each node that was in the k8ssandra datacenter from the old cassandra datacennter.

Did you expect to see something different?

I expected it to be possible to set size to 0 so I can decomission an entire datacenter before deleting it from the cluster.

OR deleting a cassandradatacenter that is part of a larger cluster to gracefully decomission all nodes on deletion

How to reproduce it (as minimally and precisely as possible):

Set size to 0 and see that it gets rejected AND delete the cassandradatacenter afterwards

Environment

Cass Operator version:

Insert image tag or Git SHA here
* Kubernetes version information: `kubectl version` * Kubernetes cluster kind:```insert how you created your cluster: kops, bootkube, etc.```* Manifests:```insert manifests relevant to the issue```* Cass Operator Logs:```

insert Cass Operator logs relevant to the issue here


**Anything else we need to know?**:



┆Issue is synchronized with this [Jira Feature](https://k8ssandra.atlassian.net/browse/K8SSAND-619) by [Unito](https://www.unito.io)
┆epic: Decomission Datacenters
┆fixVersions: k8ssandra-operator-v1.0.0
┆friendlyId: K8SSAND-619
┆priority: Medium

The text was updated successfully, but these errors were encountered:

burmanm · 2021-06-27T11:00:37Z

If you set CassandraDatacenter to "Stopped" (Stopped = true), it will drain all the nodes and set to pod count to 0. Would that simplify what you want to achieve?

arianvp · 2021-06-28T06:34:48Z

Hmm maybe!

After I stopped the datacenter I should be able to run nodetool removenode instead of nodetool assasinate then to remove those stopped nodes gracefully?

This could work. I'll experiment with this solution!

ErickRamirezAU · 2021-06-28T06:45:02Z

After I stopped the datacenter I should be able to run nodetool removenode instead of nodetool assasinate then to remove those stopped nodes gracefully?

I haven't tested it but I don't think you'd need to do a removenode. In any case, only use assassinate as a last resort when all else has failed. Try every other trick in the book to decommission a C* node before you even think about assassinating because it's a dirty hack. Cheers!

arianvp · 2021-06-28T20:22:44Z

@ErickRamirezAU I tried that and it did seem to drain the nodes, it did not decommission them. (They were left in a DN state)

{"level":"info","ts":1624910946.51477,"logger":"reconciliation_handler","msg":"calling Management API drain node - POST /api/v0/ops/node/drain","requestNamespace":"default","requestName":"dc2","loopID":"e647192d-0e60-44c2-86d5-1b3977553c6f","namespace":"default","datacenterName":"dc2","clusterName":"cluster1","pod":"cluster1-dc2-default-sts-0"}
{"level":"info","ts":1624910948.8240023,"logger":"reconciliation_handler","msg":"calling Management API drain node - POST /api/v0/ops/node/drain","requestNamespace":"default","requestName":"dc2","loopID":"e647192d-0e60-44c2-86d5-1b3977553c6f","namespace":"default","datacenterName":"dc2","clusterName":"cluster1","pod":"cluster1-dc2-default-sts-1"}
{"level":"info","ts":1624910951.0394273,"logger":"reconciliation_handler","msg":"calling Management API drain node - POST /api/v0/ops/node/drain","requestNamespace":"default","requestName":"dc2","loopID":"e647192d-0e60-44c2-86d5-1b3977553c6f","namespace":"default","datacenterName":"dc2","clusterName":"cluster1","pod":"cluster1-dc2-default-sts-2"}
{"level":"info","ts":1624910953.314612,"logger":"reconciliation_handler","msg":"rack drains done","requestNamespace":"default","requestName":"dc2","loopID":"e647192d-0e60-44c2-86d5-1b3977553c6f","namespace":"default","datacenterName":"dc2","clusterName":"cluster1","rack":"default","nodesDrained":3,"nodeDrainErrors":0}

Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.17.3.103  138.62 KiB  1            100.0%            6f5d98e6-0d32-4c5b-be64-e9910b6a91fa  default
UN  172.17.1.63   138.9 KiB  1            100.0%            f39f44f2-43c9-4041-84c4-d45b7f0ce900  default
UN  172.17.2.46   133.29 KiB  1            100.0%            98f77f96-3fda-4453-88c9-40b2dbd1db10  default
Datacenter: dc2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
DN  172.17.2.180  121.92 KiB  1            0.0%              22199d0a-a128-47ff-bd85-f953c75ee0e3  default
DN  172.17.3.134  103.97 KiB  1            0.0%              4e757d6b-a0ba-45b7-b55f-8da0112d91e4  default
DN  172.17.1.239  128.37 KiB  1            0.0%              1b5f9d28-80c2-44f9-8633-f8df4a0ad1f0  default

Because they were drained nodetool removenode did work though

So this seems like a nice graceful work around to the problem. stopped: true in combination with nodetool removenode. However point still stands that there is no way to decommission a datacenter (a la https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsDecomissionDC.html)

arianvp added the bug Something isn't working label Jun 23, 2021

burmanm self-assigned this Jun 29, 2021

sync-by-unito bot unassigned burmanm Sep 22, 2021

sync-by-unito bot changed the title ~~No way to gracefully decommission an entire datacenter~~ K8SSAND-619 ⁃ No way to gracefully decommission an entire datacenter Nov 17, 2021

sync-by-unito bot assigned burmanm Nov 17, 2021

burmanm added the enhancement New feature or request label Dec 23, 2021

burmanm mentioned this issue Dec 23, 2021

Add support for decommissioning an entire datacenter [K8SSAND-1057][K8SSAND-619] #250

Merged

5 tasks

burmanm closed this as completed in #250 Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8SSAND-619 ⁃ No way to gracefully decommission an entire datacenter #125

K8SSAND-619 ⁃ No way to gracefully decommission an entire datacenter #125

arianvp commented Jun 23, 2021 •

edited by sync-by-unito bot

Loading

burmanm commented Jun 27, 2021

arianvp commented Jun 28, 2021

ErickRamirezAU commented Jun 28, 2021 •

edited

Loading

arianvp commented Jun 28, 2021 •

edited

Loading

K8SSAND-619 ⁃ No way to gracefully decommission an entire datacenter #125

K8SSAND-619 ⁃ No way to gracefully decommission an entire datacenter #125

Comments

arianvp commented Jun 23, 2021 • edited by sync-by-unito bot Loading

burmanm commented Jun 27, 2021

arianvp commented Jun 28, 2021

ErickRamirezAU commented Jun 28, 2021 • edited Loading

arianvp commented Jun 28, 2021 • edited Loading

arianvp commented Jun 23, 2021 •

edited by sync-by-unito bot

Loading

ErickRamirezAU commented Jun 28, 2021 •

edited

Loading

arianvp commented Jun 28, 2021 •

edited

Loading