Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8SSAND-619 ⁃ No way to gracefully decommission an entire datacenter #125

Closed
arianvp opened this issue Jun 23, 2021 · 4 comments · Fixed by #250
Closed

K8SSAND-619 ⁃ No way to gracefully decommission an entire datacenter #125

arianvp opened this issue Jun 23, 2021 · 4 comments · Fixed by #250
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@arianvp
Copy link
Contributor

arianvp commented Jun 23, 2021

What happened?

I was in the process of migrating an existing cassandra cluster to k8ssandra using the instructions described here: https://docs.k8ssandra.io/tasks/migrate/

After succesfully adding the k8ssandra datacenter to my existing cluster, I wanted to abort the procedure and decomission the k8ssandra cluster again. I tried scaling down the cassandradatacenter to size 0 however cass-operator threw an error saying that the size needs to be at least 1.

I then decided to delete the cassandradatacenter hoping that would gracefully decomission the datacenter. However instead it just deleted all the pods ands pvc at once and left all the nodes in a DN state. Luckily I hadn't streamed any data yet between the datacenters and the k8sssandra datacenter was not in use yet, but it was a bit of an awkward bad state to end up in. I had to then manually nodetool assasinate each node that was in the k8ssandra datacenter from the old cassandra datacennter.

Did you expect to see something different?

I expected it to be possible to set size to 0 so I can decomission an entire datacenter before deleting it from the cluster.

OR deleting a cassandradatacenter that is part of a larger cluster to gracefully decomission all nodes on deletion

How to reproduce it (as minimally and precisely as possible):

Set size to 0 and see that it gets rejected AND delete the cassandradatacenter afterwards

Environment

  • Cass Operator version:

    Insert image tag or Git SHA here

    * Kubernetes version information: `kubectl version` * Kubernetes cluster kind:```insert how you created your cluster: kops, bootkube, etc.```* Manifests:```insert manifests relevant to the issue```* Cass Operator Logs:```

insert Cass Operator logs relevant to the issue here


**Anything else we need to know?**:



┆Issue is synchronized with this [Jira Feature](https://k8ssandra.atlassian.net/browse/K8SSAND-619) by [Unito](https://www.unito.io)
┆epic: Decomission Datacenters
┆fixVersions: k8ssandra-operator-v1.0.0
┆friendlyId: K8SSAND-619
┆priority: Medium
@arianvp arianvp added the bug Something isn't working label Jun 23, 2021
@burmanm
Copy link
Contributor

burmanm commented Jun 27, 2021

If you set CassandraDatacenter to "Stopped" (Stopped = true), it will drain all the nodes and set to pod count to 0. Would that simplify what you want to achieve?

@arianvp
Copy link
Contributor Author

arianvp commented Jun 28, 2021

Hmm maybe!

After I stopped the datacenter I should be able to run nodetool removenode instead of nodetool assasinate then to remove those stopped nodes gracefully?

This could work. I'll experiment with this solution!

@ErickRamirezAU
Copy link

ErickRamirezAU commented Jun 28, 2021

After I stopped the datacenter I should be able to run nodetool removenode instead of nodetool assasinate then to remove those stopped nodes gracefully?

I haven't tested it but I don't think you'd need to do a removenode. In any case, only use assassinate as a last resort when all else has failed. Try every other trick in the book to decommission a C* node before you even think about assassinating because it's a dirty hack. Cheers!

@arianvp
Copy link
Contributor Author

arianvp commented Jun 28, 2021

@ErickRamirezAU I tried that and it did seem to drain the nodes, it did not decommission them. (They were left in a DN state)

{"level":"info","ts":1624910946.51477,"logger":"reconciliation_handler","msg":"calling Management API drain node - POST /api/v0/ops/node/drain","requestNamespace":"default","requestName":"dc2","loopID":"e647192d-0e60-44c2-86d5-1b3977553c6f","namespace":"default","datacenterName":"dc2","clusterName":"cluster1","pod":"cluster1-dc2-default-sts-0"}
{"level":"info","ts":1624910948.8240023,"logger":"reconciliation_handler","msg":"calling Management API drain node - POST /api/v0/ops/node/drain","requestNamespace":"default","requestName":"dc2","loopID":"e647192d-0e60-44c2-86d5-1b3977553c6f","namespace":"default","datacenterName":"dc2","clusterName":"cluster1","pod":"cluster1-dc2-default-sts-1"}
{"level":"info","ts":1624910951.0394273,"logger":"reconciliation_handler","msg":"calling Management API drain node - POST /api/v0/ops/node/drain","requestNamespace":"default","requestName":"dc2","loopID":"e647192d-0e60-44c2-86d5-1b3977553c6f","namespace":"default","datacenterName":"dc2","clusterName":"cluster1","pod":"cluster1-dc2-default-sts-2"}
{"level":"info","ts":1624910953.314612,"logger":"reconciliation_handler","msg":"rack drains done","requestNamespace":"default","requestName":"dc2","loopID":"e647192d-0e60-44c2-86d5-1b3977553c6f","namespace":"default","datacenterName":"dc2","clusterName":"cluster1","rack":"default","nodesDrained":3,"nodeDrainErrors":0}
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.17.3.103  138.62 KiB  1            100.0%            6f5d98e6-0d32-4c5b-be64-e9910b6a91fa  default
UN  172.17.1.63   138.9 KiB  1            100.0%            f39f44f2-43c9-4041-84c4-d45b7f0ce900  default
UN  172.17.2.46   133.29 KiB  1            100.0%            98f77f96-3fda-4453-88c9-40b2dbd1db10  default
Datacenter: dc2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
DN  172.17.2.180  121.92 KiB  1            0.0%              22199d0a-a128-47ff-bd85-f953c75ee0e3  default
DN  172.17.3.134  103.97 KiB  1            0.0%              4e757d6b-a0ba-45b7-b55f-8da0112d91e4  default
DN  172.17.1.239  128.37 KiB  1            0.0%              1b5f9d28-80c2-44f9-8633-f8df4a0ad1f0  default

Because they were drained nodetool removenode did work though

So this seems like a nice graceful work around to the problem. stopped: true in combination with nodetool removenode. However point still stands that there is no way to decommission a datacenter (a la https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsDecomissionDC.html)

@burmanm burmanm self-assigned this Jun 29, 2021
@sync-by-unito sync-by-unito bot changed the title No way to gracefully decommission an entire datacenter K8SSAND-619 ⁃ No way to gracefully decommission an entire datacenter Nov 17, 2021
@burmanm burmanm added the enhancement New feature or request label Dec 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
3 participants