Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add script to update the statefulset service name #104

Closed
wants to merge 6 commits into from

Conversation

jsanda
Copy link
Contributor

@jsanda jsanda commented May 26, 2021

What this PR does:
Adds a shell script that recreates the CassandraDatacenter and underlying StatefulSets. Note that there is no downtime or data loss. Deletes do not cascade to the Cassandra pods. They remain up and running.

Which issue(s) this PR fixes:
Fixes #103

Checklist

  • Changes manually tested
  • Automated Tests added/updated
  • Documentation added/updated
  • CHANGELOG.md updated (not required for documentation PRs)
  • CLA Signed: DataStax CLA

┆Issue is synchronized with this Jira Bug by Unito

@jdonenine
Copy link
Contributor

jdonenine commented May 26, 2021

I just tried running the script after getting my cluster into the upgraded state and I ended up an error:

% ./patch-cassdc-sts-svc.sh --operator=cass-operator --datacenter=test
The --operator option is required and should specify the name of the cass-operator deployment
jeffdinoto@jdinoto-rmbp16 cass-operator-upgrade % ./patch-cassdc-sts-svc.sh --operator cass-operator --datacenter test
deployment.apps/cass-operator scaled
Waiting for cass-operator scale down to complete
cass-operator is scaled down to 0 replicas
Removing finalizer from CassandraDatacenter test
cassandradatacenter.cassandra.datastax.com/test patched
Deleting CassandraDatacenter test
Error: invalid argument "orphan" for "--cascade" flag: strconv.ParseBool: parsing "orphan": invalid syntax
See 'kubectl delete --help' for usage.

cass-operator is gone, but the C* pod did stay up and running, so that's good 😄

% k get all
NAME                          READY   STATUS    RESTARTS   AGE
pod/test-test-default-sts-0   1/1     Running   0          72m

NAME                                          TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                                        AGE
service/cass-operator-metrics                 ClusterIP   10.96.51.27   <none>        8383/TCP,8686/TCP                              90m
service/cassandradatacenter-webhook-service   ClusterIP   10.96.36.59   <none>        443/TCP                                        91m
service/test-seed-service                     ClusterIP   None          <none>        <none>                                         72m
service/test-test-all-pods-service            ClusterIP   None          <none>        9042/TCP,8080/TCP,9103/TCP                     72m
service/test-test-service                     ClusterIP   None          <none>        9042/TCP,9142/TCP,8080/TCP,9103/TCP,9160/TCP   72m

NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cass-operator   0/0     0            0           91m

NAME                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/cass-operator-5dfcdc46f8   0         0         0       91m
replicaset.apps/cass-operator-7675b65744   0         0         0       21m

NAME                                     READY   AGE
statefulset.apps/test-test-default-sts   1/1     72m

Maybe its a version thing?

If I look at the help for delete on my system I get:

Options:
      --all=false: Delete all resources, including uninitialized ones, in the namespace of the specified resource types.
  -A, --all-namespaces=false: If present, list the requested object(s) across all namespaces. Namespace in current
context is ignored even if specified with --namespace.
      --cascade=true: If true, cascade the deletion of the resources managed by this resource (e.g. Pods created by a
ReplicationController).  Default true.

Clearly here, --cascade doesn't support much.

Here's what I'm running on, it's kind locally:

% k version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:23:52Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-22T22:54:21Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

@jdonenine
Copy link
Contributor

So I've confirmed after talking to @burmanm and some quick testing that the problem I saw has to do with the version of kubectl. Looks like the "orphan" cascade was introduced first in v1.20, so as it is, things will fail on anything prior to that - my first attempt was on v1.19.

After I upgraded to v1.21 (latest), the script worked.

However, after I ran the script, I saw that all of the C* pods terminated, I expected either no changes at all or a rolling restart?

Every 2.0s: kubectl get pods                                                                                                                                                                                                                        jdinoto-rmbp16: Wed May 26 15:20:35 2021

NAME                             READY   STATUS     RESTARTS   AGE
cass-operator-7675b65744-s7rwp   1/1     Running    0          31s
test-test-default-sts-0          0/1     Init:0/1   0          6s
test-test-default-sts-1          0/1     Init:0/1   0          6s
test-test-default-sts-2          0/1     Init:0/1   0          7s

They then started coming back up:

Every 2.0s: kubectl get pods                                                                                                                                                                                                                        jdinoto-rmbp16: Wed May 26 15:23:01 2021

NAME                             READY   STATUS    RESTARTS   AGE
cass-operator-7675b65744-s7rwp   1/1     Running   0          2m57s
test-test-default-sts-0          1/1     Running   0          2m32s
test-test-default-sts-1          0/1     Running   0          2m32s
test-test-default-sts-2          1/1     Running   0          2m33s

@jsanda
Copy link
Contributor Author

jsanda commented May 27, 2021

@jdonenine Thanks for the testing and for the script changes!

However, after I ran the script, I saw that all of the C* pods terminated, I expected either no changes at all or a rolling restart?

I have started investigating this. My expectation (or hope) is that there would be no restart of pods. StatefulSets have a podManagementPolicy property that dictates how pods will be started. Accepted values are OrderedReady and Parallel. cass-operator uses the latter.

Note that even though the pods are started in parallel, Cassandra nodes are still started serially. This is because the management-api starts/stops Cassandra and cass-operator invokes the start operation on the management-api serially.

I did a quick test. I created a CassandraDatacenter. After it became ready I scaled the cass-operator deployment down to 0. Then I deleted the CassandraDatacenter without cascading the delete. I then did a non-cascading delete of the StatefulSet. Then I recreated the StatefulSet. My Cassandra pods were not terminated. I think cass-operator is driving the pod restarts. I will investigate some more to see if I track down what is triggering the pod terminations.

@jsanda
Copy link
Contributor Author

jsanda commented May 27, 2021

I figured out why pods are being recreated. When cass-operator initially creates the StatefulSet with replicas: 0. Because the labels of the Cassandra pods match the StatefulSet's selector, I believe the pods will immediately be considered part of the StatefulSet. The pods are then deleted since the replica count is set to zero initially. I verified this behavior with my test environment:

Events:
  Type    Reason            Age   From                    Message
  ----    ------            ----  ----                    -------
  Normal  SuccessfulDelete  19m   statefulset-controller  delete Pod labels-labels-default-sts-0 in StatefulSet labels-labels-default-sts successful
  Normal  SuccessfulCreate  19m   statefulset-controller  create Pod labels-labels-default-sts-0 in StatefulSet labels-labels-default-sts successful

The above is output from kubectl describe statefulset. I was testing with a single C* node cluster.

@jdonenine We won't be able to avoid a restart without non-trivial changes to cass-operator.

@jsanda jsanda closed this Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

K8SSAND-483 ⁃ Updating Statefulsets is broken when upgrading to 1.7.0
2 participants