add script to update the statefulset service name #104

jsanda · 2021-05-26T04:12:23Z

What this PR does:
Adds a shell script that recreates the CassandraDatacenter and underlying StatefulSets. Note that there is no downtime or data loss. Deletes do not cascade to the Cassandra pods. They remain up and running.

Which issue(s) this PR fixes:
Fixes #103

Checklist

Changes manually tested
Automated Tests added/updated
Documentation added/updated
CHANGELOG.md updated (not required for documentation PRs)
CLA Signed: DataStax CLA

┆Issue is synchronized with this Jira Bug by Unito

jdonenine · 2021-05-26T17:05:32Z

I just tried running the script after getting my cluster into the upgraded state and I ended up an error:

% ./patch-cassdc-sts-svc.sh --operator=cass-operator --datacenter=test
The --operator option is required and should specify the name of the cass-operator deployment
jeffdinoto@jdinoto-rmbp16 cass-operator-upgrade % ./patch-cassdc-sts-svc.sh --operator cass-operator --datacenter test
deployment.apps/cass-operator scaled
Waiting for cass-operator scale down to complete
cass-operator is scaled down to 0 replicas
Removing finalizer from CassandraDatacenter test
cassandradatacenter.cassandra.datastax.com/test patched
Deleting CassandraDatacenter test
Error: invalid argument "orphan" for "--cascade" flag: strconv.ParseBool: parsing "orphan": invalid syntax
See 'kubectl delete --help' for usage.

cass-operator is gone, but the C* pod did stay up and running, so that's good 😄

% k get all
NAME                          READY   STATUS    RESTARTS   AGE
pod/test-test-default-sts-0   1/1     Running   0          72m

NAME                                          TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                                        AGE
service/cass-operator-metrics                 ClusterIP   10.96.51.27   <none>        8383/TCP,8686/TCP                              90m
service/cassandradatacenter-webhook-service   ClusterIP   10.96.36.59   <none>        443/TCP                                        91m
service/test-seed-service                     ClusterIP   None          <none>        <none>                                         72m
service/test-test-all-pods-service            ClusterIP   None          <none>        9042/TCP,8080/TCP,9103/TCP                     72m
service/test-test-service                     ClusterIP   None          <none>        9042/TCP,9142/TCP,8080/TCP,9103/TCP,9160/TCP   72m

NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cass-operator   0/0     0            0           91m

NAME                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/cass-operator-5dfcdc46f8   0         0         0       91m
replicaset.apps/cass-operator-7675b65744   0         0         0       21m

NAME                                     READY   AGE
statefulset.apps/test-test-default-sts   1/1     72m

Maybe its a version thing?

If I look at the help for delete on my system I get:

Options:
      --all=false: Delete all resources, including uninitialized ones, in the namespace of the specified resource types.
  -A, --all-namespaces=false: If present, list the requested object(s) across all namespaces. Namespace in current
context is ignored even if specified with --namespace.
      --cascade=true: If true, cascade the deletion of the resources managed by this resource (e.g. Pods created by a
ReplicationController).  Default true.

Clearly here, --cascade doesn't support much.

Here's what I'm running on, it's kind locally:

% k version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:23:52Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-22T22:54:21Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

jdonenine · 2021-05-26T19:23:42Z

So I've confirmed after talking to @burmanm and some quick testing that the problem I saw has to do with the version of kubectl. Looks like the "orphan" cascade was introduced first in v1.20, so as it is, things will fail on anything prior to that - my first attempt was on v1.19.

After I upgraded to v1.21 (latest), the script worked.

However, after I ran the script, I saw that all of the C* pods terminated, I expected either no changes at all or a rolling restart?

Every 2.0s: kubectl get pods                                                                                                                                                                                                                        jdinoto-rmbp16: Wed May 26 15:20:35 2021

NAME                             READY   STATUS     RESTARTS   AGE
cass-operator-7675b65744-s7rwp   1/1     Running    0          31s
test-test-default-sts-0          0/1     Init:0/1   0          6s
test-test-default-sts-1          0/1     Init:0/1   0          6s
test-test-default-sts-2          0/1     Init:0/1   0          7s

They then started coming back up:

Every 2.0s: kubectl get pods                                                                                                                                                                                                                        jdinoto-rmbp16: Wed May 26 15:23:01 2021

NAME                             READY   STATUS    RESTARTS   AGE
cass-operator-7675b65744-s7rwp   1/1     Running   0          2m57s
test-test-default-sts-0          1/1     Running   0          2m32s
test-test-default-sts-1          0/1     Running   0          2m32s
test-test-default-sts-2          1/1     Running   0          2m33s

…eters for delete --cascade

Add handling for kubectl client versions that support differing param…

jsanda · 2021-05-27T04:00:49Z

@jdonenine Thanks for the testing and for the script changes!

However, after I ran the script, I saw that all of the C* pods terminated, I expected either no changes at all or a rolling restart?

I have started investigating this. My expectation (or hope) is that there would be no restart of pods. StatefulSets have a podManagementPolicy property that dictates how pods will be started. Accepted values are OrderedReady and Parallel. cass-operator uses the latter.

Note that even though the pods are started in parallel, Cassandra nodes are still started serially. This is because the management-api starts/stops Cassandra and cass-operator invokes the start operation on the management-api serially.

I did a quick test. I created a CassandraDatacenter. After it became ready I scaled the cass-operator deployment down to 0. Then I deleted the CassandraDatacenter without cascading the delete. I then did a non-cascading delete of the StatefulSet. Then I recreated the StatefulSet. My Cassandra pods were not terminated. I think cass-operator is driving the pod restarts. I will investigate some more to see if I track down what is triggering the pod terminations.

jsanda · 2021-05-27T04:46:17Z

I figured out why pods are being recreated. When cass-operator initially creates the StatefulSet with replicas: 0. Because the labels of the Cassandra pods match the StatefulSet's selector, I believe the pods will immediately be considered part of the StatefulSet. The pods are then deleted since the replica count is set to zero initially. I verified this behavior with my test environment:

Events:
  Type    Reason            Age   From                    Message
  ----    ------            ----  ----                    -------
  Normal  SuccessfulDelete  19m   statefulset-controller  delete Pod labels-labels-default-sts-0 in StatefulSet labels-labels-default-sts successful
  Normal  SuccessfulCreate  19m   statefulset-controller  create Pod labels-labels-default-sts-0 in StatefulSet labels-labels-default-sts successful

The above is output from kubectl describe statefulset. I was testing with a single C* node cluster.

@jdonenine We won't be able to avoid a restart without non-trivial changes to cass-operator.

jsanda added 2 commits May 26, 2021 00:09

add script to update the statefulset service name

996e9bc

add back call to restore_objects and more docs

8f425f6

jdonenine and others added 4 commits May 26, 2021 22:43

Add handling for kubectl client versions that support differing param…

5e6d77d

…eters for delete --cascade

Typo

d281f93

Merge branch 'patch-sts-svc' into patch-sts-svc

66ac1f2

Merge pull request #1 from jdonenine/patch-sts-svc

173d760

Add handling for kubectl client versions that support differing param…

jsanda closed this Jul 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add script to update the statefulset service name #104

add script to update the statefulset service name #104

jsanda commented May 26, 2021 •

edited by sync-by-unito bot

Loading

jdonenine commented May 26, 2021 •

edited

Loading

jdonenine commented May 26, 2021

jsanda commented May 27, 2021 •

edited

Loading

jsanda commented May 27, 2021

add script to update the statefulset service name #104

add script to update the statefulset service name #104

Conversation

jsanda commented May 26, 2021 • edited by sync-by-unito bot Loading

jdonenine commented May 26, 2021 • edited Loading

jdonenine commented May 26, 2021

jsanda commented May 27, 2021 • edited Loading

jsanda commented May 27, 2021

jsanda commented May 26, 2021 •

edited by sync-by-unito bot

Loading

jdonenine commented May 26, 2021 •

edited

Loading

jsanda commented May 27, 2021 •

edited

Loading