Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8SSAND-1349 ⁃ Hostname lookups on Cassandra pods fail #304

Closed
jsanda opened this issue Mar 24, 2022 · 4 comments · Fixed by #305 or #309
Closed

K8SSAND-1349 ⁃ Hostname lookups on Cassandra pods fail #304

jsanda opened this issue Mar 24, 2022 · 4 comments · Fixed by #305 or #309
Assignees
Labels
bug Something isn't working zh:To-Do

Comments

@jsanda
Copy link
Contributor

jsanda commented Mar 24, 2022

What happened?
I created a CassandraDatacenter with cluster name of test and no racks and 3 C* nodes. This results in the following pods:

  • test-dc1-default-sts-0
  • test-dc1-default-sts-1
  • test-dc1-default-sts-2

As per the Kubernetes docs there should be a DNS record for each pod such that the following should be resolvable (assuming the CassandraDatacenter is created in cass-operator namespace),

test-dc1-default-sts-0.test-dc1-all-pods-service.cass-operator.svc.cluster.local

Note that test-dc1-all-pods-service is the name of the headless service with which cass-operator configures the StatefulSet. Specifically, cass-operator sets the StatefulSet.Spec.ServiceName field to the name of the all-pods service.

The DNS lookup fails because the StatefulSet.Spec.ServiceName is set to the empty string.

Did you expect to see something different?
The StatefulSet.Spec.ServiceName property should be set to the all-pods service name then hostname lookups will work.

How to reproduce it (as minimally and precisely as possible):
Create a CassandraDatacenter, e.g.,

apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  name: dc1
spec:
  clusterName: test
  config:
    jvm-server-options:
      initial_heap_size: 512M
      max_heap_size: 512M
  serverType: cassandra
  serverVersion: 4.0.3
  size: 1
  storageConfig:
    cassandraDataVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: standard

Wait for the operator to create the StatefulSet and then see that the ServiceName property is not set.

Environment

  • Cass Operator version:
I believe this affects versions 1.7.1 to 1.10.1

┆Issue is synchronized with this Jira Task by Unito
┆friendlyId: K8SSAND-1349
┆priority: Medium

@jsanda jsanda added the bug Something isn't working label Mar 24, 2022
@jsanda jsanda self-assigned this Mar 24, 2022
@sync-by-unito sync-by-unito bot changed the title Hostname lookups on Cassandra pods fail K8SSAND-1349 ⁃ Hostname lookups on Cassandra pods fail Mar 24, 2022
@jsanda jsanda reopened this Mar 28, 2022
@jsanda
Copy link
Contributor Author

jsanda commented Mar 28, 2022

I am reopening this since #305 does not fix this for upgrade scenarios. With the changes in my PR the operator will try to update the ServiceName property of the StatefulSet which fails with an error message like this:

1.6484380208310935e+09 ERROR controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller Reconciler error {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc1", "namespace": "test-upgrade-operator", "error": "StatefulSet.apps \"cluster1-dc1-r1-sts\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden"}

Since the ServiceName property is immutable, we have to recreate the StatefulSet. Simply deleting the StatefulSet will cause the operator to recreate it with ServiceName set correctly but it will also delete the Cassandra pods which is undesirable. There is a work around for this.

We remove the StatefulSet owner reference from all Cassandra pods and replace it with an owner reference to the CassandraDatacenter. Then we delete the StatefulSet, and the pods remain intact (Note that PVCs remain intact). After deleting the StatefulSet, we then remove the CassandraDatacenter owner reference from the pods so that when we recreate the StatefulSet the pods end up with the correct owner references. This additional logic needs to be done in the CheckRackCreation method.

@burmanm
Copy link
Contributor

burmanm commented Mar 28, 2022

Similar to #103

@burmanm
Copy link
Contributor

burmanm commented Mar 28, 2022

Unless you started working on this, I could grab this. There are also potential pre-1.7.1 users who have old ServiceName, thus we might want to actually get rid of that if statement and simply update the StS in any case.

Also, we need to check in the upgrade_operator test then that we don't accidentially now delete the cluster completely as the test currently would probably pass if the pods are deleted and the newly-created StS creates new pods. That's not acceptable.

@jsanda
Copy link
Contributor Author

jsanda commented Mar 29, 2022

I did some more testing and want to note a couple things. First, it is unnecessary to add an owner reference to the pods for the CassandraDatacenter. It is sufficient to simply remove the StatefulSet owner reference. This is what happens with kubectl delete --cascade=orphan.

Secondly, adding back an owner reference to the new StatefulSet causes the pods to be recreated. I spent some time reviewing the StatefulSet controller code to see if I could track down what causes the update, but I came up short. While it would be nice to avoid a rolling restart of Cassandra, its manageable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working zh:To-Do
Projects
None yet
3 participants