Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8SSAND-483 ⁃ Updating Statefulsets is broken when upgrading to 1.7.0 #103

Closed
jsanda opened this issue May 25, 2021 · 9 comments · Fixed by #105
Closed

K8SSAND-483 ⁃ Updating Statefulsets is broken when upgrading to 1.7.0 #103

jsanda opened this issue May 25, 2021 · 9 comments · Fixed by #105
Assignees
Labels
bug Something isn't working

Comments

@jsanda
Copy link
Contributor

jsanda commented May 25, 2021

What happened?
I created a CassandraDatacenter with cass-operator 1.6.0. I then updated to 1.7.0. cass-operator fails to apply StatefulSet changes. cass-operator logs this error:

{"level":"error","ts":1621959974.8570168,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"cassandradatacenter-controller","request":"default/labels","error":"StatefulSet.apps \"labels-labels-default-sts\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden","stacktrace":"github.com/go-logr/zapr.(_zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(_Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(_Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(_Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88"}

Note this part in particular:

Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden"

This regression is due to the changes in #18 which change the ServiceName property of the StatefulSet.

cass-operator logs this error and does not continue with the reconciliation process. The Cassandra pods will remain running. If you try to change the CassandraDatacenter spec in such a way that would result in a change to the StatefulSet, the changes won't be applied.

Did you expect to see something different?

How to reproduce it (as minimally and precisely as possible):

  1. Deploy cass-operator 1.6.0
  2. Create a CassandraDatacenter
apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  name: labels
spec:
  clusterName: labels
  size: 1
  storageConfig:
    cassandraDataVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: standard
  serverType: cassandra
  serverVersion: 3.11.10
  serverImage: k8ssandra/cass-management-api:3.11.10-v0.1.24
  disableSystemLoggerSidecar: true
  dockerImageRunsAsCassandra: true
  podTemplateSpec:
    metadata:
      labels:
        env: dev
    spec:
      containers: []
  1. Wait for the CassandraDatacenter to become ready
  2. Upgrade the cass-operator deployment to 1.7.0
  3. Check the cass-operator logs and you should find see the above error message

Environment

  • Cass Operator version:

    v1.7.0

    **Anything else we need to know?**:The error occurs in the `CheckRackPodTemplate` function in `reconcile_racks.go`. This will impact any existing CassandraDatacenter that upgrades cass-operator.The bug will not impact new CassandraDatacenters installed with 1.7.0. I am inclined to say that we need to revert the changes in Allow dns lookup by pod name for all pods #18; however, doing so will introduce this problem for users who created new CassandraDatacenters with 1.7.0 and then go to upgrade. Given that we need to carefully consider how best to resolve this.

┆Issue is synchronized with this Jira Bug by Unito
┆fixVersions: k8ssandra-1.2.0,cass-operator-1.7.1
┆friendlyId: K8SSAND-483
┆priority: Highest

@jsanda jsanda added the bug Something isn't working label May 25, 2021
@jsanda
Copy link
Contributor Author

jsanda commented May 25, 2021

As I mentioned in the description users upgrading to 1.7.0 will hit this issue. If we revert the change, then users who have created new CassandraDatacenters with 1.7.0 will hit this issue when they upgrade if we revert. Either way there is an upgrade problem.

We can provide a script to resolve the upgrade issue. The script will do the following:

  • Scale cass-operator deployment to 0.
  • Remove the finalizer from the CassandraDatacenter
  • Delete the CassandraDatacenter and do not cascade the delete to the StatefulSet
    • This could be done with kubectl delete cassdc dc1 --cascade=orphan
  • Delete the StatefulSet and do not cascade the delete to the Cassandra pods
  • Recreate the CassandraDatacenter
    • In my initial testing I had a yaml manifest for my CassandraDatacenter available which made it easy. It might be a bit more involved if you installed from Helm for example.

I can work on creating this script and we can use use it whether we decide to revert #18 and do a 1.7.1 release or just provide it as a work around for users upgrading to 1.7.0

@rchernobelskiy
Copy link
Contributor

Just to add some context:
The purpose of the ServiceName field of a StatefulSet is to specify the name of the service under which DNS subdomain records are created for each pod: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#stable-network-id

The operator makes two services for dse pods, one with PublishNotReadyAddresses set to true and one with it set to false.

Setting the ServiceName to the service which has PublishNotReadyAddresses set to true allows dns lookups for pods that are not ready yet.

This can be useful in a number of scenarios, one of which is having the pods on an overlay network with stable IPs, where pod to pod communication happens via dns names: https://www.datastax.com/blog/2021/05/how-connect-stateful-workloads-across-kubernetes-clusters

(Also added the above to the original PR #18)

@talonx
Copy link

talonx commented May 27, 2021

I am still encountering this with 1.7.1
{"level":"error","ts":1622117781.4414082,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"cassandradatacenter-controller","request":"cass-operator/dc1","error":"StatefulSet.apps \"apptuit-dc1-rack1-sts\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88"}

@jdonenine
Copy link
Contributor

Hi @talonx we're working as we speak on a script that can remedy the problem on systems that were previously upgraded from 1.6 -> 1.7.0. Unfortunately, just upgrading to 1.7.1 won't address the issue.

We're hoping to have the script and a blog post with some details on the issue available very soon.

@talonx
Copy link

talonx commented May 27, 2021

Hi @jdonenine in my case I had upgraded from my previous version (not 1.6 or 1.7) to 1.7.1 and started seeing this in the logs. Is that expected?

@jdonenine
Copy link
Contributor

No, I would not have expected that if you hadn't already gone to 1.7.0 @talonx .

A couple of questions...

  1. What version of cass-operator did you upgrade from?
  2. Can you share the output of a describe on the CassandraDatacenter resource you have deployed? Particularly events being reported?

@talonx
Copy link

talonx commented May 27, 2021

  1. This is the output from the (previous) operator pod logs - {"level":"info","ts":1622118203.5764365,"logger":"cmd","msg":"Operator version","operatorVersion":"datastax/cass-operator:1.3.0-release.5e7d316ade03be8ee0da0792257ca9b8ca6ed6bd"}
  2. There are no events in the describe
default-token-c5v8d:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-c5v8d
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  pool-name=cassandra
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

@jsanda
Copy link
Contributor Author

jsanda commented May 27, 2021

@talonx Can you show the output of kubectl describe deployment <cass-operator>?

@burmanm
Copy link
Contributor

burmanm commented May 27, 2021

And would it be possible to see the labels of CassandraDatacenter object?

@sync-by-unito sync-by-unito bot changed the title Updating Statefulsets is broken when upgrading to 1.7.0 K8SSAND-483 ⁃ Updating Statefulsets is broken when upgrading to 1.7.0 Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants