Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting a single cluster causes other non-affected clusters to abort all PODs & restart #615

Closed
andrey-dubnik opened this issue Jul 7, 2022 · 10 comments
Labels
bug Something isn't working done Issues in the state 'done'

Comments

@andrey-dubnik
Copy link

andrey-dubnik commented Jul 7, 2022

What happened?

Hi,

We have 2 clusters running associated with a single k8s namespace. We have dropped one cluster which resulted in another cluster to abort all the PODs at the same time and restart triggering a complete loss of service.

Did you expect to see something different?

We do not expect the cluster which was not affected by the delete operation to get affected at all.

How to reproduce it (as minimally and precisely as possible):

Create 2 clusters, same namespace (if relevant)
Delete one of the clusters
Remaining cluster will terminate all PODs and they will go into the restart mode

Environment

  • K8ssandra Operator version: v1.1.1
  • Kubernetes version information: 1.22.6
  • Kubernetes cluster kind: AKS

┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: K8OP-189

@andrey-dubnik andrey-dubnik added the bug Something isn't working label Jul 7, 2022
@sync-by-unito sync-by-unito bot changed the title Deleting a single cluster causes other non-affected clusters to abort all PODs & restart K8SSAND-1634 ⁃ Deleting a single cluster causes other non-affected clusters to abort all PODs & restart Jul 7, 2022
@andrey-dubnik
Copy link
Author

It may be related to the fact that we have 2 C* clusters in the same k8s namespace across 2 k8s clusters but we have named the datacenters the same way - primary and secondary.

This resulted in a configuration overlap as cassandra DC name is the same per namespace hence operator was trying to mess with the same DC name across 2 clusters...

@jsanda
Copy link
Contributor

jsanda commented Jul 7, 2022

@andrey-dubnik can you please provide the K8ssandraCluster manifests so we can test and try to reproduce?

@andrey-dubnik
Copy link
Author

Sure - here is the template below, just use the same template to create cluster1 and after 1 is online create cluster2.

I have tested the behaviour with a different DC names for cluster1 and cluster2 and there is no mass restart. I suspect this is down to both clusters pointing to the same DC name k get cassandradatacenters which I think should somehow be prevented from overlapping the {cluster, dc}

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
  labels:
    kustomize.toolkit.fluxcd.io/name: control-plane-cluster
    kustomize.toolkit.fluxcd.io/namespace: temporal-state
  name: cluster1
  namespace: cassandra
spec:
  auth: true
  cassandra:
    additionalSeeds:
    datacenters:
    - config:
        jvmOptions:
          gc: G1GC
          gc_g1_max_gc_pause_ms: 300
          gc_g1_rset_updating_pause_time_percent: 5
          heapSize: 512M
      jmxInitContainerImage:
        name: busybox
        registry: docker.io
        tag: 1.34.1
      metadata:
        annotations:
          prometheus.io/port: "9103"
          prometheus.io/scrape: "true"
        labels:
          app: temporal
          env: dev
          k8s_cluster: orch-dev-admin-westeurope-01
          product_id: service-composition
          provider: azure
          region: westeurope
        name: primary
      racks:
      - name: az-1
      - name: az-2
      - name: az-3
      resources:
        limits:
          cpu: 500m
          memory: 2Gi
        requests:
          cpu: 500m
          memory: 2Gi
      size: 3
      stopped: false
      storageConfig:
        cassandraDataVolumeClaimSpec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 1Ti
          storageClassName: maersk-cassandra-csi-zrs
      telemetry:
        prometheus:
          enabled: true
    jmxInitContainerImage:
      name: busybox
      registry: docker.io
      tag: 1.34.1
    serverVersion: 4.0.3
    superuserSecretRef:
      name: control-plane-superuser
  medusa:
    cassandraUserSecretRef:
      name: control-plane-medusa
    storageProperties:
      bucketName: cassandra-backups
      concurrentTransfers: 1
      maxBackupAge: 0
      maxBackupCount: 0
      multiPartUploadThreshold: 104857600
      storageProvider: azure_blobs
      storageSecretRef:
        name: control-plane-medusa-azure-credentials
      transferMaxBandwidth: 50MB/s
  reaper:
    ServiceAccountName: default
    autoScheduling:
      enabled: true
      initialDelayPeriod: PT15S
      percentUnrepairedThreshold: 10
      periodBetweenPolls: PT10M
      repairType: AUTO
      scheduleSpreadPeriod: PT6H
      timeBeforeFirstSchedule: PT5M
    cassandraUserSecretRef:
      name: control-plane-reaper-cql
    containerImage:
      name: cassandra-reaper
      registry: docker.io
      repository: thelastpickle
      tag: 3.1.1
    deploymentMode: PER_DC
    heapSize: 2Gi
    initContainerImage:
      name: cassandra-reaper
      registry: docker.io
      repository: thelastpickle
      tag: 3.1.1
    jmxUserSecretRef:
      name: control-plane-reaper-jmx
    keyspace: reaper_db
    uiUserSecretRef:
      name: control-plane-reaper-ui

@jsanda
Copy link
Contributor

jsanda commented Jul 7, 2022

@andrey-dubnik Thanks for sharing your spec. It was easy enough to reproduce the behavior. You are 100% correct that there is a naming collision. We need to adopt a naming convention for the CassandraDatacenters like we do with other resources which is to prefix their names with the K8ssandraCluster name. For your cluster, we would wind up with a CassandraDatacenter named cluster1-primary.

The DC name is specified in /etc/cassandra/cassandra-rackdc.properties. It is also stored in the system.local table.

We want Cassandra to recognize and store the DC name as primary (i.e., as specified in manifest), there will be a bit more work involved. It is going to require a small change in cass-operator. I will create the issue for that.

@jsanda
Copy link
Contributor

jsanda commented Jul 7, 2022

Here is a more scaled down example that I used to reproduce on my local kind cluster.

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
  name: tes1
spec:
  cassandra:
    serverVersion: "4.0.3"
    storageConfig:
      cassandraDataVolumeClaimSpec:
        storageClassName: standard
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
    config:
      jvmOptions:
        heapSize: 1Gi
    networking:
      hostNetwork: true
    datacenters:
      - metadata:
          name: dc1
        size: 1

and

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
  name: tes2
spec:
  cassandra:
    serverVersion: "4.0.3"
    storageConfig:
      cassandraDataVolumeClaimSpec:
        storageClassName: standard
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
    config:
      jvmOptions:
        heapSize: 1Gi
    networking:
      hostNetwork: true
    datacenters:
      - metadata:
          name: dc1
        size: 1

@jsanda
Copy link
Contributor

jsanda commented Jul 8, 2022

Some extra care will have to be take to deal with existing K8ssandraClusters to avoid downtime of C* and data loss. During the reconciliation loop the K8ssandraCluster controller fetches the CassandraDatacenter. The CassandraDatacenter is created if it isn't found.

That logic will need to be updated to look up by the old name as well if the CassandraDatacenter isn't found with the new naming scheme. If we find it with the old name, we want to recreate it with the new naming format. We can do a non-cascading delete and then recreate it. This way the StatefulSets, pods, etc. will remain intact.

@burmanm
Copy link
Contributor

burmanm commented Jul 11, 2022

I think the real fix requires some additional steps also. The references (if not Kubernetes controller reference, then something in the Status of K8ssandraCluster) should always correctly point to the objects that have been created (including the cluster it resides in). Not just rely on the naming ideology and hoping they won't clash, but it could even be as far as generating the names with uuid.

As for "non-cascading delete", that step requires uninstalling cass-operator without removing CRDs before removing the CassandraDatacenter. Otherwise, cass-operator will try to delete the underlying PVCs and remove secret annotations and if it fails, it will hold on to the finalizer and not allow deletion of CassDc.

Also, we might want to check on the "created if not found" policy. If the resource is found which we expect to be created, perhaps we should abort at that point? Otherwise K8ssandraCluster could overwrite an existing CassDc elsewhere as well.

@jsanda
Copy link
Contributor

jsanda commented Jul 12, 2022

We cannot rely on controller references since the objects we're dealing with can span across multiple Kubernetes clusters. While I am not a huge fan of the datacenters map we currently have in the status (since it just copies the CassandraDatacenters' status verbatim), it does specify each CassandraDatacenter. I'm not sure why we would also need to store the cluster since we can get that from the spec.

I'd say that prefixing the name of the CassandraDatacenter with the name of the K8ssandraCluster or the Cassandra cluster is more than simply hoping that they won't clash. It will prevent collisions in within a namespace-scoped deployment of the operator, even for multi-cluster. It should be sufficient for cluster-scoped deployments of the operator as well. This is the approach taken for other child resources.

My suggestion about the non-cascading delete had a slight oversight 😅 What about adding an annotation to the CassandraDatacenter that tells cass-operator it is a non-cascading delete?

I'm not sure about aborting if the resource is found when we expect to create it. We always check first to see if the object exists, and then create if it's not found. We don't do it the other way around, attempting to create the object first.

Copy link

sync-by-unito bot commented Sep 3, 2024

➤ Michael Burman commented:

I think the real fix requires some additional steps also. The references (if not Kubernetes controller reference, then something in the Status of K8ssandraCluster) should always correctly point to the objects that have been created (including the cluster it resides in). Not just rely on the naming ideology and hoping they won't clash, but it could even be as far as generating the names with uuid.

As for "non-cascading delete", that step requires uninstalling cass-operator without removing CRDs before removing the CassandraDatacenter. Otherwise, cass-operator will try to delete the underlying PVCs and remove secret annotations and if it fails, it will hold on to the finalizer and not allow deletion of CassDc.

Also, we might want to check on the "created if not found" policy. If the resource is found which we expect to be created, perhaps we should abort at that point? Otherwise K8ssandraCluster could overwrite an existing CassDc elsewhere as well.

@burmanm
Copy link
Contributor

burmanm commented Sep 5, 2024

This is a stale ticket (there's been name overrides for a while), so I'm closing despite the syncing solution waking this up.

@burmanm burmanm closed this as completed Sep 5, 2024
@github-project-automation github-project-automation bot moved this to Done in K8ssandra Sep 5, 2024
@adejanovski adejanovski added the done Issues in the state 'done' label Sep 5, 2024
@sync-by-unito sync-by-unito bot changed the title K8SSAND-1634 ⁃ Deleting a single cluster causes other non-affected clusters to abort all PODs & restart Deleting a single cluster causes other non-affected clusters to abort all PODs & restart Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working done Issues in the state 'done'
Projects
No open projects
Status: Done
Development

No branches or pull requests

4 participants