Deleting a single cluster causes other non-affected clusters to abort all PODs & restart #615

andrey-dubnik · 2022-07-07T09:52:05Z

What happened?

Hi,

We have 2 clusters running associated with a single k8s namespace. We have dropped one cluster which resulted in another cluster to abort all the PODs at the same time and restart triggering a complete loss of service.

Did you expect to see something different?

We do not expect the cluster which was not affected by the delete operation to get affected at all.

How to reproduce it (as minimally and precisely as possible):

Create 2 clusters, same namespace (if relevant)
Delete one of the clusters
Remaining cluster will terminate all PODs and they will go into the restart mode

Environment

K8ssandra Operator version: v1.1.1
Kubernetes version information: 1.22.6
Kubernetes cluster kind: AKS

┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: K8OP-189

andrey-dubnik · 2022-07-07T13:08:11Z

It may be related to the fact that we have 2 C* clusters in the same k8s namespace across 2 k8s clusters but we have named the datacenters the same way - primary and secondary.

This resulted in a configuration overlap as cassandra DC name is the same per namespace hence operator was trying to mess with the same DC name across 2 clusters...

jsanda · 2022-07-07T14:10:12Z

@andrey-dubnik can you please provide the K8ssandraCluster manifests so we can test and try to reproduce?

andrey-dubnik · 2022-07-07T14:31:42Z

Sure - here is the template below, just use the same template to create cluster1 and after 1 is online create cluster2.

I have tested the behaviour with a different DC names for cluster1 and cluster2 and there is no mass restart. I suspect this is down to both clusters pointing to the same DC name k get cassandradatacenters which I think should somehow be prevented from overlapping the {cluster, dc}

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
  labels:
    kustomize.toolkit.fluxcd.io/name: control-plane-cluster
    kustomize.toolkit.fluxcd.io/namespace: temporal-state
  name: cluster1
  namespace: cassandra
spec:
  auth: true
  cassandra:
    additionalSeeds:
    datacenters:
    - config:
        jvmOptions:
          gc: G1GC
          gc_g1_max_gc_pause_ms: 300
          gc_g1_rset_updating_pause_time_percent: 5
          heapSize: 512M
      jmxInitContainerImage:
        name: busybox
        registry: docker.io
        tag: 1.34.1
      metadata:
        annotations:
          prometheus.io/port: "9103"
          prometheus.io/scrape: "true"
        labels:
          app: temporal
          env: dev
          k8s_cluster: orch-dev-admin-westeurope-01
          product_id: service-composition
          provider: azure
          region: westeurope
        name: primary
      racks:
      - name: az-1
      - name: az-2
      - name: az-3
      resources:
        limits:
          cpu: 500m
          memory: 2Gi
        requests:
          cpu: 500m
          memory: 2Gi
      size: 3
      stopped: false
      storageConfig:
        cassandraDataVolumeClaimSpec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 1Ti
          storageClassName: maersk-cassandra-csi-zrs
      telemetry:
        prometheus:
          enabled: true
    jmxInitContainerImage:
      name: busybox
      registry: docker.io
      tag: 1.34.1
    serverVersion: 4.0.3
    superuserSecretRef:
      name: control-plane-superuser
  medusa:
    cassandraUserSecretRef:
      name: control-plane-medusa
    storageProperties:
      bucketName: cassandra-backups
      concurrentTransfers: 1
      maxBackupAge: 0
      maxBackupCount: 0
      multiPartUploadThreshold: 104857600
      storageProvider: azure_blobs
      storageSecretRef:
        name: control-plane-medusa-azure-credentials
      transferMaxBandwidth: 50MB/s
  reaper:
    ServiceAccountName: default
    autoScheduling:
      enabled: true
      initialDelayPeriod: PT15S
      percentUnrepairedThreshold: 10
      periodBetweenPolls: PT10M
      repairType: AUTO
      scheduleSpreadPeriod: PT6H
      timeBeforeFirstSchedule: PT5M
    cassandraUserSecretRef:
      name: control-plane-reaper-cql
    containerImage:
      name: cassandra-reaper
      registry: docker.io
      repository: thelastpickle
      tag: 3.1.1
    deploymentMode: PER_DC
    heapSize: 2Gi
    initContainerImage:
      name: cassandra-reaper
      registry: docker.io
      repository: thelastpickle
      tag: 3.1.1
    jmxUserSecretRef:
      name: control-plane-reaper-jmx
    keyspace: reaper_db
    uiUserSecretRef:
      name: control-plane-reaper-ui

jsanda · 2022-07-07T22:59:36Z

@andrey-dubnik Thanks for sharing your spec. It was easy enough to reproduce the behavior. You are 100% correct that there is a naming collision. We need to adopt a naming convention for the CassandraDatacenters like we do with other resources which is to prefix their names with the K8ssandraCluster name. For your cluster, we would wind up with a CassandraDatacenter named cluster1-primary.

The DC name is specified in /etc/cassandra/cassandra-rackdc.properties. It is also stored in the system.local table.

We want Cassandra to recognize and store the DC name as primary (i.e., as specified in manifest), there will be a bit more work involved. It is going to require a small change in cass-operator. I will create the issue for that.

jsanda · 2022-07-07T23:02:04Z

Here is a more scaled down example that I used to reproduce on my local kind cluster.

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
  name: tes1
spec:
  cassandra:
    serverVersion: "4.0.3"
    storageConfig:
      cassandraDataVolumeClaimSpec:
        storageClassName: standard
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
    config:
      jvmOptions:
        heapSize: 1Gi
    networking:
      hostNetwork: true
    datacenters:
      - metadata:
          name: dc1
        size: 1

and

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
  name: tes2
spec:
  cassandra:
    serverVersion: "4.0.3"
    storageConfig:
      cassandraDataVolumeClaimSpec:
        storageClassName: standard
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
    config:
      jvmOptions:
        heapSize: 1Gi
    networking:
      hostNetwork: true
    datacenters:
      - metadata:
          name: dc1
        size: 1

jsanda · 2022-07-08T03:00:57Z

Some extra care will have to be take to deal with existing K8ssandraClusters to avoid downtime of C* and data loss. During the reconciliation loop the K8ssandraCluster controller fetches the CassandraDatacenter. The CassandraDatacenter is created if it isn't found.

That logic will need to be updated to look up by the old name as well if the CassandraDatacenter isn't found with the new naming scheme. If we find it with the old name, we want to recreate it with the new naming format. We can do a non-cascading delete and then recreate it. This way the StatefulSets, pods, etc. will remain intact.

burmanm · 2022-07-11T08:50:59Z

I think the real fix requires some additional steps also. The references (if not Kubernetes controller reference, then something in the Status of K8ssandraCluster) should always correctly point to the objects that have been created (including the cluster it resides in). Not just rely on the naming ideology and hoping they won't clash, but it could even be as far as generating the names with uuid.

As for "non-cascading delete", that step requires uninstalling cass-operator without removing CRDs before removing the CassandraDatacenter. Otherwise, cass-operator will try to delete the underlying PVCs and remove secret annotations and if it fails, it will hold on to the finalizer and not allow deletion of CassDc.

Also, we might want to check on the "created if not found" policy. If the resource is found which we expect to be created, perhaps we should abort at that point? Otherwise K8ssandraCluster could overwrite an existing CassDc elsewhere as well.

jsanda · 2022-07-12T03:45:17Z

We cannot rely on controller references since the objects we're dealing with can span across multiple Kubernetes clusters. While I am not a huge fan of the datacenters map we currently have in the status (since it just copies the CassandraDatacenters' status verbatim), it does specify each CassandraDatacenter. I'm not sure why we would also need to store the cluster since we can get that from the spec.

I'd say that prefixing the name of the CassandraDatacenter with the name of the K8ssandraCluster or the Cassandra cluster is more than simply hoping that they won't clash. It will prevent collisions in within a namespace-scoped deployment of the operator, even for multi-cluster. It should be sufficient for cluster-scoped deployments of the operator as well. This is the approach taken for other child resources.

My suggestion about the non-cascading delete had a slight oversight 😅 What about adding an annotation to the CassandraDatacenter that tells cass-operator it is a non-cascading delete?

I'm not sure about aborting if the resource is found when we expect to create it. We always check first to see if the object exists, and then create if it's not found. We don't do it the other way around, attempting to create the object first.

sync-by-unito · 2024-09-03T10:37:28Z

➤ Michael Burman commented:

I think the real fix requires some additional steps also. The references (if not Kubernetes controller reference, then something in the Status of K8ssandraCluster) should always correctly point to the objects that have been created (including the cluster it resides in). Not just rely on the naming ideology and hoping they won't clash, but it could even be as far as generating the names with uuid.

As for "non-cascading delete", that step requires uninstalling cass-operator without removing CRDs before removing the CassandraDatacenter. Otherwise, cass-operator will try to delete the underlying PVCs and remove secret annotations and if it fails, it will hold on to the finalizer and not allow deletion of CassDc.

Also, we might want to check on the "created if not found" policy. If the resource is found which we expect to be created, perhaps we should abort at that point? Otherwise K8ssandraCluster could overwrite an existing CassDc elsewhere as well.

burmanm · 2024-09-05T15:51:47Z

This is a stale ticket (there's been name overrides for a while), so I'm closing despite the syncing solution waking this up.

andrey-dubnik added the bug Something isn't working label Jul 7, 2022

sync-by-unito bot changed the title ~~Deleting a single cluster causes other non-affected clusters to abort all PODs & restart~~ K8SSAND-1634 ⁃ Deleting a single cluster causes other non-affected clusters to abort all PODs & restart Jul 7, 2022

jsanda mentioned this issue Jul 7, 2022

If superuserSecretRef is changed, the operator should not revert it and instead report validation error #576

Open

jsanda mentioned this issue Jul 7, 2022

K8SSAND-1636 ⁃ Allow the Cassandra DC name to be different than the CassandraDatacenter name k8ssandra/cass-operator#362

Closed

1 task

burmanm mentioned this issue Jul 12, 2022

K8SSAND-1645 ⁃ If no finalizer is present, skip deletion process k8ssandra/cass-operator#366

Closed

jsanda mentioned this issue Jul 13, 2022

Allow custom Cassandra DC name k8ssandra/cass-operator#363

Closed

5 tasks

adejanovski moved this to To Groom in K8ssandra Nov 8, 2022

adejanovski added this to K8ssandra Nov 8, 2022

burmanm closed this as completed Sep 5, 2024

github-project-automation bot moved this to Done in K8ssandra Sep 5, 2024

adejanovski added the done Issues in the state 'done' label Sep 5, 2024

sync-by-unito bot changed the title ~~K8SSAND-1634 ⁃ Deleting a single cluster causes other non-affected clusters to abort all PODs & restart~~ Deleting a single cluster causes other non-affected clusters to abort all PODs & restart Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deleting a single cluster causes other non-affected clusters to abort all PODs & restart #615

Deleting a single cluster causes other non-affected clusters to abort all PODs & restart #615

andrey-dubnik commented Jul 7, 2022 •

edited by sync-by-unito bot

Loading

andrey-dubnik commented Jul 7, 2022

jsanda commented Jul 7, 2022

andrey-dubnik commented Jul 7, 2022

jsanda commented Jul 7, 2022

jsanda commented Jul 7, 2022

jsanda commented Jul 8, 2022

burmanm commented Jul 11, 2022

jsanda commented Jul 12, 2022 •

edited

Loading

sync-by-unito bot commented Sep 3, 2024

burmanm commented Sep 5, 2024

Deleting a single cluster causes other non-affected clusters to abort all PODs & restart #615

Deleting a single cluster causes other non-affected clusters to abort all PODs & restart #615

Comments

andrey-dubnik commented Jul 7, 2022 • edited by sync-by-unito bot Loading

andrey-dubnik commented Jul 7, 2022

jsanda commented Jul 7, 2022

andrey-dubnik commented Jul 7, 2022

jsanda commented Jul 7, 2022

jsanda commented Jul 7, 2022

jsanda commented Jul 8, 2022

burmanm commented Jul 11, 2022

jsanda commented Jul 12, 2022 • edited Loading

sync-by-unito bot commented Sep 3, 2024

burmanm commented Sep 5, 2024

andrey-dubnik commented Jul 7, 2022 •

edited by sync-by-unito bot

Loading

jsanda commented Jul 12, 2022 •

edited

Loading