Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8SSAND-1467 ⁃ The newly spawned pods do not use the up-to-date server-config-init under certain conditions #324

Closed
hoyhbx opened this issue Apr 21, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@hoyhbx
Copy link

hoyhbx commented Apr 21, 2022

What did you do?

As mentioned in #150, configBuilderResources, the resource configuration of server-config-init can be set by configBuilderResources. We realized that under certain conditions, the new pods do not necessarily use the most up-to-date resource configuration. For example, as shown in the steps below, when we scale up the cluster and change the resource configuration at the same time, the newly spawned pod does not use the updated configBuilderResources.

We also found that even if we explicitly separate the configBuilderResources change and scale-up change into two steps, if there are pods that are not ready yet during the two operations, the statefulSet won't get updated immediately and the newly spawned pods still use the old configBuilderResources.

# Step1: Install cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.7.1/cert-manager.yaml

# Step2: Install operator
kubectl apply -f init.yaml
kubectl apply --force-conflicts --server-side -k 'github.com/k8ssandra/cass-operator/config/deployments/cluster?ref=v1.10.3'

# Step3: Apply custom resource
kubectl apply -f cr1_spec.yaml

# Step4: Check CR for the "Config Builder Resources" field => The field is the same as cr1_spec.yaml
kubectl describe cassandradatacenters.cassandra.datastax.com cassandra-datacenter

#  Config Builder Resources:
#    Requests:
#      Cpu:     512m
#      Memory:  100m

# Step5: Check statefulset for the resource request config of server-config-init => Same as cr1_spec.yaml
kubectl describe statefulsets.apps cluster1-cassandra-datacenter-default-sts

#   server-config-init:
#    Image:      datastax/cass-config-builder:1.0.4-ubi7
#    Port:       <none>
#    Host Port:  <none>
#    Requests:
#      cpu:     512m
#      memory:  100m

# Step6: Update cassandra-datacenter
kubectl apply -f cr2_spec.yaml

# Step7: Check CR for the "Config Builder Resources" field => The field is the same as cr2_spec.yaml
kubectl describe cassandradatacenters.cassandra.datastax.com cassandra-datacenter

#  Config Builder Resources:
#    Requests:
#      Cpu:     1024m
#      Memory:  200m

# Step8: Check Pods for the resource request config of server-config-init => Not the same as cr2_spec.yaml
kubectl describe pod cluster1-cassandra-datacenter-default-sts-0

#   server-config-init:
#    Image:      datastax/cass-config-builder:1.0.4-ubi7
#    Port:       <none>
#    Host Port:  <none>
#    Requests:
#      cpu:     512m
#      memory:  100m

kubectl describe pod cluster1-cassandra-datacenter-default-sts-1

#   server-config-init:
#    Image:      datastax/cass-config-builder:1.0.4-ubi7
#    Port:       <none>
#    Host Port:  <none>
#    Requests:
#      cpu:     512m
#      memory:  100m
  • init.yaml
 apiVersion: storage.k8s.io/v1
 kind: StorageClass
 metadata:
   # Changing the name to server-storage is the only change we have made compared to upstream
   name: server-storage
 provisioner: rancher.io/local-path
 volumeBindingMode: WaitForFirstConsumer
 reclaimPolicy: Delete
  • cr1_spec.yaml
 apiVersion: cassandra.datastax.com/v1beta1
 kind: CassandraDatacenter
 metadata:
   name: cassandra-datacenter
 spec:
   clusterName: cluster1
   config:
     cassandra-yaml:
       authenticator: org.apache.cassandra.auth.PasswordAuthenticator
       authorizer: org.apache.cassandra.auth.CassandraAuthorizer
       role_manager: org.apache.cassandra.auth.CassandraRoleManager
     jvm-options:
       initial_heap_size: 800M
       max_heap_size: 800M
   configBuilderResources:
     requests:
       memory: 100m
       cpu: 512m
   managementApiAuth:
     insecure: {}
   serverType: cassandra
   serverVersion: 3.11.7
   size: 1
   storageConfig:
     cassandraDataVolumeClaimSpec:
       accessModes:
       - ReadWriteOnce
       resources:
         requests:
           storage: 3Gi
       storageClassName: server-storage
  • cr2_spec.yaml (Update size and configBuilderResources)
 apiVersion: cassandra.datastax.com/v1beta1
 kind: CassandraDatacenter
 metadata:
   name: cassandra-datacenter
 spec:
   clusterName: cluster1
   config:
     cassandra-yaml:
       authenticator: org.apache.cassandra.auth.PasswordAuthenticator
       authorizer: org.apache.cassandra.auth.CassandraAuthorizer
       role_manager: org.apache.cassandra.auth.CassandraRoleManager
     jvm-options:
       initial_heap_size: 800M
       max_heap_size: 800M
   configBuilderResources:
     requests:
       memory: 200m
       cpu: 1024m
   managementApiAuth:
     insecure: {}
   serverType: cassandra
   serverVersion: 3.11.7
   size: 2
   storageConfig:
     cassandraDataVolumeClaimSpec:
       accessModes:
       - ReadWriteOnce
       resources:
         requests:
           storage: 3Gi
       storageClassName: server-storage

Did you expect to see some different?

In Step8, I expected that server-config-init should have the same resource request configurations as cr2_spec.yaml. As mentioned in #150, configBuilderResources, the resource configuration of server-config-init can be set by configBuilderResources. We updated the variable size in cr2_spec.yaml, and thus the number of Pod will increase from 1 to 2. The new Pod will trigger the init containers which run before app containers (k8s doc). Hence, at least the new Pod should reflect the new resource requirement specified in cr2_spec.yaml (memory: 200m, cpu: 1024m).

Environment

  • Cass Operator version:

    docker.io/k8ssandra/cass-operator:v1.10.3

    * Kubernetes version information:```

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:11:29Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:25:06Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}


* Kubernetes cluster kind:

minikube start --vm-driver=docker --cpus 4 --memory 4096 --kubernetes-version v1.21.0


# Root cause
We did some investigation and we found the following maybe the possible root cause.
The function `ReconcileAllRacks` updated the  Replica number of the StatefulSet (i.e. `size` in cr1_spec.yaml and cr2_spec.yaml) before updated the podTemplate in the StatefulSet. Hence, when we update the field `size` and `configBuilderResources` at the same time, the new Pod will be created with stale podTemplate.

To elaborate, the function `ReconcileAllRacks` updated the Replica number of the StatefulSet at [reconcile_racks.go#L2416](https://github.com/k8ssandra/cass-operator/blob/c9020efb832cbfad60142194659db230e1f6995d/pkg/reconciliation/reconcile_racks.go#L2416) and updated the podTemplate in the StatefulSet at [reconcile_racks.go#L2440](https://github.com/k8ssandra/cass-operator/blob/c9020efb832cbfad60142194659db230e1f6995d/pkg/reconciliation/reconcile_racks.go#L2440). In my opinion, we need to change the order of L2416 and L2440, and then the new Pod will be spawned with new podTemplate.



┆Issue is synchronized with this [Jira Task](https://k8ssandra.atlassian.net/browse/K8SSAND-1467) by [Unito](https://www.unito.io)
┆friendlyId: K8SSAND-1467
┆priority: Medium
@hoyhbx hoyhbx added the bug Something isn't working label Apr 21, 2022
@sync-by-unito sync-by-unito bot changed the title The newly spawned pods do not use the up-to-date server-config-init under certain conditions K8SSAND-1467 ⁃ The newly spawned pods do not use the up-to-date server-config-init under certain conditions Apr 21, 2022
@burmanm
Copy link
Contributor

burmanm commented Apr 21, 2022

Swapping the lines you proposed would not work as then the reconcile would modify the existing pods while trying to scale (among other things). The scaling behavior is intended to use the previous observed state of the CassandraDatacenter. Otherwise you would run into weird state where changing the CassDc while scaling would, instead of finishing the scaling, start suddenly modifying and restarting the existing pods before continuing with the scaling (or down scaling / decommission etc).

So at this point, the behavior is as intended. Scaling up or down Cassandra is usually a time and resource consuming process, which probably shouldn't happen with large amount of rolling restarts to existing pods unless one wants to endanger the production.

@hoyhbx
Copy link
Author

hoyhbx commented May 4, 2022

Thanks for the detailed explanation! I will close this issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants