operator stuck and unable to recover the mongodb if the podTemplate is rejected by the statefulSet controller #1245

hoyhbx · 2023-03-22T18:20:39Z

What did you do to encounter the bug?
We first created a Mongodb cluster using a very standard CR.
Then we tried to create some ephemeralContainers in the MongoDB Pods by specifying it in the spec.statefulSet.spec.template.ephemeralContainers field.
The MongoDB cluster ended up getting unhealthy because the spec for the ephemeralContainers has some problems, then we tried to recover by deleting the spec.statefulSet.spec.template.ephemeralContainers. But the operator is not able to recover the cluster after we manually revert the CR. It always waits for all members to reach desired state before proceeding to update the ephemeralContainer. But the statefulSet is never going to get ready because the spec is rejected by the statefulSet controller. This gets in a infinite loop scenario.
To fix this problem, we had to delete the cluster and redeploy it.

Note that the bug can be triggered by any invalid input which will be rejected by the statefulSet controller, not limited to ephemeralContainers

Steps to reproduce the behavior:

Deploy the MongoDB cluster with spec:

apiVersion: mongodbcommunity.mongodb.com/v1
kind: MongoDBCommunity
metadata:
  namespace: mongodb
  name: test-cluster
spec:
  automationConfig:
      processes:
      - disabled: false
        name: test-cluster-1
  members: 3
  type: ReplicaSet
  version: "4.4.0"
  security:
    authentication:
      modes: ["SCRAM"]
  users:
    - name: my-user
      db: admin
      passwordSecretRef: # a reference to the secret that will be used to generate the user's password
        name: my-user-password
      roles:
        - name: clusterAdmin
          db: admin
        - name: userAdminAnyDatabase
          db: admin
      scramCredentialsSecretName: my-scram
  statefulSet:
    spec:
      template:
        spec:
          containers:
          - name: mongod
            resources:
              limits:
                cpu: '1'
                memory: 1000M
              requests:
                cpu: '1'
                memory: 1000M
          - name: mongodb-agent
            resources:
              limits:
                cpu: '1'
                memory: 1000M
              requests:
                cpu: '1'
                memory: 1000M

Add ephemeralContainer to the statefulset template by applying:

apiVersion: mongodbcommunity.mongodb.com/v1
kind: MongoDBCommunity
metadata:
  namespace: mongodb
  name: test-cluster
spec:
  automationConfig:
      processes:
      - disabled: false
        name: test-cluster-1
  members: 3
  type: ReplicaSet
  version: "4.4.0"
  security:
    authentication:
      modes: ["SCRAM"]
  users:
    - name: my-user
      db: admin
      passwordSecretRef: # a reference to the secret that will be used to generate the user's password
        name: my-user-password
      roles:
        - name: clusterAdmin
          db: admin
        - name: userAdminAnyDatabase
          db: admin
      scramCredentialsSecretName: my-scram
  statefulSet:
    spec:
      template:
        spec:
          containers:
          - name: mongod
            resources:
              limits:
                cpu: '1'
                memory: 1000M
              requests:
                cpu: '1'
                memory: 1000M
          - name: mongodb-agent
            resources:
              limits:
                cpu: '1'
                memory: 1000M
              requests:
                cpu: '1'
                memory: 1000M
          ephemeralContainers:
          - name: ACTOCONTAINER
            resources:
              limits:
                cpu: 800m

What did you expect?
The operator should be able to recover the cluster after the manual revert.

What happened instead?

The operator is stuck and cannot make any progress even after manually reverting the CR.

Normal   SuccessfulCreate  48m                   statefulset-controller  create Pod test-cluster-0 in StatefulSet test-cluster successful
  Normal   SuccessfulCreate  46m                   statefulset-controller  create Pod test-cluster-1 in StatefulSet test-cluster successful
  Normal   SuccessfulCreate  46m                   statefulset-controller  create Pod test-cluster-2 in StatefulSet test-cluster successful
  Normal   SuccessfulDelete  2m15s                 statefulset-controller  delete Pod test-cluster-2 in StatefulSet test-cluster successful
  Warning  FailedCreate      93s (x2 over 2m14s)   statefulset-controller  create Pod test-cluster-2 in StatefulSet test-cluster failed error: Pod "test-cluster-2" is invalid: [spec.ephemeralContainers[0][0].name: Invalid value: "ACTOKEY": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'), spec.ephemeralContainers[0][0].image: Required value, spec.ephemeralContainers[0][0].resources.requests[ilvwddmkyk]: Invalid value: "ilvwddmkyk": must be a standard resource type or fully qualified, spec.ephemeralContainers[0][0].resources.requests[ilvwddmkyk]: Invalid value: "ilvwddmkyk": must be a standard resource for containers, spec.ephemeralContainers[0][0].resources.requests[ACTOKEY]: Invalid value: "ACTOKEY": must be a standard resource type or fully qualified, spec.ephemeralContainers[0][0].resources.requests[ACTOKEY]: Invalid value: "ACTOKEY": must be a standard resource for containers, spec.ephemeralContainers[0].resources: Forbidden: cannot be set for an Ephemeral Container, spec.ephemeralContainers: Forbidden: cannot be set on create]
  Warning  FailedCreate      52s (x13 over 2m14s)  statefulset-controller  create Pod test-cluster-2 in StatefulSet test-cluster failed error: Pod "test-cluster-2" is invalid: [spec.ephemeralContainers[0][0].name: Invalid value: "ACTOKEY": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'), spec.ephemeralContainers[0][0].image: Required value, spec.ephemeralContainers[0][0].resources.requests[ACTOKEY]: Invalid value: "ACTOKEY": must be a standard resource type or fully qualified, spec.ephemeralContainers[0][0].resources.requests[ACTOKEY]: Invalid value: "ACTOKEY": must be a standard resource for containers, spec.ephemeralContainers[0][0].resources.requests[ilvwddmkyk]: Invalid value: "ilvwddmkyk": must be a standard resource type or fully qualified, spec.ephemeralContainers[0][0].resources.requests[ilvwddmkyk]: Invalid value: "ilvwddmkyk": must be a standard resource for containers, spec.ephemeralContainers[0].resources: Forbidden: cannot be set for an Ephemeral Container, spec.ephemeralContainers: Forbidden: cannot be set on create]

Operator Information

Operator Version - 0.7.4
MongoDB Image used - 4.4.0

Kubernetes Cluster Information

kubectl version --short --output=yaml

$ kubectl version --short --output=yaml
clientVersion:
  buildDate: "2022-05-24T12:26:19Z"
  compiler: gc
  gitCommit: 3ddd0f45aa91e2f30c70734b175631bec5b5825a
  gitTreeState: clean
  gitVersion: v1.24.1
  goVersion: go1.18.2
  major: "1"
  minor: "24"
  platform: linux/amd64
kustomizeVersion: v4.5.4
serverVersion:
  buildDate: "2021-05-21T23:01:33Z"
  compiler: gc
  gitCommit: 5e58841cce77d4bc13713ad2b91fa0d961e69192
  gitTreeState: clean
  gitVersion: v1.21.1
  goVersion: go1.16.4
  major: "1"
  minor: "21"
  platform: linux/amd64

The text was updated successfully, but these errors were encountered:

github-actions · 2023-06-11T02:13:02Z

This issue is being marked stale because it has been open for 60 days with no activity. Please comment if this issue is still affecting you. If there is no change, this issue will be closed in 30 days.

github-actions · 2023-07-12T02:09:00Z

This issue was closed because it became stale and did not receive further updates. If the issue is still affecting you, please re-open it, or file a fresh Issue with updated information.

tylergu mentioned this issue Mar 22, 2023

[BUG] mongodb-kubernetes-operator: operator stuck and unable to recover the mongodb if the podTemplate is rejected by the statefulSet controller xlab-uiuc/acto#199

Closed

irajdeep added the triaged label Apr 11, 2023

github-actions bot added the stale label Jun 11, 2023

github-actions bot closed this as completed Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator stuck and unable to recover the mongodb if the podTemplate is rejected by the statefulSet controller #1245

operator stuck and unable to recover the mongodb if the podTemplate is rejected by the statefulSet controller #1245

hoyhbx commented Mar 22, 2023

github-actions bot commented Jun 11, 2023

github-actions bot commented Jul 12, 2023

operator stuck and unable to recover the mongodb if the podTemplate is rejected by the statefulSet controller #1245

operator stuck and unable to recover the mongodb if the podTemplate is rejected by the statefulSet controller #1245

Comments

hoyhbx commented Mar 22, 2023

github-actions bot commented Jun 11, 2023

github-actions bot commented Jul 12, 2023