Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator stuck and unable to recover the mongodb if the podTemplate is rejected by the statefulSet controller #1245

Closed
hoyhbx opened this issue Mar 22, 2023 · 2 comments

Comments

@hoyhbx
Copy link
Contributor

hoyhbx commented Mar 22, 2023

What did you do to encounter the bug?
We first created a Mongodb cluster using a very standard CR.
Then we tried to create some ephemeralContainers in the MongoDB Pods by specifying it in the spec.statefulSet.spec.template.ephemeralContainers field.
The MongoDB cluster ended up getting unhealthy because the spec for the ephemeralContainers has some problems, then we tried to recover by deleting the spec.statefulSet.spec.template.ephemeralContainers. But the operator is not able to recover the cluster after we manually revert the CR. It always waits for all members to reach desired state before proceeding to update the ephemeralContainer. But the statefulSet is never going to get ready because the spec is rejected by the statefulSet controller. This gets in a infinite loop scenario.
To fix this problem, we had to delete the cluster and redeploy it.

Note that the bug can be triggered by any invalid input which will be rejected by the statefulSet controller, not limited to ephemeralContainers

Steps to reproduce the behavior:

  1. Deploy the MongoDB cluster with spec:
apiVersion: mongodbcommunity.mongodb.com/v1
kind: MongoDBCommunity
metadata:
  namespace: mongodb
  name: test-cluster
spec:
  automationConfig:
      processes:
      - disabled: false
        name: test-cluster-1
  members: 3
  type: ReplicaSet
  version: "4.4.0"
  security:
    authentication:
      modes: ["SCRAM"]
  users:
    - name: my-user
      db: admin
      passwordSecretRef: # a reference to the secret that will be used to generate the user's password
        name: my-user-password
      roles:
        - name: clusterAdmin
          db: admin
        - name: userAdminAnyDatabase
          db: admin
      scramCredentialsSecretName: my-scram
  statefulSet:
    spec:
      template:
        spec:
          containers:
          - name: mongod
            resources:
              limits:
                cpu: '1'
                memory: 1000M
              requests:
                cpu: '1'
                memory: 1000M
          - name: mongodb-agent
            resources:
              limits:
                cpu: '1'
                memory: 1000M
              requests:
                cpu: '1'
                memory: 1000M
  1. Add ephemeralContainer to the statefulset template by applying:
apiVersion: mongodbcommunity.mongodb.com/v1
kind: MongoDBCommunity
metadata:
  namespace: mongodb
  name: test-cluster
spec:
  automationConfig:
      processes:
      - disabled: false
        name: test-cluster-1
  members: 3
  type: ReplicaSet
  version: "4.4.0"
  security:
    authentication:
      modes: ["SCRAM"]
  users:
    - name: my-user
      db: admin
      passwordSecretRef: # a reference to the secret that will be used to generate the user's password
        name: my-user-password
      roles:
        - name: clusterAdmin
          db: admin
        - name: userAdminAnyDatabase
          db: admin
      scramCredentialsSecretName: my-scram
  statefulSet:
    spec:
      template:
        spec:
          containers:
          - name: mongod
            resources:
              limits:
                cpu: '1'
                memory: 1000M
              requests:
                cpu: '1'
                memory: 1000M
          - name: mongodb-agent
            resources:
              limits:
                cpu: '1'
                memory: 1000M
              requests:
                cpu: '1'
                memory: 1000M
          ephemeralContainers:
          - name: ACTOCONTAINER
            resources:
              limits:
                cpu: 800m

What did you expect?
The operator should be able to recover the cluster after the manual revert.

What happened instead?

The operator is stuck and cannot make any progress even after manually reverting the CR.

Normal   SuccessfulCreate  48m                   statefulset-controller  create Pod test-cluster-0 in StatefulSet test-cluster successful
  Normal   SuccessfulCreate  46m                   statefulset-controller  create Pod test-cluster-1 in StatefulSet test-cluster successful
  Normal   SuccessfulCreate  46m                   statefulset-controller  create Pod test-cluster-2 in StatefulSet test-cluster successful
  Normal   SuccessfulDelete  2m15s                 statefulset-controller  delete Pod test-cluster-2 in StatefulSet test-cluster successful
  Warning  FailedCreate      93s (x2 over 2m14s)   statefulset-controller  create Pod test-cluster-2 in StatefulSet test-cluster failed error: Pod "test-cluster-2" is invalid: [spec.ephemeralContainers[0][0].name: Invalid value: "ACTOKEY": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'), spec.ephemeralContainers[0][0].image: Required value, spec.ephemeralContainers[0][0].resources.requests[ilvwddmkyk]: Invalid value: "ilvwddmkyk": must be a standard resource type or fully qualified, spec.ephemeralContainers[0][0].resources.requests[ilvwddmkyk]: Invalid value: "ilvwddmkyk": must be a standard resource for containers, spec.ephemeralContainers[0][0].resources.requests[ACTOKEY]: Invalid value: "ACTOKEY": must be a standard resource type or fully qualified, spec.ephemeralContainers[0][0].resources.requests[ACTOKEY]: Invalid value: "ACTOKEY": must be a standard resource for containers, spec.ephemeralContainers[0].resources: Forbidden: cannot be set for an Ephemeral Container, spec.ephemeralContainers: Forbidden: cannot be set on create]
  Warning  FailedCreate      52s (x13 over 2m14s)  statefulset-controller  create Pod test-cluster-2 in StatefulSet test-cluster failed error: Pod "test-cluster-2" is invalid: [spec.ephemeralContainers[0][0].name: Invalid value: "ACTOKEY": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'), spec.ephemeralContainers[0][0].image: Required value, spec.ephemeralContainers[0][0].resources.requests[ACTOKEY]: Invalid value: "ACTOKEY": must be a standard resource type or fully qualified, spec.ephemeralContainers[0][0].resources.requests[ACTOKEY]: Invalid value: "ACTOKEY": must be a standard resource for containers, spec.ephemeralContainers[0][0].resources.requests[ilvwddmkyk]: Invalid value: "ilvwddmkyk": must be a standard resource type or fully qualified, spec.ephemeralContainers[0][0].resources.requests[ilvwddmkyk]: Invalid value: "ilvwddmkyk": must be a standard resource for containers, spec.ephemeralContainers[0].resources: Forbidden: cannot be set for an Ephemeral Container, spec.ephemeralContainers: Forbidden: cannot be set on create]

Operator Information

  • Operator Version - 0.7.4
  • MongoDB Image used - 4.4.0

Kubernetes Cluster Information

kubectl version --short --output=yaml
$ kubectl version --short --output=yaml
clientVersion:
  buildDate: "2022-05-24T12:26:19Z"
  compiler: gc
  gitCommit: 3ddd0f45aa91e2f30c70734b175631bec5b5825a
  gitTreeState: clean
  gitVersion: v1.24.1
  goVersion: go1.18.2
  major: "1"
  minor: "24"
  platform: linux/amd64
kustomizeVersion: v4.5.4
serverVersion:
  buildDate: "2021-05-21T23:01:33Z"
  compiler: gc
  gitCommit: 5e58841cce77d4bc13713ad2b91fa0d961e69192
  gitTreeState: clean
  gitVersion: v1.21.1
  goVersion: go1.16.4
  major: "1"
  minor: "21"
  platform: linux/amd64
@github-actions
Copy link
Contributor

This issue is being marked stale because it has been open for 60 days with no activity. Please comment if this issue is still affecting you. If there is no change, this issue will be closed in 30 days.

@github-actions github-actions bot added the stale label Jun 11, 2023
@github-actions
Copy link
Contributor

This issue was closed because it became stale and did not receive further updates. If the issue is still affecting you, please re-open it, or file a fresh Issue with updated information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants