Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] mongodb-kubernetes-operator: operator stuck and unable to recover the mongodb if the podTemplate is rejected by the statefulSet controller #199

Closed
taham0 opened this issue Feb 6, 2023 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@taham0
Copy link
Contributor

taham0 commented Feb 6, 2023

What did you do to encounter the bug?
We first created a Mongodb cluster using a very standard CR.
Then we tried to create some ephemeralContainers in the MongoDB Pods by specifying it in the spec.statefulSet.spec.template.ephemeralContainers field.
The MongoDB cluster ended up getting unhealthy because the spec for the ephemeralContainers has some problems, then we tried to recover by deleting the spec.statefulSet.spec.template.ephemeralContainers. But the operator is not able to recover the cluster after we manually revert the CR. It always waits for all members to reach desired state before proceeding to update the ephemeralContainer. But the statefulSet is never going to get ready because the spec is rejected by the statefulSet controller. This gets in a infinite loop scenario.
To fix this problem, we had to delete the cluster and redeploy it.

Steps to reproduce the behavior:

  1. Deploy the MongoDB cluster with spec:
apiVersion: mongodbcommunity.mongodb.com/v1
kind: MongoDBCommunity
metadata:
  namespace: mongodb
  name: test-cluster
spec:
  automationConfig:
      processes:
      - disabled: false
        name: test-cluster-1
  members: 3
  type: ReplicaSet
  version: "4.4.0"
  security:
    authentication:
      modes: ["SCRAM"]
  users:
    - name: my-user
      db: admin
      passwordSecretRef: # a reference to the secret that will be used to generate the user's password
        name: my-user-password
      roles:
        - name: clusterAdmin
          db: admin
        - name: userAdminAnyDatabase
          db: admin
      scramCredentialsSecretName: my-scram
  statefulSet:
    spec:
      template:
        spec:
          containers:
          - name: mongod
            resources:
              limits:
                cpu: '1'
                memory: 1000M
              requests:
                cpu: '1'
                memory: 1000M
          - name: mongodb-agent
            resources:
              limits:
                cpu: '1'
                memory: 1000M
              requests:
                cpu: '1'
                memory: 1000M
  1. Add ephemeralContainer to the statefulset template by applying:
apiVersion: mongodbcommunity.mongodb.com/v1
kind: MongoDBCommunity
metadata:
  namespace: mongodb
  name: test-cluster
spec:
  automationConfig:
      processes:
      - disabled: false
        name: test-cluster-1
  members: 3
  type: ReplicaSet
  version: "4.4.0"
  security:
    authentication:
      modes: ["SCRAM"]
  users:
    - name: my-user
      db: admin
      passwordSecretRef: # a reference to the secret that will be used to generate the user's password
        name: my-user-password
      roles:
        - name: clusterAdmin
          db: admin
        - name: userAdminAnyDatabase
          db: admin
      scramCredentialsSecretName: my-scram
  statefulSet:
    spec:
      template:
        spec:
          containers:
          - name: mongod
            resources:
              limits:
                cpu: '1'
                memory: 1000M
              requests:
                cpu: '1'
                memory: 1000M
          - name: mongodb-agent
            resources:
              limits:
                cpu: '1'
                memory: 1000M
              requests:
                cpu: '1'
                memory: 1000M
          ephemeralContainers:
          - name: ACTOCONTAINER
            resources:
              limits:
                cpu: 800m

What did you expect?
The operator should be able to recover the cluster after the manual revert.

What happened instead?

The operator is stuck and cannot make any progress even after manually reverting the CR.

Normal   SuccessfulCreate  48m                   statefulset-controller  create Pod test-cluster-0 in StatefulSet test-cluster successful
  Normal   SuccessfulCreate  46m                   statefulset-controller  create Pod test-cluster-1 in StatefulSet test-cluster successful
  Normal   SuccessfulCreate  46m                   statefulset-controller  create Pod test-cluster-2 in StatefulSet test-cluster successful
  Normal   SuccessfulDelete  2m15s                 statefulset-controller  delete Pod test-cluster-2 in StatefulSet test-cluster successful
  Warning  FailedCreate      93s (x2 over 2m14s)   statefulset-controller  create Pod test-cluster-2 in StatefulSet test-cluster failed error: Pod "test-cluster-2" is invalid: [spec.ephemeralContainers[0][0].name: Invalid value: "ACTOKEY": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'), spec.ephemeralContainers[0][0].image: Required value, spec.ephemeralContainers[0][0].resources.requests[ilvwddmkyk]: Invalid value: "ilvwddmkyk": must be a standard resource type or fully qualified, spec.ephemeralContainers[0][0].resources.requests[ilvwddmkyk]: Invalid value: "ilvwddmkyk": must be a standard resource for containers, spec.ephemeralContainers[0][0].resources.requests[ACTOKEY]: Invalid value: "ACTOKEY": must be a standard resource type or fully qualified, spec.ephemeralContainers[0][0].resources.requests[ACTOKEY]: Invalid value: "ACTOKEY": must be a standard resource for containers, spec.ephemeralContainers[0].resources: Forbidden: cannot be set for an Ephemeral Container, spec.ephemeralContainers: Forbidden: cannot be set on create]
  Warning  FailedCreate      52s (x13 over 2m14s)  statefulset-controller  create Pod test-cluster-2 in StatefulSet test-cluster failed error: Pod "test-cluster-2" is invalid: [spec.ephemeralContainers[0][0].name: Invalid value: "ACTOKEY": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'), spec.ephemeralContainers[0][0].image: Required value, spec.ephemeralContainers[0][0].resources.requests[ACTOKEY]: Invalid value: "ACTOKEY": must be a standard resource type or fully qualified, spec.ephemeralContainers[0][0].resources.requests[ACTOKEY]: Invalid value: "ACTOKEY": must be a standard resource for containers, spec.ephemeralContainers[0][0].resources.requests[ilvwddmkyk]: Invalid value: "ilvwddmkyk": must be a standard resource type or fully qualified, spec.ephemeralContainers[0][0].resources.requests[ilvwddmkyk]: Invalid value: "ilvwddmkyk": must be a standard resource for containers, spec.ephemeralContainers[0].resources: Forbidden: cannot be set for an Ephemeral Container, spec.ephemeralContainers: Forbidden: cannot be set on create]

Operator Information

  • Operator Version - 0.7.4
  • MongoDB Image used - 4.4.0

Kubernetes Cluster Information

kubectl version --short --output=yaml
$ kubectl version --short --output=yaml
clientVersion:
  buildDate: "2022-05-24T12:26:19Z"
  compiler: gc
  gitCommit: 3ddd0f45aa91e2f30c70734b175631bec5b5825a
  gitTreeState: clean
  gitVersion: v1.24.1
  goVersion: go1.18.2
  major: "1"
  minor: "24"
  platform: linux/amd64
kustomizeVersion: v4.5.4
serverVersion:
  buildDate: "2021-05-21T23:01:33Z"
  compiler: gc
  gitCommit: 5e58841cce77d4bc13713ad2b91fa0d961e69192
  gitTreeState: clean
  gitVersion: v1.21.1
  goVersion: go1.16.4
  major: "1"
  minor: "21"
  platform: linux/amd64
@taham0 taham0 added the bug Something isn't working label Feb 6, 2023
@tylergu tylergu changed the title [BUG] mongodb-kubernetes-operator: pod deleted and system unable to recover when invalid ephemeral container is specified [BUG] mongodb-kubernetes-operator: operator stuck and unable to recover mongodb if the podTemplate is rejected by the statefulSet controller Mar 22, 2023
@tylergu tylergu changed the title [BUG] mongodb-kubernetes-operator: operator stuck and unable to recover mongodb if the podTemplate is rejected by the statefulSet controller [BUG] mongodb-kubernetes-operator: operator stuck and unable to recover the mongodb if the podTemplate is rejected by the statefulSet controller Mar 22, 2023
@tylergu
Copy link
Member

tylergu commented Mar 22, 2023

@tylergu tylergu closed this as completed Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants