Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsatisfiable Affinity rule causes the entire redis cluster to be down, and redis-operator fails to recover even when the CR is reverted to correct Affinity rule #552

Closed
hoyhbx opened this issue Jan 10, 2023 · 5 comments
Labels

Comments

@hoyhbx
Copy link

hoyhbx commented Jan 10, 2023

Expected behaviour

redis-operator should avoid causing the entire redis cluster to be down when an unsatisfiable Affinity rule is specified.

After the users realize that the cluster is down and revert the Affinity rule back to a satisfiable one, redis-operator should successfully recover the cluster.

Actual behaviour

What is happening? Are all the pieces created? Can you access to the service?
The entire redis cluster is down when an unsatisfiable Affinity rule is specified, and even when reverting back, redis-operator is not able to recover the redis cluster.

Steps to reproduce the behaviour

Describe step by step what you've have done to get to this point

  1. Deploy a redis cluster with the example CR
apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
  name: test-cluster
spec:
  redis:
    customConfig:
    - maxclients 100
    - hz 50
    - timeout 60
    - tcp-keepalive 60
    - client-output-buffer-limit normal 0 0 0
    - client-output-buffer-limit slave 1000000000 1000000000 0
    - client-output-buffer-limit pubsub 33554432 8388608 60
    exporter:
      enabled: true
    hostNetwork: false
    imagePullPolicy: IfNotPresent
    replicas: 3
  1. Change the Affinity rule of redis to something unsatisfiable at the moment in the cluster
apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
  name: test-cluster
spec:
  redis:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
              - kind-worker
    customConfig:
    - maxclients 100
    - hz 50
    - timeout 60
    - tcp-keepalive 60
    - client-output-buffer-limit normal 0 0 0
    - client-output-buffer-limit slave 1000000000 1000000000 0
    - client-output-buffer-limit pubsub 33554432 8388608 60
    exporter:
      enabled: true
    hostNetwork: false
    imagePullPolicy: IfNotPresent
    replicas: 3
  1. Observe that all three replicas of redis are down
  2. Revert the CR back to the step 1
  3. Observe that the redis cluster is still down

Environment

How are the pieces configured?

  • Redis Operator version: v1.1.0
  • Kubernetes version: v1.29
  • Kubernetes configuration used (eg: Is RBAC active?)

Logs

Please, add the debugging logs. In order to be able to gather them, add -debug flag when running the operator.
operator.log

From the log, we found that the redis-operator realized that there is 0 master existing in the cluster, so it tries to assign the oldest pod as the master before updating the pods. However, since all pods are down, redis-operator always fails to establish connection with the pod, so the master assignment always fails. And because redis-operator fails to assign master to pod, it does not roll out the updates to the pods, so the redis pods never become ready. This causes circular dependency, prevent redis-operator from recovering the cluster.

@samof76
Copy link
Contributor

samof76 commented Jan 20, 2023

@hoyhbx I tried to simulate the case... but I always see only one pod is taken down, and that is pending state because of the affinity and the operator has not really proceeded to do anything, which is what is expect. Am I missing something here?

@hoyhbx
Copy link
Author

hoyhbx commented Jan 30, 2023

@samof76 , I tried to reproduce it with the latest version of the operator, and it seems that the latest version tries to update the pods one by one. This behavior prevents the entire cluster down from happening.

However, I still observe that after changing the affinity rule back to the correct one, the operator still cannot recover. It is stuck with one pod pending.

@github-actions
Copy link

This issue is stale because it has been open for 45 days with no activity.

@github-actions github-actions bot added the stale label Mar 17, 2023
@github-actions
Copy link

github-actions bot commented Apr 1, 2023

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 1, 2023
@hoyhbx
Copy link
Author

hoyhbx commented Apr 6, 2023

Hi @samof76 , I can still encounter this issue. The only difference between the newest version and the old version is that the Redis cluster does not become entirely unavailable, instead just one replica unavailable.
The main issue of this is that, even I want to manually recover the cluster by changing the CR, the operator prevents the recovery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants