Unsatisfiable Affinity rule causes the entire redis cluster to be down, and redis-operator fails to recover even when the CR is reverted to correct Affinity rule #552

hoyhbx · 2023-01-10T21:47:51Z

Expected behaviour

redis-operator should avoid causing the entire redis cluster to be down when an unsatisfiable Affinity rule is specified.

After the users realize that the cluster is down and revert the Affinity rule back to a satisfiable one, redis-operator should successfully recover the cluster.

Actual behaviour

What is happening? Are all the pieces created? Can you access to the service?
The entire redis cluster is down when an unsatisfiable Affinity rule is specified, and even when reverting back, redis-operator is not able to recover the redis cluster.

Steps to reproduce the behaviour

Describe step by step what you've have done to get to this point

Deploy a redis cluster with the example CR

apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
  name: test-cluster
spec:
  redis:
    customConfig:
    - maxclients 100
    - hz 50
    - timeout 60
    - tcp-keepalive 60
    - client-output-buffer-limit normal 0 0 0
    - client-output-buffer-limit slave 1000000000 1000000000 0
    - client-output-buffer-limit pubsub 33554432 8388608 60
    exporter:
      enabled: true
    hostNetwork: false
    imagePullPolicy: IfNotPresent
    replicas: 3

Change the Affinity rule of redis to something unsatisfiable at the moment in the cluster

apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
  name: test-cluster
spec:
  redis:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
              - kind-worker
    customConfig:
    - maxclients 100
    - hz 50
    - timeout 60
    - tcp-keepalive 60
    - client-output-buffer-limit normal 0 0 0
    - client-output-buffer-limit slave 1000000000 1000000000 0
    - client-output-buffer-limit pubsub 33554432 8388608 60
    exporter:
      enabled: true
    hostNetwork: false
    imagePullPolicy: IfNotPresent
    replicas: 3

Observe that all three replicas of redis are down
Revert the CR back to the step 1
Observe that the redis cluster is still down

Environment

How are the pieces configured?

Redis Operator version: v1.1.0
Kubernetes version: v1.29
Kubernetes configuration used (eg: Is RBAC active?)

Logs

Please, add the debugging logs. In order to be able to gather them, add -debug flag when running the operator.
operator.log

From the log, we found that the redis-operator realized that there is 0 master existing in the cluster, so it tries to assign the oldest pod as the master before updating the pods. However, since all pods are down, redis-operator always fails to establish connection with the pod, so the master assignment always fails. And because redis-operator fails to assign master to pod, it does not roll out the updates to the pods, so the redis pods never become ready. This causes circular dependency, prevent redis-operator from recovering the cluster.

The text was updated successfully, but these errors were encountered:

samof76 · 2023-01-20T12:56:10Z

@hoyhbx I tried to simulate the case... but I always see only one pod is taken down, and that is pending state because of the affinity and the operator has not really proceeded to do anything, which is what is expect. Am I missing something here?

hoyhbx · 2023-01-30T02:03:15Z

@samof76 , I tried to reproduce it with the latest version of the operator, and it seems that the latest version tries to update the pods one by one. This behavior prevents the entire cluster down from happening.

However, I still observe that after changing the affinity rule back to the correct one, the operator still cannot recover. It is stuck with one pod pending.

github-actions · 2023-03-17T01:58:42Z

This issue is stale because it has been open for 45 days with no activity.

github-actions · 2023-04-01T01:54:40Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

hoyhbx · 2023-04-06T21:18:43Z

Hi @samof76 , I can still encounter this issue. The only difference between the newest version and the old version is that the Redis cluster does not become entirely unavailable, instead just one replica unavailable.
The main issue of this is that, even I want to manually recover the cluster by changing the CR, the operator prevents the recovery.

tylergu mentioned this issue Jan 10, 2023

Unsatisfiable Affinity rule causes the entire redis cluster to be down, and redis-operator fails to recover even when the CR is reverted to correct Affinity rule xlab-uiuc/acto#196

Closed

github-actions bot added the stale label Mar 17, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsatisfiable Affinity rule causes the entire redis cluster to be down, and redis-operator fails to recover even when the CR is reverted to correct Affinity rule #552

Unsatisfiable Affinity rule causes the entire redis cluster to be down, and redis-operator fails to recover even when the CR is reverted to correct Affinity rule #552

hoyhbx commented Jan 10, 2023

samof76 commented Jan 20, 2023

hoyhbx commented Jan 30, 2023

github-actions bot commented Mar 17, 2023

github-actions bot commented Apr 1, 2023

hoyhbx commented Apr 6, 2023

Unsatisfiable Affinity rule causes the entire redis cluster to be down, and redis-operator fails to recover even when the CR is reverted to correct Affinity rule #552

Unsatisfiable Affinity rule causes the entire redis cluster to be down, and redis-operator fails to recover even when the CR is reverted to correct Affinity rule #552

Comments

hoyhbx commented Jan 10, 2023

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

Environment

Logs

samof76 commented Jan 20, 2023

hoyhbx commented Jan 30, 2023

github-actions bot commented Mar 17, 2023

github-actions bot commented Apr 1, 2023

hoyhbx commented Apr 6, 2023