You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unsatisfiable Affinity rule causes the entire redis cluster to be down, and redis-operator fails to recover even when the CR is reverted to correct Affinity rule
#552
Closed
hoyhbx opened this issue
Jan 10, 2023
· 5 comments
redis-operator should avoid causing the entire redis cluster to be down when an unsatisfiable Affinity rule is specified.
After the users realize that the cluster is down and revert the Affinity rule back to a satisfiable one, redis-operator should successfully recover the cluster.
Actual behaviour
What is happening? Are all the pieces created? Can you access to the service?
The entire redis cluster is down when an unsatisfiable Affinity rule is specified, and even when reverting back, redis-operator is not able to recover the redis cluster.
Steps to reproduce the behaviour
Describe step by step what you've have done to get to this point
Kubernetes configuration used (eg: Is RBAC active?)
Logs
Please, add the debugging logs. In order to be able to gather them, add -debug flag when running the operator. operator.log
From the log, we found that the redis-operator realized that there is 0 master existing in the cluster, so it tries to assign the oldest pod as the master before updating the pods. However, since all pods are down, redis-operator always fails to establish connection with the pod, so the master assignment always fails. And because redis-operator fails to assign master to pod, it does not roll out the updates to the pods, so the redis pods never become ready. This causes circular dependency, prevent redis-operator from recovering the cluster.
The text was updated successfully, but these errors were encountered:
@hoyhbx I tried to simulate the case... but I always see only one pod is taken down, and that is pending state because of the affinity and the operator has not really proceeded to do anything, which is what is expect. Am I missing something here?
@samof76 , I tried to reproduce it with the latest version of the operator, and it seems that the latest version tries to update the pods one by one. This behavior prevents the entire cluster down from happening.
However, I still observe that after changing the affinity rule back to the correct one, the operator still cannot recover. It is stuck with one pod pending.
Hi @samof76 , I can still encounter this issue. The only difference between the newest version and the old version is that the Redis cluster does not become entirely unavailable, instead just one replica unavailable.
The main issue of this is that, even I want to manually recover the cluster by changing the CR, the operator prevents the recovery.
Expected behaviour
redis-operator should avoid causing the entire redis cluster to be down when an unsatisfiable Affinity rule is specified.
After the users realize that the cluster is down and revert the Affinity rule back to a satisfiable one, redis-operator should successfully recover the cluster.
Actual behaviour
What is happening? Are all the pieces created? Can you access to the service?
The entire redis cluster is down when an unsatisfiable Affinity rule is specified, and even when reverting back, redis-operator is not able to recover the redis cluster.
Steps to reproduce the behaviour
Describe step by step what you've have done to get to this point
Environment
How are the pieces configured?
Logs
Please, add the debugging logs. In order to be able to gather them, add
-debug
flag when running the operator.operator.log
From the log, we found that the redis-operator realized that there is 0 master existing in the cluster, so it tries to assign the oldest pod as the master before updating the pods. However, since all pods are down, redis-operator always fails to establish connection with the pod, so the master assignment always fails. And because redis-operator fails to assign master to pod, it does not roll out the updates to the pods, so the redis pods never become ready. This causes circular dependency, prevent redis-operator from recovering the cluster.
The text was updated successfully, but these errors were encountered: