Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/redis] After failover, there will be two masters (with simple reproducible experiments included!) #5418

Closed
fzyzcjy opened this issue Feb 8, 2021 · 16 comments
Labels

Comments

@fzyzcjy
Copy link

fzyzcjy commented Feb 8, 2021

Which chart:
redis

Describe the bug

Hi thanks for the lib! I do find a bug: After failover, there will be two masters.

Steps:

  1. Install the helm chart. Now we will see master pod (pod A) and slave pod (pod B).
  2. Kill (i.e. delete) the master pod (call it pod A). Kubernetes will auto recreate another master pod (call it pod A2).
  3. Verify that pod A2 has different ip compared with pod A. (This should be the usual case since k8s never guarantee static ip)
  4. Then both A2 and B think they are masters! (Verify by, for example: run INFO replication and see the roles. Or SET k v on both instances, and observe that they do not replicate. Or look at SENTINEL REPLICAS mymaster, and see sentinels do not even know the existence of pod A2.)

Guess of why this happens:

When pod A2 is created, it is configured as a master by the bitnami startup scripts / pod spec. Moreover, it has a brand new ip. Thus, it is as if this pod A2 is a separate unrelated brand-new redis instance which has completely no relationship with the original pod A/B/sentinels. Thus, it is not recognized, and of course it is not a slave of pod B.

Thanks

@migruiz4
Copy link
Member

migruiz4 commented Feb 8, 2021

Hi @fzyzcjy,

When using Redis Sentinel it is recommended to use at least 3 replicas to prevent split-brain (The behavior described above).
It can be done by setting cluster.slaveCount: 3 in your values.yaml.

If you'd like to find more information about this, please check Redis Sentinel documentation: https://redis.io/topics/sentinel

I think the main problem here is the 2-node Redis sentinel cluster being the default because of using the cluster.slaveCount setting, so I will also create an internal task to fix it.

Thank you for reporting this, if you have any further questions let me know.

@ricoberger
Copy link
Contributor

Hi @migruiz4, we are seeing a similar problem with version 12.7.3 of the Helm chart and the following values:

image:
  registry: docker.io
  repository: bitnami/redis
  tag: 6.0.9-debian-10-r66
  pullPolicy: IfNotPresent

fullnameOverride: redis

cluster:
  enabled: true
  slaveCount: 3

sentinel:
  enabled: true
  usePassword: false
  masterSet: mymaster
  initialCheckTimeout: 5
  quorum: 2
  downAfterMilliseconds: 5000
  failoverTimeout: 18000
  parallelSyncs: 1

usePassword: false

tls:
  enabled: false

configmap: |-
  # Disable AOF https://redis.io/topics/persistence#append-only-file
  appendonly no

podDisruptionBudget:
  enabled: true
  minAvailable: 0
  maxUnavailable: 1

We have the following Pods running (master is 172.25.0.220):

NAME           READY   STATUS    RESTARTS   AGE     IP             NODE                              NOMINATED NODE   READINESS GATES
redis-node-0   3/3     Running   1          8m33s   172.25.9.114   worker-main-dev-dd554d99c-2f4sr   <none>           <none>
redis-node-1   3/3     Running   3          11m     172.25.0.220   worker-main-dev-dd554d99c-nbj6q   <none>           <none>
redis-node-2   3/3     Running   0          12m     172.25.5.187   worker-main-dev-dd554d99c-hwnf9   <none>           <none>

Now we run k rollout restart statefulset redis-node, which restarts all Pods, starting with redis-node-2.

When redis-node-2 (IP: 172.25.13.188) becomes ready, then Redis reports two different masters:

  • redis-node-0 reports 172.25.0.220 as master
  • redis-node-1 is restarted multiple times
  • redis-node-2 reports 172.25.13.188 as master

After some minutes redis-node-1 becomes ready and reports 172.25.13.188 as master, then redis-node-0 is restarted and reports 172.25.13.188 as master once it is ready.

@migruiz4
Copy link
Member

migruiz4 commented Feb 8, 2021

Hi @ricoberger ,

I'd say that is expected behavior when the master node fails as Redis Sentinel needs at least 3 nodes to decide who is the master.

When the master node fails, both 0 and 1 decide they are the master (split-brain) and they won't agree on which of them is the master until a third node comes up and chooses.

This is because Redis Sentinel uses a Quorum-based master election protocol.

As described in Redis Sentinel documentation:

  • The quorum is the number of Sentinels that need to agree about the fact the master is not reachable, in order to really mark the master as failing, and eventually start a failover procedure if possible.
  • However the quorum is only used to detect the failure. In order to actually perform a failover, one of the Sentinels needs to be elected leader for the failover and be authorized to proceed. This only happens with the vote of the majority of the Sentinel processes.

In this type of scenario, the quorum must always be (number of nodes/2 + 1) as you don't want your cluster to split into smaller independent clusters. Examples:

  • 3 nodes with quorum=1, each node could decide they are master.
  • 6 nodes with quorum=2 or 3, has the risk of splitting into two clusters of 3 or 2, respectively.

If you are looking for a Redis Sentinel cluster that is able to recover quickly from a master failure, I'd say you will need a 4 nodes cluster with quorum=3, or bigger but always using quorum=(number of nodes/2 + 1). Otherwise, you will have to wait for the third node to recover for the cluster to be active, which is the behavior you described.

@ricoberger
Copy link
Contributor

Hi @migruiz4, thanks for your fast answer.

I don't think that the setup of 3 Redis Instances and 3 Sentinels should be a problem, we are running the same setup for years on VMs. I also adjusted the Values file to run a 4 nodes cluster with a quorum of 3, with the same result:

NAME           READY   STATUS    RESTARTS   AGE     IP              NODE                              NOMINATED NODE   READINESS GATES
redis-node-0   3/3     Running   0          110s    172.25.8.56     worker-main-dev-dd554d99c-46j9b   <none>           <none>
redis-node-1   3/3     Running   0          2m27s   172.25.0.173    worker-main-dev-dd554d99c-nbj6q   <none>           <none>
redis-node-2   3/3     Running   0          3m12s   172.25.13.209   worker-main-dev-dd554d99c-qwnzd   <none>           <none>
redis-node-3   3/3     Running   0          4m9s    172.25.9.160    worker-main-dev-dd554d99c-2f4sr   <none>           <none>

redis-node-0 (172.25.8.56) is the master, then we trigger a rollout k rollout restart statefulset redis-node and everything is looking good until redis-node-0 is restarted. It takes 5 minutes before a new master is selected:

  • Mon Feb 8 22:03:55 CET 2021 172.25.8.56 --> redis-node-0 is restarted
  • Mon Feb 8 22:09:44 CET 2021 172.25.14.92 --> redis-node-1, redis-node-2, redis-node-3 selected a new master

In the time where redis-node-1, redis-node-2, redis-node-3 reporting redis-node-0 as master, the redis-node-0 isn't able to start and restarts until a new master is selected.

@fzyzcjy
Copy link
Author

fzyzcjy commented Feb 8, 2021

@migruiz4 @ricoberger

When using Redis Sentinel it is recommended to use at least 3 replicas to prevent split-brain (The behavior described above).

I do use 3 sentinels. Just forgot to mentioned it. (I used a slightly modified version of this chart; but the idea is the same)

Thanks for the replies and experiments! IMHO, the biggest problem comes from the dynamic IP when a redis pod is restarted. The traditional redis (and redis sentinel) never thought of this case. If a redis instance is killed and restarted, its IP changed. Then it looks like as if a new redis instance (instead of a old one being down and up again) when looking at its IP.

@fzyzcjy
Copy link
Author

fzyzcjy commented Feb 8, 2021

There may exist another problem: The persistent volume. Notice that bitnami uses emptyDir to store config for both redis node and redis sentinel. However, sentinel documentation says:

This file will be used by the system in order to save the current state that will be reloaded in case of restarts

Thus, IMHO we should use a PersistentVolume instead of emptyDir to save the conf file.

As for whether emptyDir or PersistentVolume for the main node I am not sure. But you know, emptyDir means, when the pod/container is restarted on the same node, data are kept; but when on different node, data get lost. So this will lead to different behavior when the redis pod is scheduled onto the same or different node.

@migruiz4
Copy link
Member

Hi @ricoberger and @fzyzcjy,

I have been trying to reproduce this issue using the latest version of the Redis chart but I was unable to.

The pod changing its IP when deleted and the emptyDir causing problems when moving to a different K8S host are good candidates to be the root cause of the problem, thank you for the suggestions.

I have created an internal task to further investigate this issue and the necessary changes required to fix this, meanwhile I will tag this as on-hold.

I will keep you updated.

@migruiz4 migruiz4 added the on-hold Issues or Pull Requests with this label will never be considered stale label Feb 11, 2021
@addtzg
Copy link

addtzg commented Feb 11, 2021

To be honest two Sentinels are enough to execute failover procedure. It's even described in official manual:
image
And from my experiments this is true. Sure it will take some more time but master will be chosen.

@robertb724
Copy link

@fzyzcjy @migruiz4 I tested switching from emptyDir to persistentVolume and that seems to have solved all of the issues we were having related to sentinel failover. We were using preemtible nodes in gke and when the master node was preempted that node would come back up and try to establish itself as the master. That node would go into CrashLoopBackOff and eventually the rest of the nodes would follow.

@tomislater
Copy link
Contributor

I have noticed that old IPs are not removed from sentinel.conf. I have default sentinel setup, the newest version (3 nodes):

cat /opt/bitnami/redis-sentinel/etc/sentinel.conf:

dir "/tmp"
port 26379
sentinel monitor mymaster 10.110.37.122 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 18000

# User-supplied sentinel configuration:
# End of sentinel configuration
sentinel myid 80d3b99129c4f9bb62cdcec2399499f2c4666789

sentinel known-sentinel mymaster 10.110.59.80 26379 475427d0cefa44aff518ab6b888c3510314d882c

sentinel known-sentinel mymaster 10.110.17.137 26379 cae9f9b5ed951299f4bdc7de1da8e996e7b7197c

sentinel known-replica mymaster 10.110.51.41 6379

sentinel known-replica mymaster 10.110.59.80 6379
# Generated by CONFIG REWRITE
protected-mode no
user default on nopass ~* &* +@all
sentinel config-epoch mymaster 3
sentinel leader-epoch mymaster 3
sentinel current-epoch 3
sentinel known-replica mymaster 10.110.16.155 6379
sentinel known-replica mymaster 10.110.18.152 6379
sentinel known-replica mymaster 10.110.17.137 6379

10.110.37.122 is my current master and 10.110.59.80, 10.110.17.137 are my replicas.

And here are problems with these IPs:

  • 10.110.51.41 - I do not know what is it
  • 10.110.16.155 - that was old replica (I deleted one pod, just for fun)
  • 10.110.18.152 - that was also old replica (I also deleted this pod, but not for fun; for tests)

Looks like sentinel configuration is not configured properly when I (or my cluster (like autoscaler)) delete pods.

It is even funnier! 🙈

I spawned a new redis release (standalone, only one pod, in another namespace) and k8s gave him 10.110.18.152 IP address which as you can see is the same as the old (deleted) replica had! And now, sentinel "thinks" that this new redis in another namespace is a replica :D And, start syncing with master...

I think that we need something which is able to remove old IPs from sentinel.conf 🤔. I must study the chart, then I can help.

@robertb724
Copy link

I have adjusted the script to use the headless service entries instead of the pod ips, it seems to have solved our issues

@tomislater
Copy link
Contributor

Hmm, what I found:

Before deleting 10.110.51.41 pod:

cat /opt/bitnami/redis-sentinel/etc/sentinel.conf 
...
sentinel monitor mymaster 10.110.17.137 6379 2
...
sentinel known-sentinel mymaster 10.110.51.41 26379 80d3b99129c4f9bb62cdcec2399499f2c4666789
sentinel current-epoch 0
sentinel known-replica mymaster 10.110.37.122 6379
sentinel known-sentinel mymaster 10.110.37.122 26379 475427d0cefa44aff518ab6b888c3510314d882c
sentinel known-replica mymaster 10.110.51.41 6379

After deleting:

cat /opt/bitnami/redis-sentinel/etc/sentinel.conf 
...
sentinel monitor mymaster 10.110.17.137 6379 2
...
sentinel known-sentinel mymaster 10.110.52.28 26379 80d3b99129c4f9bb62cdcec2399499f2c4666789
sentinel current-epoch 0
sentinel known-replica mymaster 10.110.52.28 6379
sentinel known-sentinel mymaster 10.110.37.122 26379 475427d0cefa44aff518ab6b888c3510314d882c
sentinel known-replica mymaster 10.110.37.122 6379
sentinel known-replica mymaster 10.110.51.41 6379

The new pod has 10.110.52.28 IP. And it has been added as known-replica and known-sentinel.

The old pod (10.110.51.41) has been removed as known-sentinel, but it still exists as known-replica!

@tomislater
Copy link
Contributor

@migruiz4
Copy link
Member

migruiz4 commented Sep 8, 2021

Hi,

Thank you very much @tomislater for sharing your findings and submitting the PR.

Indeed the Redis sentinel DNS feature looks like a promising fix for the issue.

I will test it myself and get back to you as soon as possible.

@carrodher
Copy link
Member

Unfortunately, this issue was created a long time ago and although there is an internal task to fix it, it was not prioritized as something to address in the short/mid term. It's not a technical reason but something related to the capacity since we're a small team.

Being said that, contributions via PRs are more than welcome in both repositories (containers and charts). Just in case you would like to contribute.

During this time, there are several releases of this asset and it's possible the issue has gone as part of other changes. If that's not the case and you are still experiencing this issue, please feel free to reopen it and we will re-evaluate it.

@zkfopen
Copy link

zkfopen commented Oct 20, 2022 via email

@github-actions github-actions bot added solved and removed on-hold Issues or Pull Requests with this label will never be considered stale labels Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants