redis + sentinel master pod reschedule / deletion results in two masters #5543

aariacarterweir · 2021-02-18T13:53:44Z

Which chart:
bitnami/redis 12.7.4

Describe the bug
If the master pod is rescheduled / deleted manually, a new master is elected properly but when the old master comes back online it elects itself as a master too.

To Reproduce
Steps to reproduce the behavior:

Install chart

helm install my-release bitnami/redis --set cluster.enabled=true,cluster.slaveCount=3,sentinel.enabled=true

Delete master pod
observe failover correctly happening and new master elected
when deleted pod is recreated and comes back online, it thinks it is a master.
now there are two masters

Expected behavior
Expected old master to rejoin as slave

Version of Helm and Kubernetes:

Output of helm version:

version.BuildInfo{Version:"v3.5.0", GitCommit:"32c22239423b3b4ba6706d450bd044baffdcf9e6", GitTreeState:"dirty", GoVersion:"go1.15.6"}

Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-14T05:15:04Z", GoVersion:"go1.15.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.15", GitCommit:"73dd5c840662bb066a146d0871216333181f4b64", GitTreeState:"clean", BuildDate:"2021-01-22T22:45:59Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

aariacarterweir · 2021-02-18T14:02:40Z

Note this is on 12.2.3 because that's the only version of the chart i can get working that doesn't initialise all instances as masters, as per #5347

javsalgar · 2021-02-22T08:50:46Z

Hi,

Thanks for reporting. Pinging @rafariossaa as he is looking into the Redis + Sentinel issues.

rafariossaa · 2021-02-22T08:54:45Z

Hi @aariacarterweir ,
Could you indicate which kubernetes cluster are you using ?
Also, I need a bit of clarification, in the first message of this issue you indicated this for v12.7.4, but later you indicated 12.2.3. I guess you mean you have this issue with 12.2.3 because with 12.7.4 you get all the instances as master. Am I right ?

rafariossaa · 2021-02-24T16:38:57Z

Hi,
A new version of the chart was released.
Could you give it a try and check if this fixed the issue for you ?

aariacarterweir · 2021-03-03T00:30:17Z

@rafariossaa sorry I haven't gotten back to you. I will give this a shot soon, but:

Also, I need a bit of clarification, in the first message of this issue you indicated this for v12.7.4, but later you indicated 12.2.3. I guess you mean you have this issue with 12.2.3 because with 12.7.4 you get all the instances as master. Am I right ?

Yup that's correct. For now I'm using the dandydeveloper chart as it works with pod deletion and also correctly promotes only one pod to master. I'll give this chart a spin again soon though and get back to you

GMartinez-Sisti · 2021-03-10T14:26:01Z

I'm having the same issue, with different result. My problem is caused by the chart using: {{ template "redis.fullname" . }}-node-0.{{ template "redis.fullname" . }}-headless... in the sentinel configuration here. If the node-0 is killed, it will never come back as it can't connect to itself on boot.
I think it should be using the redis service to connect to a sentinel node and then it could get the information it needs to bootstrap.

Example below with kind:

→ kubectl logs redis-node-0 -c sentinel
 14:17:44.81 INFO  ==> redis-headless.default.svc.cluster.local has my IP: 10.244.0.72
 14:17:44.83 INFO  ==> Cleaning sentinels in sentinel node: 10.244.0.75
Could not connect to Redis at 10.244.0.75:26379: Connection refused
 14:17:49.83 INFO  ==> Cleaning sentinels in sentinel node: 10.244.0.74
1
 14:17:54.84 INFO  ==> Sentinels clean up done
Could not connect to Redis at 10.244.0.72:26379: Connection refused

→ kubectl get pods -o wide
NAME                            READY   STATUS             RESTARTS   AGE   IP         
redis-node-0                    1/2     CrashLoopBackOff   8          13m   10.244.0.72
redis-node-1                    2/2     Running            0          12m   10.244.0.74
redis-node-2                    0/2     CrashLoopBackOff   14         12m   10.244.0.75

→ kubectl get services
NAME                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)              AGE
kubernetes          ClusterIP   10.96.0.1       <none>        443/TCP              23h
redis               ClusterIP   10.96.155.117   <none>        6379/TCP,26379/TCP   14m
redis-headless      ClusterIP   None            <none>        6379/TCP,26379/TCP   14m

rafariossaa · 2021-03-11T16:46:18Z

Hi @GMartinez-Sisti ,
Could you enable debug and get the logs from the nodes that are in CrashLoop ?.

On the node-0 config, take into account that the configmap generates a base config file that will be modified by the start scripts in configmap-scripts.yaml

qeternity · 2021-04-10T20:19:21Z

Bumping this...this is a really nasty bug and I cannot make sense of it.

Bitnami redis sentinel setup is beyond unstable. I actually think this chart should be quarantined until this is resolved. I will continue to investigate and report back.

qeternity · 2021-04-11T11:15:32Z

Ok so I have gotten to the bottom of this: if you lose the pod with both the leader sentinel and leader redis, we end up in a situation where another sentinel is promoted to leader, but continues to vote for the old redis leader which is down. When the pod comes back online, start-sentinel.sh polls the quorum for leader and attempts connection, which due to the above is pointing to its own IP.

This might be an issue with Redis, as it appears that if the leader sentinel goes down as it's failing over the leader redis to a follower, then the follower sentinels are unaware of the change and can never converge back on a consistent state.

rafariossaa · 2021-04-12T07:38:37Z

Hi,
@GMartinez-Sisti , @qeternity . Could you indicate which version of the chart and container images are you using ?
I would like to try to reproduce the issue.

GMartinez-Sisti · 2021-04-14T13:22:57Z

Hi @rafariossaa, thanks for the follow up.

I was testing with:

kind create cluster --name=redis-test
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install my-release bitnami/redis --set=usePassword=false --set=cluster.slaveCount=3 --set=sentinel.enabled=true --set=sentinel.usePassword=false

And then executing kubectl delete pod my-release-redis-node-0 to force a disruption on the cluster. After running this command I would see the behaviour described above. I can't remember the exact version that I had, but it was something along the 12.7.x version.

The good news are that I can't reproduce this problem again (just tried now with 13.0.1). Looks like #5603 and #5528 might have fixed the issues I was having.

rafariossaa · 2021-04-15T08:27:54Z

Hi,
Yes, there was some issues that were fixed.
Please, @qeternity could you also check your versions and see if your issues were also fixed?

serkantul · 2021-04-26T16:12:00Z

Hi,

I was dealing with the same issue and I can confirm that the issue seems resolved in the most recent 14.1.0 version ( commıt #6080). I was observing the same problem with the 14.0.2 version. It was not always reproducible but I could not able to find a workaround. The problem was when the master Redis pod is restarted with kubectl delete pod command, the sentinel containers in the other pods can not choose a new master and sentinel get-master-addr-by-name still returns the old master's IP address which doesn't exist anymore.

rafariossaa · 2021-04-26T17:22:40Z

Hi @serkantul ,
Is the case you observed in 14.0.2 solved for you in 14.1.0, or is it happening in other deployment you have with 14.0.2 ?

serkantul · 2021-04-27T05:54:24Z

Hi @rafariossaa,
I upgraded my deployment from 14.0.2 to 14.1.0 and I don't observe the issue anymore. I don't recall the versions exactly but I can say the latest versions of 11.x, 12.x and 13.x have the same issue, too.

rafariossaa · 2021-04-27T16:03:55Z

Hi,
Yes, it could happen it those versions.
I am happy that this is fixed for you now.

github-actions · 2021-05-13T01:26:38Z

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

rafariossaa · 2021-05-13T07:04:06Z

I am closing this issue.
Feel free to reopen it if needed or to create a new issue.

rafariossaa mentioned this issue Feb 24, 2021

[bitnami/redis] Fix issues in initialization/restarts #5603

Merged

3 tasks

github-actions bot added the stale 15 days without activity label May 13, 2021

rafariossaa closed this as completed May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

redis + sentinel master pod reschedule / deletion results in two masters #5543

redis + sentinel master pod reschedule / deletion results in two masters #5543

aariacarterweir commented Feb 18, 2021

aariacarterweir commented Feb 18, 2021

javsalgar commented Feb 22, 2021

rafariossaa commented Feb 22, 2021

rafariossaa commented Feb 24, 2021

aariacarterweir commented Mar 3, 2021

GMartinez-Sisti commented Mar 10, 2021

rafariossaa commented Mar 11, 2021 •

edited

Loading

qeternity commented Apr 10, 2021

qeternity commented Apr 11, 2021

rafariossaa commented Apr 12, 2021

GMartinez-Sisti commented Apr 14, 2021 •

edited

Loading

rafariossaa commented Apr 15, 2021

serkantul commented Apr 26, 2021

rafariossaa commented Apr 26, 2021

serkantul commented Apr 27, 2021

rafariossaa commented Apr 27, 2021

github-actions bot commented May 13, 2021

rafariossaa commented May 13, 2021

redis + sentinel master pod reschedule / deletion results in two masters #5543

redis + sentinel master pod reschedule / deletion results in two masters #5543

Comments

aariacarterweir commented Feb 18, 2021

aariacarterweir commented Feb 18, 2021

javsalgar commented Feb 22, 2021

rafariossaa commented Feb 22, 2021

rafariossaa commented Feb 24, 2021

aariacarterweir commented Mar 3, 2021

GMartinez-Sisti commented Mar 10, 2021

rafariossaa commented Mar 11, 2021 • edited Loading

qeternity commented Apr 10, 2021

qeternity commented Apr 11, 2021

rafariossaa commented Apr 12, 2021

GMartinez-Sisti commented Apr 14, 2021 • edited Loading

rafariossaa commented Apr 15, 2021

serkantul commented Apr 26, 2021

rafariossaa commented Apr 26, 2021

serkantul commented Apr 27, 2021

rafariossaa commented Apr 27, 2021

github-actions bot commented May 13, 2021

rafariossaa commented May 13, 2021

rafariossaa commented Mar 11, 2021 •

edited

Loading

GMartinez-Sisti commented Apr 14, 2021 •

edited

Loading