Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redis + sentinel master pod reschedule / deletion results in two masters #5543

Closed
aariacarterweir opened this issue Feb 18, 2021 · 18 comments
Closed
Labels
stale 15 days without activity

Comments

@aariacarterweir
Copy link

Which chart:
bitnami/redis 12.7.4

Describe the bug
If the master pod is rescheduled / deleted manually, a new master is elected properly but when the old master comes back online it elects itself as a master too.

To Reproduce
Steps to reproduce the behavior:

  1. Install chart
    helm install my-release bitnami/redis --set cluster.enabled=true,cluster.slaveCount=3,sentinel.enabled=true
    
  2. Delete master pod
  3. observe failover correctly happening and new master elected
  4. when deleted pod is recreated and comes back online, it thinks it is a master.
  5. now there are two masters

Expected behavior
Expected old master to rejoin as slave

Version of Helm and Kubernetes:

  • Output of helm version:
version.BuildInfo{Version:"v3.5.0", GitCommit:"32c22239423b3b4ba6706d450bd044baffdcf9e6", GitTreeState:"dirty", GoVersion:"go1.15.6"}
  • Output of kubectl version:
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-14T05:15:04Z", GoVersion:"go1.15.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.15", GitCommit:"73dd5c840662bb066a146d0871216333181f4b64", GitTreeState:"clean", BuildDate:"2021-01-22T22:45:59Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Additional context
Add any other context about the problem here.

@aariacarterweir
Copy link
Author

Note this is on 12.2.3 because that's the only version of the chart i can get working that doesn't initialise all instances as masters, as per #5347

@javsalgar
Copy link
Contributor

Hi,

Thanks for reporting. Pinging @rafariossaa as he is looking into the Redis + Sentinel issues.

@rafariossaa
Copy link
Contributor

Hi @aariacarterweir ,
Could you indicate which kubernetes cluster are you using ?
Also, I need a bit of clarification, in the first message of this issue you indicated this for v12.7.4, but later you indicated 12.2.3. I guess you mean you have this issue with 12.2.3 because with 12.7.4 you get all the instances as master. Am I right ?

@rafariossaa
Copy link
Contributor

Hi,
A new version of the chart was released.
Could you give it a try and check if this fixed the issue for you ?

@aariacarterweir
Copy link
Author

@rafariossaa sorry I haven't gotten back to you. I will give this a shot soon, but:

Also, I need a bit of clarification, in the first message of this issue you indicated this for v12.7.4, but later you indicated 12.2.3. I guess you mean you have this issue with 12.2.3 because with 12.7.4 you get all the instances as master. Am I right ?

Yup that's correct. For now I'm using the dandydeveloper chart as it works with pod deletion and also correctly promotes only one pod to master. I'll give this chart a spin again soon though and get back to you

@GMartinez-Sisti
Copy link

I'm having the same issue, with different result. My problem is caused by the chart using: {{ template "redis.fullname" . }}-node-0.{{ template "redis.fullname" . }}-headless... in the sentinel configuration here. If the node-0 is killed, it will never come back as it can't connect to itself on boot.
I think it should be using the redis service to connect to a sentinel node and then it could get the information it needs to bootstrap.

Example below with kind:

→ kubectl logs redis-node-0 -c sentinel
 14:17:44.81 INFO  ==> redis-headless.default.svc.cluster.local has my IP: 10.244.0.72
 14:17:44.83 INFO  ==> Cleaning sentinels in sentinel node: 10.244.0.75
Could not connect to Redis at 10.244.0.75:26379: Connection refused
 14:17:49.83 INFO  ==> Cleaning sentinels in sentinel node: 10.244.0.74
1
 14:17:54.84 INFO  ==> Sentinels clean up done
Could not connect to Redis at 10.244.0.72:26379: Connection refused

→ kubectl get pods -o wide
NAME                            READY   STATUS             RESTARTS   AGE   IP         
redis-node-0                    1/2     CrashLoopBackOff   8          13m   10.244.0.72
redis-node-1                    2/2     Running            0          12m   10.244.0.74
redis-node-2                    0/2     CrashLoopBackOff   14         12m   10.244.0.75

→ kubectl get services
NAME                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)              AGE
kubernetes          ClusterIP   10.96.0.1       <none>        443/TCP              23h
redis               ClusterIP   10.96.155.117   <none>        6379/TCP,26379/TCP   14m
redis-headless      ClusterIP   None            <none>        6379/TCP,26379/TCP   14m

@rafariossaa
Copy link
Contributor

rafariossaa commented Mar 11, 2021

Hi @GMartinez-Sisti ,
Could you enable debug and get the logs from the nodes that are in CrashLoop ?.

On the node-0 config, take into account that the configmap generates a base config file that will be modified by the start scripts in configmap-scripts.yaml

@qeternity
Copy link

Bumping this...this is a really nasty bug and I cannot make sense of it.

Bitnami redis sentinel setup is beyond unstable. I actually think this chart should be quarantined until this is resolved. I will continue to investigate and report back.

@qeternity
Copy link

Ok so I have gotten to the bottom of this: if you lose the pod with both the leader sentinel and leader redis, we end up in a situation where another sentinel is promoted to leader, but continues to vote for the old redis leader which is down. When the pod comes back online, start-sentinel.sh polls the quorum for leader and attempts connection, which due to the above is pointing to its own IP.

This might be an issue with Redis, as it appears that if the leader sentinel goes down as it's failing over the leader redis to a follower, then the follower sentinels are unaware of the change and can never converge back on a consistent state.

@rafariossaa
Copy link
Contributor

Hi,
@GMartinez-Sisti , @qeternity . Could you indicate which version of the chart and container images are you using ?
I would like to try to reproduce the issue.

@GMartinez-Sisti
Copy link

GMartinez-Sisti commented Apr 14, 2021

Hi @rafariossaa, thanks for the follow up.

I was testing with:

kind create cluster --name=redis-test
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install my-release bitnami/redis --set=usePassword=false --set=cluster.slaveCount=3 --set=sentinel.enabled=true --set=sentinel.usePassword=false

And then executing kubectl delete pod my-release-redis-node-0 to force a disruption on the cluster. After running this command I would see the behaviour described above. I can't remember the exact version that I had, but it was something along the 12.7.x version.

The good news are that I can't reproduce this problem again (just tried now with 13.0.1). Looks like #5603 and #5528 might have fixed the issues I was having.

@rafariossaa
Copy link
Contributor

Hi,
Yes, there was some issues that were fixed.
Please, @qeternity could you also check your versions and see if your issues were also fixed?

@serkantul
Copy link

Hi,

I was dealing with the same issue and I can confirm that the issue seems resolved in the most recent 14.1.0 version ( commıt #6080). I was observing the same problem with the 14.0.2 version. It was not always reproducible but I could not able to find a workaround. The problem was when the master Redis pod is restarted with kubectl delete pod command, the sentinel containers in the other pods can not choose a new master and sentinel get-master-addr-by-name still returns the old master's IP address which doesn't exist anymore.

@rafariossaa
Copy link
Contributor

Hi @serkantul ,
Is the case you observed in 14.0.2 solved for you in 14.1.0, or is it happening in other deployment you have with 14.0.2 ?

@serkantul
Copy link

Hi @rafariossaa,
I upgraded my deployment from 14.0.2 to 14.1.0 and I don't observe the issue anymore. I don't recall the versions exactly but I can say the latest versions of 11.x, 12.x and 13.x have the same issue, too.

@rafariossaa
Copy link
Contributor

Hi,
Yes, it could happen it those versions.
I am happy that this is fixed for you now.

@github-actions
Copy link

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@github-actions github-actions bot added the stale 15 days without activity label May 13, 2021
@rafariossaa
Copy link
Contributor

I am closing this issue.
Feel free to reopen it if needed or to create a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale 15 days without activity
Projects
None yet
Development

No branches or pull requests

6 participants