Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/redis] all redis/sentinel pods become master at initial install #5347

Closed
abdularis opened this issue Jan 30, 2021 · 48 comments
Closed

Comments

@abdularis
Copy link

bitnami/redis:
redis-12.7.0

Describe the bug
Please help, I tried to install bitnami/redis with sentinel.enabled=true and cluster slaveCount = 3, when it's successfully installed all 3 pods deployed become redis master, when I scale the stateful set for example to 5, the new pods will become slave of one of the 3 master. I think it should only one master exist and the rest is slave.

Expected behavior
It should only have one master and the rest are slave

@rafariossaa
Copy link
Contributor

Hi,
Have you spotted errors in your logs ?

@addtzg
Copy link

addtzg commented Feb 3, 2021

We had a similar problem: we wanted to deploy 3 node Redis cluster with sentinels. We used the latest chart from "master" branch and didn't make any changes. After deploying, two of the nodes were able to communicate with themselves and choose a master. The third one didn't connect with the rest and remain single. I believe the problem is with startup scripts that generate config, there is probably a race condition somewhere. Our solution was to get back to an older commit 2cd3ec6 ([bitnami/redis] Fix Sentinel Redis with TLS). It was randomly chosen so I'm not sure which of the commit introduced a regression. Probably the one that was supposed to fix sentinels synchronization.

@avadhanij
Copy link

avadhanij commented Feb 4, 2021

I am seeing the same issue. Installing the chart with slaveCount: 3, brings up 3 pods. But each Redis instance thinks that it's the master.

127.0.0.1:26379> INFO sentinel 
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=mymaster,status=ok,address=172.17.0.6:6379,slaves=0,sentinels=1

Even replication is not setup.

Manual Work around

After the chart is installed

  1. Describe the service to find the pod ips
  2. Choose one instance as master yourself.
  3. Kubectl exec (-c sentinel) into the other two pods and setup replication.
  4. While there, also set up sentinel. Original master must first be removed.

@kzellag
Copy link

kzellag commented Feb 4, 2021

Similarly, I am seeing the same issue for almost many 12.y.z versions.
I have been using the version 10.8.1 for months without any issue, but since I tried to upgrade to some 12.y.z versions, I started seeing 3 masters if "cluster.slaveCount" is equal to 3, and at least 3 masters if "cluster.slaveCount" is higher or equal to 4.

By tuning some configuration under the 12.7.3 version, I managed to get 1 master pod and the rest of pods as replicas (slaves), and everything seemed to work normally. The same worked for the version 12.3.2 and could probably work for many other 12.y.z versions.

Here is the override configuration that worked for me:

master:
  livenessProbe:
    enabled: true
    initialDelaySeconds: 30
  readinessProbe:
    enabled: true
    initialDelaySeconds: 30

slave:
  livenessProbe:
    enabled: true
    initialDelaySeconds: 30
  readinessProbe:
    enabled: true
    initialDelaySeconds: 30

cluster:
  enabled: true
  slaveCount: 4


sentinel:
  enabled: true
  masterSet: redis-sentinel-master
  quorum: 2
  livenessProbe:
    enabled: true
    initialDelaySeconds: 30
  readinessProbe:
    enabled: true
    initialDelaySeconds: 30

Increasing the value of "initialDelaySeconds" to 30 seconds gave enough time to the first pod to register itself as "master", then the remaining pods simply joined it as replicas (slaves). I tried with 10 seconds and then with 20 seconds, but I keep getting at least 2 masters out of 4 pods.

Although this is working now for me, I would love to hear your feedbacks and see why it takes too long for the first pod to register itself as master under the recent releases (12.y.z) while it was taking seconds in 10.8.1.

The increase of "initialDelaySeconds" to 30 seconds solved the issue, but is there any better solution ?

Thanks.

@rafariossaa
Copy link
Contributor

Hi
Thank you for sharing your findings !
I am creating a internal task to look into this issue and a possible regression issue.

@avadhanij
Copy link

@kzellag, I can confirm that your suggested settings work. I tested it out with 3 replicas. They all came up with timed gaps, and one became master, with the other two becoming replicas, as expected. Sentinels setup correctly too.

@kzellag
Copy link

kzellag commented Feb 5, 2021

@avadhanij, thanks for trying my settings and confirming it.

I have another corner case that I have confirmed today where that settings breaks:
If you start redis with my proposed settings, where initialDelaySeconds is increased, then you get only one master and the rest as replicas (slaves).
But, if you reboot the machine (a single node Kubernetes cluster), all redis pods start at the same time, and you start seeing all pods as masters again.

@avadhanij
Copy link

That's strange. I just tried this on my Minikube(minikube v1.17.1 on Darwin 10.15.7) setup(stop and then start), and the pods came back up, and the replicas and sentinels are still correctly setup.

@kzellag
Copy link

kzellag commented Feb 5, 2021

My single node is an AWS EC2 instance bootstrapped with Kubernetes. Whenever the instance is rebooted (or stopped then started), Kubernetes starts all pods at the same time, which explains why there are multiple redis masters.

In contrast to the initial deployment of redis, where its pods are deployed sequentially, which gives enough time to the first pod to identify itself as master then the rest of redis pods reach it as replicas.

@rafariossaa
Copy link
Contributor

Hi @avadhanij , @kzellag .
Just to double check we are all in the same page, which version of the chart and images are you using ?.

@tzgg
Copy link

tzgg commented Feb 5, 2021

We had a similar problem: we wanted to deploy 3 node Redis cluster with sentinels. We used the latest chart from "master" branch and didn't make any changes. After deploying, two of the nodes were able to communicate with themselves and choose a master. The third one didn't connect with the rest and remain single. I believe the problem is with startup scripts that generate config, there is probably a race condition somewhere. Our solution was to get back to an older commit 2cd3ec6 ([bitnami/redis] Fix Sentinel Redis with TLS). It was randomly chosen so I'm not sure which of the commit introduced a regression. Probably the one that was supposed to fix sentinels synchronization.

We again stumbled upon this problem even on 2cd3ec6. It is not happening every time but still it is. On latest commit from master branch problem happens every deployment. Now we are trying to set "initialDelaySeconds" to 30s and this works, but we need to do more tests to be sure. We are deploying this chart on 3 node Kubernetes 1.17 cluster. The cluster is somehow unusual because 2 of the nodes are high performance servers and the third one is VM. The problem we encounter is that sometimes high performance node create cluster with VM and the second high performance node is creating it's own cluster. So we have two masters. One of the workarounds was to start with one Redis node and scale later. Because of that I suspect some problems with generating configs during startup phase.

Summary:
redis chart from master: every time we got 2 masters
redis chart - commit 2cd3ec6: most of the time it's fine but sometimes we got 2 masters too
redis chart - commit 2cd3ec6 and 30s initial delay: we havent't seen problem yet but we are still testing

Only changed values are: enabled sentinels and slavecount set to 3. (BTW slave count for sentinel deployment probably should just be replaced with nodeCount because it's somehow confusing)

@kzellag
Copy link

kzellag commented Feb 5, 2021

@rafariossaa, I tested with the following two configurations:

  • Config-1
    • bitnami/redis Chart : 12.7.3
    • Kubernetes : 1.18.10
    • under an AWS EC2 m5.xlarge with 4 vCPUs and 16 GB of Memory

and

  • Config-2
    • bitnami/redis Chart : 12.1.3
    • Kubernetes : 1.18.10
    • under an AWS EC2 m5.2xlarge with 8 vCPUs and 32 GB of Memory

@rafariossaa
Copy link
Contributor

Hi,
Thanks for providing details on this.
I am adding this information to our current task to study this.

@avadhanij
Copy link

@rafariossaa, I am using the following versions -

Chart - redis-12.7.3
App - 6.0.10
Kubernetes - 1.19.2

@rafariossaa rafariossaa added the on-hold Issues or Pull Requests with this label will never be considered stale label Feb 10, 2021
@rafariossaa
Copy link
Contributor

Thanks for your feedback

@aariacarterweir
Copy link

I'm seeing this happen too on a local kind cluster, as well as on GKE

@rafariossaa
Copy link
Contributor

Hi,
Thanks for your feedback.

@rafariossaa
Copy link
Contributor

Hi,
A new version of the chart was released.
Could you give it a try and check if this fixed the issue for you ?

@kzellag
Copy link

kzellag commented Feb 24, 2021

Hi everyone,
Here are my observations after trying this fix (bitnami/redis : 12.7.7 , 2dc23f8).

I did the following tests:


Test-1

I Deployed the chart with the default values for livenessProbe/readinessProbe under "master:", "slave:" and "sentinel:", with 4 replicas. 3 redis pods started, where only one of them stayed in the "Running" state (myns-redis-node-1), while the two others (myns-redis-node-0 and myns-redis-node-2) kept switching between "Running" and "CrashLoopBackOff".

kubectl -n myns get pods
NAME                                             READY   STATUS             RESTARTS   AGE
myns-redis-node-0                                1/2     CrashLoopBackOff   11         25m
myns-redis-node-1                                2/2     Running            0          25m
myns-redis-node-2                                1/2     CrashLoopBackOff   9          24m

By checking the errors in the "Events" for the failing Pods (myns-redis-node-0 and myns-redis-node-2), I found that the liveness and readiness probes are failing.

Events:

  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  5m8s                  default-scheduler  Successfully assigned myns/myns-redis-node-0 to ip-172-31-78-208
  Normal   Pulled     5m9s                  kubelet            Container image "docker.io/bitnami/redis:6.0.11-debian-10-r0" already present on machine
  Normal   Created    5m9s                  kubelet            Created container redis
  Normal   Started    5m9s                  kubelet            Started container redis
  Warning  Unhealthy  4m44s (x3 over 5m4s)  kubelet            Readiness probe failed: 
Could not connect to Redis at localhost:6379: Connection refused
  Normal   Killing    4m41s                 kubelet  Container sentinel failed liveness probe, will be restarted
  Normal   Created    4m11s (x2 over 5m9s)  kubelet  Created container sentinel
  Normal   Started    4m11s (x2 over 5m8s)  kubelet  Started container sentinel
  Normal   Pulled     4m11s (x2 over 5m9s)  kubelet  Container image "docker.io/bitnami/redis-sentinel:6.0.10-debian-10-r36" already present on machine
  Warning  Unhealthy  4m6s (x6 over 5m1s)   kubelet  Liveness probe failed: 
Could not connect to Redis at localhost:26379: Connection refused
  Warning  Unhealthy  4m1s (x6 over 5m1s)  kubelet  Readiness probe failed: 
Could not connect to Redis at localhost:26379: Connection refused
  Warning  BackOff  2s (x19 over 3m49s)  kubelet  Back-off restarting failed container

Test-2

I have set explicitly the livenessProbe/readinessProbe values under "master:", "slave:" and "sentinel:" to 30 seconds, then all 4 redis pods started properly, as shown in:

kubectl -n myns get pods
NAME                                             READY   STATUS    RESTARTS   AGE
myns-redis-node-0                                2/2     Running   0          2m30s
myns-redis-node-1                                2/2     Running   0          117s
myns-redis-node-2                                2/2     Running   0          84s
myns-redis-node-3                                2/2     Running   0          47s

However, when I rebooted the node (an EC2 instance), none of the redis pods managed to be in the "Running" state as shown in:

kubectl -n myns get pods
NAME                                             READY   STATUS             RESTARTS   AGE
myns-redis-node-0                                0/2     CrashLoopBackOff   10         8m6s
myns-redis-node-1                                0/2     CrashLoopBackOff   10         7m30s
myns-redis-node-2                                0/2     CrashLoopBackOff   10         6m56s
myns-redis-node-3                                0/2     CrashLoopBackOff   10         6m22s

When I checked the "Events:" for one of the pods, I see that it is also failing the liveness and readiness probes.

Events:

  Type     Reason     Age    From               Message
  ----     ------     ----   ----               -------
  Normal   Scheduled  8m36s  default-scheduler  Successfully assigned myns/myns-redis-node-0 to ip-172-31-78-208
  Normal   Pulled     8m36s  kubelet            Container image "docker.io/bitnami/redis:6.0.11-debian-10-r0" already present on machine
  Normal   Created    8m36s  kubelet            Created container redis
  Normal   Started    8m36s  kubelet            Started container redis
  Normal   Pulled     8m36s  kubelet            Container image "docker.io/bitnami/redis-sentinel:6.0.10-debian-10-r36" already present on machine
  Normal   Created    8m36s  kubelet            Created container sentinel
  Normal   Started    8m36s  kubelet            Started container sentinel
  Warning  Unhealthy  8m6s   kubelet            Readiness probe failed: 
Could not connect to Redis at localhost:26379: Connection refused
  Normal   SandboxChanged  5m12s                  kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Started         5m11s                  kubelet  Started container sentinel
  Normal   Pulled          5m11s                  kubelet  Container image "docker.io/bitnami/redis:6.0.11-debian-10-r0" already present on machine
  Normal   Started         5m11s                  kubelet  Started container redis
  Normal   Pulled          5m11s                  kubelet  Container image "docker.io/bitnami/redis-sentinel:6.0.10-debian-10-r36" already present on machine
  Normal   Created         5m11s                  kubelet  Created container sentinel
  Normal   Created         5m11s                  kubelet  Created container redis
  Warning  Unhealthy       4m25s (x2 over 4m35s)  kubelet  Readiness probe failed: 
Could not connect to Redis at localhost:6379: Connection refused
  Warning  Unhealthy  4m24s (x2 over 4m34s)  kubelet  Liveness probe failed: 
Could not connect to Redis at localhost:6379: Connection refused
  Warning  BackOff    4m20s                  kubelet  Back-off restarting failed container
  Warning  Unhealthy  4m18s (x5 over 4m38s)  kubelet  Liveness probe failed: 
Could not connect to Redis at localhost:26379: Connection refused
  Normal   Killing    4m18s                  kubelet  Container sentinel failed liveness probe, will be restarted
  Warning  Unhealthy  4m10s (x7 over 4m40s)  kubelet  Readiness probe failed: 
Could not connect to Redis at localhost:26379: Connection refused

May be I am missing something, but that was my observations for testing (bitnami/redis : 12.7.7 , 2dc23f8).

Note that increasing the liveness/readiness probes to 30 seconds worked for me with some previous releases (like bitnami/redis 12.1.3 , 544b7bc), but only for the initial deployment, and after rebooting the node, all pods started in the "Running" state but all of them as "master".

Thank you!
Kamal.

@rafariossaa
Copy link
Contributor

rafariossaa commented Feb 25, 2021

Hi @kzellag,
Thanks for your testing.

Regarding test1 you will need to increase the liveness as you found in test 2. You will need to tune those parameters depending on the cluster. I tried in gce, azure and minikube and I found not issues with the default settings.

Regarding the test 2, could you share the logs of the sentinel and redis containers ?
I tried in a google k8s cluster (3 nodes, n1-standard-2, 1.17.15-gke.800), and installed with:

helm install myredis -f values.yaml --set password=mypass   --set cluster.enabled=true   --set cluster.slaveCount=5   --set sentinel.enabled=true .

I got the 5 redis nodes up and running without issues. Then I resized the cluster to 2 nodes, and couple of redis nodes needed to be redeployed in other k8s node, but it went without issues, and I got only 1 master and 4 slaves:

$ kubectl get pods,svc -o wide
NAME                 READY   STATUS    RESTARTS   AGE     IP           NODE                                           NOMINATED NODE   READINESS GATES
pod/myredis-node-0   2/2     Running   1          12m     10.168.0.7   gke-rrios-cluster-default-pool-d37ea21f-jp15   <none>           <none>
pod/myredis-node-1   2/2     Running   1          28m     10.168.1.4   gke-rrios-cluster-default-pool-d37ea21f-76vn   <none>           <none>
pod/myredis-node-2   2/2     Running   1          27m     10.168.0.5   gke-rrios-cluster-default-pool-d37ea21f-jp15   <none>           <none>
pod/myredis-node-3   2/2     Running   0          9m29s   10.168.1.7   gke-rrios-cluster-default-pool-d37ea21f-76vn   <none>           <none>
pod/myredis-node-4   2/2     Running   1          25m     10.168.1.5   gke-rrios-cluster-default-pool-d37ea21f-76vn   <none>           <none>

NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)              AGE   SELECTOR
service/kubernetes         ClusterIP   10.171.240.1     <none>        443/TCP              19h   <none>
service/myredis            ClusterIP   10.171.242.201   <none>        6379/TCP,26379/TCP   29m   app=redis,release=myredis
service/myredis-headless   ClusterIP   None             <none>        6379/TCP,26379/TCP   29m   app=redis,release=myredis


$ kubectl exec -it myredis-node-0 -- redis-cli -h myredis -p 26379 -a $REDIS_PASSWORD sentinel get-master-addr-by-name mymaster
Defaulting container name to redis.
Use 'kubectl describe pod/myredis-node-0 -n default' to see all of the containers in this pod.
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
1) "10.168.0.5"
2) "6379"



$ kubectl exec -it myredis-node-2 -- redis-cli -h localhost -p 6379 -a $REDIS_PASSWORD role
Defaulting container name to redis.
Use 'kubectl describe pod/myredis-node-2 -n default' to see all of the containers in this pod.
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
1) "master"
2) (integer) 446183
3) 1) 1) "10.168.1.4"
      2) "6379"
      3) "445913"
   2) 1) "10.168.1.5"
      2) "6379"
      3) "445643"
   3) 1) "10.168.0.7"
      2) "6379"
      3) "446183"
   4) 1) "10.168.1.7"
      2) "6379"
      3) "446048"

I got some restart because the PVC took its time to move from a node to the other.

@avadhanij
Copy link

@rafariossaa, I pulled the latest chart and reinstalled it on my minikube cluster. I did not use the livenessProbe and readinessProbe 30 second values @kzellag provided as initial workaround.

It works. Even on the first bring up, I can see that the master is correctly elected, the other two become replicas, and the sentinel info reflects it as well.

master0:name=my-master,status=ok,address=172.17.0.4:6379,slaves=2,sentinels=3

@kzellag
Copy link

kzellag commented Feb 25, 2021

Hi again,
I did 2 additional tests, where one works under a "KIND" cluster, and the other, under an EC2 instance, shows the same errors as in the aforementionned Test-1.


Under KIND

kubectl -n myns get pods
NAME READY STATUS RESTARTS AGE
myredis-node-0 2/2 Running 0 15m
myredis-node-1 2/2 Running 0 15m
myredis-node-2 2/2 Running 1 14m
myredis-node-3 2/2 Running 1 14m

and here is information about the master/replicas:


myredis-node-0 : 10.244.0.5
role:master
connected_slaves:3
master0:name=redis-sentinel-master,status=ok,address=10.244.0.5:6379,slaves=3,sentinels=4

myredis-node-1 : 10.244.0.6
role:slave
master_host:10.244.0.5
master0:name=redis-sentinel-master,status=ok,address=10.244.0.5:6379,slaves=3,sentinels=4

myredis-node-2 : 10.244.0.7
role:slave
master_host:10.244.0.5
master0:name=redis-sentinel-master,status=ok,address=10.244.0.5:6379,slaves=3,sentinels=4

myredis-node-3 : 10.244.0.8
role:slave
master_host:10.244.0.5
master0:name=redis-sentinel-master,status=ok,address=10.244.0.5:6379,slaves=3,sentinels=4


Under and EC-2 instance (Similar to Test-2)

kubectl -n myns get pods
NAME READY STATUS RESTARTS AGE
myredis-node-0 1/2 CrashLoopBackOff 7 10m
myredis-node-1 2/2 Running 0 10m
myredis-node-2 1/2 CrashLoopBackOff 6 9m57s

Logs under the "redis" container for the 3 nodes (myredis-node-0, myredis-node-1 and myredis-node-2)

kubectl -n myns logs myredis-node-0 -c redis
22:00:47.25 WARN ==> myredis-headless.myns.svc.cluster.local does not contain the IP of this pod: 192.168.0.88
22:00:52.25 WARN ==> myredis-headless.myns.svc.cluster.local does not contain the IP of this pod: 192.168.0.88
22:00:57.26 WARN ==> myredis-headless.myns.svc.cluster.local does not contain the IP of this pod: 192.168.0.88
22:01:02.27 WARN ==> myredis-headless.myns.svc.cluster.local does not contain the IP of this pod: 192.168.0.88
22:01:07.27 WARN ==> myredis-headless.myns.svc.cluster.local does not contain the IP of this pod: 192.168.0.88
22:01:12.28 WARN ==> myredis-headless.myns.svc.cluster.local does not contain the IP of this pod: 192.168.0.88
22:01:17.29 INFO ==> myredis-headless.myns.svc.cluster.local has my IP: 192.168.0.88
I am master
redis 22:01:17.30 INFO ==> ** Starting Redis **
1:C 25 Feb 2021 22:01:17.312 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 25 Feb 2021 22:01:17.312 # Redis version=6.0.11, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 25 Feb 2021 22:01:17.312 # Configuration loaded
1:M 25 Feb 2021 22:01:17.313 * Running mode=standalone, port=6379.
1:M 25 Feb 2021 22:01:17.313 # Server initialized
1:M 25 Feb 2021 22:01:17.314 * Ready to accept connections
1:M 25 Feb 2021 22:01:26.216 * Replica 192.168.0.89:6379 asks for synchronization
1:M 25 Feb 2021 22:01:26.216 * Full resync requested by replica 192.168.0.89:6379
1:M 25 Feb 2021 22:01:26.216 * Replication backlog created, my new replication IDs are 'c98e71d75bd82c1b20548bdccec0a2c97bd27e7c' and '0000000000000000000000000000000000000000'
1:M 25 Feb 2021 22:01:26.216 * Starting BGSAVE for SYNC with target: disk
1:M 25 Feb 2021 22:01:26.216 * Background saving started by pid 99
99:C 25 Feb 2021 22:01:26.219 * DB saved on disk
99:C 25 Feb 2021 22:01:26.219 * RDB: 0 MB of memory used by copy-on-write
1:M 25 Feb 2021 22:01:26.233 * Background saving terminated with success
1:M 25 Feb 2021 22:01:26.233 * Synchronization with replica 192.168.0.89:6379 succeeded
1:M 25 Feb 2021 22:01:39.217 * Replica 192.168.0.90:6379 asks for synchronization
1:M 25 Feb 2021 22:01:39.217 * Full resync requested by replica 192.168.0.90:6379
1:M 25 Feb 2021 22:01:39.217 * Starting BGSAVE for SYNC with target: disk
1:M 25 Feb 2021 22:01:39.217 * Background saving started by pid 118
118:C 25 Feb 2021 22:01:39.220 * DB saved on disk
118:C 25 Feb 2021 22:01:39.220 * RDB: 0 MB of memory used by copy-on-write
1:M 25 Feb 2021 22:01:39.275 * Background saving terminated with success
1:M 25 Feb 2021 22:01:39.275 * Synchronization with replica 192.168.0.90:6379 succeeded

kubectl -n myns logs myredis-node-1 -c redis
22:01:21.15 WARN ==> myredis-headless.myns.svc.cluster.local does not contain the IP of this pod: 192.168.0.89
22:01:26.17 INFO ==> myredis-headless.myns.svc.cluster.local has my IP: 192.168.0.89
redis 22:01:26.20 INFO ==> ** Starting Redis **
1:C 25 Feb 2021 22:01:26.214 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 25 Feb 2021 22:01:26.214 # Redis version=6.0.11, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 25 Feb 2021 22:01:26.214 # Configuration loaded
1:S 25 Feb 2021 22:01:26.215 * Running mode=standalone, port=6379.
1:S 25 Feb 2021 22:01:26.215 # Server initialized
1:S 25 Feb 2021 22:01:26.216 * Ready to accept connections
1:S 25 Feb 2021 22:01:26.216 * Connecting to MASTER 192.168.0.88:6379
1:S 25 Feb 2021 22:01:26.216 * MASTER <-> REPLICA sync started
1:S 25 Feb 2021 22:01:26.216 * Non blocking connect for SYNC fired the event.
1:S 25 Feb 2021 22:01:26.216 * Master replied to PING, replication can continue...
1:S 25 Feb 2021 22:01:26.216 * Partial resynchronization not possible (no cached master)
1:S 25 Feb 2021 22:01:26.216 * Full resync from master: c98e71d75bd82c1b20548bdccec0a2c97bd27e7c:0
1:S 25 Feb 2021 22:01:26.233 * MASTER <-> REPLICA sync: receiving 176 bytes from master to disk
1:S 25 Feb 2021 22:01:26.233 * MASTER <-> REPLICA sync: Flushing old data
1:S 25 Feb 2021 22:01:26.233 * MASTER <-> REPLICA sync: Loading DB in memory
1:S 25 Feb 2021 22:01:26.235 * Loading RDB produced by version 6.0.11
1:S 25 Feb 2021 22:01:26.235 * RDB age 0 seconds
1:S 25 Feb 2021 22:01:26.235 * RDB memory usage when created 1.87 Mb
1:S 25 Feb 2021 22:01:26.235 * MASTER <-> REPLICA sync: Finished with success
1:S 25 Feb 2021 22:01:26.235 * Background append only file rewriting started by pid 34
1:S 25 Feb 2021 22:01:26.258 * AOF rewrite child asks to stop sending diffs.
34:C 25 Feb 2021 22:01:26.258 * Parent agreed to stop sending diffs. Finalizing AOF...
34:C 25 Feb 2021 22:01:26.258 * Concatenating 0.00 MB of AOF diff received from parent.
34:C 25 Feb 2021 22:01:26.258 * SYNC append only file rewrite performed
34:C 25 Feb 2021 22:01:26.259 * AOF rewrite: 0 MB of memory used by copy-on-write
1:S 25 Feb 2021 22:01:26.316 * Background AOF rewrite terminated with success
1:S 25 Feb 2021 22:01:26.317 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
1:S 25 Feb 2021 22:01:26.317 * Background AOF rewrite finished successfully

kubectl -n myns logs myredis-node-2 -c redis
22:01:34.17 WARN ==> myredis-headless.myns.svc.cluster.local does not contain the IP of this pod: 192.168.0.90
22:01:39.18 INFO ==> myredis-headless.myns.svc.cluster.local has my IP: 192.168.0.90
redis 22:01:39.20 INFO ==> ** Starting Redis **
1:C 25 Feb 2021 22:01:39.214 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 25 Feb 2021 22:01:39.214 # Redis version=6.0.11, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 25 Feb 2021 22:01:39.214 # Configuration loaded
1:S 25 Feb 2021 22:01:39.215 * Running mode=standalone, port=6379.
1:S 25 Feb 2021 22:01:39.216 # Server initialized
1:S 25 Feb 2021 22:01:39.216 * Ready to accept connections
1:S 25 Feb 2021 22:01:39.216 * Connecting to MASTER 192.168.0.88:6379
1:S 25 Feb 2021 22:01:39.216 * MASTER <-> REPLICA sync started
1:S 25 Feb 2021 22:01:39.216 * Non blocking connect for SYNC fired the event.
1:S 25 Feb 2021 22:01:39.216 * Master replied to PING, replication can continue...
1:S 25 Feb 2021 22:01:39.216 * Partial resynchronization not possible (no cached master)
1:S 25 Feb 2021 22:01:39.217 * Full resync from master: c98e71d75bd82c1b20548bdccec0a2c97bd27e7c:1428
1:S 25 Feb 2021 22:01:39.275 * MASTER <-> REPLICA sync: receiving 177 bytes from master to disk
1:S 25 Feb 2021 22:01:39.275 * MASTER <-> REPLICA sync: Flushing old data
1:S 25 Feb 2021 22:01:39.275 * MASTER <-> REPLICA sync: Loading DB in memory
1:S 25 Feb 2021 22:01:39.277 * Loading RDB produced by version 6.0.11
1:S 25 Feb 2021 22:01:39.277 * RDB age 0 seconds
1:S 25 Feb 2021 22:01:39.277 * RDB memory usage when created 1.93 Mb
1:S 25 Feb 2021 22:01:39.277 * MASTER <-> REPLICA sync: Finished with success
1:S 25 Feb 2021 22:01:39.277 * Background append only file rewriting started by pid 33
1:S 25 Feb 2021 22:01:39.301 * AOF rewrite child asks to stop sending diffs.
33:C 25 Feb 2021 22:01:39.301 * Parent agreed to stop sending diffs. Finalizing AOF...
33:C 25 Feb 2021 22:01:39.301 * Concatenating 0.00 MB of AOF diff received from parent.
33:C 25 Feb 2021 22:01:39.301 * SYNC append only file rewrite performed
33:C 25 Feb 2021 22:01:39.301 * AOF rewrite: 0 MB of memory used by copy-on-write
1:S 25 Feb 2021 22:01:39.316 * Background AOF rewrite terminated with success
1:S 25 Feb 2021 22:01:39.316 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
1:S 25 Feb 2021 22:01:39.316 * Background AOF rewrite finished successfully

Logs under the "sentinel" container for the 3 nodes (myredis-node-0, myredis-node-1 and myredis-node-2)

kubectl -n myns logs myredis-node-0 -c sentinel
22:13:34.38 INFO ==> myredis-headless.myns.svc.cluster.local has my IP: 192.168.0.88
22:13:34.38 INFO ==> Cleaning sentinels in sentinel node: 192.168.0.90
Could not connect to Redis at 192.168.0.90:26379: Connection refused
22:13:39.39 INFO ==> Cleaning sentinels in sentinel node: 192.168.0.89
1
22:13:44.39 INFO ==> Sentinels clean up done
Could not connect to Redis at 192.168.0.88:26379: Connection refused

kubectl -n myns logs myredis-node-1 -c sentinel
22:01:21.31 WARN ==> myredis-headless.myns.svc.cluster.local does not contain the IP of this pod: 192.168.0.89
22:01:26.32 INFO ==> myredis-headless.myns.svc.cluster.local has my IP: 192.168.0.89
22:01:26.33 INFO ==> Cleaning sentinels in sentinel node: 192.168.0.88
1
22:01:31.33 INFO ==> Sentinels clean up done
1:X 25 Feb 2021 22:01:31.355 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 25 Feb 2021 22:01:31.355 # Redis version=6.0.10, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 25 Feb 2021 22:01:31.355 # Configuration loaded
1:X 25 Feb 2021 22:01:31.356 * Running mode=sentinel, port=26379.
1:X 25 Feb 2021 22:01:31.360 # Sentinel ID is 516350ba579097ca6c34c76fb627bf48d7052a56
1:X 25 Feb 2021 22:01:31.360 # +monitor master redis-sentinel-master 192.168.0.88 6379 quorum 2
1:X 25 Feb 2021 22:01:31.360 * +slave slave 192.168.0.89:6379 192.168.0.89 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:01:31.542 * +sentinel sentinel 795fb483800aa825b6cc05ac6f5f5a0a72b884fe 192.168.0.88 26379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:01:41.423 * +slave slave 192.168.0.90:6379 192.168.0.90 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:01:44.325 # +reset-master master redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:01:49.451 # +reset-master master redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:01:51.506 * +slave slave 192.168.0.89:6379 192.168.0.89 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:01:51.510 * +slave slave 192.168.0.90:6379 192.168.0.90 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:01:54.730 # +reset-master master redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:01:59.766 # +reset-master master redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:02:01.581 * +slave slave 192.168.0.89:6379 192.168.0.89 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:02:01.584 * +slave slave 192.168.0.90:6379 192.168.0.90 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:02:14.143 # +reset-master master redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:02:21.660 * +slave slave 192.168.0.89:6379 192.168.0.89 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:02:21.664 * +slave slave 192.168.0.90:6379 192.168.0.90 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:02:28.143 # +reset-master master redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:02:31.675 * +slave slave 192.168.0.89:6379 192.168.0.89 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:02:31.679 * +slave slave 192.168.0.90:6379 192.168.0.90 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:02:33.386 # +reset-master master redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:02:41.688 * +slave slave 192.168.0.89:6379 192.168.0.89 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:02:41.692 * +slave slave 192.168.0.90:6379 192.168.0.90 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:02:50.149 # +reset-master master redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:02:51.735 * +slave slave 192.168.0.89:6379 192.168.0.89 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:02:51.738 * +slave slave 192.168.0.90:6379 192.168.0.90 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:03:35.138 # +reset-master master redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:03:41.976 * +slave slave 192.168.0.89:6379 192.168.0.89 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:03:41.981 * +slave slave 192.168.0.90:6379 192.168.0.90 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:03:56.164 # +reset-master master redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:04:02.099 * +slave slave 192.168.0.89:6379 192.168.0.89 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:04:02.105 * +slave slave 192.168.0.90:6379 192.168.0.90 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:05:07.178 # +reset-master master redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:05:12.347 * +slave slave 192.168.0.89:6379 192.168.0.89 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:05:12.350 * +slave slave 192.168.0.90:6379 192.168.0.90 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:05:29.138 # +reset-master master redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:05:32.453 * +slave slave 192.168.0.89:6379 192.168.0.89 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:05:32.457 * +slave slave 192.168.0.90:6379 192.168.0.90 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:08:02.219 # +reset-master master redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:08:02.987 * +slave slave 192.168.0.89:6379 192.168.0.89 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:08:02.991 * +slave slave 192.168.0.90:6379 192.168.0.90 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:08:32.166 # +reset-master master redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:08:33.026 * +slave slave 192.168.0.89:6379 192.168.0.89 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:08:33.029 * +slave slave 192.168.0.90:6379 192.168.0.90 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:13:24.163 # +reset-master master redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:13:33.888 * +slave slave 192.168.0.89:6379 192.168.0.89 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:13:33.892 * +slave slave 192.168.0.90:6379 192.168.0.90 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:13:39.396 # +reset-master master redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:13:43.899 * +slave slave 192.168.0.89:6379 192.168.0.89 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:13:43.903 * +slave slave 192.168.0.90:6379 192.168.0.90 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:13:46.121 # +reset-master master redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:13:53.944 * +slave slave 192.168.0.89:6379 192.168.0.89 6379 @ redis-sentinel-master 192.168.0.88 6379
1:X 25 Feb 2021 22:13:53.947 * +slave slave 192.168.0.90:6379 192.168.0.90 6379 @ redis-sentinel-master 192.168.0.88 6379

kubectl -n myns logs myredis-node-2 -c sentinel
22:13:46.11 INFO ==> myredis-headless.myns.svc.cluster.local has my IP: 192.168.0.90
22:13:46.11 INFO ==> Cleaning sentinels in sentinel node: 192.168.0.89
1
22:13:51.12 INFO ==> Cleaning sentinels in sentinel node: 192.168.0.88
Could not connect to Redis at 192.168.0.88:26379: Connection refused
22:13:56.12 INFO ==> Sentinels clean up done
Could not connect to Redis at 192.168.0.88:26379: Connection refused

Here are my installation steps:

helm -n myns install myredis bitnami/redis --version 12.7.7 --values redisoverrides.yaml

where the file "redisoverrides.yaml" content is:

master:
persistence:
enabled: false

slave:
persistence:
enabled: false

cluster:
enabled: true
slaveCount: 4

sentinel:
enabled: true
masterSet: redis-sentinel-master

usePassword: false

metrics:
enabled: false

I have used the same config "redisoverrides.yaml" for both tests (against a KIND cluster, and against an EC2 instance), but it is working under the first but not under the second.

For the same config, when I revert back to previous releases (like 12.1.3), I get them at least starting with multiple masters, and setting the values of livenessProbe/readinessProbe to 30 seconds, comes up with a single master and the rest as replicas.

@miguelaeh
Copy link
Contributor

miguelaeh commented Feb 26, 2021

Hi guys,
Thank you very much for your detailed processes, @rafariossaa is working on it and will get back to you with news

@kzellag
Copy link

kzellag commented Mar 30, 2021

Any update on this issue ?
We are still seeing this multi-master behavior under a single node Kubernetes cluster when it is rebooted or stopped-then-started (as EC2 instance).

Thanks,
Kamal.

@miguelaeh
Copy link
Contributor

Hi @kzellag ,
Sorry we missed to update this thread, @rafariossaa created this PR after checking the issues #5603 could you take a look at the changes and verify if they solve your problem?

@kzellag
Copy link

kzellag commented Mar 31, 2021

Hi @miguelaeh,
@rafariossaa has already shared a proposal fix #5603 on Feb 24, which I tested and still seeing this issue.
I have already provided detailed logs in this thread after trying this fix. You can refer to them in:
#5347 (comment)

Thanks,
Kamal.

@marcosbc
Copy link
Contributor

marcosbc commented Apr 1, 2021

I've just re-opened the internal task so we can further investigate these issues. We'll get back to you soon, but unfortunately I cannot give an ETA due to Easter holidays.

@juan-vg
Copy link

juan-vg commented Jul 21, 2021

Any news here?

I'm getting the message redis-headless.test-redis-cluster.svc.cluster.local does not contain the IP of this pod: 10.X.Y.Z on a redis node trying to come up (without success).

My cluster has 3 nodes and it works fine (master is reallocated when neccessary) until I delete at the same time the master and one slave. Then the cluster never comes up again. One of the killed nodes tries to come up, and warns about the above message.

I've tried to set initialDelaySeconds to 30, but it's not helping at all

cluster:
  enabled: true
  slaveCount: 2
usePassword: false
nameoverride: "redis"

architecture: replication

master:
  persistence:
    size: 10Gi
  livenessProbe:
    initialDelaySeconds: 30
  readinessProbe:
    initialDelaySeconds: 30

replica:
  persistence:
    size: 10Gi
  livenessProbe:
    initialDelaySeconds: 30
  readinessProbe:
    initialDelaySeconds: 30

sentinel:
  enabled: true
  usePassword: false
  downAfterMilliseconds: 20000
  failoverTimeout: 18000
  cleanDelaySeconds: 5
  livenessProbe:
    initialDelaySeconds: 30
  readinessProbe:
    initialDelaySeconds: 30

auth:
  enabled: false
  sentinel: false
$ kubectl get po
NAME           READY   STATUS    RESTARTS   AGE
redis-client   1/1     Running   0          114m
redis-node-0   0/2     Running   0          31s
redis-node-2   2/2     Running   0          38m
$ kubectl logs redis-node-0 -c redis 
 14:18:02.80 WARN  ==> redis-headless.test-redis-cluster.svc.cluster.local does not contain the IP of this pod: 10.3.193.182
 14:18:07.82 WARN  ==> redis-headless.test-redis-cluster.svc.cluster.local does not contain the IP of this pod: 10.3.193.182

# ... after some time ...

$ kubectl get po
NAME           READY   STATUS    RESTARTS   AGE
redis-client   1/1     Running   0          117m
redis-node-0   0/2     Running   4          4m8s
redis-node-2   2/2     Running   0          42m

@rafariossaa
Copy link
Contributor

Hi,
Currently the task is in our backlog.
Regarding your cluster, if you have 3 nodes and 2 of them fails, the remaining one doesn't "know" if it is the only one of if it suffered a network issue and was isolated from the other nodes. If you want your cluster to be alive with two failing nodes you need to deploy at least a 5 nodes cluster. This way, you will have (3+2) nodes, and the 3 remaining nodes, that are the majority of the nodes, can decide a new master node.

@juan-vg
Copy link

juan-vg commented Jul 22, 2021

Hi @rafariossaa!

Thank you for your quick answer. I completely understand your point, and as the quorum is 2 the remaining node will never be master while it's alone. It's the desired behavior, no problem.

I don't want the cluster to work under that conditions, but I want the cluster to come up again somewhen. Then, once the 3 nodes are restored, I want the cluster to continue the operative. What it's actually happening is that the 2 killed nodes never come up again.

In the above example, I killed redis-node-0 (master) and redis-node-1 (slave). Then redis-node-0 tried to come up, but it was not able to do it. When I queried the logs for this node redis-node-0, the message was redis-headless.test-redis-cluster.svc.cluster.local does not contain the IP of this pod: 10.X.Y.Z

@rafariossaa
Copy link
Contributor

OK, I see your point.

Let me add that to the task.

@juan-vg
Copy link

juan-vg commented Sep 10, 2021

@rafariossaa I've tested again the previous scenario using the latest helm&redis versions (chart v15.3.2 & app v6.2.5). This seems to be fixed, and now the cluster comes up again as expected. Even when deleting all nodes it comes up again, master is elected, and all the data is there (I have persistence enabled). From my side this is successfully fixed.

@abdularis how is it on your side?

PD: I bet solution it's among d50b1f7, 9559497 and 18ecfc2

@miguelaeh
Copy link
Contributor

Hi @juan-vg ,
thank you very much for the confirmation!

@koo9
Copy link

koo9 commented Nov 30, 2021

had that issue with helm chart from the installation yesterday. uninstall it then installed again with the latest version 6.2.6 . is the issue addressed in the latest version?

@miguelaeh
Copy link
Contributor

Hi @koo9
It should be

@elucidsoft
Copy link

Same issue.

@koo9
Copy link

koo9 commented Dec 6, 2021

Same issue.

tried the latest. it so far so good here.

@elucidsoft
Copy link

Only worked when I removed it and reinstalled it. First install it did not behave properly even though it said it installed correctly without error.

@koo9
Copy link

koo9 commented Dec 6, 2021

Only worked when I removed it and reinstalled it. First install it did not behave properly even though it said it installed correctly without error.

in my previous installation, it worked then after a few days, there was an error in the log complaining the local cluster does not contain the ip of the pod. after re-installing it with the latest chart, works so far.

@elucidsoft
Copy link

Same, failed after 1 day of running...I've tried to use this in the past over a year ago and had same exact problem. It's still simply not stable.

@koo9
Copy link

koo9 commented Dec 7, 2021

ideally the sentinel and redis should run on separate containers, not sure if that has anything to do with what we are seeing.

@javsalgar
Copy link
Contributor

@koo9 do you mean separate pods or containers? Right now they run on separate containers inside the same pod.

@koo9
Copy link

koo9 commented Dec 9, 2021

@javsalgar you are right, I mean separate pod.

@nvijayaraju
Copy link

Hi all. I am facing this issue with the latest chart. Is there any work around for now?

@rafariossaa
Copy link
Contributor

Hi,
We are working on this, we hope to have news soon.

@h0jeZvgoxFepBQ2C
Copy link

Hm not really usable right now with this problems.. I really think there should be a hint in the readme which shows that the sentinel integration is not stable.

@carrodher
Copy link
Member

Hi, this issue should be solved in recent versions of the container and Helm chart. Please, feel free to reopen this ticket if something doesn't work as expected.

@carrodher carrodher removed the on-hold Issues or Pull Requests with this label will never be considered stale label Mar 12, 2022
@koo9
Copy link

koo9 commented Mar 12, 2022

Hi, this issue should be solved in recent versions of the container and Helm chart. Please, feel free to reopen this ticket if something doesn't work as expected.

excellent!Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests