-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No route to host error after Redis master re-spawned #1544
Comments
Hi, Alex. On each slave we define the environment variable
Being a stateful set, the network identifier is always the same, so when a new master pod is created, it will have the same network identifier. I've tried to reproduce your issue but I wasn't able to. This is what I did:
Looking at the logs of one of the slaves, you can see that the connection is lost and then recovered:
Regards, |
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback. |
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary. |
we got the same behaviour and investigating ^^ @sstaw feel free to re-open the issue failover works as expected. but there seems to be some RC that sentinel caches old endpoint IP and doesnt communicate with the other Sentinels to get refreshed IP mapping |
@alemorcuq the problem is that when a master fails, new elected master can be from the pool of "slaves-stateful-set" |
Hi, @dntosas. Could you share more details about your investigation and the issue itself? Unfortunately I'm not a Redis expert so that would help me understand the problem. Regards, |
hello @alemorcuq i think this issue is related to this one #19059 which is resolved recently |
Does enabling the |
@alemorcuq it seems so, 5 days now and still cant reproduce this error ^ |
Thank you very much for letting us know, @dntosas ! |
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback. |
Closing as solved. |
This is still an issue I just had this happen. It's very rare but when it happens it's catastrophic and it doesn't recover from it on its own. |
Hi @elucidsoft , could you share with us what you did so we can try to reproduce the issue? |
@dani8art I can easily reproduce it in a microk8s installation. - When I restart my laptop, redis-master-0 gets assigned another IP address, and slaves keep on trying to connect to the old one. Setting a staticID (helm/charts#19059) did not help. The only way to recover is to delete slave pods. Chart 10.7.12, Redis 6.0.6-debian-10-r9.
Before slave-0 restart (kubectl delete pod):
After restart:
|
I've tried to reproduce it but I couldn't, even if my master pod gets assigned another IP it still works $ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
redis-master-0 2/2 Running 0 98s 10.48.2.118 darteaga-tests-default-pool-c3dd3f10-xdfd <none> <none>
redis-slave-0 2/2 Running 2 7m9s 10.48.0.236 darteaga-tests-default-pool-c3dd3f10-pm16 <none> <none>
redis-slave-1 2/2 Running 0 6m23s 10.48.3.85 darteaga-tests-default-pool-c3dd3f10-xf66 <none> <none> kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
redis-master-0 2/2 Running 0 2m4s 10.48.2.119 darteaga-tests-default-pool-c3dd3f10-xdfd <none> <none>
redis-slave-0 2/2 Running 2 9m22s 10.48.0.236 darteaga-tests-default-pool-c3dd3f10-pm16 <none> <none>
redis-slave-1 2/2 Running 0 8m36s 10.48.3.85 darteaga-tests-default-pool-c3dd3f10-xf66 <none> <none> |
@dani8art Your environment is different :) image:
tag: 6.0.6-debian-10-r9
sentinel:
enabled: true
usePassword: false
staticID: true
image:
tag: 6.0.6-debian-10-r11
serviceAccount:
create: true
rbac:
create: true |
Hi @weisdd I've also tried to restart all of them together and it worked also. $ kubectl delete pod redis-master-0 redis-slave-0 redis-slave-1
pod "redis-master-0" deleted
pod "redis-slave-0" deleted
pod "redis-slave-1" deleted $ kubectl get pods -w
NAME READY STATUS RESTARTS AGE
redis-master-0 0/2 ContainerCreating 0 3s
redis-slave-0 0/2 ContainerCreating 0 3s
redis-slave-0 0/2 Error 0 11s
redis-slave-0 0/2 Error 1 12s
redis-master-0 0/2 Running 0 14s
redis-slave-0 0/2 CrashLoopBackOff 1 14s
redis-master-0 1/2 Running 0 19s
redis-master-0 2/2 Running 0 20s
redis-slave-0 1/2 CrashLoopBackOff 1 24s
redis-slave-0 1/2 Running 2 29s
redis-slave-0 2/2 Running 2 37s
redis-slave-1 0/2 Pending 0 0s
redis-slave-1 0/2 Pending 0 0s
redis-slave-1 0/2 ContainerCreating 0 0s
redis-slave-1 0/2 Running 0 11s
redis-slave-1 1/2 Running 0 17s
redis-slave-1 2/2 Running 0 20s $ kubectl get pods
NAME READY STATUS RESTARTS AGE
redis-master-0 2/2 Running 0 67s
redis-slave-0 2/2 Running 2 67s
redis-slave-1 2/2 Running 0 30s We released recently a new version of this chart please try it with the latest one.
|
Hi @dani8art, I've just tried the latest chart, the behaviour is the same - slaves keep on trying to connect to the old master's IP. This time I did:
|
@dani8art Yesterday, we decided to use Redis with Gravitee APIM3. The latter doesn't support sentinel / cluster, so I was looking for a universal solution with HAProxy + Sentinel out-of-the-box, and came across another chart: dandydev/redis-ha. What's interesting about that implementation is that pods use ClusterIP services for synchronization: |
I just had this happen again in production using latest everything. I don't understand what's causing it but it seems related to cluster restarts. |
Hi @elucidsoft, could you share info about your deployment, please? |
It's sentinel based, staticID enabled. The rest is the default configuration. It's also not repeatable, I have attempted to force it to happen by deleting the pods manually. I tried this for hours and it always recovered. Every single time. I tried restarting the nodes and still it recovered. I have gone 4 months without this happening, but I just had it happen twice in two weeks as well. What's happening is EXACTLY what the original poster of this issue explains. I believe what's causing it that I'm on GKE and Google will automatically upgrade my nodes and cluster. When they do this, it's when it happens. The nodes get rebooted in a rolling type style. 1, 1 come back fully, 2, 2 comes back fully, 3. I have my master set to be on one node, and both my slaves to be on their own using taint policies. So each master and slave is always on a separate node. |
Hi @elucidsoft thanks for the info it seems to be very related to #5181, we will check if we can reproduce it in our local environments and figure out the issue. |
I think that issue is what I'm experiencing yes... |
Hi, We have opened an internal task in order to investigate a bit more about this error, unfortunately, we can not give you an ETA. BTW, any action or research results are welcome if you are performing more tests on this. Thanks for the feedback! |
Hi, |
Hi there,
I am having some problems when using the Redis chart with Sentinel mode enabled. The start up of the chart works like a charm. However, the Redis slaves are not able to re-connect to the newly created Redis master after the Redis master crashed and re-spawned.
From the log, I can see that Redis slaves are still trying to connect to the Redis master using the IP address of the old pod instead of the new pod. I have looked through all the configs again, and cannot see anything around this area.
The README states the slaves would be able to reconnect to the master. Would you be able to explain how exactly that works? How would the Redis slave pods know the IP address of the new Redis master pod?
Cheers,
Alex
The text was updated successfully, but these errors were encountered: