-
Notifications
You must be signed in to change notification settings - Fork 950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replicaof no one
hangs on replica when connectivity to master is broken
#2907
Comments
Hi @safa-topal, thank you for reporting this. I will take a look and come back :) |
Hi @safa-topal, it doesn't seem to reproduce on my side so maybe I am doing something wrong? So:
|
Hi @kostasrim, I can reproduce it with roughly the same flow (except on the 3rd step I cut the connectivity with iptables config instead of killing the master process). Curious, are there any logs indicating that replica is trying to reconnect to master? I don't see any such logs during my tests. |
Hi @safa-topal I can retry with iptables config instead but I am quite packed for the rest of the day. I used p.s. I doubt that iptables config is the problem but I should verify |
I also don't think iptables will make a difference on this case. I have the |
Your logs should be filled:
Which is suspicious that you don't get them. I will try with iptables and ping back |
Yes, looks suspicious. There's no activity in the logs of replica I have until my code runs
|
I am pretty sure there is a logical explanation about this. Let me get back to you :) |
@safa-topal what happens if you kill/stop the master without the iptables? |
@romange on that case, what happens is identical to what @kostasrim experiences:
|
when you use iptables, a replica does not know that the master is dead and it relies on tcp keep alive settings to recognize a closed socket. And, I just checked - we do not actually configure TCP keep alive on our replication connections on replica, instead we have the following comment https://github.com/dragonflydb/dragonfly/blob/main/src/server/protocol_client.cc#L225 |
@romange I remember Roy experimented with those... I can take a look tomorrow |
@kostasrim I do not think they will help with the original issue of "replica no one" being stuck. I would actually check at what step it is stuck when performing this command. My guess is |
This comment explained why we do not see the "reconnect" logs, but it does not explain why "replica no one" hangs |
After some digging with |
Hi @safa-topal , just to be 100% sure, you dropped the connection via:
I found the bug and patched it but I wanted to double check. |
Hİ @kostasrim, yes, looks similar. This is the iptable directives I've used:
|
@safa-topal hmm I wonder if we see the same issue now because:
Both of them have side effects in this context: For (1) redis-cli will stop working, so issuing Also both of them will make the system unstable (on my ubuntu it crashes a few things) as it drops all kinds of packets. So in this context I think these two are improper (meaning that my patch won't fix them since the issue is with how iptables is used). Also both of them do not work with version 1.14.5 Now for the:
Doesn't actually have an effect because it modifies the
Let me know if you see something different. I just want to be 100% sure we are on the same page. |
Hi @kostasrim Yes there's something different in my scenario compared to what you explain above. I run these iptables on the master node only so replica is not affected at all. Therefore below statements are not true when requests made against the replica but true "if" my code was making requests to master node -- which is not the case:
Once master gets these iptables configuration then my failover code gets executed and then replica gets promoted to be master. There are no issues with connection to the replica at this stage. |
Oh I see, if it's specific to the master instance then it has similar semantics with my config as well. I expect my patch to work :) |
Describe the bug
In a failover scenario to the replica instance, when a replica can't connect to its master anymore,
replicaof no one
command fails to return OK but instead hangs for a relatively long time. Observed in another shell that replica seems have accepted the config change and reports as master; so the command actually works. Simple workaround would be handling the timeout and ignoring it; however then it might cause false negatives later when other timeouts happen for more genuine reasons.To Reproduce
Steps to reproduce the behavior:
replicaof no one
on the replica nodeExpected behavior
Server must return OK response to the client.
Environment :
Additional context
1.14.5 doesn't have the same issue.
The text was updated successfully, but these errors were encountered: