-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document MasterSlave connection behavior on partial node failures #894
Comments
Thanks for this extensive bug report. Static configuration (without Sentinel) is meant to remain static so we won't provide any means of an automated refresh. For Redis Cluster and Sentinel, we have at least a notion of topology source that we can query regularly. Sentinel and Cluster give us some sort of reliability where to find a topology source. For Master/Replica, that's different as changes to the static setup are not tied to any rules. Regarding your suggestions:
The assumption here is that We face quite often the requirement to allow connections to a partially-down Redis setup. Changing our behavior would raise the bar in terms of Redis availability.
This approach could work for a few users, but not for all. Some users might not want this behavior. |
A couple extra details - in case they help 🙂
|
Thanks for the update. ElastiCache has indeed no publicly discoverable endpoints. IIRC there is a possibility to listen to changes via AWS SNS but I'm not so much involved into AWS. |
For now, I'm going to test the connection state before marking it as active. I'm not sure if this is exactly the right strategy, but essentially it's this:
The second thing I am attempting is creating another Again, I'm not sure if that's the right way to approach this, but that is the best way I can see to do it on our end 😄 |
The above mentioned workflow (testing the connection) seems about right. I tried to reproduce a half-failed state. When the specified endpoint does not react ( Both work as they should. |
Closing this ticket as there's nothing left to do here. |
Just to be clear, did you start the
I really just want to make sure we're testing this the same way. I can possibly stitch a sample case together if it is too difficult to reproduce by hand. |
Yes, I tried it with various scenarios (timeout 1 second):
|
Bug Report
Current Behavior
Connecting to master/slave setups using autodiscovery can half-fail and leave the list of nodes empty, with no chance to recover. The problem is that creating the
StatefulRedisMasterSlaveConnection
will succeed, but every attempt to use it will fail. The error produced is atio.lettuce.core.masterslave.MasterSlaveConnectionProvider#getMaster
.I believe the happens because of the initial topology refresh can fail. The first connection to the master may be fine -- this causes us to not get an exception. But the refresh seems to happen out-of-band and appears to ping each discovered Redis node at
io.lettuce.core.masterslave.MasterSlaveTopologyRefresh#getNodes
and more specificallyio.lettuce.core.masterslave.Connections#requestPing
.To reproduce this you can set a breakpoint (or a pause of your choice) at the call to
requestPing
and send aDEBUG SLEEP [some interval]
to Redis - preferably the master node. After this you can see theknownNodes
list refresh, and if you're in a single master environment it will set itself to an empty list. If you're in a master/slave environment, any node that fails to ping at this exact point in time will be removed from the node list and not contacted again. I think this is probably by design, but it unfortunately produces at best unexpected results and at worst connections that can't ever work.From here I don't think it can recover. The instance of the
StatefulRedisMasterSlaveConnection
stays alive, but it will throw every time you try to use it because there are no nodes in the list and the topology will never refresh. This makes it hard to detect.Input Code
Expected behavior/code
I expected that if the connections were unreachable they would either:
Environment
Possible Solution
I would suggest that the list of nodes initially discovered be cached somewhere, and retries be attempted later. Clustered clients do this with periodic and adaptive refreshes. I think this should be possible using the initial discovery in non-clustered setups to produce more reliable long-term results and a chance to recover from this particular failure state.
Another solution would be to simply throw an error if no nodes are discovered so we may catch it before the
StatefulRedisMasterSlaveConnection
instance is kept (just like a normal connection failure would produce). It may be worthwhile to expose the present list of nodes from the connection so they may be inspected for issues (e.g. if we only get slave nodes that may not be suitable for all cases)Additional context
I hope this helps! Please let me know if I have left out any important details or if there is a workaround or expected behavior I'm clearly missing. I may be able to take suggestions and make them into a PR.
The text was updated successfully, but these errors were encountered: