-
Notifications
You must be signed in to change notification settings - Fork 986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about detecting dead connection #1572
Comments
@wangkekekexili on what OS are you testing this code (as far as I know only linux/epoll is handling keep-alive properly netty/netty#9780 ) ? did you enable keepAlive in your app that is facing this problem ? edit:
are you sure its true ? when new instance is added to the cluster, new IP can appear but why AWS could change existing instance IP? Can you paste stacktrace you see in your logs ? |
Thanks for your response.
Sorry I may not make it very clear but by "scaling up" I mean modifying the node type (say changing from cache.m5.large to cache.m5.2xlarge) to make the instance have more memory (https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Scaling.RedisStandalone.ScaleUp.html). This action doesn't change the number of read replicas. (I also can confirm that just adding replica won't have errors as we have done that in production.)
Image "adoptopenjdk/openjdk8:jdk8u252-b09" (https://hub.docker.com/layers/adoptopenjdk/openjdk8/jdk8u252-b09/images/sha256-daf9b6b24d0a0d2099900e6eeef15b37360edd1c1933673173729773741e53a9?context=explore) is used.
Yes. I have enable "keepAlive". I set that in the socket option:
I also noticed that I need to tweak some socket options to override default values by doing
But keep-alive feature doesn't really work in my case so I didn't include this part in my question snippet. |
The general motivation to use Lettuce is its built-in resiliency by trying to auto-reconnect. That being said, you should not see dead connections, rather the way to think about it is to consider a connection temporary not available because of a failover. Using a HA deployment where the endpoint (DNS name) gets updated with the active master or replica node is the right way to approach high availability.
I assume you're talking about AWS removing the node and reconfiguring the cluster. As long as the infrastructure puts back a node and updates the DNS name, everything is fine. If the DNS name itself changes ( Moreover, if a peer goes away and stops responding (firewall change, server node gets killed), then keep-alive is a good choice to detect dead peers. With #1437, we will apply Keep-Alive customizations, basically what you've outlined in your comment #1572 (comment). Note that extended keep-alive requires either using NIO sockets with Java 11 or newer, epoll sockets (native transport), or io_uring sockets (native transport). |
Thank you @mp911de for your response.
Yes, during AWS redis scale-up process, AWS sets up new server and updates DNS record to switch to the new IP without changing the name. Here is how AWS support describes the process: "
In my case, it causes time out errors for some time during the process. Let me show a concrete example below. During one scaling up test, I connected to a REDACTED.cache.amazonaws.com reader endpoint, it had IP 172.16.51.76 in the beginning and later switched to 172.16.51.138 during the scale-up. At one time, client starts to show errors. It's around this time that DNS record is updated to new IP address.
Lettuce logs show that it is still trying to talk to the old IP address.
Some time later, Lettuce notices the channel is inactive, and re-connects. It successfully re-connects with the new IP address.
It looks to me that if Lettuce can notice the connection is not available at "2021-01-04T10:41:14.024Z" and tries to re-connect at that point then it can be recovered sooner, thus wondering if it is possible. |
Lettuce doesn't monitor DNS. If during scaling, a new host gets in place first, the DNS gets updated and then the old host goes away, then the reconnect at that time is the only trigger we have. Clearly, you can handle scaling events in your application by issuing a Since there isn't anything beyond that what we could do, I'd like to close this ticket. |
@mp911de Thank you. I'm wondering if Lettuce can try to re-connect before server responds to QUIT command (since server may not be available to answer the response) or if we can manually tell Lettuce to re-connect? |
No, that doesn't work. Another alternative could be reflectively obtaining the channel and closing it. Since the connection doesn't expect the channel to be closed, it will try to reconnect. However, reflection is tricky. |
I'm not sure whether this is a feature request, an issue on my end or just a simple question so please forgive me for not completely following the template.
Current Behavior
The issue we are encountering is that during the scale-up process of aws Redis, we are seeing
io.lettuce.core.RedisCommandTimeoutException
errors.We are using non-cluster mode Redis and connecting to reader endpoint. When scaling up aws Redis, DNS domain name remains the same but IP changes; that's when client starts to show errors. After some time, ConnectionWatchdog seems to notice the channel is inactive. Lettuce reconnects and it gets the updated IP address.
I think the timeout issue is caused by client side still holding the existing connection when the peer disappears. It doesn't know the peer disappears and keeps sending requests using the existing connection. I'm wondering what I can do here to detect the dead connection? Could ConnectionWatchdog be updated to catch dead connection and try re-connect?
Input Code
I'm using this simple code for testing the behavior:
Environment
Any suggestions would be greatly appreciated!
The text was updated successfully, but these errors were encountered: