Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client cannot handle DNS updates #144

Closed
pulse00 opened this issue Oct 2, 2015 · 15 comments
Closed

Client cannot handle DNS updates #144

pulse00 opened this issue Oct 2, 2015 · 15 comments
Labels
type: feature A new feature
Milestone

Comments

@pulse00
Copy link

pulse00 commented Oct 2, 2015

We're running a disque cluster on aws. Today we've tested a complete failure of a single node. The setup looks like this

  • node1.example.com running one disque node
  • node2.example.com running another disque node
  • client.example.com running the client (in our case a spinach client)

both nodes have been clustered using disque cluster meet ... .

Now the outage test looked like this:

  1. Terminate instance node1.example.com
  2. Re-provision instance node1.example.com (get's a new IP address on AWS)
  3. Join the cluster again using disque cluster meet node1.example.com 7711 on node2.example.com
  4. Terminate node2.example.com
  5. Re-provision node2.example.com (get's a new IP address on AWS)
  6. Join the cluster by running disque cluster meet node2.example.com 7711 on node1.example.com

From the disque perspective this test was successfull, however spinach could not handle the DNS change:

Cannot connect: io.netty.channel.ConnectTimeoutException: connection timed out: node1.example.com/10.0.0.1:7711 [c.l.r.p.ConnectionWatchdog]
Reconnecting, last destination was node1.example.com/10.0.0.1:7711 [c.l.r.p.ConnectionWatchdog]

I'm creating this issue in this repository, because i think the underlying ConnectionWatchdog handles the failover scenario.

@mp911de mp911de added the type: feature A new feature label Oct 2, 2015
@mp911de
Copy link
Collaborator

mp911de commented Oct 2, 2015

You're totally right here. Hostname to IP resolution is handled by the Java mechanisms which then means, once the hostname is looked up, the address is cached for the duration of the JVM uptime.

Custom DNS lookups are possible but require additional infrastructure, like providing a DNS address. Although I like the failover (and DNS lookup), I see that it has natural limits. True that these failovers can be handled on DNS level. However, I see also following things:

  1. Maintain the cluster topology within the client (like for Redis Cluster, so you require at least one stable server from which the cluster view is obtained. It's all on IP level (not hostnames) then, and we're fine)
  2. That functionality could be built into lettuce. My usual experience from the past was: Then just restart that client. It's not nice, and it does not feel cloud native.
  3. The functionality could be covered perhaps on RedisURI/DisqueURI level since they cache the SocketAddress. You could provide one, that performs the lookup just in time and handles caching on its own.
  4. I also had an exposed SocketAddressProvider API on my mind. It's an API that is used already, and it could be made public. Users could bring their own implementation.

We should collect our thoughts and discuss these.

@mp911de
Copy link
Collaborator

mp911de commented Oct 9, 2015

Just checked, the netty DNS resolver is part of netty 4.1/5.0 and not 4.0.

The JavaVM allows to tweak DNS caching by setting -Dnetworkaddress.cache.ttl=0 or -Dsun.net.inetaddr.ttl=0, see also http://docs.oracle.com/javase/6/docs/technotes/guides/net/properties.html#nct for details.

Update: Using system properties works only if not using a security manager. If using a security manager following properties need to be set before the first TCP connection is initiated (otherwise they are ignored):

static {
        java.security.Security.setProperty("networkaddress.cache.ttl", "60");
        java.security.Security.setProperty("networkaddress.cache.negative.ttl", "10");
}

This means not caching the SocketAddress could help to meet your needs.

@mp911de
Copy link
Collaborator

mp911de commented Oct 14, 2015

@pulse00 Could you give it a try by subclassing DisqueURI.DisqueHostAndPort and overriding the method getResolvedAddress() to always return a new InetSocketAddress together with the network cache settings property? Maybe also a variant of https://github.com/mp911de/spinach/blob/hello-cluster/src/main/java/biz/paluch/spinach/impl/SocketAddressSupplierFactory.java#L41 could help to leverage the DNS issue (dynamic topology updates to be done).

@mp911de
Copy link
Collaborator

mp911de commented Oct 22, 2015

@pulse00 Perhaps https://github.com/mp911de/spinach/wiki/SocketAddress-Supplier-API can help you. Just pushed the code which allows to update the cluster view, no more relying on DNS once the initial connect is done.

@pulse00
Copy link
Author

pulse00 commented Oct 23, 2015

cool, thanks. we'll give that a try and let you know if it solved the issue.

@mp911de
Copy link
Collaborator

mp911de commented Oct 25, 2015

To cope with the Java DNS resolution settings, I would change to not cache the InetSocketAddress within the client. The RedisURI.getResolvedAddress method can still be overridden to implement a custom lookup if needed.

mp911de added a commit that referenced this issue Oct 25, 2015
The SocketAddress obtained by getResolvedAddress is no longer cached. This is to allow dynamic DNS updates at the time of establishing a connection. This change introduces the possibility to change the Redis connection point details because the connection point details are obtained directly from the RedisURI argument.
mp911de added a commit that referenced this issue Oct 25, 2015
The SocketAddress obtained by getResolvedAddress is no longer cached. This is to allow dynamic DNS updates at the time of establishing a connection. This change introduces the possibility to change the Redis connection point details because the connection point details are obtained directly from the RedisURI argument.
mp911de referenced this issue in mp911de/spinach Oct 25, 2015
The SocketAddress obtained by getResolvedAddress is no longer cached. This is to allow dynamic DNS updates at the time of establishing a connection. This change introduces the possibility to change the Disque connection point details because the connection point details are obtained directly from the DisqueURI argument.

Reference: mp911de/lettuce#144
@mp911de
Copy link
Collaborator

mp911de commented Nov 30, 2015

@pulse00 any update on this topic from your side?

@pulse00
Copy link
Author

pulse00 commented Dec 3, 2015

@mp911de sorry for the delay - we'll be testing failover scenarios in our infrastructure the next couple of days, i hope i can give you some feedback then.

@mp911de
Copy link
Collaborator

mp911de commented Dec 3, 2015

np, thx for the update.

@mzapletal
Copy link
Contributor

Hi Marc, we are taking up this issue again. Since spring-data-redis still relies on lettuce 3.3 we cannot switch to the latest release of lettuce 4.0 including your change. Hence we sublassed RedisURI as you have suggested. Since EpollProvider.newSocketAddress(getSocket()) has package level access we put the subclassed RedisURI in a com.lambdaworks.redis package :)

I'll keep you posted as soon as we have conducted our outage test.

@mp911de
Copy link
Collaborator

mp911de commented Jan 15, 2016

The fix is (will be) part of the 3.4 release (see 7a7cb47). Give Spring Data Redis with the 3.4-SNAPSHOT build a try.

@mzapletal
Copy link
Contributor

Thanks Marc, we will do that

@mp911de mp911de added this to the Lettuce 3.4 milestone Jan 21, 2016
@oklahomer
Copy link

I was thinking to employ 3.4.1, but just noticed this "Lettuce 3.4 " labeled issue was still open. I think this issue is already fixed, merged and already a part of 3.4.1.Final.
Is it right?

@mp911de
Copy link
Collaborator

mp911de commented Feb 19, 2016

Sorry for the confusion. I just didn't clean it up properly. The issue is fixed with 3.4 and the fix is also part of 3.4.1. There are however two bugs (#194, #196) that are quite ugly so maybe you wait until 3.4.2 which I planned to release for next week.

@oklahomer
Copy link

Thanks for the updates, and I'm looking forward to having 3.4.2 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature A new feature
Projects
None yet
Development

No branches or pull requests

4 participants