-
Notifications
You must be signed in to change notification settings - Fork 15
Try every host when attempting to get the ip to hostname mapping if a failure is hit. #6014
Try every host when attempting to get the ip to hostname mapping if a failure is hit. #6014
Conversation
Generate changelog in
|
@@ -104,7 +105,8 @@ public CassandraService( | |||
this.blacklist = blacklist; | |||
this.poolMetrics = poolMetrics; | |||
|
|||
Supplier<Map<String, String>> hostnamesByIpSupplier = new HostnamesByIpSupplier(this::getRandomGoodHost); | |||
Supplier<Map<String, String>> hostnamesByIpSupplier = | |||
new HostnamesByIpSupplier(this::getAllHostsUnlessBlacklisted); | |||
this.hostnameByIpSupplier = Suppliers.memoizeWithExpiration(hostnamesByIpSupplier::get, 2, TimeUnit.MINUTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this 2 minute memoization factor into cassandra nodes coming into and leaving cluster, especially during rolling restarts and upgrades?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have the historical context on why the 2 minutes was picked here, but from what I have seen, it takes roughly 2 (usually 3) minutes for a Cassandra node to leave/re-join a cluster, so this should be safe. Although, maybe we knock this down to 1 minute? Question for the Atlas folks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinions on the memoisation length - it was probably picked arbitrarily, honestly. Yeah we can make this 1.
...b-cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/pool/CassandraService.java
Outdated
Show resolved
Hide resolved
.map(Optional::get) | ||
.findFirst() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're grabbing the first result, should we filter out any empty responses to grab the first non-empty view of hostnamesByIp
?
.map(Optional::get) | |
.findFirst() | |
.map(Optional::get) | |
.filter(hostnamesByIp -> !hostnamesByIp.isEmpty()) | |
.findFirst() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this originally, but the (subtle) case that an empty map is a valid response, in the case that they mapping keyspace doesn't exist, table doesn't exist, or there are no rows in the map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a result, you end up querying every single node, which isn't needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does the mapping keyspace or table not exist for a given node (e.g. is it only on bootstrap)? Aside from duration, why wouldn't we want to continue querying nodes to try and find a non-empty view from other nodes?
Related to the duration & 60 second timeout mentioned above, it feels like we might want to have something managing cluster state in background thread and possibly Refreshable
mechanisms for updating & subscribing suppliers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the keyspace/table are an optional dependency, and are only needed in the case that AtlasDb is ran in a split environment (i.e. in two separate VPCs). As a result, we can expect that sometimes these tables do not exist, and in those cases, the IP addresses should be sufficient for communicating with.
As for querying other nodes for non-empty views, the read itself is done at quorum, so there should be no difference in querying the 1 or 2 nodes, as that is what we're effectively doing under-the-hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to the duration & 60 second timeout mentioned above, it feels like we might want to have something managing cluster state in background thread and possibly Refreshable mechanisms for updating & subscribing suppliers
Agreed -- this is already ran in a background thread, so this doesn't block any critical read/write paths.
…cassandra/pool/CassandraService.java Co-authored-by: David Schlosnagle <[email protected]>
...sandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/pool/HostnamesByIpSupplier.java
Outdated
Show resolved
Hide resolved
log.warn("Could not get hostnames by ip from Cassandra", e); | ||
return ImmutableMap.of(); | ||
} | ||
return hosts.get().stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should unroll this into a loop, and track a best-effort timeout here. I.e., if we take over 60s for this to run, we should just fail, in the case that we prevent an actual update occurring where hostnames are not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also @jeremyk-91 I decided to go through each host rather than a quorum of hosts, as the blacklist filter removes an unknown number of hosts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the potential expense of resolving these, it seems like we might want this all to be updating & heartbeating in a background thread as opposed to potentially impacting on regular atlas transaction execution thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good flag - I think this should be on a background thread already though, as following up the places where this supplier is called seems to all point back to the background task that updates the token ring (unless this PR changed that, of course).
…cassandra/pool/HostnamesByIpSupplier.java Co-authored-by: David Schlosnagle <[email protected]>
@@ -367,6 +369,12 @@ public Optional<CassandraClientPoolingContainer> getRandomGoodHostForPredicate( | |||
return randomLivingHost.map(pools::get); | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a separate consideration, but for com.palantir.atlasdb.keyvalue.cassandra.pool.CassandraService#getReachableProxies
should we be including the inputHost
as one of the reachable proxies as that may be the client addressable hostname/IP? I was thinking something along the lines of:
private Set<InetSocketAddress> getReachableProxies(String inputHost) throws UnknownHostException {
int knownPort = getKnownPort();
ImmutableSet.Builder<InetSocketAddress> proxies = ImmutableSet.builder();
proxies.add(new InetSocketAddress(inputHost, knownPort));
for (InetAddress address : InetAddress.getAllByName(inputHost)) {
// It is okay to have reachable proxies that do not have a hostname
proxies.add(new InetSocketAddress(address, knownPort));
}
return proxies.build();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems fine to me, though not super aware of the network-related implications
...sandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/pool/HostnamesByIpSupplier.java
Outdated
Show resolved
Hide resolved
@@ -104,7 +105,8 @@ public CassandraService( | |||
this.blacklist = blacklist; | |||
this.poolMetrics = poolMetrics; | |||
|
|||
Supplier<Map<String, String>> hostnamesByIpSupplier = new HostnamesByIpSupplier(this::getRandomGoodHost); | |||
Supplier<Map<String, String>> hostnamesByIpSupplier = | |||
new HostnamesByIpSupplier(this::getAllHostsUnlessBlacklisted); | |||
this.hostnameByIpSupplier = Suppliers.memoizeWithExpiration(hostnamesByIpSupplier::get, 2, TimeUnit.MINUTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinions on the memoisation length - it was probably picked arbitrarily, honestly. Yeah we can make this 1.
@@ -367,6 +369,14 @@ public Optional<CassandraClientPoolingContainer> getRandomGoodHostForPredicate( | |||
return randomLivingHost.map(pools::get); | |||
} | |||
|
|||
public List<PoolingContainer<CassandraClient>> getAllHostsUnlessBlacklisted() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: getAllNonBlacklistedHosts? but yeah, the main part looks fine
log.warn("Could not get hostnames by ip from Cassandra", e); | ||
return ImmutableMap.of(); | ||
} | ||
return hosts.get().stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good flag - I think this should be on a background thread already though, as following up the places where this supplier is called seems to all point back to the background task that updates the token ring (unless this PR changed that, of course).
@@ -367,6 +369,12 @@ public Optional<CassandraClientPoolingContainer> getRandomGoodHostForPredicate( | |||
return randomLivingHost.map(pools::get); | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems fine to me, though not super aware of the network-related implications
...sandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/pool/HostnamesByIpSupplier.java
Outdated
Show resolved
Hide resolved
…cassandra/pool/HostnamesByIpSupplier.java Co-authored-by: David Schlosnagle <[email protected]>
...sandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/pool/HostnamesByIpSupplier.java
Outdated
Show resolved
Hide resolved
👍 from my end (no strong preference on the duration compares, though I think I agree with @schlosna on that one) |
…cassandra/pool/HostnamesByIpSupplier.java Co-authored-by: David Schlosnagle <[email protected]>
Goals (and why):
==COMMIT_MSG==
Try every host when attempting to get the ip to hostname mapping if a failure is hit.
==COMMIT_MSG==
Implementation Description (bullets):
Testing (What was existing testing like? What have you done to improve it?):
Concerns (what feedback would you like?):
timeout * N
for the task to fail to update the addresses. This would be problematic on very large clusters, running in Kubernetes (as the IP changes are significant here), restarting an entire rack a time. However, this failure condition is so unlikely, I'd argue it should be overlooked (for now).As basically:
N * timeout > time to start all nodes in rack Y
, but that assumes we are hitting the timeout exception, rather than something else (i.e. connection refused, which is almost instant).Where should we start reviewing?:
Priority (whenever / two weeks / yesterday):
Yesterday