Skip to content
This repository has been archived by the owner on Nov 14, 2024. It is now read-only.

Remove token range checks, validate new servers on their topology #6317

Merged
merged 24 commits into from
Nov 4, 2022

Conversation

Sam-Kramer
Copy link
Contributor

@Sam-Kramer Sam-Kramer commented Oct 25, 2022

General

Before this PR:
On start up and when a host was added/removed, we would perform a 'sanity check' that all hosts in the cluster had the same token ring. This was a flaky enough of a call to warrant it being removed, nor did it actually prevent anything anyways.

After this PR:
Remove all token ring validation checks. Validate that any new host added to our cassandra pool see's all other hosts; this ensures that we do not accidentally introduce user-led split brain which would have been previously caught by the token ring validation checks.
==COMMIT_MSG==
Removes token ring validation checks and instead checks for user-led split brain by ensuring any host added to our connection pool has its topology validated.
==COMMIT_MSG==

Priority: P1

Concerns / possible downsides (what feedback would you like?):

  • CassandraClientPool is a mess, and as a result this somewhat degraded this PR
  • API BREAKS, it doesn't look the classes I changed are exposed anywhere, but loved to be proved wrong :)

Is documentation needed?:
No

Compatibility

Does this PR create any API breaks (e.g. at the Java or HTTP layers) - if so, do we have compatibility?:
Potentially -- we can easily retain our old APIs if needed, but this should be an internal API.

Does this PR change the persisted format of any data - if so, do we have forward and backward compatibility?:
N/A

The code in this PR may be part of a blue-green deploy. Can upgrades from previous versions safely coexist? (Consider restarts of blue or green nodes.):
Yes

Does this PR rely on statements being true about other products at a deployment - if so, do we have correct product dependencies on these products (or other ways of verifying that these statements are true)?:
Yes (requires https://github.com/palantir/cassandra/releases/tag/1.101.0), and this is achieved by not performing topology validation when the endpoint is not present (i.e. we programmatically handle this case). It may be worth bumping:

minimumVersion = '3.31.0'
, but this hasn't been done in a very long time!

Does this PR need a schema migration?
No

Testing and Correctness

What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?:
None! We assume Cassandra hosts may have or not have support this endpoint, and support both cases.

What was existing testing like? What have you done to improve it?:
Added tests to ensure we do not add nodes with invalid topologies; also added tests to CassandraClientPool to ensure we re-use cassandra containers.

If this PR contains complex concurrent or asynchronous code, is it correct? The onus is on the PR writer to demonstrate this.:
N/A -- code is modifying a statement that is called by a synchronized method, and is the only method that is called to modify/change the pool.

If this PR involves acquiring locks or other shared resources, how do we ensure that these are always released?:
N/A

Execution

How would I tell this PR works in production? (Metrics, logs, etc.):
Metrics and logs. CassandraTopologyValidationMetrics can be used to determine how often this is failing fleetwide, and also log will contain the exact failures and differences.

Has the safety of all log arguments been decided correctly?:
Yes

Will this change significantly affect our spending on metrics or logs?:
No

How would I tell that this PR does not work in production? (monitors, etc.):
Service fails to start up during B/G upgrade, and we should create a monitor for tracking if validationFailures ever exceeds 0.

If this PR does not work as expected, how do I fix that state? Would rollback be straightforward?:
Rollback!

If the above plan is more complex than “recall and rollback”, please tag the support PoC here (if it is the end of the week, tag both the current and next PoC):

Scale

Would this PR be expected to pose a risk at scale? Think of the shopping product at our largest stack.:
No

Would this PR be expected to perform a large number of database calls, and/or expensive database calls (e.g., row range scans, concurrent CAS)?:
No

Would this PR ever, with time and scale, become the wrong thing to do - and if so, how would we know that we need to do something differently?:
No

Development Process

Where should we start reviewing?:
Now!

If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?:

Please tag any other people who should be aware of this PR:
@jeremyk-91
@sverma30
@raiju


private MapUtils() {}

public static <K, V> Map<K, V> combineMaps(Map<K, V> primary, Map<K, V> secondary) {
Copy link
Contributor Author

@Sam-Kramer Sam-Kramer Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't find anything, but worth the reviewer double-checking / informing if there is something already :)

If there isn't, will write tests and add docs.

Copy link
Contributor

@mdaudali mdaudali Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Quick drive by before I actually review later) This looks to be shallow merging maps right? In that case you can use https://docs.oracle.com/javase/8/docs/api/java/util/Map.html#merge-K-V-java.util.function.BiFunction-

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I saw that, the only issue with this is that it modifies the underlying map

Copy link
Contributor

@mdaudali mdaudali Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is ImmutableMap.putAll(map2).putAll(map1).buildKeepingLast()?

@@ -602,4 +561,9 @@ public enum StartupChecks {
RUN,
DO_NOT_RUN
}

@VisibleForTesting
CassandraAbsentHostTracker getAbsentHostTracker() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh this is ugly, but the alternative is to have this be passed in via constructor -- we can go down this path, but the constructor of this class is already bloated as it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively we could abandon this altogether, and just not test absent host tracking (which was the case before!!)

}

public void returnOrCreatePool(CassandraServer server, Optional<CassandraClientPoolingContainer> container) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that users do not directly use this class.. therefore should not be an API break?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal code search tool says there's no use, so we should be good :)

@Sam-Kramer Sam-Kramer marked this pull request as ready for review October 25, 2022 16:57
Copy link
Contributor

@mdaudali mdaudali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good! Something to consider for resource management of containers and some nits to make this easier to read.

I'll validate ABI breaks in the next pass, but they should be easy to fix if they exist :)

@@ -376,7 +376,10 @@ private Set<CassandraServer> validateNewHostsTopologiesAndMaybeAddToPool(Set<Cas
getContainerForNewServers(serversToAdd);

Map<CassandraServer, CassandraClientPoolingContainer> allContainers =
MapUtils.combineMaps(getCurrentPools(), serversToAddContainers);
ImmutableMap.<CassandraServer, CassandraClientPoolingContainer>builder()
.putAll(getCurrentPools())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is different to what you previously had. With combineMaps, the code kept the entry in the getCurrentPools map in the case of a duplicate between GCP and STAC. In the new code, you prefer STAC entries in the case of a duplicate (hence why I mentioned map2 then map1, rather than the other way around!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current case, you're preferring new containers to old containers, and if they're never marked in the absent host tracker, you're just going to be creating containers and then abandoning them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 make sense, will swap em!

}

public void returnOrCreatePool(CassandraServer server, Optional<CassandraClientPoolingContainer> container) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal code search tool says there's no use, so we should be good :)

serversToAdd,
allContainers,
HumanReadableDuration.seconds(5),
HumanReadableDuration.minutes(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will change when you switch from HumanReadableDuration to Duration in the other PR

cassandra.refreshTokenRangesAndGetServers();
}

Preconditions.checkState(
!(getCurrentPools().isEmpty() && !serversToAdd.isEmpty()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the equivalent via de Morgan's law: !gCP.isEmpty() || serversToAdd.isEmpty(), which has fewer negations and imo easier to read


setCassandraServersTo(CASS_SERVER_1, CASS_SERVER_2, CASS_SERVER_3);
deterministicExecutor.tick(POOL_REFRESH_INTERVAL_SECONDS, TimeUnit.SECONDS);
assertThat(poolServers).containsExactlyInAnyOrder(CASS_SERVER_1, CASS_SERVER_2, CASS_SERVER_3);
assertThat(poolServers.keySet()).containsExactlyInAnyOrder(CASS_SERVER_1, CASS_SERVER_3);
assertThat(clientPool.getAbsentHostTracker().returnPool(CASS_SERVER_2)).isPresent();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider making this a separate test, or changing the name to show that non-valid hosts are added to the absent host tracker

assertThat(poolServers.keySet()).containsExactlyInAnyOrder(CASS_SERVER_1, CASS_SERVER_2);

setCassandraServersTo(CASS_SERVER_1, CASS_SERVER_2, CASS_SERVER_3);
deterministicExecutor.tick(POOL_REFRESH_INTERVAL_SECONDS, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider making a method called refreshPool in the test that just ticks by pool refresh intervals seconds and calling that everywhere


@Test
public void reuseContainerIfPreviouslyInvalid() {
setupThriftServers(ImmutableSet.of(CASS_SERVER_1.proxy(), CASS_SERVER_2.proxy()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test and the above are a bit hard to read, and I'm not entirely sure I understand what property one is testing that the other isn't.

If we can't stick to arrange act assert given the complexity of the tests, either simplify or add comments s.t. the state of the world is clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove reuseContainerIfPreviouslyInvalid, as it's effectively a subtest of reuseAbsentHostsContainerIfPresent.

reuseContainerIfPreviouslyInvalid primarily tested that we would use the container in the absent host tracker if it was present, where as reuseContainerIfPreviouslyInvalid tested that we actually store the container we generated!

}

@Test
public void refreshHandlesNothingToAddOrRemove() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the expected state for "handling"? Perhaps noServersAddedOrRemovedWhenCassandraServerListStatic or something like that.

@@ -108,25 +112,26 @@ public void setup() {
when(runtimeConfig.unresponsiveHostBackoffTimeSeconds()).thenReturn(UNRESPONSIVE_HOST_BACKOFF_SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside, I'm surprised there are no tests to remove for the token ring consistency validation that we're removing...

@Sam-Kramer Sam-Kramer changed the base branch from skramer/add-cluster-topology-check to develop November 1, 2022 17:25
@changelog-app
Copy link

changelog-app bot commented Nov 1, 2022

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

Removes token ring validation checks and instead checks for user-led split brain by ensuring any host added to our connection pool has its topology validated.

Check the box to generate changelog(s)

  • Generate changelog entry

@Sam-Kramer Sam-Kramer force-pushed the skramer/remove-token-range-verify branch from d8bf1ff to 6829dbe Compare November 1, 2022 17:35
Copy link
Contributor

@mdaudali mdaudali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good! A few nits and things to cleanup (and some notes left for posterity since they vanished on the force push). Should be small, then I can take another look tomorrow with you or send it off for second pass :)

private DeterministicScheduler deterministicExecutor = new DeterministicScheduler();
private Set<CassandraServer> poolServers = new HashSet<>();
private Map<CassandraServer, CassandraClientPoolingContainer> poolServers = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this can be final, along with Cassandra and CassandraTopologyValidator

when(config.socketTimeoutMillis()).thenReturn(1);
setupHostsWithInconsistentTopology();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit trippy - to me, this indicates that we're adding hosts with inconsistent topology, but actually, since its varargs, we're setting that there are no hosts with inconsistent topology.
a) I mentioned in a previous comment that if you're going to set empty, either overload the method or just have a method with a clearer name
b) This kinda messes with other mocks on this object (but since you're retrying and thus calling the method a second time, it works).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a) yup makes sense, overloaded!
(b) removed, although fwiw it shouldn't be retrying at all, since we're mocking setting the return of getNewHostsWithInconsistentTopologiesAndRetry -- IIRC it's totally valid to overwrite an object already mocked?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For b) Turns out I was wrong.- I thought

when(config.autoRefreshNodes()).thenReturn(true);
when(config.autoRefreshNodes()).thenReturn(false);

was identical to when(config.autoRefreshNodes()).thenReturn(true).thenReturn(false);
but a test tells me I'm wrong :/

@@ -299,7 +308,7 @@ public void hostIsAutomaticallyRemovedOnStartup() {
setCassandraServersTo(CASS_SERVER_1);

createClientPool();
assertThat(poolServers).containsExactlyInAnyOrder(CASS_SERVER_1);
assertThatMap(poolServers).containsOnlyKeys(CASS_SERVER_1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not realise that there was a map specific method! you can just use assertThat though (since assertThat generates a MapAssert)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah duh

setCassandraServersTo(CASS_SERVER_1, CASS_SERVER_2, CASS_SERVER_3);
refreshPool();
assertThatMap(poolServers).containsOnlyKeys(CASS_SERVER_1, CASS_SERVER_3);
assertThat(absentHostTracker.returnPool(CASS_SERVER_2)).isPresent();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you ended up going for the separate test, I'd just remove this line (otherwise the next test is still redundant!)

setCassandraServersTo(CASS_SERVER_1, CASS_SERVER_2, CASS_SERVER_3);
setupHostsWithInconsistentTopology(CASS_SERVER_2);
createClientPool();
assertThat(absentHostTracker.returnPool(CASS_SERVER_2)).isPresent();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a whitespace just above this so it's clearer what the setup is and what we're asserting

createClientPool();
assertThatMap(poolServers).containsOnlyKeys(CASS_SERVER_1, CASS_SERVER_2);

setCassandraServersTo(CASS_SERVER_1, CASS_SERVER_2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't personally think this line adds much (at least, it seems to me if you don't call set cassandra servers, the list is static). This is a minor point if you feel it helps readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove, my aim here was for readability, but as you said yourself you don't think it adds much :)

}

@Test
public void throwsWhenTryingToAddServerPresentInCurerntServers() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public void throwsWhenTryingToAddServerPresentInCurerntServers() {
public void throwsWhenTryingToAddServerPresentInCurrentServers() {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, I don't quite see why adding a server that's already present is a problem - shouldn't that be a no-op?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it can be, then we can also remove the visible for testing on the validate method (yay!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry should have left a comment for you explaining this in a bit more detail.

Basically, it should be impossible for us to have conflicting maps, due to the nature of how we determine serversToAdd (i.e. servers not present in the pool presently). If this sometime in the future throws, we then have strong evidence of a bug, and suggests something sketchy going on.

That said your other comment here has a valid point (which I initially went with), which was it shouldn't really matter, so arguably we shouldn't break in production on this. However, this means we are OK with potentially introducing this as a bug in the future, which is not ideal.

I don't feel strongly either way, but opt'd for this mode just b/c this really shouldn't happen, and could introduce a nasty bug later on if we change assumptions of topology validation.

versions.lock Outdated
@@ -89,11 +89,11 @@ com.palantir.dialogue:dialogue-core:3.69.0 (3 constraints: 883d5fe3)
com.palantir.dialogue:dialogue-futures:3.69.0 (3 constraints: dd33d7c0)
com.palantir.dialogue:dialogue-serde:3.69.0 (3 constraints: c62e7ca5)
com.palantir.dialogue:dialogue-target:3.69.0 (7 constraints: a776356a)
com.palantir.docker.compose:docker-compose-rule-core:1.8.0 (3 constraints: d92f0e11)
com.palantir.docker.compose:docker-compose-rule-core:1.8.0 (3 constraints: dc2f1711)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noting to self that we should merge the other PR #6349 (if no one gets to it tomorrow, I'll do it) and then merge here so this isn't in the diff

server -> absentHostTracker.trackAbsentCassandraServer(server, serversToAddContainers.get(server)));
if (!newHostsWithDifferingTopology.isEmpty()) {
log.warn(
"Found hosts with differing topologies, or are potentially unreachable.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Found hosts with differing topologies, or are potentially unreachable.",
"Some hosts are potentially unreachable or have topologies that are not in consensus with the majority (if startup) or the current servers in the pool (if refreshing).",

@@ -106,18 +101,24 @@ public void shutdown() {

private static final SafeLogger log = SafeLoggerFactory.get(CassandraClientPoolImpl.class);

private static final String EMPTY_CLIENT_POOL_MESSAGE =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I think you were trying to do is make this public and use it in the test where we you do LoggableExceptionThrownBy. However, that's not the case (it's just used in the one precondition). As I mentioned in the MTD PR, we generally inline the message both here and the test. You could feasibly get away with just testing that the message contains No servers were successfully added to the pool, since the test is just to assert that it's our exception that we're throwin

when(config.autoRefreshNodes()).thenReturn(true);
setCassandraServersTo(CASS_SERVER_1);
assertThatLoggableExceptionThrownBy(this::createClientPool)
.isInstanceOf(SafeIllegalStateException.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's also assert on the message :)

CassandraClientPoolImpl pool = createClientPool();
assertThat(poolServers).containsOnlyKeys(CASS_SERVER_1, CASS_SERVER_2);
assertThatThrownBy(() -> pool.validateNewHostsTopologiesAndMaybeAddToPool(poolServers, Set.of(CASS_SERVER_1)))
.isInstanceOf(IllegalArgumentException.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider: This is the exception thrown by the immutable map (IIRC) - since it's not a safe log, we won't see the message nor the args.

Consider doing a precondition check to see if there are any duplicates and explicitly safe arg (since they're servers) so there's more debugging information

@@ -556,4 +658,18 @@ private void setupThriftServers(Set<InetSocketAddress> servers) {
.map(CassandraServer::of)
.collect(ImmutableSet.toImmutableSet()));
}

private void setupHostsWithConsistentTopology() {
// Not hosts have inconsistent topology... therefore they're consistent!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: No

@Sam-Kramer Sam-Kramer force-pushed the skramer/remove-token-range-verify branch from e4b0d8d to 8c43272 Compare November 3, 2022 14:58
Copy link
Contributor

@mdaudali mdaudali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, one nit but will not block, sending over to someone else to do a final pass

Preconditions.checkArgument(
Sets.intersection(currentContainers.keySet(), serversToAddContainers.keySet())
.isEmpty(),
"The current pool of servers should not have any server(s) that are being added.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: this is declaring what should be the correct argument, as opposed to what the state is when this exception is thrown (i.e., closer to something like Tried adding a server that already exists. This indicates a bug in.... )

Copy link
Contributor

@Jolyon-S Jolyon-S left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to review: tests, and new token range check. Thus far just reviewed the changes to the current cassandra code.

Comment on lines +350 to +354
Set<CassandraServer> validatedServersToAdd =
validateNewHostsTopologiesAndMaybeAddToPool(getCurrentPools(), serversToAdd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we re-fetching the current pools? We could just get this once, then derive currentServers from that. I think this is mainly nice as it should protect us a little more from any sneaky race conditions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, and there is a commit with this change, but it revealed a nasty bug: we must call getCurrentPools() again as we've modified the underlying pool (i.e. we removed the servers which are no longer used/present).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could just filter these hosts from our "snapshot"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've just re-read the code, and the refresh codepath should be only ever run on a single thread at any one time. That means we should be resilient to any race conditions in this case.

Comment on lines 359 to 365
"No servers were successfully added to the pool. This means we could not come to a consensus on"
+ " cluster topology, and the client cannot start as there are no valid hosts. This state should"
+ " be transient (<5 minutes), and if it is not, indicates that the user may have accidentally"
+ " configured AltasDB to use two separate Cassandra clusters (i.e., user-led split brain).",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this happen after startup? This refresh task will happen by default every 2 minutes, and due to runtime config, possibly(?) also after any runtime config changes. As such, this error message might want to change to reflect that (what happens if you fail to refresh your hosts but you've already started up?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup makes sense, changed start to connect, although I suspect the case that this happens live to be extremely rare.


/**
* Validates new servers to add to the cassandra client container pool,
* by checking them with the @see {@link com.palantir.atlasdb.keyvalue.cassandra.CassandraTopologyValidator}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorrect use of @see - I suggest you just remove it, the @link does the job.

+ " cluster topology, and the client cannot start as there are no valid hosts. This state should"
+ " be transient (<5 minutes), and if it is not, indicates that the user may have accidentally"
+ " configured AltasDB to use two separate Cassandra clusters (i.e., user-led split brain).",
SafeArg.of("serversToAdd", serversToAdd));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider checking out CassandraLogHelper.collectionOfHosts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah lets chuck that in there

Comment on lines 391 to 392
SafeArg.of("serversToAdd", serversToAddContainers.keySet()),
SafeArg.of("currentServers", currentContainers.keySet()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here w.r.t. CassandraLogHelper

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@Jolyon-S Jolyon-S left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - one nit, but don't let that block you for another review round.

@@ -110,22 +119,21 @@ public void setup() {
when(config.getKeyspaceOrThrow()).thenReturn("ks");
blacklist = new Blacklist(config, Refreshable.only(UNRESPONSIVE_HOST_BACKOFF_SECONDS));

doAnswer(invocation -> poolServers.add(getInvocationAddress(invocation)))
doAnswer(invocation -> poolServers.putIfAbsent(
getCassandraServerFromInvocation(invocation), mock(CassandraClientPoolingContainer.class)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to extract out a single mock'd CCPC for ease of use here (unless you do some very fancy thing mocking individual CCPCs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(unless you do some very fancy thing mocking individual CCPCs) we do need this unfortunately to tell if we are re-using containers or not :(

@Sam-Kramer Sam-Kramer force-pushed the skramer/remove-token-range-verify branch from 7126d32 to bafb707 Compare November 3, 2022 20:11
@bulldozer-bot bulldozer-bot bot merged commit f404e45 into develop Nov 4, 2022
@bulldozer-bot bulldozer-bot bot deleted the skramer/remove-token-range-verify branch November 4, 2022 12:16
@svc-autorelease
Copy link
Collaborator

Released 0.750.0

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants