[PDS-321180] Permit Cluster Switches/Additions, Where New Nodes Agree With Old Nodes #6415

jeremyk-91 · 2023-01-11T20:59:43Z

General

Before this PR:
If a Cassandra cluster [A, B, C] changes its IPs to [D, E, F] within two minutes / the Cassandra client pool refresh interval, it is possible that the cluster will enter a stuck state where it is not possible to recover without bouncing the service. This is because the cluster correctly re-discovers [D, E, F] from configuration, but the host validation check we have will reject this change because there is no quorum among the cluster (we have three topology proposals and three unreachables).

After this PR:

==COMMIT_MSG==
In the event there is no quorum, we check the new nodes for their perspective of what the cluster should look like. If new nodes are in agreement and their agreement matches the agreed-upon host IDs that the old nodes agreed on, we consider them to be consistent.
==COMMIT_MSG==

Priority: P2 - there is a temporary workaround for the immediate burning issue of the day around preventing fast rolls, but this is pretty high still.

Concerns / possible downsides (what feedback would you like?):

Does this actually fix the problem: can we recover from a cluster [A, B, C] changing to [D, E, F]? I think so: we will make the cluster [A, B, C, D, E, F] (and in practice we will only really talk to D, E, F because A, B, C are block-listed). In any case, on a subsequent refresh when we pick one of D, E, F from which to refresh the token ring we'll be able to remove A, B, C.

Is documentation needed?: No

Compatibility

Does this PR create any API breaks (e.g. at the Java or HTTP layers) - if so, do we have compatibility?: No

Does this PR change the persisted format of any data - if so, do we have forward and backward compatibility?: No

The code in this PR may be part of a blue-green deploy. Can upgrades from previous versions safely coexist? (Consider restarts of blue or green nodes.): Yes - Cassandra client pooling logic is in-memory on each node.

Does this PR rely on statements being true about other products at a deployment - if so, do we have correct product dependencies on these products (or other ways of verifying that these statements are true)?: Not really?

Does this PR need a schema migration? No

Testing and Correctness

What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?: Not much

What was existing testing like? What have you done to improve it?: Added unit tests for relevant cases.

If this PR contains complex concurrent or asynchronous code, is it correct? The onus is on the PR writer to demonstrate this.: No complex asynchronous code here. The topology validator is called from a single-threaded executor.

If this PR involves acquiring locks or other shared resources, how do we ensure that these are always released?: It doesn't?

Execution

How would I tell this PR works in production? (Metrics, logs, etc.): We can recover from an ABC -> DEF change; also, there's plenty of logging.

Has the safety of all log arguments been decided correctly?: Yes, I think so.

Will this change significantly affect our spending on metrics or logs?: No

How would I tell that this PR does not work in production? (monitors, etc.): ABC -> DEF is still bad

If this PR does not work as expected, how do I fix that state? Would rollback be straightforward?: Rollback

If the above plan is more complex than “recall and rollback”, please tag the support PoC here (if it is the end of the week, tag both the current and next PoC): -

Scale

Would this PR be expected to pose a risk at scale? Think of the shopping product at our largest stack.: No

Would this PR be expected to perform a large number of database calls, and/or expensive database calls (e.g., row range scans, concurrent CAS)?: No

Would this PR ever, with time and scale, become the wrong thing to do - and if so, how would we know that we need to do something differently?: No

Development Process

Where should we start reviewing?: CassandraTopologyValidator

If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?: No

Please tag any other people who should be aware of this PR:
@jeremyk-91
@sverma30
@raiju

changelog-app · 2023-01-11T20:59:47Z

Generate changelog in `changelog/@unreleased`

Type

Description

Cassandra client pool improvement: in the event there is no quorum, we check the new nodes for their perspective of what the cluster should look like. If new nodes are in agreement and their agreement matches the agreed-upon host IDs that the old nodes agreed on, we consider them to be consistent. This allows for handling of cases where the Cassandra cluster changes the IPs of all nodes between bounces.

Check the box to generate changelog(s)

Generate changelog entry

…sdb into jkong/servers-and-the-pool

Sam-Kramer

PR looks good -- we can still get hit by this bug if a new hosts join, and we restart all nodes at the same time exactly when the new hosts join. This is not a realistic edge-case, so I think we can ignore that. Left some style nits/preferences, but obviously feel free to ignore!

Sam-Kramer · 2023-01-16T17:34:15Z

...sandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraTopologyValidator.java

+            Optional<ConsistentClusterTopology> maybeTopology = maybeGetConsistentClusterTopology(
+                            newServersWithoutSoftFailures)
+                    .agreedTopology();
+            if (maybeTopology.isPresent()) {


nit: You can still use Optional::map here if you'd like :)

of even Optional.ifPresent(pastConsistentTopology::set); return mabeTopology.map(topology -> Sets.difference(newServersWithoutSoftFailures.keySet(), topology.serversInConsensus())).orElseGet(newServersWithoutSoftFailures::keySet);

Yeah I think the second version is good. I start to find these can get very hard to read especially if pipelines have side-effects, but declaring that upfront is fine.

Sam-Kramer · 2023-01-16T17:38:55Z

...sandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraTopologyValidator.java

+                    return newServersWithoutSoftFailures.keySet();
+                }
+                ConsistentClusterTopology newNodesAgreedTopology = maybeTopology.get();
+                if (!newNodesAgreedTopology


gut check: will ImmutableSet compare each element regardless of order? If we really want to make this future proof, I guess we could do Sets.difference instead? Totally up to you.

Yep - I assume your concern here is that two sets that have the same elements but different order might end up not-being-equal, but equals() is actually defined to avoid this kind of issue:

Compares the specified object with this set for equality. Returns true if the specified object is also a set, the two sets have the same size, and every member of the specified set is contained in this set (or equivalently, every member of this set is contained in the specified set). This definition ensures that the equals method works properly across different implementations of the set interface.

Sam-Kramer · 2023-01-16T17:43:25Z

...sandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraTopologyValidator.java

+    }
+
+    @Value.Immutable
+    public interface ClusterTopologyResult {


We had something similar in a previous version of this PR, where we instead had ConsistentClusterTopology have the type, and provided the static constructors there.

Sam-Kramer · 2023-01-17T10:21:20Z

...ra/src/test/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraTopologyValidatorTest.java

+        Map<CassandraServer, CassandraClientPoolingContainer> newHosts = EntryStream.of(allHosts)
+                .filterKeys(key -> NEW_HOSTS.contains(key.cassandraHostName()))
+                .toMap();
+        Set<CassandraServer> oldCassandraServers = filterServers(allHosts, OLD_HOSTS::contains);


nit: you can just do oldHosts.keySet()

you're right! yeah I kind of rushed the tests a bit, admittedly

Sam-Kramer · 2023-01-17T10:25:15Z

...ra/src/test/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraTopologyValidatorTest.java

+        Set<CassandraServer> oldCassandraServers = filterServers(allHosts, OLD_HOSTS::contains);
+        Set<CassandraServer> newCassandraServers = filterServers(allHosts, NEW_HOSTS::contains);
+        Set<String> hostsOffline = ImmutableSet.of(OLD_HOST_ONE);
+        setHostIds(filterContainers(allHosts, hostsOffline::contains), HostIdResult.hardFailure());


To remove filterContainers you could just do oldHosts.get(OLD_HOST_ONE)

Annoyingly OLD_HOST_ONE is a String though and the map is keyed on CassandraServer, so not sure that'll work :(

ugh yes that's right, this makes sense then!

Sam-Kramer · 2023-01-17T10:27:14Z

...ra/src/test/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraTopologyValidatorTest.java

+        Set<String> hostsOffline = ImmutableSet.of(OLD_HOST_ONE);
+        setHostIds(filterContainers(allHosts, hostsOffline::contains), HostIdResult.hardFailure());
+
+        setHostIds(filterContainers(newHosts, server -> !hostsOffline.contains(server)), HostIdResult.success(UUIDS));


ah fair you need it here! you can also do Predicate.not(hostsOffline::contains)

Nice! Yeah makes sense

Sam-Kramer · 2023-01-17T10:28:44Z

...sandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraTopologyValidator.java


    public CassandraTopologyValidator(CassandraTopologyValidationMetrics metrics) {
        this.metrics = metrics;
+        this.pastConsistentTopology = new AtomicReference<>();


We should update the comment on line 59 indicating the change in our algorithm.

jeremyk-91

we can still get hit by this bug if a new hosts join, and we restart all nodes at the same time exactly when the new hosts join.

Yeah that's unfortunate: don't have a good way around that :( thanks for the review!

jeremyk-91 · 2023-01-17T19:28:55Z

...sandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraTopologyValidator.java

+                    return newServersWithoutSoftFailures.keySet();
+                }
+                ConsistentClusterTopology newNodesAgreedTopology = maybeTopology.get();
+                if (!newNodesAgreedTopology


Yep - I assume your concern here is that two sets that have the same elements but different order might end up not-being-equal, but equals() is actually defined to avoid this kind of issue:

Compares the specified object with this set for equality. Returns true if the specified object is also a set, the two sets have the same size, and every member of the specified set is contained in this set (or equivalently, every member of this set is contained in the specified set). This definition ensures that the equals method works properly across different implementations of the set interface.

jeremyk-91 · 2023-01-17T19:33:40Z

...sandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraTopologyValidator.java

+            Optional<ConsistentClusterTopology> maybeTopology = maybeGetConsistentClusterTopology(
+                            newServersWithoutSoftFailures)
+                    .agreedTopology();
+            if (maybeTopology.isPresent()) {


Yeah I think the second version is good. I start to find these can get very hard to read especially if pipelines have side-effects, but declaring that upfront is fine.

jeremyk-91 · 2023-01-17T19:38:43Z

...ra/src/test/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraTopologyValidatorTest.java

+        Map<CassandraServer, CassandraClientPoolingContainer> newHosts = EntryStream.of(allHosts)
+                .filterKeys(key -> NEW_HOSTS.contains(key.cassandraHostName()))
+                .toMap();
+        Set<CassandraServer> oldCassandraServers = filterServers(allHosts, OLD_HOSTS::contains);


you're right! yeah I kind of rushed the tests a bit, admittedly

jeremyk-91 · 2023-01-17T19:38:46Z

...ra/src/test/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraTopologyValidatorTest.java

+        Set<CassandraServer> oldCassandraServers = filterServers(allHosts, OLD_HOSTS::contains);
+        Set<CassandraServer> newCassandraServers = filterServers(allHosts, NEW_HOSTS::contains);
+        Set<String> hostsOffline = ImmutableSet.of(OLD_HOST_ONE);
+        setHostIds(filterContainers(allHosts, hostsOffline::contains), HostIdResult.hardFailure());


Annoyingly OLD_HOST_ONE is a String though and the map is keyed on CassandraServer, so not sure that'll work :(

jeremyk-91 · 2023-01-17T19:40:40Z

...ra/src/test/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraTopologyValidatorTest.java

+        Set<String> hostsOffline = ImmutableSet.of(OLD_HOST_ONE);
+        setHostIds(filterContainers(allHosts, hostsOffline::contains), HostIdResult.hardFailure());
+
+        setHostIds(filterContainers(newHosts, server -> !hostsOffline.contains(server)), HostIdResult.success(UUIDS));


Nice! Yeah makes sense

Sam-Kramer · 2023-01-18T11:37:51Z

Took another look, no more comments, lgtm!

svc-autorelease · 2023-01-25T14:02:23Z

Failed to push a commit onto develop to move @unreleased changelogs

com.palantir.conjure.java.api.errors.UnknownRemoteException: Response status: 502
	at com.palantir.conjure.java.dialogue.serde.ErrorDecoder.decodeInternal(ErrorDecoder.java:123)
	at com.palantir.conjure.java.dialogue.serde.ErrorDecoder.decode(ErrorDecoder.java:68)
	at com.palantir.conjure.java.dialogue.serde.ConjureBodySerDe$EmptyBodyDeserializer.deserialize(ConjureBodySerDe.java:351)
	at com.palantir.conjure.java.dialogue.serde.ConjureBodySerDe$EmptyBodyDeserializer.deserialize(ConjureBodySerDe.java:338)
	at com.palantir.conjure.java.client.jaxrs.DialogueFeignClient$RemoteExceptionDecoder.decode(DialogueFeignClient.java:336)
	at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:134)
	at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:76)
	at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:103)
	at jdk.proxy2/jdk.proxy2.$Proxy81.createTree(Unknown Source)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at com.palantir.github.clients.PerTraceRequestCountingClientFactory.invoke(PerTraceRequestCountingClientFactory.java:35)
	at com.palantir.github.clients.PerTraceRequestCountingClientFactory.lambda$client$0(PerTraceRequestCountingClientFactory.java:29)
	at jdk.proxy2/jdk.proxy2.$Proxy81.createTree(Unknown Source)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at com.palantir.github.clients.ErrorTrackingExternalClientFactory.invoke(ErrorTrackingExternalClientFactory.java:36)
	at com.palantir.github.clients.ErrorTrackingExternalClientFactory.lambda$client$0(ErrorTrackingExternalClientFactory.java:30)
	at jdk.proxy2/jdk.proxy2.$Proxy81.createTree(Unknown Source)
	at com.palantir.autorelease.CommitFactory.createCommit(CommitFactory.java:100)
	at com.palantir.autorelease.CommitFactory.createCommit(CommitFactory.java:70)
	at com.palantir.autorelease.ReleaserHelper.moveChangelogs(ReleaserHelper.java:44)
	at com.palantir.autorelease.Releaser.labelRelease(Releaser.java:186)
	at com.palantir.autorelease.DefaultRepositoryArchetype.automatedRelease(DefaultRepositoryArchetype.java:114)
	at com.palantir.autorelease.Repository.automatedRelease(Repository.java:63)
	at com.palantir.autorelease.label.DefaultWebhookHandler.handlePullRequestClosedEvent(DefaultWebhookHandler.java:218)
	at com.palantir.autorelease.label.DefaultWebhookHandler.handleWebhook(DefaultWebhookHandler.java:127)
	at com.palantir.github.GithubWebhookResource.lambda$receiveWebhook$0(GithubWebhookResource.java:58)
	at com.palantir.tracing.Tracers$TracingAwareRunnable.run(Tracers.java:584)
	at com.palantir.tritium.metrics.TaggedMetricsExecutorService$TaggedMetricsRunnable.run(TaggedMetricsExecutorService.java:140)
	at com.palantir.nylon.threads.RenamingExecutorService$RenamingRunnable.run(RenamingExecutorService.java:92)
	at org.jboss.threads.EnhancedViewExecutor$EnhancedViewExecutorRunnable.run(EnhancedViewExecutor.java:519)
	at org.jboss.threads.ContextHandler$1.runWith(ContextHandler.java:18)
	at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513)
	at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1538)
	at com.palantir.tritium.metrics.TaggedMetricsThreadFactory$InstrumentedTask.run(TaggedMetricsThreadFactory.java:69)
	at java.base/java.lang.Thread.run(Thread.java:833)
	Suppressed: com.palantir.conjure.java.dialogue.serde.ErrorDecoder$ResponseDiagnostic: Response Diagnostic Information: {status=502, Server=GitHub.com, Content-Type=application/json; charset=utf-8, Content-Length=322, Date=Wed, 25 Jan 2023 14:02:23 GMT}
	Suppressed: com.palantir.logsafe.exceptions.SafeRuntimeException: unknown error: {serviceName=github, errorBody={"message":"Sorry, your request timed out.  It's likely that your input was too large to process.  Consider building the tree incrementally, or building the commits you need in a local clone of the repository and then pushing them to GitHub.","documentation_url":"https://docs.github.com/rest/reference/git#create-a-tree"}}
		at com.palantir.github.clients.ErrorLoggingErrorTracker.trackUnknownRemoteError(ErrorLoggingErrorTracker.java:23)
		at com.palantir.github.clients.ErrorTracker.lambda$andAlso$0(ErrorTracker.java:14)
		at com.palantir.github.clients.ErrorTrackingExternalClientFactory.invoke(ErrorTrackingExternalClientFactory.java:41)
		... 20 more

jeremyk-91 added 7 commits January 11, 2023 18:35

that's the bad case

e3feda0

i can't read

958216e

spotless

30206dc

change the backing idea

538563b

unused

9cce280

docs

bb94039

telemetry

989be89

jeremyk-91 requested review from mdaudali, Sam-Kramer and zpear January 11, 2023 20:59

Add generated changelog entries

b8fb37f

jeremyk-91 changed the title ~~[PDS-324509] Permit Cluster Switches/Additions, Where New Nodes Agree With Old Nodes~~ [PDS-321180] Permit Cluster Switches/Additions, Where New Nodes Agree With Old Nodes Jan 11, 2023

jeremyk-91 added 2 commits January 11, 2023 21:02

bonk

cc23d88

Merge branch 'jkong/servers-and-the-pool' of github.com:palantir/atla…

183aa22

…sdb into jkong/servers-and-the-pool

Sam-Kramer approved these changes Jan 17, 2023

View reviewed changes

docs and traps

6d4af38

jeremyk-91 commented Jan 17, 2023

View reviewed changes

jeremyk-91 added autorelease merge when ready labels Jan 25, 2023

bulldozer-bot bot merged commit 4c391f8 into develop Jan 25, 2023

bulldozer-bot bot deleted the jkong/servers-and-the-pool branch January 25, 2023 14:02

jeremyk-91 mentioned this pull request Jan 30, 2023

[PDS-321180] [DO NOT MERGE unless discussed!] Revisit decision to not admit new hosts if we can't get a quorum #6407

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PDS-321180] Permit Cluster Switches/Additions, Where New Nodes Agree With Old Nodes #6415

[PDS-321180] Permit Cluster Switches/Additions, Where New Nodes Agree With Old Nodes #6415

jeremyk-91 commented Jan 11, 2023

changelog-app bot commented Jan 11, 2023 •

edited by jeremyk-91

Loading

Sam-Kramer left a comment

Sam-Kramer Jan 16, 2023

Sam-Kramer Jan 17, 2023

jeremyk-91 Jan 17, 2023

Sam-Kramer Jan 16, 2023

jeremyk-91 Jan 17, 2023

Sam-Kramer Jan 18, 2023

Sam-Kramer Jan 16, 2023

Sam-Kramer Jan 17, 2023

jeremyk-91 Jan 17, 2023

Sam-Kramer Jan 17, 2023

jeremyk-91 Jan 17, 2023

Sam-Kramer Jan 18, 2023

Sam-Kramer Jan 17, 2023

jeremyk-91 Jan 17, 2023

Sam-Kramer Jan 17, 2023

jeremyk-91 left a comment

jeremyk-91 Jan 17, 2023

jeremyk-91 Jan 17, 2023

jeremyk-91 Jan 17, 2023

jeremyk-91 Jan 17, 2023

jeremyk-91 Jan 17, 2023

Sam-Kramer commented Jan 18, 2023

svc-autorelease commented Jan 25, 2023

[PDS-321180] Permit Cluster Switches/Additions, Where New Nodes Agree With Old Nodes #6415

[PDS-321180] Permit Cluster Switches/Additions, Where New Nodes Agree With Old Nodes #6415

Conversation

jeremyk-91 commented Jan 11, 2023

General

Compatibility

Testing and Correctness

Execution

Scale

Development Process

changelog-app bot commented Jan 11, 2023 • edited by jeremyk-91 Loading

Generate changelog in changelog/@unreleased

Sam-Kramer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremyk-91 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sam-Kramer commented Jan 18, 2023

svc-autorelease commented Jan 25, 2023

changelog-app bot commented Jan 11, 2023 •

edited by jeremyk-91

Loading

Generate changelog in `changelog/@unreleased`