Skip to content
This repository has been archived by the owner on Nov 14, 2024. It is now read-only.

[PDS-321180] Permit Cluster Switches/Additions, Where New Nodes Agree With Old Nodes #6415

Merged
merged 11 commits into from
Jan 25, 2023

Conversation

jeremyk-91
Copy link
Contributor

General

Before this PR:
If a Cassandra cluster [A, B, C] changes its IPs to [D, E, F] within two minutes / the Cassandra client pool refresh interval, it is possible that the cluster will enter a stuck state where it is not possible to recover without bouncing the service. This is because the cluster correctly re-discovers [D, E, F] from configuration, but the host validation check we have will reject this change because there is no quorum among the cluster (we have three topology proposals and three unreachables).

After this PR:

==COMMIT_MSG==
In the event there is no quorum, we check the new nodes for their perspective of what the cluster should look like. If new nodes are in agreement and their agreement matches the agreed-upon host IDs that the old nodes agreed on, we consider them to be consistent.
==COMMIT_MSG==

Priority: P2 - there is a temporary workaround for the immediate burning issue of the day around preventing fast rolls, but this is pretty high still.

Concerns / possible downsides (what feedback would you like?):

  • Does this actually fix the problem: can we recover from a cluster [A, B, C] changing to [D, E, F]? I think so: we will make the cluster [A, B, C, D, E, F] (and in practice we will only really talk to D, E, F because A, B, C are block-listed). In any case, on a subsequent refresh when we pick one of D, E, F from which to refresh the token ring we'll be able to remove A, B, C.

Is documentation needed?: No

Compatibility

Does this PR create any API breaks (e.g. at the Java or HTTP layers) - if so, do we have compatibility?: No

Does this PR change the persisted format of any data - if so, do we have forward and backward compatibility?: No

The code in this PR may be part of a blue-green deploy. Can upgrades from previous versions safely coexist? (Consider restarts of blue or green nodes.): Yes - Cassandra client pooling logic is in-memory on each node.

Does this PR rely on statements being true about other products at a deployment - if so, do we have correct product dependencies on these products (or other ways of verifying that these statements are true)?: Not really?

Does this PR need a schema migration? No

Testing and Correctness

What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?: Not much

What was existing testing like? What have you done to improve it?: Added unit tests for relevant cases.

If this PR contains complex concurrent or asynchronous code, is it correct? The onus is on the PR writer to demonstrate this.: No complex asynchronous code here. The topology validator is called from a single-threaded executor.

If this PR involves acquiring locks or other shared resources, how do we ensure that these are always released?: It doesn't?

Execution

How would I tell this PR works in production? (Metrics, logs, etc.): We can recover from an ABC -> DEF change; also, there's plenty of logging.

Has the safety of all log arguments been decided correctly?: Yes, I think so.

Will this change significantly affect our spending on metrics or logs?: No

How would I tell that this PR does not work in production? (monitors, etc.): ABC -> DEF is still bad

If this PR does not work as expected, how do I fix that state? Would rollback be straightforward?: Rollback

If the above plan is more complex than “recall and rollback”, please tag the support PoC here (if it is the end of the week, tag both the current and next PoC): -

Scale

Would this PR be expected to pose a risk at scale? Think of the shopping product at our largest stack.: No

Would this PR be expected to perform a large number of database calls, and/or expensive database calls (e.g., row range scans, concurrent CAS)?: No

Would this PR ever, with time and scale, become the wrong thing to do - and if so, how would we know that we need to do something differently?: No

Development Process

Where should we start reviewing?: CassandraTopologyValidator

If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?: No

Please tag any other people who should be aware of this PR:
@jeremyk-91
@sverma30
@raiju

@changelog-app
Copy link

changelog-app bot commented Jan 11, 2023

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

Cassandra client pool improvement: in the event there is no quorum, we check the new nodes for their perspective of what the cluster should look like. If new nodes are in agreement and their agreement matches the agreed-upon host IDs that the old nodes agreed on, we consider them to be consistent. This allows for handling of cases where the Cassandra cluster changes the IPs of all nodes between bounces.

Check the box to generate changelog(s)

  • Generate changelog entry

@jeremyk-91 jeremyk-91 changed the title [PDS-324509] Permit Cluster Switches/Additions, Where New Nodes Agree With Old Nodes [PDS-321180] Permit Cluster Switches/Additions, Where New Nodes Agree With Old Nodes Jan 11, 2023
Copy link
Contributor

@Sam-Kramer Sam-Kramer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good -- we can still get hit by this bug if a new hosts join, and we restart all nodes at the same time exactly when the new hosts join. This is not a realistic edge-case, so I think we can ignore that. Left some style nits/preferences, but obviously feel free to ignore!

Optional<ConsistentClusterTopology> maybeTopology = maybeGetConsistentClusterTopology(
newServersWithoutSoftFailures)
.agreedTopology();
if (maybeTopology.isPresent()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You can still use Optional::map here if you'd like :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of even Optional.ifPresent(pastConsistentTopology::set); return mabeTopology.map(topology -> Sets.difference(newServersWithoutSoftFailures.keySet(), topology.serversInConsensus())).orElseGet(newServersWithoutSoftFailures::keySet);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think the second version is good. I start to find these can get very hard to read especially if pipelines have side-effects, but declaring that upfront is fine.

return newServersWithoutSoftFailures.keySet();
}
ConsistentClusterTopology newNodesAgreedTopology = maybeTopology.get();
if (!newNodesAgreedTopology
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gut check: will ImmutableSet compare each element regardless of order? If we really want to make this future proof, I guess we could do Sets.difference instead? Totally up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - I assume your concern here is that two sets that have the same elements but different order might end up not-being-equal, but equals() is actually defined to avoid this kind of issue:

Compares the specified object with this set for equality. Returns true if the specified object is also a set, the two sets have the same size, and every member of the specified set is contained in this set (or equivalently, every member of this set is contained in the specified set). This definition ensures that the equals method works properly across different implementations of the set interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, TIL!

}

@Value.Immutable
public interface ClusterTopologyResult {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had something similar in a previous version of this PR, where we instead had ConsistentClusterTopology have the type, and provided the static constructors there.

Map<CassandraServer, CassandraClientPoolingContainer> newHosts = EntryStream.of(allHosts)
.filterKeys(key -> NEW_HOSTS.contains(key.cassandraHostName()))
.toMap();
Set<CassandraServer> oldCassandraServers = filterServers(allHosts, OLD_HOSTS::contains);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can just do oldHosts.keySet()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right! yeah I kind of rushed the tests a bit, admittedly

Set<CassandraServer> oldCassandraServers = filterServers(allHosts, OLD_HOSTS::contains);
Set<CassandraServer> newCassandraServers = filterServers(allHosts, NEW_HOSTS::contains);
Set<String> hostsOffline = ImmutableSet.of(OLD_HOST_ONE);
setHostIds(filterContainers(allHosts, hostsOffline::contains), HostIdResult.hardFailure());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To remove filterContainers you could just do oldHosts.get(OLD_HOST_ONE)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annoyingly OLD_HOST_ONE is a String though and the map is keyed on CassandraServer, so not sure that'll work :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh yes that's right, this makes sense then!

Set<String> hostsOffline = ImmutableSet.of(OLD_HOST_ONE);
setHostIds(filterContainers(allHosts, hostsOffline::contains), HostIdResult.hardFailure());

setHostIds(filterContainers(newHosts, server -> !hostsOffline.contains(server)), HostIdResult.success(UUIDS));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah fair you need it here! you can also do Predicate.not(hostsOffline::contains)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Yeah makes sense


public CassandraTopologyValidator(CassandraTopologyValidationMetrics metrics) {
this.metrics = metrics;
this.pastConsistentTopology = new AtomicReference<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update the comment on line 59 indicating the change in our algorithm.

Copy link
Contributor Author

@jeremyk-91 jeremyk-91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can still get hit by this bug if a new hosts join, and we restart all nodes at the same time exactly when the new hosts join.

Yeah that's unfortunate: don't have a good way around that :( thanks for the review!

return newServersWithoutSoftFailures.keySet();
}
ConsistentClusterTopology newNodesAgreedTopology = maybeTopology.get();
if (!newNodesAgreedTopology
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - I assume your concern here is that two sets that have the same elements but different order might end up not-being-equal, but equals() is actually defined to avoid this kind of issue:

Compares the specified object with this set for equality. Returns true if the specified object is also a set, the two sets have the same size, and every member of the specified set is contained in this set (or equivalently, every member of this set is contained in the specified set). This definition ensures that the equals method works properly across different implementations of the set interface.

Optional<ConsistentClusterTopology> maybeTopology = maybeGetConsistentClusterTopology(
newServersWithoutSoftFailures)
.agreedTopology();
if (maybeTopology.isPresent()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think the second version is good. I start to find these can get very hard to read especially if pipelines have side-effects, but declaring that upfront is fine.

Map<CassandraServer, CassandraClientPoolingContainer> newHosts = EntryStream.of(allHosts)
.filterKeys(key -> NEW_HOSTS.contains(key.cassandraHostName()))
.toMap();
Set<CassandraServer> oldCassandraServers = filterServers(allHosts, OLD_HOSTS::contains);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right! yeah I kind of rushed the tests a bit, admittedly

Set<CassandraServer> oldCassandraServers = filterServers(allHosts, OLD_HOSTS::contains);
Set<CassandraServer> newCassandraServers = filterServers(allHosts, NEW_HOSTS::contains);
Set<String> hostsOffline = ImmutableSet.of(OLD_HOST_ONE);
setHostIds(filterContainers(allHosts, hostsOffline::contains), HostIdResult.hardFailure());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annoyingly OLD_HOST_ONE is a String though and the map is keyed on CassandraServer, so not sure that'll work :(

Set<String> hostsOffline = ImmutableSet.of(OLD_HOST_ONE);
setHostIds(filterContainers(allHosts, hostsOffline::contains), HostIdResult.hardFailure());

setHostIds(filterContainers(newHosts, server -> !hostsOffline.contains(server)), HostIdResult.success(UUIDS));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Yeah makes sense

@Sam-Kramer
Copy link
Contributor

Took another look, no more comments, lgtm!

@bulldozer-bot bulldozer-bot bot merged commit 4c391f8 into develop Jan 25, 2023
@bulldozer-bot bulldozer-bot bot deleted the jkong/servers-and-the-pool branch January 25, 2023 14:02
@svc-autorelease
Copy link
Collaborator

Failed to push a commit onto develop to move @unreleased changelogs

com.palantir.conjure.java.api.errors.UnknownRemoteException: Response status: 502
	at com.palantir.conjure.java.dialogue.serde.ErrorDecoder.decodeInternal(ErrorDecoder.java:123)
	at com.palantir.conjure.java.dialogue.serde.ErrorDecoder.decode(ErrorDecoder.java:68)
	at com.palantir.conjure.java.dialogue.serde.ConjureBodySerDe$EmptyBodyDeserializer.deserialize(ConjureBodySerDe.java:351)
	at com.palantir.conjure.java.dialogue.serde.ConjureBodySerDe$EmptyBodyDeserializer.deserialize(ConjureBodySerDe.java:338)
	at com.palantir.conjure.java.client.jaxrs.DialogueFeignClient$RemoteExceptionDecoder.decode(DialogueFeignClient.java:336)
	at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:134)
	at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:76)
	at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:103)
	at jdk.proxy2/jdk.proxy2.$Proxy81.createTree(Unknown Source)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at com.palantir.github.clients.PerTraceRequestCountingClientFactory.invoke(PerTraceRequestCountingClientFactory.java:35)
	at com.palantir.github.clients.PerTraceRequestCountingClientFactory.lambda$client$0(PerTraceRequestCountingClientFactory.java:29)
	at jdk.proxy2/jdk.proxy2.$Proxy81.createTree(Unknown Source)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at com.palantir.github.clients.ErrorTrackingExternalClientFactory.invoke(ErrorTrackingExternalClientFactory.java:36)
	at com.palantir.github.clients.ErrorTrackingExternalClientFactory.lambda$client$0(ErrorTrackingExternalClientFactory.java:30)
	at jdk.proxy2/jdk.proxy2.$Proxy81.createTree(Unknown Source)
	at com.palantir.autorelease.CommitFactory.createCommit(CommitFactory.java:100)
	at com.palantir.autorelease.CommitFactory.createCommit(CommitFactory.java:70)
	at com.palantir.autorelease.ReleaserHelper.moveChangelogs(ReleaserHelper.java:44)
	at com.palantir.autorelease.Releaser.labelRelease(Releaser.java:186)
	at com.palantir.autorelease.DefaultRepositoryArchetype.automatedRelease(DefaultRepositoryArchetype.java:114)
	at com.palantir.autorelease.Repository.automatedRelease(Repository.java:63)
	at com.palantir.autorelease.label.DefaultWebhookHandler.handlePullRequestClosedEvent(DefaultWebhookHandler.java:218)
	at com.palantir.autorelease.label.DefaultWebhookHandler.handleWebhook(DefaultWebhookHandler.java:127)
	at com.palantir.github.GithubWebhookResource.lambda$receiveWebhook$0(GithubWebhookResource.java:58)
	at com.palantir.tracing.Tracers$TracingAwareRunnable.run(Tracers.java:584)
	at com.palantir.tritium.metrics.TaggedMetricsExecutorService$TaggedMetricsRunnable.run(TaggedMetricsExecutorService.java:140)
	at com.palantir.nylon.threads.RenamingExecutorService$RenamingRunnable.run(RenamingExecutorService.java:92)
	at org.jboss.threads.EnhancedViewExecutor$EnhancedViewExecutorRunnable.run(EnhancedViewExecutor.java:519)
	at org.jboss.threads.ContextHandler$1.runWith(ContextHandler.java:18)
	at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513)
	at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1538)
	at com.palantir.tritium.metrics.TaggedMetricsThreadFactory$InstrumentedTask.run(TaggedMetricsThreadFactory.java:69)
	at java.base/java.lang.Thread.run(Thread.java:833)
	Suppressed: com.palantir.conjure.java.dialogue.serde.ErrorDecoder$ResponseDiagnostic: Response Diagnostic Information: {status=502, Server=GitHub.com, Content-Type=application/json; charset=utf-8, Content-Length=322, Date=Wed, 25 Jan 2023 14:02:23 GMT}
	Suppressed: com.palantir.logsafe.exceptions.SafeRuntimeException: unknown error: {serviceName=github, errorBody={"message":"Sorry, your request timed out.  It's likely that your input was too large to process.  Consider building the tree incrementally, or building the commits you need in a local clone of the repository and then pushing them to GitHub.","documentation_url":"https://docs.github.com/rest/reference/git#create-a-tree"}}
		at com.palantir.github.clients.ErrorLoggingErrorTracker.trackUnknownRemoteError(ErrorLoggingErrorTracker.java:23)
		at com.palantir.github.clients.ErrorTracker.lambda$andAlso$0(ErrorTracker.java:14)
		at com.palantir.github.clients.ErrorTrackingExternalClientFactory.invoke(ErrorTrackingExternalClientFactory.java:41)
		... 20 more

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants