Skip to content
This repository has been archived by the owner on Nov 14, 2024. It is now read-only.

[PDS-336637, PDS-335142 and friends] Verify a new server list against a previously accepted topology, if one exists, by tracking the origin of new cass servers #6458

Merged
merged 9 commits into from
Feb 28, 2023

Conversation

mdaudali
Copy link
Contributor

@mdaudali mdaudali commented Feb 28, 2023

This differs from #6455 by using the config server list instead of just available hosts when redoing the quorum check

General

Before this PR:
#6415 added a feature to accept a new set of servers even if there's no quorum, iff the new set of servers have quorum and match the last accepted topology. This was added to handle cases where a bunch of nodes are down, but we want to add a new set.

However, the above PR inadvertantly missed wiring in the case when the cassandra client pool is empty. This happens on startup, but can also happen when the servers in the client pool are marked absent for when the set of desired servers does not contain the existing servers., which happens when QUORUM = 2, as noted in the P0s linked above. (And historically, we mark them not absent the next time we refresh successfully). There's a thing on should we even be marking so many things absent if it leaves the client pool empty, but I'd like to discuss that with the team on Monday.

Currently, when dissent occurs and the client pool is emptied, we are completely unable to recover. This is because there are no nodes we can reach to get the token range (which is how Atlas does discovery), and so we used the last set of IPs and union that with the set of servers from config (i.e., skylab discovery).

Especially in Cross-VPC setups like our dog-fooding stack, the last of set of IPs are unreachable, whereas the config provided URIs are discoverable.
Since the number of IPs = number of config provided URIs, then we have exactly half the nodes unreachable. After we fixed the quorum calculation (#6427), this meant that we could no longer reach consensus, and, since we didn't check against a previous topology in this case, we would never refresh.

A test that I added that fails:
Screenshot 2023-02-24 at 15 09 09

After this PR:

==COMMIT_MSG==
AtlasDB clients can now recover after receiving a dissenting response to a Cassandra Topology check.
==COMMIT_MSG==

Priority: P1, there are P0s and the method of getting out (bouncing) is pretty painful in some cases.

Concerns / possible downsides (what feedback would you like?):
The whole point of this check is to avoid split brain situations. Given the escape hatch of only checking the config servers, this does increase the risk of allowing split-brain. The principle of checking against the last valid topology should mitigate that.

Please let me know if there are other edge cases that I've missed!

Is documentation needed?:
This behaviour matches the actual docs

Compatibility

Does this PR create any API breaks (e.g. at the Java or HTTP layers) - if so, do we have compatibility?:
No
Does this PR change the persisted format of any data - if so, do we have forward and backward compatibility?:
No
The code in this PR may be part of a blue-green deploy. Can upgrades from previous versions safely coexist? (Consider restarts of blue or green nodes.):
Yes
Does this PR rely on statements being true about other products at a deployment - if so, do we have correct product dependencies on these products (or other ways of verifying that these statements are true)?:
No
Does this PR need a schema migration?
No

Testing and Correctness

What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?:
That we can safely proceed by just checking the available hosts, since we'll eventually refresh again after getting at least one host, retrieve the token range, and get the full list of hosts.

What was existing testing like? What have you done to improve it?:
The original tests didn't actually test the no quorum case! I've addressed that. Please do check with a debugger if you want to be convinced

If this PR contains complex concurrent or asynchronous code, is it correct? The onus is on the PR writer to demonstrate this.:
N/A
If this PR involves acquiring locks or other shared resources, how do we ensure that these are always released?:
N/A

Execution

How would I tell this PR works in production? (Metrics, logs, etc.):
Recovery after a Quorum = 2 incident
Has the safety of all log arguments been decided correctly?:
No logging changed
Will this change significantly affect our spending on metrics or logs?:
No
How would I tell that this PR does not work in production? (monitors, etc.):
Either we still don't recover, or worse, we allow a split-brain when we shouldn't
If this PR does not work as expected, how do I fix that state? Would rollback be straightforward?:
Rollback is not straightforward.
If the above plan is more complex than “recall and rollback”, please tag the support PoC here (if it is the end of the week, tag both the current and next PoC):
@jeremyk-91 as the next on-call (I'm the current)

Scale

Would this PR be expected to pose a risk at scale? Think of the shopping product at our largest stack.:
No
Would this PR be expected to perform a large number of database calls, and/or expensive database calls (e.g., row range scans, concurrent CAS)?:
No
Would this PR ever, with time and scale, become the wrong thing to do - and if so, how would we know that we need to do something differently?:
No

Development Process

Where should we start reviewing?:
CassandraTopologyValidator
If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?:
N/A
Please tag any other people who should be aware of this PR:

@mdaudali mdaudali changed the title [Proof of concept] Tracking the origin of new cass servers [PDS-336637, PDS-335142 and friends] Verify a new server list against a previously accepted topology, if one exists, by tracking the origin of new cass servers Feb 28, 2023
@mdaudali mdaudali marked this pull request as ready for review February 28, 2023 15:13
Copy link
Contributor

@jeremyk-91 jeremyk-91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks!

@mdaudali
Copy link
Contributor Author

👍

@bulldozer-bot bulldozer-bot bot merged commit 82898a8 into develop Feb 28, 2023
@bulldozer-bot bulldozer-bot bot deleted the mdaudali/prototype/trackorigin branch February 28, 2023 17:18
@svc-autorelease
Copy link
Collaborator

Released 0.806.0

bulldozer-bot bot pushed a commit that referenced this pull request Sep 28, 2023
- Shift the determination of whether a node originated from the config to the validator: this deals with an edge case in #6458 where in the event that a node providing the token ring was removed in the same client pool ring update, it is possible to encounter a situation where there are no live servers but also no nodes appearing to come from configuration.
- Implement #6723: in the event no topology had previously been agreed, we allow a quorum in configuration to be the source of the first topology agreed upon by the cluster.
- Shift the acceptance criteria for adding new nodes to the cluster to be based on having at least one host ID in common with the past agreed topology - this allows for cluster expansions. I believe this is safe because a false positive here implies UUID collision, which we don't generally protect against. There is a caveat which I've noted in the Concerns.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants