-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: support non-voting replicas #51943
Comments
Hi @aayushshah15, please add a C-ategory label to your issue. Check out the label system docs. While you're here, please consider adding an A- label to help keep our repository tidy. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
cc @tbg This is pretty rough but @andreimatei had mentioned that I should get your eyes on this in case there's potential for ill-defined interactions (or things I should double-check/be careful about) with your work on atomic replication changes. |
As we work towards adding support for persistent learner replicas, we'd like to decouple it from the ephemeral learner state. Conceptually, learners in their current state are an _internal_ intermediate state used for atomic replication changes, whereas persistent learners are meant to be a user-visible state explicitly chosen to be placed in specific localities. There are also some practical reasons that motivated this change: 1. The replicate queue, when it discovers a learner replica, infers that it must be a relic of a failed `ChangeReplicas` run, and clears it up. 2. The raft snapshot queue doesn't send it snapshots since these learners are supposed to receive it directly from the `ChangeReplicas` txn. 3. We disallow range merges when we detect that either of the concerned ranges have a learner replica, as in our current state, this is likely because there is an ongoing atomic configuration change. In all of these scenarios it seems safer & more efficient to be able to differentiate between persistent & ephemeral learners based on the ReplicaType (as opposed to some other solution like a `isPersistent` bit on the replica descriptor) This PR performs this change with a combination of Goland-rename and then grepping for string literals and comments that talk about learners. Note that the `SnapshotRequest_LEARNER` type has not been changed in this PR. Touches cockroachdb#51943. Release note: None
As we work towards adding support for persistent learner replicas, we'd like to decouple it from the ephemeral learner state. Conceptually, learners in their current state are an _internal_ intermediate state used for atomic replication changes, whereas persistent learners are meant to be a user-visible state explicitly chosen to be placed in specific localities. There are also some practical reasons that motivated this change: 1. The replicate queue, when it discovers a learner replica, infers that it must be a relic of a failed `ChangeReplicas` run, and clears it up. 2. The raft snapshot queue doesn't send it snapshots since these learners are supposed to receive it directly from the `ChangeReplicas` txn. 3. We disallow range merges when we detect that either of the concerned ranges have a learner replica, as in our current state, this is likely because there is an ongoing atomic configuration change. In all of these scenarios it seems safer & more efficient to be able to differentiate between persistent & ephemeral learners based on the ReplicaType (as opposed to some other solution like a `isPersistent` bit on the replica descriptor) This PR performs this change with a combination of Goland-rename and then grepping for string literals and comments that talk about learners. Note that the `SnapshotRequest_LEARNER` type has not been changed in this PR. Touches cockroachdb#51943. Release note: None
Before this patch, the DistSender would only route BatchRequests to replicas of types `VOTER_FULL` and `VOTER_INCOMING`. This patch changes that to let the DistSender route requests (for instance, follower read requests) to `NON_VOTER` replicas as well. Makes progress on cockroachdb#51943 Release note: None
Before this patch, the DistSender would only route BatchRequests to replicas of types `VOTER_FULL` and `VOTER_INCOMING`. This patch changes that to let the DistSender route requests (for instance, follower read requests) to `NON_VOTER` replicas as well. Makes progress on cockroachdb#51943 Release note: None
This PR teaches `AdminChangeReplicas` to atomically promote voters to non-voters, demote voters to non-voters or swap voters with non-voters via joint consensus. Fixes cockroachdb#58499 Informs cockroachdb#51943 Release note: None
This PR teaches `AdminChangeReplicas` to atomically promote voters to non-voters, demote voters to non-voters or swap voters with non-voters via joint consensus. Fixes cockroachdb#58499 Informs cockroachdb#51943 Release note: None
This PR teaches `AdminChangeReplicas` to atomically promote voters to non-voters, demote voters to non-voters or swap voters with non-voters via joint consensus. Fixes cockroachdb#58499 Informs cockroachdb#51943 Release note: None
This PR teaches `AdminChangeReplicas` to atomically promote voters to non-voters, demote voters to non-voters or swap voters with non-voters via joint consensus. Fixes cockroachdb#58499 Informs cockroachdb#51943 Release note: None
This PR teaches `AdminChangeReplicas` to atomically promote voters to non-voters, demote voters to non-voters or swap voters with non-voters via joint consensus. Fixes cockroachdb#58499 Informs cockroachdb#51943 Release note: None
This PR teaches `AdminChangeReplicas` to atomically promote voters to non-voters, demote voters to non-voters or swap voters with non-voters via joint consensus. Fixes cockroachdb#58499 Informs cockroachdb#51943 Release note: None
This PR teaches `AdminChangeReplicas` to atomically promote voters to non-voters, demote voters to non-voters or swap voters with non-voters via joint consensus. Fixes cockroachdb#58499 Informs cockroachdb#51943 Release note: None
58627: kvserver: support atomic promotions and demotions of non-voting replicas r=aayushshah15 a=aayushshah15 This PR teaches `AdminChangeReplicas` to atomically promote voters to non-voters, demote voters to non-voters or swap voters with non-voters via joint consensus. The approach followed by this PR tries to coalesce `ReplicationChanges` of types `ADD_VOTER` and `REMOVE_NON_VOTER` on a given target as promotions of non-voters into voters and likewise, `ADD_NON_VOTER` and `REMOVE_VOTER` changes for a given target as demotions of voters into non-voters. When all 4 of these operations are simultaneously passed into a `ChangeReplicasRequest`, the patch will try to execute them as one atomic swap of a voter with a non-voter. Fixes #58499 Informs #51943 Release note: None 60799: sql: include virtual table names in crash reports r=knz a=rafiss This uses the functionality that we already had in place for telemetry data. This might potentially make some crash reports more useful without compromising privacy. Release note (general change): Crash reports that are sent to Cockroach Labs now no longer redact the names of builtin virtual tables from the crdb_internal, information_schema, and pg_catalog schemas. 60938: sql: drop partitions and zone configs from RBR tables on region drop r=otan a=arulajmani Closes #58340 Release note (sql change): `ALTER DATABASE ... DROP REGION ...` repartitions regional by row tables to remove the partition for the removed region and removes the zone congfiguration for the partition as well. 60939: bazel: correct BUILD file generation order r=irfansharif a=irfansharif pkg/BUILD.bazel has the (auto-generated) directive that instructs bazel to strip out the /pkg prefix. If we start off without pkg/BUILD.bazel[1], the earlier gazelle invocations don't learn about the prefix strip. [1]: Maybe we shouldn't be able to, I was testing to see if there were no extraneous diffs by re-generating the file entirely. Release note: None Co-authored-by: Aayush Shah <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: arulajmani <[email protected]> Co-authored-by: irfan sharif <[email protected]>
Non-voting replicas exist. This issue is no longer useful. |
Currently, CRDB only supports voting replicas: replicas that participate in quorum. This means that a high replication factor comes with the cost of high write latencies, since more replicas need to achieve consensus on every write. This trade-off is undesirable if the use-case only cares about using these replicas to serve follower reads, and not for fault tolerance. It is especially undesirable when the inter-replica latency between these voting replicas is high (i.e. when they are spread out across multiple distant regions).
Non-voting replicas would follow the raft log (thus be able to serve follower reads), but would not participate in quorum. This means that they have, essentially, no impact on write latencies at the cost of not being able to be used for fault-tolerance. They unlock use-cases where, for instance, a quorum of voting replicas lives in North America, but we'd still like to serve low-latency follower reads in EU, Asia and Australia, without having this cripple write latencies.
This is a (rough) tracking issue for the major changes required to introduce non-voting replicas (also referred to as long-lived learners due to some implementation details). They are listed roughly in the order in which (I believe) they should be implemented. This issue will be updated as new considerations are discovered.
New Replica type for non-voting replicas (done)
The desire for this stems from the assumptions made about learners being short lived in a couple of different places (covered below). There are valid reasons for some of those assumptions to stay in place for the short lived learners, but they don’t apply to long lived learners.
Pull requests: #53715
Zone config syntax to enable creation of non-voters
Pull requests: #57184
Compatibility with range merges (done)
Currently, we don’t merge a range if it has a learner replica. In the
AdminMerge
transaction, if the LHS leaseholder detects learner replicas for either of the involved ranges, it just returns an error. It seems like we did it this way in order to avoid working out any potential edge cases since the learners were only supposed to be ephemeral. We’ll need to make merges work with (at least) long lived learners.Pull requests: #56197
Interactions with raft snapshot queue and replicate queue (done)
The atomic replication change logic “manually” sends a snapshot to the short lived learner that it instantiates in the first phase of the change. The raft snapshot queue, thus, entirely ignores the learners it comes across. Once the long lived learners are added, they'll need to be able to receive snapshots from raft snapshot queue just like
VOTER
replicas.Replicas are added by the atomic replication change logic in two phases, the first phase creates a short lived learner and the second promotes it to a voter. If the co-ordinating node dies before the second phase, we’re left with an orphaned learner replica. The replicate queue, thus, removes any learners it sees.
Allocator level changes
We need to teach the allocator to obey the specified constraints to place long lived learners in the correct localities. Here is a preliminary, non-exhaustive list of considerations we need to be aware of while making this change:
VOTER
replicas across a cluster is diversity. The primary motivation for this is fault tolerance, which is not a consideration for learner replicas. We probably want to place learners primarily based on the localities of the incoming traffic (“follow the workload”).VOTER
replicas on them, do we just silently omit these learners?Allow distsender to route follower read requests to learners
DistSender currently ignores learners when routing queries. We’ll want to start including long lived learners here if the request is such that it can be served via follower-reads.
Pull requests: #58772
Roachtest that exercises & asserts use of follower reads
Currently, we lack large scale integration tests that test for “proper” use of follower reads (“are we correctly serving follower reads for a given compatible workload?”). We’ll want to write a roachtest that asserts, for a pre-determined workload, a minimum threshold of requests that need to be served via follower reads.
Quota pool prevents raft leaders from getting too far ahead of followers
This means a slow learner can slow down regular raft traffic. We’ll want to consider whether we should allow long lived learners to potentially trail further behind than other types of replicas are allowed to.
The text was updated successfully, but these errors were encountered: