Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: support non-voting replicas #51943

Closed
aayushshah15 opened this issue Jul 27, 2020 · 4 comments
Closed

kvserver: support non-voting replicas #51943

aayushshah15 opened this issue Jul 27, 2020 · 4 comments
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. A-kv-replication-constraints A-multiregion Related to multi-region
Milestone

Comments

@aayushshah15
Copy link
Contributor

aayushshah15 commented Jul 27, 2020

Currently, CRDB only supports voting replicas: replicas that participate in quorum. This means that a high replication factor comes with the cost of high write latencies, since more replicas need to achieve consensus on every write. This trade-off is undesirable if the use-case only cares about using these replicas to serve follower reads, and not for fault tolerance. It is especially undesirable when the inter-replica latency between these voting replicas is high (i.e. when they are spread out across multiple distant regions).

Non-voting replicas would follow the raft log (thus be able to serve follower reads), but would not participate in quorum. This means that they have, essentially, no impact on write latencies at the cost of not being able to be used for fault-tolerance. They unlock use-cases where, for instance, a quorum of voting replicas lives in North America, but we'd still like to serve low-latency follower reads in EU, Asia and Australia, without having this cripple write latencies.

This is a (rough) tracking issue for the major changes required to introduce non-voting replicas (also referred to as long-lived learners due to some implementation details). They are listed roughly in the order in which (I believe) they should be implemented. This issue will be updated as new considerations are discovered.

New Replica type for non-voting replicas (done)

The desire for this stems from the assumptions made about learners being short lived in a couple of different places (covered below). There are valid reasons for some of those assumptions to stay in place for the short lived learners, but they don’t apply to long lived learners.
Pull requests: #53715

Zone config syntax to enable creation of non-voters

Pull requests: #57184

Compatibility with range merges (done)

Currently, we don’t merge a range if it has a learner replica. In the AdminMerge transaction, if the LHS leaseholder detects learner replicas for either of the involved ranges, it just returns an error. It seems like we did it this way in order to avoid working out any potential edge cases since the learners were only supposed to be ephemeral. We’ll need to make merges work with (at least) long lived learners.
Pull requests: #56197

Interactions with raft snapshot queue and replicate queue (done)

The atomic replication change logic “manually” sends a snapshot to the short lived learner that it instantiates in the first phase of the change. The raft snapshot queue, thus, entirely ignores the learners it comes across. Once the long lived learners are added, they'll need to be able to receive snapshots from raft snapshot queue just like VOTER replicas.

Replicas are added by the atomic replication change logic in two phases, the first phase creates a short lived learner and the second promotes it to a voter. If the co-ordinating node dies before the second phase, we’re left with an orphaned learner replica. The replicate queue, thus, removes any learners it sees.

Allocator level changes

We need to teach the allocator to obey the specified constraints to place long lived learners in the correct localities. Here is a preliminary, non-exhaustive list of considerations we need to be aware of while making this change:

  • One heuristic we care a lot about when placing VOTER replicas across a cluster is diversity. The primary motivation for this is fault tolerance, which is not a consideration for learner replicas. We probably want to place learners primarily based on the localities of the incoming traffic (“follow the workload”).
  • We currently do not include traffic served via follower reads in stats for load based splitting. We'll probably want to change that as we start seeing topologies that make heavy use of learner replicas.
  • If all the nodes that qualify under the constraints specified for learners are ones that already have VOTER replicas on them, do we just silently omit these learners?

Allow distsender to route follower read requests to learners

DistSender currently ignores learners when routing queries. We’ll want to start including long lived learners here if the request is such that it can be served via follower-reads.
Pull requests: #58772

Roachtest that exercises & asserts use of follower reads

Currently, we lack large scale integration tests that test for “proper” use of follower reads (“are we correctly serving follower reads for a given compatible workload?”). We’ll want to write a roachtest that asserts, for a pre-determined workload, a minimum threshold of requests that need to be served via follower reads.

Quota pool prevents raft leaders from getting too far ahead of followers

This means a slow learner can slow down regular raft traffic. We’ll want to consider whether we should allow long lived learners to potentially trail further behind than other types of replicas are allowed to.

@blathers-crl
Copy link

blathers-crl bot commented Jul 27, 2020

Hi @aayushshah15, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@aayushshah15 aayushshah15 added A-kv-distribution Relating to rebalancing and leasing. A-kv-replication-constraints labels Jul 27, 2020
@aayushshah15 aayushshah15 self-assigned this Jul 27, 2020
@aayushshah15 aayushshah15 added this to the 20.2 milestone Jul 27, 2020
@aayushshah15
Copy link
Contributor Author

cc @tbg

This is pretty rough but @andreimatei had mentioned that I should get your eyes on this in case there's potential for ill-defined interactions (or things I should double-check/be careful about) with your work on atomic replication changes.

@aayushshah15
Copy link
Contributor Author

cc @lunevalex @nvanbenschoten

@aayushshah15 aayushshah15 changed the title kvserver: make learners long lived kvserver: support long-lived learners Aug 7, 2020
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Aug 19, 2020
As we work towards adding support for persistent learner replicas, we'd
like to decouple it from the ephemeral learner state. Conceptually,
learners in their current state are an _internal_ intermediate state
used for atomic replication changes, whereas persistent learners are
meant to be a user-visible state explicitly chosen to be placed in
specific localities.

There are also some practical reasons that motivated this change:
1. The replicate queue, when it discovers a learner replica, infers that
it must be a relic of a failed `ChangeReplicas` run, and clears it up.
2. The raft snapshot queue doesn't send it snapshots since these
learners are supposed to receive it directly from the `ChangeReplicas`
txn.
3. We disallow range merges when we detect that either of the concerned
ranges have a learner replica, as in our current state, this is likely
because there is an ongoing atomic configuration change.

In all of these scenarios it seems safer & more efficient to be able
to differentiate between persistent & ephemeral learners based on the
ReplicaType (as opposed to some other solution like a `isPersistent` bit
on the replica descriptor)

This PR performs this change with a combination of Goland-rename and
then grepping for string literals and comments that talk about learners.
Note that the `SnapshotRequest_LEARNER` type has not been changed in
this PR.

Touches cockroachdb#51943.

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Aug 31, 2020
As we work towards adding support for persistent learner replicas, we'd
like to decouple it from the ephemeral learner state. Conceptually,
learners in their current state are an _internal_ intermediate state
used for atomic replication changes, whereas persistent learners are
meant to be a user-visible state explicitly chosen to be placed in
specific localities.

There are also some practical reasons that motivated this change:
1. The replicate queue, when it discovers a learner replica, infers that
it must be a relic of a failed `ChangeReplicas` run, and clears it up.
2. The raft snapshot queue doesn't send it snapshots since these
learners are supposed to receive it directly from the `ChangeReplicas`
txn.
3. We disallow range merges when we detect that either of the concerned
ranges have a learner replica, as in our current state, this is likely
because there is an ongoing atomic configuration change.

In all of these scenarios it seems safer & more efficient to be able
to differentiate between persistent & ephemeral learners based on the
ReplicaType (as opposed to some other solution like a `isPersistent` bit
on the replica descriptor)

This PR performs this change with a combination of Goland-rename and
then grepping for string literals and comments that talk about learners.
Note that the `SnapshotRequest_LEARNER` type has not been changed in
this PR.

Touches cockroachdb#51943.

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Jan 28, 2021
Before this patch, the DistSender would only route BatchRequests to
replicas of types `VOTER_FULL` and `VOTER_INCOMING`. This patch changes
that to let the DistSender route requests (for instance, follower read
requests) to `NON_VOTER` replicas as well.

Makes progress on cockroachdb#51943

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Jan 28, 2021
Before this patch, the DistSender would only route BatchRequests to
replicas of types `VOTER_FULL` and `VOTER_INCOMING`. This patch changes
that to let the DistSender route requests (for instance, follower read
requests) to `NON_VOTER` replicas as well.

Makes progress on cockroachdb#51943

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Feb 15, 2021
This PR teaches `AdminChangeReplicas` to atomically promote voters to
non-voters, demote voters to non-voters or swap voters with non-voters
via joint consensus.

Fixes cockroachdb#58499
Informs cockroachdb#51943

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Feb 16, 2021
This PR teaches `AdminChangeReplicas` to atomically promote voters to
non-voters, demote voters to non-voters or swap voters with non-voters
via joint consensus.

Fixes cockroachdb#58499
Informs cockroachdb#51943

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Feb 18, 2021
This PR teaches `AdminChangeReplicas` to atomically promote voters to
non-voters, demote voters to non-voters or swap voters with non-voters
via joint consensus.

Fixes cockroachdb#58499
Informs cockroachdb#51943

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Feb 22, 2021
This PR teaches `AdminChangeReplicas` to atomically promote voters to
non-voters, demote voters to non-voters or swap voters with non-voters
via joint consensus.

Fixes cockroachdb#58499
Informs cockroachdb#51943

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Feb 22, 2021
This PR teaches `AdminChangeReplicas` to atomically promote voters to
non-voters, demote voters to non-voters or swap voters with non-voters
via joint consensus.

Fixes cockroachdb#58499
Informs cockroachdb#51943

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Feb 22, 2021
This PR teaches `AdminChangeReplicas` to atomically promote voters to
non-voters, demote voters to non-voters or swap voters with non-voters
via joint consensus.

Fixes cockroachdb#58499
Informs cockroachdb#51943

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Feb 22, 2021
This PR teaches `AdminChangeReplicas` to atomically promote voters to
non-voters, demote voters to non-voters or swap voters with non-voters
via joint consensus.

Fixes cockroachdb#58499
Informs cockroachdb#51943

Release note: None
craig bot pushed a commit that referenced this issue Feb 22, 2021
58627: kvserver: support atomic promotions and demotions of non-voting replicas r=aayushshah15 a=aayushshah15

This PR teaches `AdminChangeReplicas` to atomically promote voters to non-voters, demote voters to non-voters or swap voters with non-voters via joint consensus.

The approach followed by this PR tries to coalesce `ReplicationChanges` of types `ADD_VOTER` and `REMOVE_NON_VOTER` on a given target as promotions of non-voters into voters and likewise, `ADD_NON_VOTER` and `REMOVE_VOTER` changes for a given target as demotions of voters into non-voters. When all 4 of these operations are simultaneously passed into a `ChangeReplicasRequest`, the patch will try to execute them as one atomic swap of a voter with a non-voter. 

Fixes #58499
Informs #51943

Release note: None


60799: sql: include virtual table names in crash reports r=knz a=rafiss

This uses the functionality that we already had in place for telemetry
data. This might potentially make some crash reports more useful without
compromising privacy.

Release note (general change): Crash reports that are sent to Cockroach Labs
now no longer redact the names of builtin virtual tables from the
crdb_internal, information_schema, and pg_catalog schemas.

60938: sql: drop partitions and zone configs from RBR tables on region drop r=otan a=arulajmani

Closes #58340

Release note (sql change): `ALTER DATABASE ... DROP REGION ...`
repartitions regional by row tables to remove the partition for the
removed region and removes the zone congfiguration for the partition
as well.

60939: bazel: correct BUILD file generation order r=irfansharif a=irfansharif

pkg/BUILD.bazel has the (auto-generated) directive that instructs bazel
to strip out the /pkg prefix. If we start off without
pkg/BUILD.bazel[1], the earlier gazelle invocations don't learn about
the prefix strip.

[1]: Maybe we shouldn't be able to, I was testing to see if there were
     no extraneous diffs by re-generating the file entirely.

Release note: None

Co-authored-by: Aayush Shah <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: arulajmani <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
@andreimatei
Copy link
Contributor

Non-voting replicas exist. This issue is no longer useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. A-kv-replication-constraints A-multiregion Related to multi-region
Projects
None yet
Development

No branches or pull requests

6 participants