Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: unify replica addition and removal paths #39640

Merged
merged 3 commits into from
Aug 14, 2019

Conversation

tbg
Copy link
Member

@tbg tbg commented Aug 13, 2019

This continues the reworking of the various replication change APIs with
the goal of allowing a) testing of general atomic replication changes b)
issuing replica swaps from the replicate queue (in 19.2).

For previous steps, see:

#39485
#39611

This change is not a pure plumbing PR. Instead, it unifies
(*Replica).addReplica and (*Replica).removeReplica into a method that
can do both, (*Replica).addAndRemoveReplicas.

Given a slice of ReplicationChanges, this method first adds learner
replicas corresponding to the desired new voters. After having sent
snapshots to all of them, the method issues a configuration change that
atomically

  • upgrades all learners to voters
  • removes any undesired replicas.

Note that no atomic membership changes are actually carried out yet. This
is because the callers of addAndRemoveReplicas pass in only a single
change (i.e. an addition or removal), which the method also verifies.

Three pieces are missing after this PR: First, we need to be able to
instruct raft to carry out atomic configuration changes:

added, removed := crt.Added(), crt.Removed()
if len(added)+len(removed) != 1 {
log.Fatalf(context.TODO(), "atomic replication changes not supported yet")
}

which in particular requires being able to store the ConfState
corresponding to a joint configuration in the unreplicated local state
(under a new key).

Second, we must pass the slice of changes handed to
AdminChangeReplicas through to addAndRemoveReplicas without unrolling
it first, see:

func (r *Replica) ChangeReplicas(
ctx context.Context,
changeType roachpb.ReplicaChangeType,
target roachpb.ReplicationTarget,
desc *roachpb.RangeDescriptor,
reason storagepb.RangeLogEventReason,
details string,
) (updatedDesc *roachpb.RangeDescriptor, _ error) {
if desc == nil {
return nil, errors.Errorf("%s: the current RangeDescriptor must not be nil", r)
}
switch changeType {
case roachpb.ADD_REPLICA:
return r.addReplica(ctx, target, desc, SnapshotRequest_REBALANCE, reason, details)
case roachpb.REMOVE_REPLICA:
return r.removeReplica(ctx, target, desc, SnapshotRequest_REBALANCE, reason, details)
default:
return nil, errors.Errorf(`unknown change type: %s`, changeType)
}
}

and

case *roachpb.AdminChangeReplicasRequest:
var err error
expDesc := &tArgs.ExpDesc
for _, chg := range tArgs.Changes() {
// Update expDesc to the outcome of the previous run to enable detection
// of concurrent updates while applying a series of changes.
expDesc, err = r.ChangeReplicas(
ctx, chg.ChangeType, chg.Target, expDesc, storagepb.ReasonAdminRequest, "")
if err != nil {
break
}
}

Third, we must to teach the replicate queue to issue the "atomic swaps";
this is the reason we're introducing atomic membership changes in the first
place.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from nvanbenschoten August 13, 2019 20:55
@tbg tbg force-pushed the atomic/addreplica branch 2 times, most recently from 887e4b5 to 06dde3c Compare August 13, 2019 21:32
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 6 of 6 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/storage/replica_command.go, line 913 at r3 (raw file):

		chg, ok := byNodeID[rDesc.NodeID]
		delete(byNodeID, rDesc.NodeID)
		if !ok || chg.ChangeType == roachpb.REMOVE_REPLICA {

nit here and below: chg.ChangeType != roachpb.ADD_REPLICA would be a little easier to read because it would spell out the kinds of changes we care about in these loops.


pkg/storage/replica_command.go, line 947 at r3 (raw file):

}

func (r *Replica) addAndRemoveReplicas(

This function could use a comment about the process we go through when adding and removing replicas. Something that mentions that a learner replica is created for each added replica one-by-one, they are sent a snapshot one-by-one, and then a single configuration change is run to promote all learners to voting replicas and remove all removed replicas in a single atomic step.


pkg/storage/replica_command.go, line 970 at r3 (raw file):

	if !useLearners {
		// NB: we will never use atomic replication changes while learners are not
		// also active.

Add a TODO to return an error here once we remove the one above.


pkg/storage/replica_command.go, line 985 at r3 (raw file):

	// Now move it to be a full voter (waiting on it to get a raft snapshot first,
	// so it's not immediately way behind).

This comment no longer lines up with the code.


pkg/storage/replica_command.go, line 1009 at r3 (raw file):

) (*roachpb.RangeDescriptor, error) {
	if len(chgs) == 0 {
		// If there's nothing to do, return early to avoid redundant work.

I'd either rename this function to maybeAddLearnerReplicas/addAnyLearnerReplicas or move this check to the caller. It would make the previous function's stages more clear.


pkg/storage/replica_command.go, line 1078 at r3 (raw file):

			// switching between StateSnapshot and StateProbe. Even if we worked through
			// these, it would be susceptible to future similar issues.
			if err := r.sendSnapshot(ctx, rDesc, SnapshotRequest_LEARNER, priority); err != nil {

Do you see a world in which we parallelize this across all learners added in a single step?


pkg/storage/replica_command.go, line 1096 at r3 (raw file):

	updatedDesc.SetReplicas(roachpb.MakeReplicaDescriptors(&newReplicas))
	for _, chg := range removes {
		if _, found := updatedDesc.RemoveReplica(chg.Target.NodeID, chg.Target.StoreID); !found {

Can we modify newReplicas instead using ReplicaDescriptors.RemoveReplica and then only call updatedDesc.SetReplicas once?


pkg/storage/replica_command.go, line 1221 at r3 (raw file):

}

func execChangeReplicasTxn(

Something about how we update the roachpb.RangeDescriptor above this function call but then also need to pass in the roachpb.ReplicationChanges slice so we can construct var added, removed []roachpb.ReplicaDescriptor from the updated descriptor seems weird to me. Would it be easier for both the caller and the function itself if the function took added, removed []roachpb.ReplicaDescriptor arguments and performed the descriptor modification internally?

tbg added 2 commits August 14, 2019 11:41
It looked suspicious that we were sending a snapshot using a descriptor
that was marked as a voter. Ultimately, this info wasn't used, but for
clarity make sure we only mutate the descriptor for the change replicas
txn.

Release note: None
Declutter the main method.

Release note: None
@tbg tbg force-pushed the atomic/addreplica branch from 06dde3c to ff2e90b Compare August 14, 2019 10:43
This continues the reworking of the various replication change APIs with
the goal of allowing
a) testing of general atomic replication changes
b) issuing replica swaps from the replicate queue (in 19.2).

For previous steps, see:

cockroachdb#39485
cockroachdb#39611

This change is not a pure plumbing PR. Instead, it unifies
`(*Replica).addReplica` and `(*Replica).removeReplica` into a method
that can do both, `(*Replica).addAndRemoveReplicas`.

Given a slice of ReplicationChanges, this method first adds learner
replicas corresponding to the desired new voters. After having sent
snapshots to all of them, the method issues a configuration change that
atomically
- upgrades all learners to voters
- removes any undesired replicas.

Note that no atomic membership changes are *actually* carried out yet.
This is because the callers of `addAndRemoveReplicas` pass in only a
single change (i.e. an addition or removal), which the method also
verifies.

Three pieces are missing after this PR: First, we need to be able to
instruct raft to carry out atomic configuration changes:

https://github.com/cockroachdb/cockroach/blob/2e8db6ca53c59d3d281e64939f79d937195403d4/pkg/storage/replica_proposal_buf.go#L448-L451

which in particular requires being able to store the ConfState
corresponding to a joint configuration in the unreplicated local state
(under a new key).

Second, we must pass the slice of changes handed to
`AdminChangeReplicas` through to `addAndRemoveReplicas` without
unrolling it first, see:

https://github.com/cockroachdb/cockroach/blob/3b316bac6ef342590ddc68d2989714d6e126371a/pkg/storage/replica_command.go#L870-L891

and

https://github.com/cockroachdb/cockroach/blob/3b316bac6ef342590ddc68d2989714d6e126371a/pkg/storage/replica.go#L1314-L1325

Third, we must to teach the replicate queue to issue the "atomic
swaps"; this is the reason we're introducing atomic membership changes
in the first place.

Release note: None
@tbg tbg force-pushed the atomic/addreplica branch from ff2e90b to 329d825 Compare August 14, 2019 11:09
@tbg tbg requested a review from nvanbenschoten August 14, 2019 11:20
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fast review! Addressed your comments and made some more cleanups, PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/replica_command.go, line 947 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This function could use a comment about the process we go through when adding and removing replicas. Something that mentions that a learner replica is created for each added replica one-by-one, they are sent a snapshot one-by-one, and then a single configuration change is run to promote all learners to voting replicas and remove all removed replicas in a single atomic step.

Done. I'm planning (not in this PR) to fold addAndRemoveReplicas into ChangeReplicas and will rework that larger comment there when I do, too.


pkg/storage/replica_command.go, line 970 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Add a TODO to return an error here once we remove the one above.

I just added the check now.


pkg/storage/replica_command.go, line 1009 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'd either rename this function to maybeAddLearnerReplicas/addAnyLearnerReplicas or move this check to the caller. It would make the previous function's stages more clear.

Pulled this to the caller, PTAL.


pkg/storage/replica_command.go, line 1078 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you see a world in which we parallelize this across all learners added in a single step?

Potentially at some point, not anytime soon though.


pkg/storage/replica_command.go, line 1096 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can we modify newReplicas instead using ReplicaDescriptors.RemoveReplica and then only call updatedDesc.SetReplicas once?

I went the other way and completely removed newReplicas. Instead we just maintain updatedDesc from the beginning of the method. The result is simpler code.


pkg/storage/replica_command.go, line 1221 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Something about how we update the roachpb.RangeDescriptor above this function call but then also need to pass in the roachpb.ReplicationChanges slice so we can construct var added, removed []roachpb.ReplicaDescriptor from the updated descriptor seems weird to me. Would it be easier for both the caller and the function itself if the function took added, removed []roachpb.ReplicaDescriptor arguments and performed the descriptor modification internally?

Good idea, done. I still have an itch to provide a better abstraction around the mutation of a replica descriptor via a sequence of ReplicationChanges (to avoid manually keeping the two in sync) but in the interest of time I'll leave that to future me.
Not passing changes here also made it possible to streamline the learner addition (which now takes a slice of targets only).

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 6 of 6 files at r6.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@tbg
Copy link
Member Author

tbg commented Aug 14, 2019

TFTR!
bors r=nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Aug 14, 2019

Build failed (retrying...)

@tbg
Copy link
Member Author

tbg commented Aug 14, 2019

^- flaked on #39670

craig bot pushed a commit that referenced this pull request Aug 14, 2019
39640: storage: unify replica addition and removal paths r=nvanbenschoten a=tbg

This continues the reworking of the various replication change APIs with
the goal of allowing a) testing of general atomic replication changes b)
issuing replica swaps from the replicate queue (in 19.2).

For previous steps, see:

#39485
#39611

This change is not a pure plumbing PR. Instead, it unifies
`(*Replica).addReplica` and `(*Replica).removeReplica` into a method that
can do both, `(*Replica).addAndRemoveReplicas`.

Given a slice of ReplicationChanges, this method first adds learner
replicas corresponding to the desired new voters. After having sent
snapshots to all of them, the method issues a configuration change that
atomically
- upgrades all learners to voters
- removes any undesired replicas.

Note that no atomic membership changes are *actually* carried out yet. This
is because the callers of `addAndRemoveReplicas` pass in only a single
change (i.e. an addition or removal), which the method also verifies.

Three pieces are missing after this PR: First, we need to be able to
instruct raft to carry out atomic configuration changes:

https://github.com/cockroachdb/cockroach/blob/2e8db6ca53c59d3d281e64939f79d937195403d4/pkg/storage/replica_proposal_buf.go#L448-L451

which in particular requires being able to store the ConfState
corresponding to a joint configuration in the unreplicated local state
(under a new key).

Second, we must pass the slice of changes handed to
`AdminChangeReplicas` through to `addAndRemoveReplicas` without unrolling
it first, see:

https://github.com/cockroachdb/cockroach/blob/3b316bac6ef342590ddc68d2989714d6e126371a/pkg/storage/replica_command.go#L870-L891

and

https://github.com/cockroachdb/cockroach/blob/3b316bac6ef342590ddc68d2989714d6e126371a/pkg/storage/replica.go#L1314-L1325

Third, we must to teach the replicate queue to issue the "atomic swaps";
this is the reason we're introducing atomic membership changes in the first
place.

Release note: None

39656: kv: init heartbeat txn log tag later r=nvanbenschoten a=tbg

At init() time, the txn proto has not been populated yet.
Found while investigating #39652.

This change strikes me as clunky, but I don't have the bandwidth to dig deeper
right now.

Release note: None

39666: testutils/lint/passes: disable under nightly stress r=mjibson a=mjibson

Under stress these error with "go build a: failed to cache compiled Go files".

Fixes #39616
Fixes #39541
Fixes #39479

Release note: None

39669: rpc: use gRPC enforced minimum keepalive timeout r=knz a=ajwerner

Before this commit we'd experience the following annoying log message from gRPC
every time we create a new connection telling us that our setting is being
ignored.

```
Adjusting keepalive ping interval to minimum period of 10s
```

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
Co-authored-by: Matt Jibson <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
@craig
Copy link
Contributor

craig bot commented Aug 14, 2019

Build succeeded

@craig craig bot merged commit 329d825 into cockroachdb:master Aug 14, 2019
Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)


pkg/storage/replica_command.go, line 972 at r6 (raw file):

	useLearners := useLearnerReplicas.Get(&settings.SV)
	useLearners = useLearners && settings.Version.IsActive(cluster.VersionLearnerReplicas)
	if !useLearners {

how this this work with replica removal when useLearners is false? if it's something i'm not seeing, this definitely deserves a comment

@tbg tbg deleted the atomic/addreplica branch August 15, 2019 22:01
tbg added a commit to tbg/cockroach that referenced this pull request Aug 15, 2019
I introduced a bug in cockroachdb#39640 which would fail removal of replicas
whenever preemptive snapshots were used, since we'd accidentally
send a preemptive snapshot which would end up with a "node already
in descriptor" error.

See:

cockroachdb#39640 (review)

Release note: None
@tbg
Copy link
Member Author

tbg commented Aug 15, 2019

Thanks for spotting this, here's a fix: #39697

tbg added a commit to tbg/cockroach that referenced this pull request Aug 16, 2019
I introduced a bug in cockroachdb#39640 which would fail removal of replicas
whenever preemptive snapshots were used, since we'd accidentally
send a preemptive snapshot which would end up with a "node already
in descriptor" error.

See:

cockroachdb#39640 (review)

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Aug 16, 2019
I introduced a bug in cockroachdb#39640 which would fail removal of replicas
whenever preemptive snapshots were used, since we'd accidentally
send a preemptive snapshot which would end up with a "node already
in descriptor" error.

See:

cockroachdb#39640 (review)

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Aug 16, 2019
I introduced a bug in cockroachdb#39640 which would fail removal of replicas
whenever preemptive snapshots were used, since we'd accidentally
send a preemptive snapshot which would end up with a "node already
in descriptor" error.

See:

cockroachdb#39640 (review)

Release note: None
craig bot pushed a commit that referenced this pull request Aug 16, 2019
39697: storage: fix replica removal when learners not active r=danhhz a=tbg

I introduced a bug in #39640 which would fail removal of replicas
whenever preemptive snapshots were used, since we'd accidentally
send a preemptive snapshot which would end up with a "node already
in descriptor" error.

See:

#39640 (review)

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants