kvserver: misc rac2 race conditions fixes #131108

kvoli · 2024-09-20T16:47:27Z

In ProcessPiggybackedAdmittedAtLeaderRaftMuLocked, we swap the
pending updates map with another one before releasing the mutex, in
order to minimize the time holding a mutex. Subsequently, the method
would then check if the updates map was empty by calling len(),
performing a read.

When the updates map was not empty, the read was against a map which was
now only referenced in this function, because a new map was swapped in
under the lock.

When the updates map was empty, the map isn't swapped out and is
referenced by the receiver struct instance, reading a ref without
locking.

Return whether the updates map was empty, to avoid racing when reading
the length of the shared ref empty map.

Goroutines can mutate the Raft node state without acquiring raftMu,
however must acquire replica mu.

Acquire a read lock on replica mu when reading the tracker progress
state from the raw node in FollowerStateRaftMuLocked to prevent races.

Part of: #130187
Release note: None

blathers-crl · 2024-09-20T16:47:31Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2024-09-20T16:47:39Z

This change is

pkg/kv/kvserver/kvflowcontrol/replica_rac2/raft_node.go

pkg/kv/kvserver/kvflowcontrol/replica_rac2/processor.go

sumeerbhola

Reviewed 1 of 1 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)

pkg/kv/kvserver/kvflowcontrol/replica_rac2/processor.go line 1034 at r1 (raw file):

		}
		return true
	}(); updatesEmpty {

The swapping logic which uses updates is needlessly subtle, in that the optimization to return early is intertwined with the synchronization correctness. My suggestion is to do a tweak like:

	// updates is left unset (so empty unallocated map) or refers to the map in
	// p.leader.scratch, so can be read and written without holding
	// pendingAdmittedMu.
	var updates map[roachpb.ReplicaID]rac2.AdmittedVector

	// Swap the pendingAdmittedMu.updates map with the empty scratch if non-empty. This is an optimization to
	// minimize the time we hold the pendingAdmittedMu lock.
	if updatesEmpty := func() bool {
		p.leader.pendingAdmittedMu.Lock()
		defer p.leader.pendingAdmittedMu.Unlock()
		if len(p.leader.pendingAdmittedMu.updates) > 0 {
			updates = p.leader.pendingAdmittedMu.updates
			p.leader.pendingAdmittedMu.updates = p.leader.scratch
			p.leader.scratch = updates
			return false
		}
		return true
	}(); updatesEmpty {
		return
	}
...

pkg/kv/kvserver/kvflowcontrol/replica_rac2/raft_node.go line 51 at r2 (raw file):

For consistency, it would be better to move this lock to the caller. Eventually, it will move all the way up to RaftEvent construction time.

Given this call will move all the way into processorImpl eventually, I don't think we want to unnecessarily teach rangeController about Replica.mu (like processorImpl) -- we'll teach it and then undo that teaching.
So doing this is fine as a stop-gap. Another alternative (again a stop-gap) would be to implement this (RaftInterface, which only has this one method) in processorImpl and processorImpl would call Replica.MuRLock. Either way, the name of this method can't change since the caller (rangeController) is only holding raftMu. I am leaning towards the stop-gap in this PR.

kvoli

TYFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @pav-kv and @sumeerbhola)

pkg/kv/kvserver/kvflowcontrol/replica_rac2/processor.go line 1025 at r1 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

Good catch.

Thanks!

pkg/kv/kvserver/kvflowcontrol/replica_rac2/processor.go line 1034 at r1 (raw file):

Previously, sumeerbhola wrote…

The swapping logic which uses updates is needlessly subtle, in that the optimization to return early is intertwined with the synchronization correctness. My suggestion is to do a tweak like:

	// updates is left unset (so empty unallocated map) or refers to the map in
	// p.leader.scratch, so can be read and written without holding
	// pendingAdmittedMu.
	var updates map[roachpb.ReplicaID]rac2.AdmittedVector

	// Swap the pendingAdmittedMu.updates map with the empty scratch if non-empty. This is an optimization to
	// minimize the time we hold the pendingAdmittedMu lock.
	if updatesEmpty := func() bool {
		p.leader.pendingAdmittedMu.Lock()
		defer p.leader.pendingAdmittedMu.Unlock()
		if len(p.leader.pendingAdmittedMu.updates) > 0 {
			updates = p.leader.pendingAdmittedMu.updates
			p.leader.pendingAdmittedMu.updates = p.leader.scratch
			p.leader.scratch = updates
			return false
		}
		return true
	}(); updatesEmpty {
		return
	}
...

Adopted.

pkg/kv/kvserver/kvflowcontrol/replica_rac2/raft_node.go line 51 at r2 (raw file):

Previously, sumeerbhola wrote…

For consistency, it would be better to move this lock to the caller. Eventually, it will move all the way up to RaftEvent construction time.

Given this call will move all the way into processorImpl eventually, I don't think we want to unnecessarily teach rangeController about Replica.mu (like processorImpl) -- we'll teach it and then undo that teaching.
So doing this is fine as a stop-gap. Another alternative (again a stop-gap) would be to implement this (RaftInterface, which only has this one method) in processorImpl and processorImpl would call Replica.MuRLock. Either way, the name of this method can't change since the caller (rangeController) is only holding raftMu. I am leaning towards the stop-gap in this PR.

I've gone ahead and moved the follower infos into the raft event, reaching out for them under the replica lock in HandleReady. This probably warrants another look.

sumeerbhola

Reviewed 1 of 3 files at r3, 7 of 9 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @kvoli and @pav-kv)

pkg/kv/kvserver/kvflowcontrol/replica_rac2/processor.go line 125 at r4 (raw file):

	// transitions both in rac2.HandleRaftEventRaftMuLocked and
	// rac2.SetReplicasRaftMuLocked.
	FollowersStateRLocked() raft.Status

nit: all the methods here are RLocked wrt Replica.mu, so to be consistent perhaps leave this is Locked().

We ask for this state for all replicas, including the leader itself. Should we rename this ReplicaStateInfo?

How about changing the interface to something like:

// infoMap is an in-out parameter. It is expected to be empty, and is populated with the ReplicaStateInfos for all replicas.
ReplicasStateLocked(infoMap map[ReplicaID]ReplicaStateInfo)

And maintain a scratch empty map in processorImpl.

Then the only thing we need to change in the future to optimize memory allocation is raftNodeForRACv2 and its use of RawNode.Status().

pkg/kv/kvserver/kvflowcontrol/replica_rac2/processor.go line 1222 at r4 (raw file):

		} else {
			infos[replica.ReplicaID] = rac2.FollowerStateInfo{
				State: tracker.StateProbe,

Is Status.Progress incomplete, or is this just being defensive?

Since tracker.StateProbe is the zero value, we may not need this defensive code.

kvoli

TYFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @pav-kv and @sumeerbhola)

pkg/kv/kvserver/kvflowcontrol/replica_rac2/processor.go line 125 at r4 (raw file):

Previously, sumeerbhola wrote…

nit: all the methods here are RLocked wrt Replica.mu, so to be consistent perhaps leave this is Locked().

We ask for this state for all replicas, including the leader itself. Should we rename this ReplicaStateInfo?

How about changing the interface to something like:
// infoMap is an in-out parameter. It is expected to be empty, and is populated with the ReplicaStateInfos for all replicas.
ReplicasStateLocked(infoMap map[ReplicaID]ReplicaStateInfo)
And maintain a scratch empty map in processorImpl.

Then the only thing we need to change in the future to optimize memory allocation is raftNodeForRACv2 and its use of RawNode.Status().

Nice, updated to this.

pkg/kv/kvserver/kvflowcontrol/replica_rac2/processor.go line 1222 at r4 (raw file):

Previously, sumeerbhola wrote…

Is Status.Progress incomplete, or is this just being defensive?

Since tracker.StateProbe is the zero value, we may not need this defensive code.

Just being defensive. I've removed it since it doesn't appear necessary.

pkg/kv/kvserver/kvflowcontrol/replica_rac2/processor.go

pkg/kv/kvserver/kvflowcontrol/replica_rac2/raft_node.go

pav-kv · 2024-09-24T22:01:46Z

pkg/kv/kvserver/kvflowcontrol/replica_rac2/processor.go

@@ -114,6 +111,19 @@ type RaftNode interface {
 	// NB: NextUnstableIndex can regress when the node accepts appends or
 	// snapshots from a newer leader.
 	NextUnstableIndexLocked() uint64
+	// ReplicasStateLocked returns the current status state of all replicas. The
+	// value of Match, Next are populated iff in StateReplicate. All entries >=


These values are always populated. E.g. in StateProbe they reflect the position at which raft is probing the appends. Maybe we don't need this detail here - is it enough to know that we will only be using the values if state is StateReplicate?

I think that is fine judging by the prototype code and what we have so far.

I mean, this comment seems inaccurate? The implementation of this method does not condition Match/Next field population to StateReplicate, and neither does RawNode. I.e. these are not zero values in non-StateReplicate case. In fact, Match is always accurate, and Next is always > Match.

I think that is fine

Do you mean it's fine that these fields are populated - we don't use Match/Next if not in StateReplicate? Should the comment say this much? E.g.:

// ReplicasStateLocked returns the replication flow state for all replicas. // // RACv2 uses the Match and Next indices only for StateReplicate replicas. // ... elaboration on what these mean in StateReplicate ... // // ... elaboration on transitions to/from StateReplicate ...

sumeerbhola

Reviewed 5 of 7 files at r5, 2 of 2 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @kvoli and @pav-kv)

pkg/kv/kvserver/kvflowcontrol/replica_rac2/processor.go line 115 at r5 (raw file):

we don't use Match/Next if not in StateReplicate

Correct

pkg/kv/kvserver/kvflowcontrol/replica_rac2/processor.go line 478 at r6 (raw file):

		// scratchInfoMap is used as a pre-allocated in-out parameter for calling
		// ReplicasStateLocked when constructing a rac2.RaftEvent.
		scratchInfoMap map[roachpb.ReplicaID]rac2.ReplicaStateInfo

This should not be nested inside leader and should be initialized in NewProcessor. We use this even when it is not a leader. Then we can remove the other initializations and simply clear the map before calling ReplicasStateLocked.

kvoli · 2024-09-25T00:06:54Z

I rushed out the last few fixups, before heading out. Will go through and address the feedback extensively tomorrow 😀.

kvoli

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @pav-kv and @sumeerbhola)

pkg/kv/kvserver/kvflowcontrol/replica_rac2/processor.go line 1027 at r3 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

nit: could you pad the comments?

Done.

pkg/kv/kvserver/kvflowcontrol/replica_rac2/processor.go line 115 at r5 (raw file):

Previously, sumeerbhola wrote…

we don't use Match/Next if not in StateReplicate

Correct

I've removed:

The value of Match, Next are populated iff in StateReplicate.

And added in:

RACv2 uses the Match and Next indices only for StateReplicate replicas.

pkg/kv/kvserver/kvflowcontrol/replica_rac2/processor.go line 478 at r6 (raw file):

Previously, sumeerbhola wrote…

This should not be nested inside leader and should be initialized in NewProcessor. We use this even when it is not a leader. Then we can remove the other initializations and simply clear the map before calling ReplicasStateLocked.

Moved up.

In `ProcessPiggybackedAdmittedAtLeaderRaftMuLocked`, we swap the pending updates map with another one before releasing the mutex, in order to minimize the time holding a mutex. Subsequently, the method would then check if the updates map was empty by calling `len()`, performing a read. When the updates map was not empty, the read was against a map which was now only referenced in this function, because a new map was swapped in under the lock. When the updates map was empty, the map isn't swapped out and is referenced by the receiver struct instance, reading a ref without locking. Return whether the updates map was empty, to avoid racing when reading the length of the shared ref empty map. Part of: cockroachdb#130187 Release note: None

Pass in `ReplicasStateInfo` via `RaftEvent` to the `RangeController`. This resolves an issue where the Raft status was raced on due to not holding a replica mu while reading. Part of: cockroachdb#130187 Release note: None

kvoli · 2024-09-25T15:04:48Z

Failures are unrelated:

/github-actions-runner/_work/_temp/df68ceb1-57fa-4638-bceb-a4f6e8e825bb.sh: line 1: ./build/github/cleanup-engflow-keys.sh: No such file or directory

sumeerbhola

Reviewed 1 of 8 files at r8.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @pav-kv)

kvoli · 2024-09-25T16:25:23Z

TYFTR!

bors r=sumeerbhola

craig · 2024-09-25T16:52:29Z

Build succeeded:

130728: kvserver: add rac2 v1 integration tests r=sumeerbhola a=kvoli 1st commit from #130619. 2nd-3rd commits from #131106. 4th-5th commits from #131107. 6th-7th commits from #131108. 8th commit from #131109. --- Introduce several tests in `flow_control_integration_test.go`, mirroring the existing tests but applied to the replication flow control v2 machinery. The tests largely follow an identical pattern to the existing v1 tests, swapping in rac2 metrics and vtables. The following tests are added: ``` TestFlowControlBasicV2 TestFlowControlRangeSplitMergeV2 TestFlowControlBlockedAdmissionV2 TestFlowControlAdmissionPostSplitMergeV2 TestFlowControlCrashedNodeV2 TestFlowControlRaftSnapshotV2 TestFlowControlRaftMembershipV2 TestFlowControlRaftMembershipRemoveSelfV2 TestFlowControlClassPrioritizationV2 TestFlowControlQuiescedRangeV2 TestFlowControlUnquiescedRangeV2 TestFlowControlTransferLeaseV2 TestFlowControlLeaderNotLeaseholderV2 TestFlowControlGranterAdmitOneByOneV2 ``` These tests all have at least two variants: ``` V2EnabledWhenLeaderV1Encoding V2EnabledWhenLeaderV2Encoding ``` When `V2EnabledWhenLeaderV1Encoding` is run, the tests use a different testdata file, which has a `_v1_encoding` suffix. A separate file is necessary because when the protocol enablement level is `V2EnabledWhenLeaderV1Encoding`, all entries which are subject to admission control are encoded as `raftpb.LowPri`, regardless of their original priority, as we don't want to pay the cost to deserialize the raft admission meta. The v1 encoding variants retain the same comments as the v2 encoding, however any comments referring to regular tokens should be interpreted as elastic tokens instead, due to the above. Two v1 tests are not ported over to v2: ``` TestFlowControlRaftTransportBreak TestFlowControlRaftTransportCulled ``` These omitted tests behave identically to `TestFlowControlCrashedNodeV2` as rac2 is less tightly coupled to the raft transport, instead operating on replication states (e.g., `StateProbe`, `StateReplicate`). --- Add `TestFlowControlV1ToV2Transition`, which ratchets up the enabled version of replication flow control v2: ``` v1 protocol with v1 encoding => v2 protocol with v1 encoding => v2 protocol with v2 encoding ``` The test is structured to issue writes and wait for returned tokens whenever the protocol transitions from v1 to v2, or a leader changes. More specifically, the test takes the following steps: ``` (1) Start n1, n2, n3 with v1 protocol and v1 encoding. (2) Upgrade n1 to v2 protocol with v1 encoding. (3) Transfer the range lease to n2. (4) Upgrade n2 to v2 protocol with v1 encoding. (5) Upgrade n3 to v2 protocol with v1 encoding. (6) Upgrade n1 to v2 protocol with v2 encoding. (7) Transfer the range lease to n1. (8) Upgrade n2,n3 to v2 protocol with v2 encoding. (9) Transfer the range lease to n3. ``` Between each step, we issue writes, (un)block admission and observe the flow control metrics and vtables. Resolves: #130431 Resolves: #129276 Release note: None 131252: roachtest: port decommission/mixed-versions r=srosenberg,DarrylWong a=renatolabs This commit ports the `decommission/mixed-versions` roachtest to use the `mixedversion` framework (instead of the old `newUpgradeTest` API). It also updates `acceptance/decommission-self` since both tests used shared functionality that needed to be updated. Prior to this commit, the acceptance test used the old upgrade test API even though it was not an upgrade test. Fixes: #110531 Fixes: #110530 Release note: None 131364: upgrades: give test an additional core under remote exec r=rail a=rickystewart This has been timing out. Epic: none Release note: None Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Renato Costa <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>

kvoli self-assigned this Sep 20, 2024

kvoli requested review from a team as code owners September 20, 2024 16:47

kvoli mentioned this pull request Sep 20, 2024

kvserver: add rac2 v1 integration tests #130728

Merged

kvoli requested review from pav-kv and sumeerbhola September 20, 2024 17:03

kvoli force-pushed the 240920.rac-race-fixes branch from 9228488 to 754a1de Compare September 20, 2024 17:54

pav-kv reviewed Sep 23, 2024

View reviewed changes

pkg/kv/kvserver/kvflowcontrol/replica_rac2/raft_node.go Outdated Show resolved Hide resolved

pkg/kv/kvserver/kvflowcontrol/replica_rac2/processor.go Show resolved Hide resolved

sumeerbhola requested a review from pav-kv September 24, 2024 13:53

sumeerbhola approved these changes Sep 24, 2024

View reviewed changes

kvoli force-pushed the 240920.rac-race-fixes branch 2 times, most recently from 2a67402 to ead83d9 Compare September 24, 2024 16:20

kvoli commented Sep 24, 2024

View reviewed changes

kvoli force-pushed the 240920.rac-race-fixes branch 3 times, most recently from 39ab814 to 46f64af Compare September 24, 2024 17:48

sumeerbhola requested changes Sep 24, 2024

View reviewed changes

kvoli force-pushed the 240920.rac-race-fixes branch from 46f64af to 5d7cc84 Compare September 24, 2024 21:50

kvoli commented Sep 24, 2024

View reviewed changes

kvoli force-pushed the 240920.rac-race-fixes branch from 5d7cc84 to 25658b8 Compare September 24, 2024 21:58

kvoli requested a review from sumeerbhola September 24, 2024 21:59

pav-kv reviewed Sep 24, 2024

View reviewed changes

kvoli force-pushed the 240920.rac-race-fixes branch from 25658b8 to 9133f12 Compare September 24, 2024 22:35

kvoli requested a review from pav-kv September 24, 2024 22:36

kvoli force-pushed the 240920.rac-race-fixes branch from 9133f12 to ada1d52 Compare September 24, 2024 22:54

sumeerbhola requested changes Sep 24, 2024

View reviewed changes

kvoli force-pushed the 240920.rac-race-fixes branch from ada1d52 to 20c77ce Compare September 25, 2024 13:52

kvoli force-pushed the 240920.rac-race-fixes branch from 20c77ce to 7cfbfe8 Compare September 25, 2024 13:54

kvoli commented Sep 25, 2024

View reviewed changes

kvoli requested a review from sumeerbhola September 25, 2024 13:58

kvoli force-pushed the 240920.rac-race-fixes branch from 7cfbfe8 to 1431248 Compare September 25, 2024 14:18

kvflowcontrol: pass follower info via raft event

80e8b58

Pass in `ReplicasStateInfo` via `RaftEvent` to the `RangeController`. This resolves an issue where the Raft status was raced on due to not holding a replica mu while reading. Part of: cockroachdb#130187 Release note: None

kvoli force-pushed the 240920.rac-race-fixes branch from 1431248 to 80e8b58 Compare September 25, 2024 14:31

sumeerbhola approved these changes Sep 25, 2024

View reviewed changes

craig bot merged commit e523823 into cockroachdb:master Sep 25, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: misc rac2 race conditions fixes #131108

kvserver: misc rac2 race conditions fixes #131108

kvoli commented Sep 20, 2024

blathers-crl bot commented Sep 20, 2024

cockroach-teamcity commented Sep 20, 2024

sumeerbhola left a comment

kvoli left a comment

sumeerbhola left a comment

kvoli left a comment

pav-kv Sep 24, 2024

kvoli Sep 24, 2024

pav-kv Sep 24, 2024

sumeerbhola left a comment

kvoli commented Sep 25, 2024

kvoli left a comment

kvoli commented Sep 25, 2024

sumeerbhola left a comment

kvoli commented Sep 25, 2024

craig bot commented Sep 25, 2024

kvserver: misc rac2 race conditions fixes #131108

kvserver: misc rac2 race conditions fixes #131108

Conversation

kvoli commented Sep 20, 2024

blathers-crl bot commented Sep 20, 2024

cockroach-teamcity commented Sep 20, 2024

sumeerbhola left a comment

Choose a reason for hiding this comment

kvoli left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

kvoli left a comment

Choose a reason for hiding this comment

pav-kv Sep 24, 2024

Choose a reason for hiding this comment

kvoli Sep 24, 2024

Choose a reason for hiding this comment

pav-kv Sep 24, 2024

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

kvoli commented Sep 25, 2024

kvoli left a comment

Choose a reason for hiding this comment

kvoli commented Sep 25, 2024

sumeerbhola left a comment

Choose a reason for hiding this comment

kvoli commented Sep 25, 2024

craig bot commented Sep 25, 2024