storage: Re-enable storeReplicaRaftReadyConcurrency #7846

bdarnell · 2016-07-14T21:53:45Z

This was disabled in #7673 due to concerns raised in #7672. It was a big performance improvement, so we should figure out how to turn it back on.

It should always be possible to process initialized replicas concurrently - they are guaranteed to address disjoint parts of the keyspace. The only possible interference is between an initialized and uninitialized replica (the uninitialized replica writing a HardState as the initialized one splits), or between two uninitialized replicas (both applying snapshots, one pre-split and one post-split).

Therefore I believe we can fix the races by processing all uninitialized replicas one at a time, then processing the initialized replicas in parallel.

bdarnell · 2016-07-14T22:09:48Z

The race between two uninitialized ranges applying snapshots exists regardless of concurrency settings; see #7830

petermattis · 2016-07-15T00:25:51Z

Dividing the replicas to process into initialized and uninitialized sets would be straightforward, but leaving this important aspect of Replica synchronization outside of Replica feels off to me. But the alternative feels more dangerous. I hope the PR that addresses this issue comes with a healthy dose of commentary.

tbg · 2016-07-15T12:23:13Z

but leaving this important aspect of Replica synchronization outside of Replica feels off to me.

What important aspect do you mean (i.e. what's the dangerous alternative)? I feel ok about this purely from the Raft perspective. The Store is in charge of making sure each Replica is always alone on its keyspace, so it needs to be the one which tells replicas when it's safe to write to disk.

OTOH (and this is somewhat out of scope for this issue), we have various moving parts holding on to replicas outside of the Raft processing loop (scanner queues, reads, etc). Even pre-Raft, operations pass through Store (but without holding on to any locks) and end up in Replica, which then uses the command queue to serialize those items which touch overlapping key ranges. All of that concurrency seems pretty tangled up and the Store doesn't really know about it.

petermattis · 2016-07-15T12:56:46Z

Naively, I would expect Replica.handleRaftReady to be safe for concurrent use. Store has to ensure there is only a single Replica for a given range ID and that Replicas do not have overlapping key spans, but also having it enforce the synchronization of Replica.handleRaftReady is what feels off.

The more dangerous alternative is to try and replace Store.processRaftMu with a Replica.handleRaftReadyMu.

tbg · 2016-07-15T13:03:43Z

Again, not the right issue, but I wanted to dump two other observations:

readOnlyCmdMu is really only ever used for protecting against timestamp cache skew during splits. As such, it's kind of redundant - it would be equivalent to making any EndTransaction with a split occupy the whole key range in the command queue instead (in fact, the RHS would suffice).
(*Store).SplitRange is one of the most awkward locking situations because the path is acquire processRaftMu -> handleRaftReady -> many stack frames in *Replica -> (*Replica).splitTrigger -> SplitRange. With storage{,/engine}: separate Raft application side effects, remove (Batch).Defer() #6286, I think we could instead return a side effect from handleRaftReady so that the processRaft goroutine could do the tricky work of shifting the replicas around. In particular, the commit trigger would essentially move there completely - the Store would make sure the LHS is without moving parts, copy timestamp cache etc, create the RHS and finally let the updated and new replica loose. I think that would make a lot more sense than the current state of affairs.

The second point brings me back towards the previous post: How does the store make sure the Replica is without moving parts? Currently, what's done implicitly is a) no Raft command is being applied (because this is called from within a Raft command as it committed), and there are no concurrent reads (at least during the critical section of copying the timestamp cache, via readOnlyCmdMu). That sounds mostly fine for a Split.

What about replicaGC? We

hold r.mu but then release it before wiping on-disk state for what seems like pure being risky
don't even grab readOnlyCmdMu which we really ought to
make a half-harted attempt at removing the Replica from all scanner queues (since the queues are chan-based, I'm sure we can still see that Replica being used immediately after)
cancel all pending Raft commands (i.e. tell waiting clients "go away") (this part seems ok)

I'm fairly sure this means that any reads which are pending at the time a replica is removed might get blocked at some random point (perhaps on r.mu early in) but then proceed, perhaps hitting the wiped disk.
All of that stuff can be cleaned up (will take a look actually).

I think a combination of making sure all engine writes on *Replica are orchestrated through handleRaftReady (for which it's fairly clear that it's single-threaded) plus some way of making sure that no reads are in flight (or even scheduled) are the two important ingredients here. Perhaps we can trim the surface area of *Replica to expose this core better.

tbg · 2016-07-15T13:05:37Z

Naively, I would expect Replica.handleRaftReady to be safe for concurrent use

Hmm, isn't the contract for handleRaftReady inherently non-concurrent? Raft gives you one ready after the next, and never two at the same time, and you're a state machine.

Store has to ensure there is only a single Replica for a given range ID and that Replicas do not have overlapping key spans, but also having it enforce the synchronization of Replica.handleRaftReady is what feels off.

But *Store is a collection of state machines which share some underlying storage, so it makes a lot of sense that it would be *Store which has to run only those machines in parallel which don't overlap.

From my understanding of these moving parts, what Ben suggests makes a lot of sense.

tbg · 2016-07-15T13:24:38Z

Hmm, maybe the snapshot could also return a side effect from handleRaftReady (one that says "my boundaries changed, please deal with it"). Or, to get rid of the double-unmarshalling we do right now (where we peek into the snapshot for the descriptor before passing it futher down), we could unmarshal the snapshot in the store as we get a hold of the and instead of feeding it into handleRaftReady feed it into handleRaftSnapshot). Just brainstorming.

This moves some of the state machine stuff up to Store (i.e. the Replica only applies a Ready, it gets less involved with its own Raft group than it does now). The upshot is that the stuff that changes the range boundaries (split, snapshot) happens in Store itself.

petermattis · 2016-07-15T14:55:14Z

Hmm, isn't the contract for handleRaftReady inherently non-concurrent? Raft gives you one ready after the next, and never two at the same time, and you're a state machine.

Currently yes, but a Replica.handleRaftReadMu could be added to make it safe for concurrent callers (which we won't have any) and for other parts of the code to lock in order to block execution of Replica.handleRaftReadMu. With the current setup, there is no way for one replica to block the raft processing. That usually isn't a problem, except in the case of splits.

But *Store is a collection of state machines which share some underlying storage, so it makes a lot of sense that it would be *Store which has to run only those machines in parallel which don't overlap.

Yes, that makes sense. If only all access to replicas was coordinate through Store. But we have the various queues performing work on replicas without going through Store.

bdarnell · 2016-07-16T16:59:56Z

and for other parts of the code to lock in order to block execution of Replica.handleRaftReady

It makes more sense phrased that way, to block handleRaftReady rather than to allow concurrent calls to it. But it's still tricky: we have to go through the Store to get the right Replica object on which to acquire this lock. (note that even though we have a Replica object for the right hand side of the split during the split trigger, it's not the same as the preexisting uninitialized replica. And then when we replace the uninitialized replica with our initialized version, we replace this lock for anyone else that might be trying to access it, so we're still relying on subtle synchronization through the Store)

But we have the various queues performing work on replicas without going through Store.

The queues should probably hold RangeIDs rather than Replicas, so they have to go back to the Store to get the Replica object (along with some other synchronization to make sure the Replica doesn't change out from under the execution, although several of the queues can work just as well even if the local replica is gone, because they do their work through a distributed transaction).

This moves some of the state machine stuff up to Store (i.e. the Replica only applies a Ready, it gets less involved with its own Raft group than it does now). The upshot is that the stuff that changes the range boundaries (split, snapshot) happens in Store itself.

I've had similar thoughts, but this probably belongs in #6144.

Only initialized replicas are safe to parallelize, so process all uninitialized replicas before starting goroutines for the initialized ones. Fixes cockroachdb#7846

bdarnell self-assigned this Jul 14, 2016

bdarnell mentioned this issue Jul 16, 2016

storage: Make uninitialized replicas distinct types #6144

Closed

bdarnell mentioned this issue Jul 16, 2016

storage: Re-enable concurrent processing of raft commands #7860

Merged

bdarnell closed this as completed in #7860 Jul 17, 2016

bdarnell mentioned this issue Jul 18, 2016

storage: Queues should contain IDs, not Replica objects #7879

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Re-enable storeReplicaRaftReadyConcurrency #7846

storage: Re-enable storeReplicaRaftReadyConcurrency #7846

bdarnell commented Jul 14, 2016

bdarnell commented Jul 14, 2016

petermattis commented Jul 15, 2016

tbg commented Jul 15, 2016

petermattis commented Jul 15, 2016

tbg commented Jul 15, 2016

tbg commented Jul 15, 2016

tbg commented Jul 15, 2016 •

edited

Loading

petermattis commented Jul 15, 2016

bdarnell commented Jul 16, 2016

storage: Re-enable storeReplicaRaftReadyConcurrency #7846

storage: Re-enable storeReplicaRaftReadyConcurrency #7846

Comments

bdarnell commented Jul 14, 2016

bdarnell commented Jul 14, 2016

petermattis commented Jul 15, 2016

tbg commented Jul 15, 2016

petermattis commented Jul 15, 2016

tbg commented Jul 15, 2016

tbg commented Jul 15, 2016

tbg commented Jul 15, 2016 • edited Loading

petermattis commented Jul 15, 2016

bdarnell commented Jul 16, 2016

tbg commented Jul 15, 2016 •

edited

Loading