Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Re-enable storeReplicaRaftReadyConcurrency #7846

Closed
bdarnell opened this issue Jul 14, 2016 · 9 comments
Closed

storage: Re-enable storeReplicaRaftReadyConcurrency #7846

bdarnell opened this issue Jul 14, 2016 · 9 comments
Assignees

Comments

@bdarnell
Copy link
Contributor

This was disabled in #7673 due to concerns raised in #7672. It was a big performance improvement, so we should figure out how to turn it back on.

It should always be possible to process initialized replicas concurrently - they are guaranteed to address disjoint parts of the keyspace. The only possible interference is between an initialized and uninitialized replica (the uninitialized replica writing a HardState as the initialized one splits), or between two uninitialized replicas (both applying snapshots, one pre-split and one post-split).

Therefore I believe we can fix the races by processing all uninitialized replicas one at a time, then processing the initialized replicas in parallel.

@bdarnell bdarnell self-assigned this Jul 14, 2016
@bdarnell
Copy link
Contributor Author

The race between two uninitialized ranges applying snapshots exists regardless of concurrency settings; see #7830

@petermattis
Copy link
Collaborator

Dividing the replicas to process into initialized and uninitialized sets would be straightforward, but leaving this important aspect of Replica synchronization outside of Replica feels off to me. But the alternative feels more dangerous. I hope the PR that addresses this issue comes with a healthy dose of commentary.

@tbg
Copy link
Member

tbg commented Jul 15, 2016

but leaving this important aspect of Replica synchronization outside of Replica feels off to me.

What important aspect do you mean (i.e. what's the dangerous alternative)? I feel ok about this purely from the Raft perspective. The Store is in charge of making sure each Replica is always alone on its keyspace, so it needs to be the one which tells replicas when it's safe to write to disk.

OTOH (and this is somewhat out of scope for this issue), we have various moving parts holding on to replicas outside of the Raft processing loop (scanner queues, reads, etc). Even pre-Raft, operations pass through Store (but without holding on to any locks) and end up in Replica, which then uses the command queue to serialize those items which touch overlapping key ranges. All of that concurrency seems pretty tangled up and the Store doesn't really know about it.

@petermattis
Copy link
Collaborator

Naively, I would expect Replica.handleRaftReady to be safe for concurrent use. Store has to ensure there is only a single Replica for a given range ID and that Replicas do not have overlapping key spans, but also having it enforce the synchronization of Replica.handleRaftReady is what feels off.

The more dangerous alternative is to try and replace Store.processRaftMu with a Replica.handleRaftReadyMu.

@tbg
Copy link
Member

tbg commented Jul 15, 2016

Again, not the right issue, but I wanted to dump two other observations:

  • readOnlyCmdMu is really only ever used for protecting against timestamp cache skew during splits. As such, it's kind of redundant - it would be equivalent to making any EndTransaction with a split occupy the whole key range in the command queue instead (in fact, the RHS would suffice).
  • (*Store).SplitRange is one of the most awkward locking situations because the path is acquire processRaftMu -> handleRaftReady -> many stack frames in *Replica -> (*Replica).splitTrigger -> SplitRange. With storage{,/engine}: separate Raft application side effects, remove (Batch).Defer() #6286, I think we could instead return a side effect from handleRaftReady so that the processRaft goroutine could do the tricky work of shifting the replicas around. In particular, the commit trigger would essentially move there completely - the Store would make sure the LHS is without moving parts, copy timestamp cache etc, create the RHS and finally let the updated and new replica loose. I think that would make a lot more sense than the current state of affairs.

The second point brings me back towards the previous post: How does the store make sure the Replica is without moving parts? Currently, what's done implicitly is a) no Raft command is being applied (because this is called from within a Raft command as it committed), and there are no concurrent reads (at least during the critical section of copying the timestamp cache, via readOnlyCmdMu). That sounds mostly fine for a Split.

What about replicaGC? We

  • hold r.mu but then release it before wiping on-disk state for what seems like pure being risky
  • don't even grab readOnlyCmdMu which we really ought to
  • make a half-harted attempt at removing the Replica from all scanner queues (since the queues are chan-based, I'm sure we can still see that Replica being used immediately after)
  • cancel all pending Raft commands (i.e. tell waiting clients "go away") (this part seems ok)

I'm fairly sure this means that any reads which are pending at the time a replica is removed might get blocked at some random point (perhaps on r.mu early in) but then proceed, perhaps hitting the wiped disk.
All of that stuff can be cleaned up (will take a look actually).

I think a combination of making sure all engine writes on *Replica are orchestrated through handleRaftReady (for which it's fairly clear that it's single-threaded) plus some way of making sure that no reads are in flight (or even scheduled) are the two important ingredients here. Perhaps we can trim the surface area of *Replica to expose this core better.

@tbg
Copy link
Member

tbg commented Jul 15, 2016

Naively, I would expect Replica.handleRaftReady to be safe for concurrent use

Hmm, isn't the contract for handleRaftReady inherently non-concurrent? Raft gives you one ready after the next, and never two at the same time, and you're a state machine.

Store has to ensure there is only a single Replica for a given range ID and that Replicas do not have overlapping key spans, but also having it enforce the synchronization of Replica.handleRaftReady is what feels off.

But *Store is a collection of state machines which share some underlying storage, so it makes a lot of sense that it would be *Store which has to run only those machines in parallel which don't overlap.

From my understanding of these moving parts, what Ben suggests makes a lot of sense.

@tbg
Copy link
Member

tbg commented Jul 15, 2016

Hmm, maybe the snapshot could also return a side effect from handleRaftReady (one that says "my boundaries changed, please deal with it"). Or, to get rid of the double-unmarshalling we do right now (where we peek into the snapshot for the descriptor before passing it futher down), we could unmarshal the snapshot in the store as we get a hold of the and instead of feeding it into handleRaftReady feed it into handleRaftSnapshot). Just brainstorming.

This moves some of the state machine stuff up to Store (i.e. the Replica only applies a Ready, it gets less involved with its own Raft group than it does now). The upshot is that the stuff that changes the range boundaries (split, snapshot) happens in Store itself.

@petermattis
Copy link
Collaborator

Hmm, isn't the contract for handleRaftReady inherently non-concurrent? Raft gives you one ready after the next, and never two at the same time, and you're a state machine.

Currently yes, but a Replica.handleRaftReadMu could be added to make it safe for concurrent callers (which we won't have any) and for other parts of the code to lock in order to block execution of Replica.handleRaftReadMu. With the current setup, there is no way for one replica to block the raft processing. That usually isn't a problem, except in the case of splits.

But *Store is a collection of state machines which share some underlying storage, so it makes a lot of sense that it would be *Store which has to run only those machines in parallel which don't overlap.

Yes, that makes sense. If only all access to replicas was coordinate through Store. But we have the various queues performing work on replicas without going through Store.

@bdarnell
Copy link
Contributor Author

and for other parts of the code to lock in order to block execution of Replica.handleRaftReady

It makes more sense phrased that way, to block handleRaftReady rather than to allow concurrent calls to it. But it's still tricky: we have to go through the Store to get the right Replica object on which to acquire this lock. (note that even though we have a Replica object for the right hand side of the split during the split trigger, it's not the same as the preexisting uninitialized replica. And then when we replace the uninitialized replica with our initialized version, we replace this lock for anyone else that might be trying to access it, so we're still relying on subtle synchronization through the Store)

But we have the various queues performing work on replicas without going through Store.

The queues should probably hold RangeIDs rather than Replicas, so they have to go back to the Store to get the Replica object (along with some other synchronization to make sure the Replica doesn't change out from under the execution, although several of the queues can work just as well even if the local replica is gone, because they do their work through a distributed transaction).

This moves some of the state machine stuff up to Store (i.e. the Replica only applies a Ready, it gets less involved with its own Raft group than it does now). The upshot is that the stuff that changes the range boundaries (split, snapshot) happens in Store itself.

I've had similar thoughts, but this probably belongs in #6144.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Jul 16, 2016
Only initialized replicas are safe to parallelize, so process all
uninitialized replicas before starting goroutines for the initialized
ones.

Fixes cockroachdb#7846
bdarnell added a commit to bdarnell/cockroach that referenced this issue Jul 16, 2016
Only initialized replicas are safe to parallelize, so process all
uninitialized replicas before starting goroutines for the initialized
ones.

Fixes cockroachdb#7846
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants