Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: ensure Replica ranges do not overlap #7830

Closed
rjnn opened this issue Jul 14, 2016 · 25 comments
Closed

storage: ensure Replica ranges do not overlap #7830

rjnn opened this issue Jul 14, 2016 · 25 comments
Assignees
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone

Comments

@rjnn
Copy link
Contributor

rjnn commented Jul 14, 2016

Nothing prevents two replicas from both considering themselves responsible for overlapping sections of the key-space. This can happen, for example, if a wrongly configured pre-emptive snapshot is applied to a range when there is already a replica keeping track of a range, or some subset of it.

Currently store.mu.replicasByKey is a BTree keyed on the end key. This is useful for finding which replicas are responsible for a given range, but it does nothing to prevent two (or more) Replicas from becoming accidentally responsible for ranges that overlap. This proposal is to add an interval tree that keeps track of which replica is in charge of which range, locked by the existing store mutex. When a replica is added, it has to check this Interval Tree to ensure that it is not clobbering an existing replica. We should also consider whether this IntervalTree can be used in place of the current replicasByKey BTree, in which case it can replace it, instead of having two trees. Note that the performance of the IntervalTree might be different from the performance of the BTree, so this should be checked.

cc @tamird, @tschottdorf

@rjnn rjnn self-assigned this Jul 14, 2016
@tbg
Copy link
Member

tbg commented Jul 14, 2016

See #7833 for a bit of applied context. In that PR, I add that missing check for this part:

if a wrongly configured pre-emptive snapshot is applied to a range when there is already a replica keeping track of a range, or some subset of it.

but I have a hard time trusting it given the abundance of locks (and scarcity of locks held throughout).

It seems that (*Store).canApplySnapshot already checks the condition for which the interval tree is suggested.

@bdarnell
Copy link
Contributor

I think I've found a bug that can result in the creation of overlapping replicas (maybe this is the cause of #5291?). The problem is that in handleRaftMessage, we check canApplySnapshot, but we don't update replicasByKey until later, when the (non-preemptive) snapshot is asynchronously applied. This allows two snapshots to be accepted simultaneously even if they would ultimately conflict. We need to claim the keyspace synchronously in handleRaftMessage (and then un-claim it if the snapshot is not applied in handleRaftReady?)

@petermattis
Copy link
Collaborator

replicasByKey doesn't currently ensure that there is no overlapping responsibility of the key space, but it can. I don't see any need for an IntervalTree to achieve this. We should add checks to Store.processRangeDescriptorUpdateLocked and any other manipulations of replicasByKey to enforce that invariant.

tbg added a commit to tbg/cockroach that referenced this issue Jul 15, 2016
This serves to demonstrate that this PR works, however we have open issues
about the safety of preemptive snapshots, so likely we do not want to merge
this (see cockroachdb#7830 and the discussions in cockroachdb#6144).
@rjnn
Copy link
Contributor Author

rjnn commented Jul 15, 2016

It appears that hasOverlappingReplicaLocked does the necessary check. @bdarnell, can you confirm that this is correct? So we just need to ensure that this function is called in all relevant places (such as in Store.processRangeDescriptorUpdate).

@bdarnell
Copy link
Contributor

Yes, hasOverlappingReplicaLocked is the correct check. But processRangeDescriptorUpdate is too late - it is called after we have already committed our on-disk state, and if it fails we have no choice but to panic. We need to check for overlapping replicas early enough that we can still handle the error reasonably.

tbg added a commit to tbg/cockroach that referenced this issue Jul 18, 2016
This serves to demonstrate that this PR works, however we have open issues
about the safety of preemptive snapshots, so likely we do not want to merge
this (see cockroachdb#7830 and the discussions in cockroachdb#6144).
@rjnn
Copy link
Contributor Author

rjnn commented Jul 18, 2016

Summary of a discussion with @bdarnell:

Currently store.canApplySnapshot checks store.hasOverlappingRanges which checks the store.mu.rangesByKey map. However, once this check succeeds, replica.applySnapshot is not called synchronously - so there is the possiblity of a race if two MsgSnap requests for overlapping ranges are let through before either is applied. If this happens, currently the second replica.applySnapshot will panic - as any check for overlaps happens too late in replica.applySnapshot (currently it happens right at the end when r.setDesc is called), since data is already committed to disk. Moving the check for overlaps before committing data to disk will help, but will currently still panic as the failed replica.applySnapshot will bubble up through replica.handleRaftReady which isn't allowed to fail.

One way to fix this is to add a new type for range reservations (in addition to the existing family of types suggested in #6144). When store.canApplySnapshot succeeds it should add a range reservation before passing on the MsgSnap to Raft. This way, the second store.canApplySnapshot will fail due to the pre-existing reservation.

The problem with range reservations is that if Raft ignores the snapshot (e.g. due to an outdated term) it fails silently --- replica.ApplySnapshot is never called, so our reservation will persist forever. Thus, if we are adding range reservations, we need a way to evict stale range reservations. This is solvable through three paths:

  1. Add reservation timeouts. If we encounter a reservation older than a particular timeout, we overwrite it. This is unsatisfying. If for any reason a replica is legitimately delayed, replica.applySnapshot will fail, and we will panic in replica.handleRaftReady.
  2. Add upstream changes to Raft so that snapshots that are rejected are documented in the next Ready struct, so that replica.handleRaftReady can use this information to remove the reservation. This requires upstream changes, and we need to be sure we cover all the code paths through which a snapshot could be rejected. This is delicate and brittle.
  3. Keep some additional side information (through a boolean in store.mu.existsReservations) whenever we give out a reservation. Since we are guaranteed to either get the snapshot at in the next Ready or be sure that it has been thrown away, if the flag is set at the start of replica.handleRaftReady we can then clear out all reservations (and reset the flag) if there are no snapshots in the Ready.

The third option is the least unsatisfying to me. I think that's the only path forward, although maybe there is something simpler that I'm missing.

rjnn pushed a commit to rjnn/cockroach that referenced this issue Jul 18, 2016
processRangeDescriptorUpdateLocked checks to see if there is an
existing range before adding a Replica. We should widen the check to
look for any overlapping ranges. This addresses one of the cases in cockroachdb#7830.
rjnn pushed a commit to rjnn/cockroach that referenced this issue Jul 19, 2016
processRangeDescriptorUpdateLocked checks to see if there is an
existing range before adding a Replica. We should widen the check to
look for any overlapping ranges. This addresses one of the cases in cockroachdb#7830.
rjnn pushed a commit to rjnn/cockroach that referenced this issue Jul 19, 2016
processRangeDescriptorUpdateLocked checks to see if there is an
existing range before adding a Replica. We should widen the check to
look for any overlapping ranges. This addresses one of the cases in cockroachdb#7830.
@tbg
Copy link
Member

tbg commented Jul 19, 2016

You could implement the third option by hiding the bit in s.pendingRaftGroups, though that will be less space efficient (since snapshots are rare and struct{}{} is optimized away when bool isn't).

To make the third option a little more palatable, you can actually translate it into a performance improvement. Namely, instead of just a marker, you actually store the unmarshaled snapshot (you unmarshal it early to get the descriptor, and thus the key range). Then, in handleRaftReady when you apply the snapshot, you can use the unmarshaled one from the map.
This of course means that we're going to leak memory at an alarming rate if the map isn't cleared out, but we know it is (unless the implementation has a bug).

@tbg
Copy link
Member

tbg commented Jul 19, 2016

cc #7833 (preemptive snapshots through Raft) https://reviewable.io/reviews/cockroachdb/cockroach/7833#-KMjP00ZMUR9CftaOJxx

@petermattis

Just noticed that applying a snapshot here will prevent other raft processing on the store. When we eventually make normal raft processing concurrent in Store.processRaft, this will be a bottleneck.

@bdarnell

The solution to #7830 should let us fix this. Once we can reserve our spot in replicasByKey before applying the snapshot, we no longer need to hold processRaftMu throughout the snapshot application process, and we can use a Replica-level lock instead.

tbg added a commit to tbg/cockroach that referenced this issue Jul 19, 2016
This serves to demonstrate that this PR works, however we have open issues
about the safety of preemptive snapshots, so likely we do not want to merge
this (see cockroachdb#7830 and the discussions in cockroachdb#6144).
@petermattis
Copy link
Collaborator

Looks like there should be a bit of refactoring of handleRaftMessage. Currently, canApplySnapshot is called before we retrieve/create the replica. I think we should move it afterward, but while Store.mu is still held. canApplySnapshot should then be renamed to something like prepareForSnapshotLocked and it would add to the replicasByKey map if the replica is uninitialized. (If the replica is initialized, should we be checking that the snapshot key range matches the replica key range?)

Is there a reason to store the "range reservation" bit outside of Replica? If we're feeding a MsgSnap to rawnode.Step for an initialized replica that we've added to replicasByKey, we could add a bit (r.mu.reserved?) that is checked in Replica.handleRaftReady. I'm thinking of something like:

func (r *Replica) handleRaftReady() error {
  r.mu.Lock()
  reserved := r.mu.reserved
  ...
  r.mu.Unlock()

  if reserved {
    defer r.maybeClearReservation()
  }

  ...
}

func (r *Replica) maybeClearReservation() {
  r.mu.Lock()
  reserved := r.mu.reserved
  r.mu.reserved = false
  r.mu.Unlock()
  if !reserved {
    return
  }
  r.store.clearReservation(r)  // remove from Store.mu.replicasByKey
}

@tbg
Copy link
Member

tbg commented Jul 19, 2016

r.store.clearReservation(r) // remove from Store.mu.replicasByKey

I think we should avoid upcalls from Replica to Store where we can. If the store is holding a reservation, why hide its state somewhere else, especially when the lifecycle is that transparent (exactly like pendingRaftGroups)?

@petermattis
Copy link
Collaborator

There are plenty of upcalls from Replica to Store (e.g. r.store.processRangeDescriptorUpdate), not sure why this would be problematic. Yes, the Store is holding the reservation in the form of a Replica in the replicasByKey map, but the Replica is the entity which knows when the reservation has been filled or should be cleared.

If we don't have Replica making an upcall to Store, how do you propose for Store to be told that the reservation should be cleared? A return value from Replica.handleRaftReady?

@tbg
Copy link
Member

tbg commented Jul 19, 2016

@arjunravinarayan's comment, if I understand it correctly, above points out that we can simply throw away all registrations after handling a cycle of Readys. This is because we make a reservation when we Step the RaftGroup, so after looking for the next ready, no more snapshot for that Step can be emitted and whatever reservation is left will never be claimed by the respective Replica.

@petermattis
Copy link
Collaborator

@tschottdorf That's my understanding as well and is used by my suggestion above. So where are you going to put that check if not in a defer from Replica.handleRaftReady? Store.processRaft is the only other location, but doing it there would require you to keep a list of all of the reservations that weren't fulfilled and then loop over them. Certainly possible to do that as well, but it doesn't seem any cleaner to me.

@tbg
Copy link
Member

tbg commented Jul 19, 2016

There are plenty of upcalls from Replica to Store (e.g. r.store.processRangeDescriptorUpdate), not sure why this would be problematic.

Because Store hierarchically sits above Replica, so every call in the other direction complicates concurrency.

If we don't have Replica making an upcall to Store, how do you propose for Store to be told that the reservation should be cleared? A return value from Replica.handleRaftReady?
@tschottdorf That's my understanding as well and is used by my suggestion above. So where are you going to put that check if not in a defer from Replica.handleRaftReady? Store.processRaft is the only other location, but doing it there would require you to keep a list of all of the reservations that weren't fulfilled and then loop over them. Certainly possible to do that as well, but it doesn't seem any cleaner to me.

I am playing with (in WIP) returning a value from handleRaftReady for the split trigger (as a starting point for #6286). The idea is again that the Store should perform those complex operations which exceed the scope a Replica is aware of. We will need to return side effects from command execution anyway (for proposer-eval'ed KV), and it's only a short distance to return those which are best handled at the *Store to it.
In that case, a return value to clean out the reservation fits right in.

But I don't even want to make it about that, let's make it about snapshot application, which can bring an unitialized replica to initialized state. This is also something the *Store needs to learn about, which currently again translates into awkward upcalls (r.store.processRangeDescriptorUpdate). That's another prime candidate to be a side effect returned from handleRaftReady. That makes it even more fitting that undoing the reservation should be done that same way.

And even in the case in which for some reason we don't return anything from handleRaftReady, I still feel strongly that it would be a mistake to entangle that Replica-Store upcall jungle more by separating reservation creation and removal further.

@petermattis
Copy link
Collaborator

How are you going to propagate a value from Replica.splitTrigger back up the call chain to be returned from Replica.handleRaftReady?

@tbg
Copy link
Member

tbg commented Jul 19, 2016

Just as you'd think. splitTrigger -> ... -> applyRaft* -> ... -> handleRaftReady (later, with leader-proposed Raft, the split trigger would have to take its inputs from the supplied WriteSet where possible, and the rest would need to be sent explicitly with the Raft command). Is there a problem staring me in the eye that I'm failing to see?

@tbg
Copy link
Member

tbg commented Jul 19, 2016

Or rather, what I'm proposing is that the batch command would really only set the state for the real split trigger (i.e. do the writes that prepare the LHS and RHS) and then return the relevant information up to the Store which would then adjust its replicas accordingly (the part which is now in the batch.Defer in the split trigger. I'm also about to post a (very first) PR for this direction (one which doesn't lock us into going up to handleRaftReady; after all we'll need this as a side effect anyway on some level).

@tbg
Copy link
Member

tbg commented Jul 19, 2016

The part moved to its own function in #7915 (just posted) is roughly what I would expect to be run on Store instead of being called from the bowels of *Replica.

@petermattis
Copy link
Collaborator

Replica.handleRaftReady loops over committed entries calling Replica.processRaftCommand which calls Replica.applyRaftCommand and on down to Replica.splitTrigger. But I don't see anything that prevents multiple committed entries from being applied in a single call to Replica.handleRaftReady. After an entry containing a split-trigger, don't the effects need to be applied before processing the next raft command?

I'll take a look at your PR.

@petermattis
Copy link
Collaborator

I took a look at #7915. My point still stands that I think splitTriggerPostCommit needs to run after the batch is committed but before the next raft command is processed. With the current structure of Replica.handleRaftReady, that implies an upcall to Store in some form, not a return value.

@tbg
Copy link
Member

tbg commented Jul 19, 2016

Good point about applying multiple entries in one Ready. That does taint it a little bit, but only means that the store needs to pass a handler down (the short distance) to handleRaftReady instead of getting a value returned from *Replica (yep, that's an upcall). You can probably see where I'm going: *Replica shouldn't be able to call random stuff on *Store. Instead, *Store should be in control and be at the top of the synchronization.

rjnn pushed a commit to rjnn/cockroach that referenced this issue Jul 19, 2016
processRangeDescriptorUpdateLocked checks to see if there is an
existing range before adding a Replica. We should widen the check to
look for any overlapping ranges. This addresses one of the cases in
Issue cockroachdb#7830.
rjnn pushed a commit to rjnn/cockroach that referenced this issue Jul 19, 2016
processRangeDescriptorUpdateLocked checks to see if there is an
existing range before adding a Replica. We should widen the check to
look for any overlapping ranges. This addresses one of the cases in
Issue cockroachdb#7830.
@petermattis
Copy link
Collaborator

Yeah, I had the same thought of passing some interface to Replica.handleRaftReady that would be used to constrain the upcalls to Store, but even doing that those are still upcalls. If Replica shouldn't be able to call random stuff on Store, should we replace Replica.store with Replica.env which has the type replicaEnv:

type replicaEnv interface {
  func splitRange(...) ...
  func updateRangeDesc(...) ...
}

@tbg
Copy link
Member

tbg commented Jul 19, 2016

That's closer to passing in the handler into handleRaftReady, but I don't see the downside to going all the way.You're right that this is still an upcall, but there's merit to restricting the scope in which it is available. When passed into handleRaftReady, *Store knows exactly what locks are being held and the two are one stack frame apart. When available to the *Replica at any time, of course we're going to call it under the same locks, but considerably further down. The win of clarity and having a clean interface is desirable.

That said, I think in spirit we (meaning the two of us) agree on the general direction and it's probably best to see how it shakes out at this point. I've started (with #7915) to tackle the side-effect portion of proposer-eval'ed-kv, and once the split trigger is moved to proper side-effect location, it's only a few stack frames up to try out that handler without investing a lot extra (even if it gets shot down at that stage).

To untangle this thread from @arjunravinarayan's proposal #3 above, with all of the future ways the situation may be refactored taken out of consideration, I think keeping the Replica out of it if possible is a good way to go, at least for an initial version. Assuming the store indeed holds some map of reservations (populated when we Step a raft group with a snapshot), we would not have to tell the Replica anything about reservations at all. Instead, after we've called all the handleRaftReadys we have for the moment, we loop through the reservations and remove them (i.e. removing those entries from the tree for which we don't have an initialized range, and keeping the rest, but clearing out the map after).
The map is usually empty and even if not, its processing overhead pales in comparison to how expensive applying a snapshot is. I also don't think we can come up with a more legible/graspable alternative - it's accessed in precisely the two moving parts of Store Raft processing.
Does the above check out and WDYT?

@bdarnell
Copy link
Contributor

should we replace Replica.store with Replica.env which has the type replicaEnv

We had that before with the RangeManager interface, which was removed in #2828 because we ended up adding too much cruft to it. Maybe it could work if we were more disciplined about what went into the interface.

rjnn pushed a commit that referenced this issue Jul 20, 2016
…ranges

storage: ensure nonoverlapping ranges
existing range before adding a Replica. We should widen the check to
look for any overlapping ranges. This addresses one of the cases in
Issue #7830.
@petermattis petermattis added this to the Q3 milestone Jul 21, 2016
@petermattis petermattis added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Jul 21, 2016
rjnn pushed a commit to rjnn/cockroach that referenced this issue Aug 11, 2016
Add replica placeholders as a separate type. Replica laceholders are
added to the store.mu.replicasByKey BTree when a pre-emptive snapshot
is approved, and atomically swapped with replicas once the snapshot is
applied, preventing two overlapping snapshots being approved
simultaneously. Closes cockroachdb#7830.

Also move some replica functions from store.go into replica.go.

WIP on test cases for this commit.
rjnn pushed a commit to rjnn/cockroach that referenced this issue Aug 12, 2016
Add replica placeholders as a separate type. Replica laceholders are
added to the store.mu.replicasByKey BTree when a pre-emptive snapshot
is approved, and atomically swapped with replicas once the snapshot is
applied, preventing two overlapping snapshots being approved
simultaneously. Closes cockroachdb#7830.

Also move some replica functions from store.go into replica.go.

WIP on test cases for this commit.
rjnn pushed a commit to rjnn/cockroach that referenced this issue Aug 12, 2016
Add replica placeholders as a separate type. Replica laceholders are
added to the store.mu.replicasByKey BTree when a pre-emptive snapshot
is approved, and atomically swapped with replicas once the snapshot is
applied, preventing two overlapping snapshots being approved
simultaneously. Closes cockroachdb#7830.

Also move some replica functions from store.go into replica.go.

WIP on test cases for this commit.
rjnn pushed a commit to rjnn/cockroach that referenced this issue Aug 15, 2016
Add replica placeholders as a separate type. Replica laceholders are
added to the store.mu.replicasByKey BTree when a pre-emptive snapshot
is approved, and atomically swapped with replicas once the snapshot is
applied, preventing two overlapping snapshots being approved
simultaneously. Closes cockroachdb#7830.

Also move some replica functions from store.go into replica.go.

WIP on test cases for this commit.
rjnn pushed a commit to rjnn/cockroach that referenced this issue Aug 16, 2016
Add replica placeholders as a separate type. Replica laceholders are
added to the store.mu.replicasByKey BTree when a pre-emptive snapshot
is approved, and atomically swapped with replicas once the snapshot is
applied, preventing two overlapping snapshots being approved
simultaneously. Closes cockroachdb#7830.

Also move some replica functions from store.go into replica.go.
rjnn pushed a commit to rjnn/cockroach that referenced this issue Aug 16, 2016
Add replica placeholders as a separate type. Replica laceholders are
added to the store.mu.replicasByKey BTree when a pre-emptive snapshot
is approved, and atomically swapped with replicas once the snapshot is
applied, preventing two overlapping snapshots being approved
simultaneously. Closes cockroachdb#7830.

Also move some replica functions from store.go into replica.go.
rjnn pushed a commit to rjnn/cockroach that referenced this issue Aug 16, 2016
Add replica placeholders as a separate type. Replica laceholders are
added to the store.mu.replicasByKey BTree when a pre-emptive snapshot
is approved, and atomically swapped with replicas once the snapshot is
applied, preventing two overlapping snapshots being approved
simultaneously. Closes cockroachdb#7830.

Also move some replica functions from store.go into replica.go.
rjnn pushed a commit to rjnn/cockroach that referenced this issue Aug 16, 2016
Add replica placeholders as a separate type. Replica laceholders are
added to the store.mu.replicasByKey BTree when a pre-emptive snapshot
is approved, and atomically swapped with replicas once the snapshot is
applied, preventing two overlapping snapshots being approved
simultaneously. Closes cockroachdb#7830.

Also move some replica functions from store.go into replica.go.
@rjnn
Copy link
Contributor Author

rjnn commented Aug 16, 2016

Closed via #8365.

@rjnn rjnn closed this as completed Aug 16, 2016
pav-kv pushed a commit to pav-kv/cockroach that referenced this issue Mar 5, 2024
…-active

raft: Set the RecentActive flag for newly added nodes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

4 participants