storage: ensure Replica ranges do not overlap #7830

rjnn · 2016-07-14T16:34:35Z

Nothing prevents two replicas from both considering themselves responsible for overlapping sections of the key-space. This can happen, for example, if a wrongly configured pre-emptive snapshot is applied to a range when there is already a replica keeping track of a range, or some subset of it.

Currently store.mu.replicasByKey is a BTree keyed on the end key. This is useful for finding which replicas are responsible for a given range, but it does nothing to prevent two (or more) Replicas from becoming accidentally responsible for ranges that overlap. This proposal is to add an interval tree that keeps track of which replica is in charge of which range, locked by the existing store mutex. When a replica is added, it has to check this Interval Tree to ensure that it is not clobbering an existing replica. We should also consider whether this IntervalTree can be used in place of the current replicasByKey BTree, in which case it can replace it, instead of having two trees. Note that the performance of the IntervalTree might be different from the performance of the BTree, so this should be checked.

cc @tamird, @tschottdorf

The text was updated successfully, but these errors were encountered:

tbg · 2016-07-14T16:53:25Z

See #7833 for a bit of applied context. In that PR, I add that missing check for this part:

if a wrongly configured pre-emptive snapshot is applied to a range when there is already a replica keeping track of a range, or some subset of it.

but I have a hard time trusting it given the abundance of locks (and scarcity of locks held throughout).

It seems that (*Store).canApplySnapshot already checks the condition for which the interval tree is suggested.

bdarnell · 2016-07-14T22:04:25Z

I think I've found a bug that can result in the creation of overlapping replicas (maybe this is the cause of #5291?). The problem is that in handleRaftMessage, we check canApplySnapshot, but we don't update replicasByKey until later, when the (non-preemptive) snapshot is asynchronously applied. This allows two snapshots to be accepted simultaneously even if they would ultimately conflict. We need to claim the keyspace synchronously in handleRaftMessage (and then un-claim it if the snapshot is not applied in handleRaftReady?)

petermattis · 2016-07-15T00:19:31Z

replicasByKey doesn't currently ensure that there is no overlapping responsibility of the key space, but it can. I don't see any need for an IntervalTree to achieve this. We should add checks to Store.processRangeDescriptorUpdateLocked and any other manipulations of replicasByKey to enforce that invariant.

This serves to demonstrate that this PR works, however we have open issues about the safety of preemptive snapshots, so likely we do not want to merge this (see cockroachdb#7830 and the discussions in cockroachdb#6144).

rjnn · 2016-07-15T20:45:53Z

It appears that hasOverlappingReplicaLocked does the necessary check. @bdarnell, can you confirm that this is correct? So we just need to ensure that this function is called in all relevant places (such as in Store.processRangeDescriptorUpdate).

bdarnell · 2016-07-15T20:56:46Z

Yes, hasOverlappingReplicaLocked is the correct check. But processRangeDescriptorUpdate is too late - it is called after we have already committed our on-disk state, and if it fails we have no choice but to panic. We need to check for overlapping replicas early enough that we can still handle the error reasonably.

This serves to demonstrate that this PR works, however we have open issues about the safety of preemptive snapshots, so likely we do not want to merge this (see cockroachdb#7830 and the discussions in cockroachdb#6144).

rjnn · 2016-07-18T21:10:19Z

Summary of a discussion with @bdarnell:

Currently store.canApplySnapshot checks store.hasOverlappingRanges which checks the store.mu.rangesByKey map. However, once this check succeeds, replica.applySnapshot is not called synchronously - so there is the possiblity of a race if two MsgSnap requests for overlapping ranges are let through before either is applied. If this happens, currently the second replica.applySnapshot will panic - as any check for overlaps happens too late in replica.applySnapshot (currently it happens right at the end when r.setDesc is called), since data is already committed to disk. Moving the check for overlaps before committing data to disk will help, but will currently still panic as the failed replica.applySnapshot will bubble up through replica.handleRaftReady which isn't allowed to fail.

One way to fix this is to add a new type for range reservations (in addition to the existing family of types suggested in #6144). When store.canApplySnapshot succeeds it should add a range reservation before passing on the MsgSnap to Raft. This way, the second store.canApplySnapshot will fail due to the pre-existing reservation.

The problem with range reservations is that if Raft ignores the snapshot (e.g. due to an outdated term) it fails silently --- replica.ApplySnapshot is never called, so our reservation will persist forever. Thus, if we are adding range reservations, we need a way to evict stale range reservations. This is solvable through three paths:

Add reservation timeouts. If we encounter a reservation older than a particular timeout, we overwrite it. This is unsatisfying. If for any reason a replica is legitimately delayed, replica.applySnapshot will fail, and we will panic in replica.handleRaftReady.
Add upstream changes to Raft so that snapshots that are rejected are documented in the next Ready struct, so that replica.handleRaftReady can use this information to remove the reservation. This requires upstream changes, and we need to be sure we cover all the code paths through which a snapshot could be rejected. This is delicate and brittle.
Keep some additional side information (through a boolean in store.mu.existsReservations) whenever we give out a reservation. Since we are guaranteed to either get the snapshot at in the next Ready or be sure that it has been thrown away, if the flag is set at the start of replica.handleRaftReady we can then clear out all reservations (and reset the flag) if there are no snapshots in the Ready.

The third option is the least unsatisfying to me. I think that's the only path forward, although maybe there is something simpler that I'm missing.

processRangeDescriptorUpdateLocked checks to see if there is an existing range before adding a Replica. We should widen the check to look for any overlapping ranges. This addresses one of the cases in cockroachdb#7830.

tbg · 2016-07-19T09:20:59Z

You could implement the third option by hiding the bit in s.pendingRaftGroups, though that will be less space efficient (since snapshots are rare and struct{}{} is optimized away when bool isn't).

To make the third option a little more palatable, you can actually translate it into a performance improvement. Namely, instead of just a marker, you actually store the unmarshaled snapshot (you unmarshal it early to get the descriptor, and thus the key range). Then, in handleRaftReady when you apply the snapshot, you can use the unmarshaled one from the map.
This of course means that we're going to leak memory at an alarming rate if the map isn't cleared out, but we know it is (unless the implementation has a bug).

tbg · 2016-07-19T11:40:05Z

cc #7833 (preemptive snapshots through Raft) https://reviewable.io/reviews/cockroachdb/cockroach/7833#-KMjP00ZMUR9CftaOJxx

@petermattis

Just noticed that applying a snapshot here will prevent other raft processing on the store. When we eventually make normal raft processing concurrent in Store.processRaft, this will be a bottleneck.

@bdarnell

The solution to #7830 should let us fix this. Once we can reserve our spot in replicasByKey before applying the snapshot, we no longer need to hold processRaftMu throughout the snapshot application process, and we can use a Replica-level lock instead.

This serves to demonstrate that this PR works, however we have open issues about the safety of preemptive snapshots, so likely we do not want to merge this (see cockroachdb#7830 and the discussions in cockroachdb#6144).

petermattis · 2016-07-19T14:38:38Z

Looks like there should be a bit of refactoring of handleRaftMessage. Currently, canApplySnapshot is called before we retrieve/create the replica. I think we should move it afterward, but while Store.mu is still held. canApplySnapshot should then be renamed to something like prepareForSnapshotLocked and it would add to the replicasByKey map if the replica is uninitialized. (If the replica is initialized, should we be checking that the snapshot key range matches the replica key range?)

Is there a reason to store the "range reservation" bit outside of Replica? If we're feeding a MsgSnap to rawnode.Step for an initialized replica that we've added to replicasByKey, we could add a bit (r.mu.reserved?) that is checked in Replica.handleRaftReady. I'm thinking of something like:

func (r *Replica) handleRaftReady() error {
  r.mu.Lock()
  reserved := r.mu.reserved
  ...
  r.mu.Unlock()

  if reserved {
    defer r.maybeClearReservation()
  }

  ...
}

func (r *Replica) maybeClearReservation() {
  r.mu.Lock()
  reserved := r.mu.reserved
  r.mu.reserved = false
  r.mu.Unlock()
  if !reserved {
    return
  }
  r.store.clearReservation(r)  // remove from Store.mu.replicasByKey
}

tbg · 2016-07-19T14:46:46Z

r.store.clearReservation(r) // remove from Store.mu.replicasByKey

I think we should avoid upcalls from Replica to Store where we can. If the store is holding a reservation, why hide its state somewhere else, especially when the lifecycle is that transparent (exactly like pendingRaftGroups)?

petermattis · 2016-07-19T14:51:10Z

There are plenty of upcalls from Replica to Store (e.g. r.store.processRangeDescriptorUpdate), not sure why this would be problematic. Yes, the Store is holding the reservation in the form of a Replica in the replicasByKey map, but the Replica is the entity which knows when the reservation has been filled or should be cleared.

If we don't have Replica making an upcall to Store, how do you propose for Store to be told that the reservation should be cleared? A return value from Replica.handleRaftReady?

tbg · 2016-07-19T14:54:25Z

@arjunravinarayan's comment, if I understand it correctly, above points out that we can simply throw away all registrations after handling a cycle of Readys. This is because we make a reservation when we Step the RaftGroup, so after looking for the next ready, no more snapshot for that Step can be emitted and whatever reservation is left will never be claimed by the respective Replica.

petermattis · 2016-07-19T15:00:17Z

@tschottdorf That's my understanding as well and is used by my suggestion above. So where are you going to put that check if not in a defer from Replica.handleRaftReady? Store.processRaft is the only other location, but doing it there would require you to keep a list of all of the reservations that weren't fulfilled and then loop over them. Certainly possible to do that as well, but it doesn't seem any cleaner to me.

tbg · 2016-07-19T15:06:05Z

There are plenty of upcalls from Replica to Store (e.g. r.store.processRangeDescriptorUpdate), not sure why this would be problematic.

Because Store hierarchically sits above Replica, so every call in the other direction complicates concurrency.

If we don't have Replica making an upcall to Store, how do you propose for Store to be told that the reservation should be cleared? A return value from Replica.handleRaftReady?
@tschottdorf That's my understanding as well and is used by my suggestion above. So where are you going to put that check if not in a defer from Replica.handleRaftReady? Store.processRaft is the only other location, but doing it there would require you to keep a list of all of the reservations that weren't fulfilled and then loop over them. Certainly possible to do that as well, but it doesn't seem any cleaner to me.

I am playing with (in WIP) returning a value from handleRaftReady for the split trigger (as a starting point for #6286). The idea is again that the Store should perform those complex operations which exceed the scope a Replica is aware of. We will need to return side effects from command execution anyway (for proposer-eval'ed KV), and it's only a short distance to return those which are best handled at the *Store to it.
In that case, a return value to clean out the reservation fits right in.

But I don't even want to make it about that, let's make it about snapshot application, which can bring an unitialized replica to initialized state. This is also something the *Store needs to learn about, which currently again translates into awkward upcalls (r.store.processRangeDescriptorUpdate). That's another prime candidate to be a side effect returned from handleRaftReady. That makes it even more fitting that undoing the reservation should be done that same way.

And even in the case in which for some reason we don't return anything from handleRaftReady, I still feel strongly that it would be a mistake to entangle that Replica-Store upcall jungle more by separating reservation creation and removal further.

petermattis · 2016-07-19T15:14:48Z

How are you going to propagate a value from Replica.splitTrigger back up the call chain to be returned from Replica.handleRaftReady?

tbg · 2016-07-19T15:18:49Z

Just as you'd think. splitTrigger -> ... -> applyRaft* -> ... -> handleRaftReady (later, with leader-proposed Raft, the split trigger would have to take its inputs from the supplied WriteSet where possible, and the rest would need to be sent explicitly with the Raft command). Is there a problem staring me in the eye that I'm failing to see?

tbg · 2016-07-19T15:21:03Z

Or rather, what I'm proposing is that the batch command would really only set the state for the real split trigger (i.e. do the writes that prepare the LHS and RHS) and then return the relevant information up to the Store which would then adjust its replicas accordingly (the part which is now in the batch.Defer in the split trigger. I'm also about to post a (very first) PR for this direction (one which doesn't lock us into going up to handleRaftReady; after all we'll need this as a side effect anyway on some level).

tbg · 2016-07-19T15:24:48Z

The part moved to its own function in #7915 (just posted) is roughly what I would expect to be run on Store instead of being called from the bowels of *Replica.

petermattis · 2016-07-19T15:27:24Z

Replica.handleRaftReady loops over committed entries calling Replica.processRaftCommand which calls Replica.applyRaftCommand and on down to Replica.splitTrigger. But I don't see anything that prevents multiple committed entries from being applied in a single call to Replica.handleRaftReady. After an entry containing a split-trigger, don't the effects need to be applied before processing the next raft command?

I'll take a look at your PR.

petermattis · 2016-07-19T15:31:33Z

I took a look at #7915. My point still stands that I think splitTriggerPostCommit needs to run after the batch is committed but before the next raft command is processed. With the current structure of Replica.handleRaftReady, that implies an upcall to Store in some form, not a return value.

tbg · 2016-07-19T15:32:38Z

Good point about applying multiple entries in one Ready. That does taint it a little bit, but only means that the store needs to pass a handler down (the short distance) to handleRaftReady instead of getting a value returned from *Replica (yep, that's an upcall). You can probably see where I'm going: *Replica shouldn't be able to call random stuff on *Store. Instead, *Store should be in control and be at the top of the synchronization.

processRangeDescriptorUpdateLocked checks to see if there is an existing range before adding a Replica. We should widen the check to look for any overlapping ranges. This addresses one of the cases in Issue cockroachdb#7830.

petermattis · 2016-07-19T19:55:07Z

Yeah, I had the same thought of passing some interface to Replica.handleRaftReady that would be used to constrain the upcalls to Store, but even doing that those are still upcalls. If Replica shouldn't be able to call random stuff on Store, should we replace Replica.store with Replica.env which has the type replicaEnv:

type replicaEnv interface {
  func splitRange(...) ...
  func updateRangeDesc(...) ...
}

tbg · 2016-07-19T20:32:49Z

That's closer to passing in the handler into handleRaftReady, but I don't see the downside to going all the way.You're right that this is still an upcall, but there's merit to restricting the scope in which it is available. When passed into handleRaftReady, *Store knows exactly what locks are being held and the two are one stack frame apart. When available to the *Replica at any time, of course we're going to call it under the same locks, but considerably further down. The win of clarity and having a clean interface is desirable.

That said, I think in spirit we (meaning the two of us) agree on the general direction and it's probably best to see how it shakes out at this point. I've started (with #7915) to tackle the side-effect portion of proposer-eval'ed-kv, and once the split trigger is moved to proper side-effect location, it's only a few stack frames up to try out that handler without investing a lot extra (even if it gets shot down at that stage).

To untangle this thread from @arjunravinarayan's proposal #3 above, with all of the future ways the situation may be refactored taken out of consideration, I think keeping the Replica out of it if possible is a good way to go, at least for an initial version. Assuming the store indeed holds some map of reservations (populated when we Step a raft group with a snapshot), we would not have to tell the Replica anything about reservations at all. Instead, after we've called all the handleRaftReadys we have for the moment, we loop through the reservations and remove them (i.e. removing those entries from the tree for which we don't have an initialized range, and keeping the rest, but clearing out the map after).
The map is usually empty and even if not, its processing overhead pales in comparison to how expensive applying a snapshot is. I also don't think we can come up with a more legible/graspable alternative - it's accessed in precisely the two moving parts of Store Raft processing.
Does the above check out and WDYT?

bdarnell · 2016-07-19T21:30:35Z

should we replace Replica.store with Replica.env which has the type replicaEnv

We had that before with the RangeManager interface, which was removed in #2828 because we ended up adding too much cruft to it. Maybe it could work if we were more disciplined about what went into the interface.

…ranges storage: ensure nonoverlapping ranges existing range before adding a Replica. We should widen the check to look for any overlapping ranges. This addresses one of the cases in Issue #7830.

Add replica placeholders as a separate type. Replica laceholders are added to the store.mu.replicasByKey BTree when a pre-emptive snapshot is approved, and atomically swapped with replicas once the snapshot is applied, preventing two overlapping snapshots being approved simultaneously. Closes cockroachdb#7830. Also move some replica functions from store.go into replica.go. WIP on test cases for this commit.

Add replica placeholders as a separate type. Replica laceholders are added to the store.mu.replicasByKey BTree when a pre-emptive snapshot is approved, and atomically swapped with replicas once the snapshot is applied, preventing two overlapping snapshots being approved simultaneously. Closes cockroachdb#7830. Also move some replica functions from store.go into replica.go.

rjnn · 2016-08-16T22:07:55Z

Closed via #8365.

…-active raft: Set the RecentActive flag for newly added nodes

rjnn self-assigned this Jul 14, 2016

tamird mentioned this issue Jul 14, 2016

storage: let preemptive snapshots go through Raft #7833

Merged

bdarnell mentioned this issue Jul 14, 2016

storage: Re-enable storeReplicaRaftReadyConcurrency #7846

Closed

rjnn mentioned this issue Jul 18, 2016

storage: ensure nonoverlapping ranges #7905

Merged

tbg mentioned this issue Jul 19, 2016

storage: avoid unmarshaling snapshot data in Store.canApplySnapshot #7850

Closed

tbg mentioned this issue Jul 19, 2016

storage: Make uninitialized replicas distinct types #6144

Closed

petermattis added this to the Q3 milestone Jul 21, 2016

petermattis added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Jul 21, 2016

rjnn mentioned this issue Aug 8, 2016

storage: add replica placeholders for snapshots #8365

Merged

rjnn closed this as completed Aug 16, 2016

a-robinson mentioned this issue Mar 17, 2017

stability: crash on uninitialized replica #14193

Closed

pav-kv pushed a commit to pav-kv/cockroach that referenced this issue Mar 5, 2024

Merge pull request cockroachdb#7830 from aaronlehmann/new-nodes-start…

7dbf3f7

…-active raft: Set the RecentActive flag for newly added nodes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: ensure Replica ranges do not overlap #7830

storage: ensure Replica ranges do not overlap #7830

rjnn commented Jul 14, 2016

tbg commented Jul 14, 2016

bdarnell commented Jul 14, 2016

petermattis commented Jul 15, 2016

rjnn commented Jul 15, 2016

bdarnell commented Jul 15, 2016

rjnn commented Jul 18, 2016

tbg commented Jul 19, 2016

tbg commented Jul 19, 2016 •

edited

Loading

petermattis commented Jul 19, 2016

tbg commented Jul 19, 2016

petermattis commented Jul 19, 2016

tbg commented Jul 19, 2016

petermattis commented Jul 19, 2016

tbg commented Jul 19, 2016

petermattis commented Jul 19, 2016

tbg commented Jul 19, 2016 •

edited

Loading

tbg commented Jul 19, 2016 •

edited

Loading

tbg commented Jul 19, 2016

petermattis commented Jul 19, 2016

petermattis commented Jul 19, 2016

tbg commented Jul 19, 2016 •

edited

Loading

petermattis commented Jul 19, 2016

tbg commented Jul 19, 2016

bdarnell commented Jul 19, 2016

rjnn commented Aug 16, 2016 •

edited by tamird

Loading

storage: ensure Replica ranges do not overlap #7830

storage: ensure Replica ranges do not overlap #7830

Comments

rjnn commented Jul 14, 2016

tbg commented Jul 14, 2016

bdarnell commented Jul 14, 2016

petermattis commented Jul 15, 2016

rjnn commented Jul 15, 2016

bdarnell commented Jul 15, 2016

rjnn commented Jul 18, 2016

tbg commented Jul 19, 2016

tbg commented Jul 19, 2016 • edited Loading

petermattis commented Jul 19, 2016

tbg commented Jul 19, 2016

petermattis commented Jul 19, 2016

tbg commented Jul 19, 2016

petermattis commented Jul 19, 2016

tbg commented Jul 19, 2016

petermattis commented Jul 19, 2016

tbg commented Jul 19, 2016 • edited Loading

tbg commented Jul 19, 2016 • edited Loading

tbg commented Jul 19, 2016

petermattis commented Jul 19, 2016

petermattis commented Jul 19, 2016

tbg commented Jul 19, 2016 • edited Loading

petermattis commented Jul 19, 2016

tbg commented Jul 19, 2016

bdarnell commented Jul 19, 2016

rjnn commented Aug 16, 2016 • edited by tamird Loading

tbg commented Jul 19, 2016 •

edited

Loading

tbg commented Jul 19, 2016 •

edited

Loading

tbg commented Jul 19, 2016 •

edited

Loading

tbg commented Jul 19, 2016 •

edited

Loading

rjnn commented Aug 16, 2016 •

edited by tamird

Loading