-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: ensure Replica ranges do not overlap #7830
Comments
See #7833 for a bit of applied context. In that PR, I add that missing check for this part:
but I have a hard time trusting it given the abundance of locks (and scarcity of locks held throughout). It seems that |
I think I've found a bug that can result in the creation of overlapping replicas (maybe this is the cause of #5291?). The problem is that in |
|
This serves to demonstrate that this PR works, however we have open issues about the safety of preemptive snapshots, so likely we do not want to merge this (see cockroachdb#7830 and the discussions in cockroachdb#6144).
It appears that |
Yes, |
This serves to demonstrate that this PR works, however we have open issues about the safety of preemptive snapshots, so likely we do not want to merge this (see cockroachdb#7830 and the discussions in cockroachdb#6144).
Summary of a discussion with @bdarnell: Currently One way to fix this is to add a new type for range reservations (in addition to the existing family of types suggested in #6144). When The problem with range reservations is that if Raft ignores the snapshot (e.g. due to an outdated term) it fails silently ---
The third option is the least unsatisfying to me. I think that's the only path forward, although maybe there is something simpler that I'm missing. |
processRangeDescriptorUpdateLocked checks to see if there is an existing range before adding a Replica. We should widen the check to look for any overlapping ranges. This addresses one of the cases in cockroachdb#7830.
processRangeDescriptorUpdateLocked checks to see if there is an existing range before adding a Replica. We should widen the check to look for any overlapping ranges. This addresses one of the cases in cockroachdb#7830.
processRangeDescriptorUpdateLocked checks to see if there is an existing range before adding a Replica. We should widen the check to look for any overlapping ranges. This addresses one of the cases in cockroachdb#7830.
You could implement the third option by hiding the bit in To make the third option a little more palatable, you can actually translate it into a performance improvement. Namely, instead of just a marker, you actually store the unmarshaled snapshot (you unmarshal it early to get the descriptor, and thus the key range). Then, in |
cc #7833 (preemptive snapshots through Raft) https://reviewable.io/reviews/cockroachdb/cockroach/7833#-KMjP00ZMUR9CftaOJxx
|
This serves to demonstrate that this PR works, however we have open issues about the safety of preemptive snapshots, so likely we do not want to merge this (see cockroachdb#7830 and the discussions in cockroachdb#6144).
Looks like there should be a bit of refactoring of Is there a reason to store the "range reservation" bit outside of
|
I think we should avoid upcalls from |
There are plenty of upcalls from If we don't have |
@arjunravinarayan's comment, if I understand it correctly, above points out that we can simply throw away all registrations after handling a cycle of |
@tschottdorf That's my understanding as well and is used by my suggestion above. So where are you going to put that check if not in a defer from |
Because Store hierarchically sits above Replica, so every call in the other direction complicates concurrency.
I am playing with (in WIP) returning a value from But I don't even want to make it about that, let's make it about snapshot application, which can bring an unitialized replica to initialized state. This is also something the And even in the case in which for some reason we don't return anything from |
How are you going to propagate a value from |
Just as you'd think. |
Or rather, what I'm proposing is that the batch command would really only set the state for the real split trigger (i.e. do the writes that prepare the LHS and RHS) and then return the relevant information up to the Store which would then adjust its replicas accordingly (the part which is now in the |
The part moved to its own function in #7915 (just posted) is roughly what I would expect to be run on |
I'll take a look at your PR. |
I took a look at #7915. My point still stands that I think |
Good point about applying multiple entries in one |
processRangeDescriptorUpdateLocked checks to see if there is an existing range before adding a Replica. We should widen the check to look for any overlapping ranges. This addresses one of the cases in Issue cockroachdb#7830.
processRangeDescriptorUpdateLocked checks to see if there is an existing range before adding a Replica. We should widen the check to look for any overlapping ranges. This addresses one of the cases in Issue cockroachdb#7830.
Yeah, I had the same thought of passing some interface to
|
That's closer to passing in the handler into That said, I think in spirit we (meaning the two of us) agree on the general direction and it's probably best to see how it shakes out at this point. I've started (with #7915) to tackle the side-effect portion of proposer-eval'ed-kv, and once the split trigger is moved to proper side-effect location, it's only a few stack frames up to try out that handler without investing a lot extra (even if it gets shot down at that stage). To untangle this thread from @arjunravinarayan's proposal #3 above, with all of the future ways the situation may be refactored taken out of consideration, I think keeping the |
We had that before with the |
…ranges storage: ensure nonoverlapping ranges existing range before adding a Replica. We should widen the check to look for any overlapping ranges. This addresses one of the cases in Issue #7830.
Add replica placeholders as a separate type. Replica laceholders are added to the store.mu.replicasByKey BTree when a pre-emptive snapshot is approved, and atomically swapped with replicas once the snapshot is applied, preventing two overlapping snapshots being approved simultaneously. Closes cockroachdb#7830. Also move some replica functions from store.go into replica.go. WIP on test cases for this commit.
Add replica placeholders as a separate type. Replica laceholders are added to the store.mu.replicasByKey BTree when a pre-emptive snapshot is approved, and atomically swapped with replicas once the snapshot is applied, preventing two overlapping snapshots being approved simultaneously. Closes cockroachdb#7830. Also move some replica functions from store.go into replica.go. WIP on test cases for this commit.
Add replica placeholders as a separate type. Replica laceholders are added to the store.mu.replicasByKey BTree when a pre-emptive snapshot is approved, and atomically swapped with replicas once the snapshot is applied, preventing two overlapping snapshots being approved simultaneously. Closes cockroachdb#7830. Also move some replica functions from store.go into replica.go. WIP on test cases for this commit.
Add replica placeholders as a separate type. Replica laceholders are added to the store.mu.replicasByKey BTree when a pre-emptive snapshot is approved, and atomically swapped with replicas once the snapshot is applied, preventing two overlapping snapshots being approved simultaneously. Closes cockroachdb#7830. Also move some replica functions from store.go into replica.go. WIP on test cases for this commit.
Add replica placeholders as a separate type. Replica laceholders are added to the store.mu.replicasByKey BTree when a pre-emptive snapshot is approved, and atomically swapped with replicas once the snapshot is applied, preventing two overlapping snapshots being approved simultaneously. Closes cockroachdb#7830. Also move some replica functions from store.go into replica.go.
Add replica placeholders as a separate type. Replica laceholders are added to the store.mu.replicasByKey BTree when a pre-emptive snapshot is approved, and atomically swapped with replicas once the snapshot is applied, preventing two overlapping snapshots being approved simultaneously. Closes cockroachdb#7830. Also move some replica functions from store.go into replica.go.
Add replica placeholders as a separate type. Replica laceholders are added to the store.mu.replicasByKey BTree when a pre-emptive snapshot is approved, and atomically swapped with replicas once the snapshot is applied, preventing two overlapping snapshots being approved simultaneously. Closes cockroachdb#7830. Also move some replica functions from store.go into replica.go.
Add replica placeholders as a separate type. Replica laceholders are added to the store.mu.replicasByKey BTree when a pre-emptive snapshot is approved, and atomically swapped with replicas once the snapshot is applied, preventing two overlapping snapshots being approved simultaneously. Closes cockroachdb#7830. Also move some replica functions from store.go into replica.go.
Closed via #8365. |
…-active raft: Set the RecentActive flag for newly added nodes
Nothing prevents two replicas from both considering themselves responsible for overlapping sections of the key-space. This can happen, for example, if a wrongly configured pre-emptive snapshot is applied to a range when there is already a replica keeping track of a range, or some subset of it.
Currently
store.mu.replicasByKey
is a BTree keyed on the end key. This is useful for finding which replicas are responsible for a given range, but it does nothing to prevent two (or more) Replicas from becoming accidentally responsible for ranges that overlap. This proposal is to add an interval tree that keeps track of which replica is in charge of which range, locked by the existing store mutex. When a replica is added, it has to check this Interval Tree to ensure that it is not clobbering an existing replica. We should also consider whether this IntervalTree can be used in place of the currentreplicasByKey
BTree, in which case it can replace it, instead of having two trees. Note that the performance of the IntervalTree might be different from the performance of the BTree, so this should be checked.cc @tamird, @tschottdorf
The text was updated successfully, but these errors were encountered: