-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: teach scatter to use the allocator and zone config #16249
Conversation
Reviewed 4 of 4 files at r1. pkg/storage/replica_command.go, line 3882 at r1 (raw file):
can you help me understand why this uses a backoff even in the success case? if it's for throttling, why does it need to be exponential? why does it need to be randomized? exponential backoff makes sense to me in the other loops. pkg/storage/replica_command.go, line 3884 at r1 (raw file):
hm, and we don't need to get the system config again? Comments from Reviewable |
Awesome, thanks for working on this! I like the idea of letting the normal mechanism downreplicate. I have a question though - how will downreplication choose the replicas to remove? I assume it can remove replicas we just added, but perhaps we don't expect the allocator to make that decision (at least not frequently)? I am trying to understand how much bias there is toward existing replicas when we scatter. |
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed. pkg/storage/replica_command.go, line 3882 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
It's doing double-duty as a "retry until no more work" loop and as a "backoff and retry on error" loop. You're right that it should look more like this: var err error
for i := 0; i < numTries; i++ {
target := getTarget()
if target == nil {
break
}
if err = tryAddingReplica(target); err != nil {
time.Sleep(2**i * time.Second)
}
} pkg/storage/replica_command.go, line 3884 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Er, you tell me. We only use the system config to get the zone constraints. If those have changed out from under us, our Comments from Reviewable |
The replication queue does not explicitly avoid undoing recent moves, but it doesn't happen too often because the adds and removals are both based on the same heuristics. (If it did, we'd want to put some sort of bias against removing recently-added replicas because otherwise we'd just be wasting a lot of work). Reviewed 4 of 4 files at r1. pkg/storage/replica_command.go, line 3863 at r1 (raw file):
I think we tend to prefer the name pkg/storage/replica_command.go, line 3882 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
You could also use the If there's not going to be a sleep on success, I don't see why the backoff on failure should use a non-default retry configuration. Side note: This pkg/storage/replica_command.go, line 3884 at r1 (raw file):
You're using the past tense there and it's true that the add_replicas we made in the past iterations may have targeted the wrong stores, but if it's worth refreshing the range descriptor it's probably worth getting the zone config again for future iterations. pkg/storage/replica_command.go, line 3899 at r1 (raw file):
This is the only successful exit condition of the loop (other than hitting maxAttempts, which appears to surprisingly return without error). I'm concerned that this will just move things around up to pkg/storage/replica_command.go, line 3910 at r1 (raw file):
Only adding new replicas here and relying on the replication queue to downreplicate puts us in an unusual state with too many replicas (up to 8, if we started from 3 and use up our maxAttempts of 5). That's better than ending up with too few replicas but I still think it would be better to alternate adds and removes to make sure we stay within 1 of the desired replication factor. Comments from Reviewable |
Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed. pkg/storage/replica_command.go, line 3882 at r1 (raw file):
instead of a sentinel error, you use use pkg/storage/replica_command.go, line 3991 at r1 (raw file):
As discussed offline, it sounds from your testing that scatter is reliable enough now that we can actually fail if the retries aren't enough to resolve an error. So then we can actually fail the request instead of logging the error and ignoring it (aka no longer best-effort) Comments from Reviewable |
pkg/storage/replica_command.go, line 3910 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Unless I'm mistaken, removing replicas outside of the replicate queue is how things get bungled up in the first place. Comments from Reviewable |
Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed. pkg/storage/replica_command.go, line 3910 at r1 (raw file): Previously, BramGruneir (Bram Gruneir) wrote…
Yeah. This code is currently relying on the replicate queue to do the removals, but it's doing all the adds before it starts waiting on the downreplication. I think what i'm suggesting is to alternate adding a replica and waiting for one replica to be removed. Could we also use the replication queue to do the adds? If we do Comments from Reviewable |
pkg/storage/replica_command.go, line 3910 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Do you mean As for waiting for it to down-replicate first before adding another replica, I think it will be slower since the down-replicates can quickly cycle through down-replications (as long as the replica being removed is not the leader). If we keep the number of replicas lower, the chance of picking the current replica is greatly increased. Also, scatter would have to wait between down-replicates and this waiting on the other queue will take time. Comments from Reviewable |
pkg/storage/allocator.go
Outdated
@@ -525,7 +529,9 @@ func (a *Allocator) ShouldTransferLease( | |||
log.Infof(ctx, "ShouldTransferLease (lease-holder=%d):\n%s", leaseStoreID, sl) | |||
} | |||
|
|||
transferDec, _ := a.shouldTransferLeaseUsingStats(ctx, sl, source, existing, stats) | |||
transferDec, _ := a.shouldTransferLeaseUsingStats( | |||
ctx, sl, source, existing, stats, false, /* alwaysAllowDecisionWithoutStats */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* !alwaysAllowDecisionWithoutStats */
pkg/storage/replicate_queue.go
Outdated
@@ -438,6 +440,7 @@ func (rq *replicateQueue) transferLease( | |||
repl.stats, | |||
checkTransferLeaseSource, | |||
checkCandidateFullness, | |||
false, /* alwaysAllowDecisionWithoutStats */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!alwaysAllowDecisionWithoutStats
pkg/storage/replica_command.go
Outdated
return err | ||
} | ||
|
||
// XXX: These retry settings are pulled from nowhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO(benesch)
?
pkg/storage/replica_command.go
Outdated
func doScatter(ctx context.Context, repl *Replica) error { | ||
desc := repl.Desc() | ||
|
||
sysCfg, _ := repl.store.cfg.Gossip.GetSystemConfig() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the omitted return value here? Not an error
, right?
Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed. pkg/storage/replica_command.go, line 3910 at r1 (raw file): Previously, BramGruneir (Bram Gruneir) wrote…
No, I mean Yeah, it'll be slower to down-replicate before up-replicating. The question is whether that's a reasonable price to pay for keeping things more "normal". We haven't tested high replication factors like that much. I would lean towards the more conservative approach here unless it ends up being a substantial part of the overall restore time. Comments from Reviewable |
Reviewed 4 of 4 files at r1. pkg/storage/allocator.go, line 431 at r1 (raw file):
It's not your fault, but this method is getting really ugly and could use some refactoring. Three boolean parameters in a row is quite the code smell. pkg/storage/allocator.go, line 559 at r1 (raw file):
Does this need to be a parameter here or can callers just interpret an empty replica descriptor as needing to move on to not using stats? pkg/storage/replica_command.go, line 3866 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
It indicates whether or not the config is set, so if we move forward when it's false we could very well be ignoring the real system config (using an empty one in its place). pkg/storage/replica_command.go, line 3882 at r1 (raw file):
I think the error message needs a little more work. It has two verbs and it's not clear what it means. Also, while it's not super important, if you make this a package level var it'll save an allocation each time this method is called pkg/storage/replica_command.go, line 3910 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
For what it's worth, I'm also worried about adding large numbers of replicas. It seems risky, particularly in this sort of high-flux situation where new members might not even have started participating in raft consensus before others are being removed. It might make for a good stress test, though :) pkg/storage/replica_command.go, line 3973 at r1 (raw file):
What happens if this particular replica gets removed? Will the descriptor keep getting updated here or will we just run into Comments from Reviewable |
I'll do another round of addressing feedback shortly! Wanted to get my initial thoughts out. Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed. pkg/storage/replica_command.go, line 3884 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Oh, sorry. I thought this was in the lease transfer loop, at which point all pkg/storage/replica_command.go, line 3899 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Won't hitting (Heads up: I'm going to refactor this to use a for loop and pkg/storage/replica_command.go, line 3910 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
I initially prototyped an implementation that did exactly that: it would just repeatedly queue the replica. It didn't work very well. When you issue the scatter command after I thought of a couple workarounds for this, but none that were satisfying. The first was to have the client repeatedly issue scatter requests. In that world, scatter was essentially a "please force-add this to your replicate queue, then return and let me know if there's work to be done." This more or less worked, but it was conceptually odd (at the very least, it was certainly not a "scatter" command but more of a "is range balanced?" command) and required a ton of client-side logic to e.g. retry only the failing ranges. It also required setting a The other workaround, which I never actually tried to implement, was a range local key to indicate that a range wanted to be aggressively scattered. The scatter command would set this key on the original leaseholder and it would follow the range around until... well, that was exactly the problem. It wasn't clear when that flag should be unset. Plus, the replicate queue and allocator go to great lengths to be completely stateless, and a FWIW, my reading of the replicate queue indicates it will happily upreplicate a range without waiting for it to downreplicate, so I think the replicate queue is already susceptible to the problem you're describing, @bdarnell. I really haven't verified this claim, though. Comments from Reviewable |
Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed. pkg/storage/replica_command.go, line 3899 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Yeah. I thought I saw a path by which it could return pkg/storage/replica_command.go, line 3910 at r1 (raw file):
Yes, if the allocator tells it to. But if we're just queuing the replica and letting the allocator do its thing, it would never do that. Comments from Reviewable |
Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed. pkg/storage/replica_command.go, line 3910 at r1 (raw file):
But isn't that exactly what we do here? We'll only add a replica if the allocator tells us to. The only difference is that we ask the allocator to include throttled stores when making its decision. Comments from Reviewable |
Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed. pkg/storage/replica_command.go, line 3866 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Eep! I definitely copied this line from elsewhere (and a quick pkg/storage/replica_command.go, line 3882 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Hah, thanks! I'm going to rewrite this bit with a for loop and pkg/storage/replica_command.go, line 3973 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Oooh, good question. I very quickly got lost while trying to find the answer, but it definitely seems like I'd need to manually query the range descriptor here. FWIW, @danhhz suggested that we don't bother waiting for the downreplication to happen. Thoughts? Comments from Reviewable |
Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed. pkg/storage/replica_command.go, line 3910 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
You're asking the allocator for a rebalance target, which the replication queue would only do if ComputeAction had decided that was the next thing to do. It wouldn't do that if it was above the target replication factor and needed to be downreplicated. The similarity here is exactly why I'm trying to figure out if we can use the replicate queue instead of duplicating part of its logic. But the throttled replica thing makes it tricky; I don't see a clean way to plumb that through, so maybe it's better to just make the Comments from Reviewable |
Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed. pkg/storage/replica_command.go, line 3973 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
If we don't wait for downreplication then we may start the restore while the range still has too many replicas, leading to wasted work as we copy the restored data to 6 replicas instead of 3. Comments from Reviewable |
Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed. pkg/storage/replica_command.go, line 3910 at r1 (raw file):
👍 I'll switch this to alternate between upreplicating and waiting for downreplication. The one downside I forsee is that we'll never remove our own replica, as we'll be the leaseholder when downreplication occurs. I think we can just transfer the lease first, though—there's no downside in issuing an Comments from Reviewable |
Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed. pkg/storage/replica_command.go, line 3910 at r1 (raw file):
You can't call Comments from Reviewable |
Review status: all files reviewed at latest revision, 14 unresolved discussions, some commit checks failed. pkg/storage/replica_command.go, line 3866 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
I just did a quick scan and it looks like all the non-test call sites that ignore the second return value are calling pkg/storage/replica_command.go, line 3973 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
To answer my own question here, the result of Comments from Reviewable |
Ran into some hiccups while attempting to address review feedback; Peter suggested I post them here before going any further. First, @a-robinson rightly pointed out that Second, @bdarnell and @a-robinson both agreed that adding up to So I guess I'm asking: should I continue with this attempt or try another approach? It seems to me this approach is only viable if we're a) comfortable making |
I don't see a problem with issuing RangeLookups from the scatter
implementation, FWIW.
…On Jun 8, 2017 11:30, "Nikhil Benesch" ***@***.***> wrote:
Ran into some hiccups while attempting to address review feedback; Peter
suggested I post them here before going any further.
First, @a-robinson <https://github.com/a-robinson> rightly pointed out
that r.Desc() won't be updated if the replica is transferred away from
the node executing the scatter. Unfortunately, it doesn't seem like there's
any way to get the updated range descriptor from the storage package,
save for issuing a RangeLookup request. This isn't done anywhere else in
the storage package—is it a reasonable thing to do?
Second, @bdarnell <https://github.com/bdarnell> and @a-robinson
<https://github.com/a-robinson> both agreed that adding up to REPL-FACTOR
replicas before removing any could be destabilizing, but I don't see any
easy alternative. If we attempt to alternate adds and removes, for at least
some ranges, the replicate queue is almost certain to transfer our lease
away before we've added all the replicas we wanted to. This is another
incarnation of scatter *really* wanting some range-local state.
So I guess I'm asking: should I continue with this attempt or try another
approach? It seems to me this approach is only viable if we're a)
comfortable making RangeLookup requests from the implementation of
scatter, and b) comfortable running with 2*REPL-FACTOR replicas for a
short period of time.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#16249 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABdsPNd0QC3QqJ4bK39tEiBQlbJ99T_Lks5sCD22gaJpZM4NsOTw>
.
|
We do I'm OK with temporarily going up to 2x the target replication as long as we're waiting for the replication factor to get back down to the target before the restore process starts its heavy writes. |
I don't think it's awful to start the Import part of RESTORE while this is all down-replicating. We limit how many requests are in flight at a time, so once everything has finished down-replicating (a few minutes at most?), every Import sent after that will be replicated normally. (I remember Nikhil and I deciding at some point that not waiting for the down-replication was considerably simpler for some reason, but now I can't remember why. @benesch?) |
It's simpler not to wait for downreplication when scatter isn't
best-effort—i.e., in a world where one range failing to scatter causes
DistSender to give up on the entire request (a minute after observing the
first failure, I think?). The downreplication is O(n) in the number of
ranges scattered and so the 2tb restore will quickly run out of retries
while waiting for downreplication.
We could solve this by either interspersing splits and scatters in restore
or drastically increasing the retry backoff.
…On Thu, Jun 8, 2017 at 2:52 PM, Daniel Harrison ***@***.***> wrote:
as long as we're waiting for the replication factor to get back down to
the target before the restore process starts its heavy writes
I don't think it's awful to start the Import part of RESTORE while this is
all down-replicating. We limit how many requests are in flight at a time,
so once everything has finished down-replicating (a few minutes at most?),
every Import sent after that will be replicated normally. (I remember
Nikhil and I deciding at some point that not waiting for the
down-replication was considerably simpler for some reason, but now I can't
remember why. @benesch <https://github.com/benesch>?)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16249 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA15IKnolTE7OvqhAcq8V0Qv_zALrN4kks5sCELwgaJpZM4NsOTw>
.
|
Ha! The problem is 1000% deletion tombstones. Behold the most successful scatter yet courtesy of this diff: --- a/pkg/storage/replica_raftstorage.go
+++ b/pkg/storage/replica_raftstorage.go
@@ -617,15 +617,11 @@ func clearRangeData(
defer iter.Close()
const metadataRanges = 2
- for i, keyRange := range makeAllKeyRanges(desc) {
+ for _, keyRange := range makeAllKeyRanges(desc) {
// The metadata ranges have a relatively small number of keys making usage
// of range tombstones (as created by ClearRange) a pessimization.
var err error
- if i < metadataRanges {
- err = batch.ClearIterRange(iter, keyRange.start, keyRange.end)
- } else {
- err = batch.ClearRange(keyRange.start, keyRange.end)
- }
+ err = batch.ClearIterRange(iter, keyRange.start, keyRange.end)
if err != nil {
return err
} |
f293fac
to
24ae61e
Compare
Bravo, this is really tremendous work! You stuck with it long after I would have just committed the hacks , but I'd wait for someone on core to give a final ok before merging Review status: 0 of 12 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed. pkg/ccl/sqlccl/restore.go, line 775 at r14 (raw file):
I'm happy to leave this as a followup, but are there any outstanding concerns with making a failed scatter fail the restore? (or did these last two insights fix them?) Comments from Reviewable |
I haven't been following all of the iterations of this PR closely, but the clear-range hack . Review status: 0 of 12 files reviewed at latest revision, 12 unresolved discussions, some commit checks failed. pkg/storage/replica_raftstorage.go, line 625 at r14 (raw file):
Let's add some more words here:
Comments from Reviewable |
Reviewed 1 of 3 files at r4, 10 of 10 files at r12, 2 of 2 files at r13, 9 of 9 files at r14. Comments from Reviewable |
Reviewed 10 of 10 files at r12, 2 of 2 files at r13. Comments from Reviewable |
LGTM Reviewed 1 of 10 files at r12, 2 of 2 files at r13, 9 of 9 files at r14. Comments from Reviewable |
Scattering is *much* more reliable when empty snapshots are not limited, and this doesn't seem to have any other adverse effects in my testing. It's also early in the release cycle, so we'll have time to fix any bugs this may introduce.
Having upwards of 4000 RocksDB range tombstones in one SST renders a node useless, as operations that used to take microseconds take dozens of milliseconds. Under most workloads, this situation is rare: reads don't create tombstones, and inserting data will eventually cause a compaction that cleans up these tombstones. When presplitting for a 2TB restore on an otherwise idle cluster, however, up to 12k snapshots may be applied before any data is ingested. This quickly generates more range deletion tombstones than RocksDB can handle. As a quick fix, this commit avoids deletion tombstones when clearing ranges with less than 64 keys. Down the road, we may want to investigate teaching RocksDB to automatically compact any file that exceeds a certain number of range deletion tombstones, but this should do the trick for now.
Replace the existing "toy" implementation of scatter with a real implementation that uses the zone configuration and the allocator's recommendations.
Thanks, all, for some serious advice and reviews! Review status: 0 of 14 files reviewed at latest revision, 11 unresolved discussions, all commit checks successful. pkg/ccl/sqlccl/restore.go, line 775 at r14 (raw file): Previously, danhhz (Daniel Harrison) wrote…
I'm not quite convinced yet that scatter is reliable enough; often it's just one lease that fails to transfer or just one range that fails to upreplicate, and it'd be a shame to fail the whole restore over that. Tossing it in a retry loop should fix it, but since you're happy leaving that to a follow up PR, I'm definitely going to leave that to a followup PR. pkg/storage/replica_raftstorage.go, line 625 at r14 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done. Comments from Reviewable |
`ALTER TABLE... SCATTER` expects to receive a list of ranges that were scattered. This information was accidentally dropped in the new scatter implementation (dbd90cf, cockroachdb#16249). This commit restores the old behavior, and adds a test to boot. Fixes cockroachdb#17153.
`ALTER TABLE... SCATTER` expects to receive a list of ranges that were scattered. This information was accidentally dropped in the new scatter implementation (dbd90cf, cockroachdb#16249). This commit restores the old behavior, and adds a test to boot. Fixes cockroachdb#17153.
Replace the existing "toy" implementation of scatter with a real implementation that uses the zone configuration and the allocator's recommendations.
This is a ~90% complete PR, but wanted to get it out sooner rather than later now that the necessary changes to the allocator and restore have landed. In short, the approach is to issue add replica and lease transfer commands, then wait for the replicate queue to downreplicate before returning. (This avoids causing underreplication; the old implementation would often race with the replicate queue and remove too many replicas.)
Here's an example of scattering on a 2TB restore:
I'll try to get an example where one of the nodes has too few replicas per store tomorrow; the example above really only shows leaseholder balancing and has a tooltip in the way of the interesting bit. 🤦♂️