storage: teach scatter to use the allocator and zone config #16249

benesch · 2017-05-31T21:34:44Z

Replace the existing "toy" implementation of scatter with a real implementation that uses the zone configuration and the allocator's recommendations.

This is a ~90% complete PR, but wanted to get it out sooner rather than later now that the necessary changes to the allocator and restore have landed. In short, the approach is to issue add replica and lease transfer commands, then wait for the replicate queue to downreplicate before returning. (This avoids causing underreplication; the old implementation would often race with the replicate queue and remove too many replicas.)

Here's an example of scattering on a 2TB restore:

I'll try to get an example where one of the nodes has too few replicas per store tomorrow; the example above really only shows leaseholder balancing and has a tooltip in the way of the interesting bit. 🤦‍♂️

cockroach-teamcity · 2017-05-31T21:34:51Z

This change is

tamird · 2017-05-31T21:49:16Z

Reviewed 4 of 4 files at r1.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

pkg/storage/replica_command.go, line 3882 at r1 (raw file):

	// Start rebalancing by adding replicas of this range to the stores the
	// allocator tells us to, if any exist.
	maybeMoreWorkErr := errors.New("allocator may request suggest additional rebalance targets")

can you help me understand why this uses a backoff even in the success case? if it's for throttling, why does it need to be exponential? why does it need to be randomized?

exponential backoff makes sense to me in the other loops.

pkg/storage/replica_command.go, line 3884 at r1 (raw file):

	maybeMoreWorkErr := errors.New("allocator may request suggest additional rebalance targets")
	if err := retry.WithMaxAttempts(ctx, retryOptions, maxAttempts, func() error {
		// Get an updated range descriptor, as we might sleep for several seconds

hm, and we don't need to get the system config again?

Comments from Reviewable

RaduBerinde · 2017-05-31T23:42:38Z

Awesome, thanks for working on this!

I like the idea of letting the normal mechanism downreplicate. I have a question though - how will downreplication choose the replicas to remove? I assume it can remove replicas we just added, but perhaps we don't expect the allocator to make that decision (at least not frequently)? I am trying to understand how much bias there is toward existing replicas when we scatter.

benesch · 2017-06-01T14:13:10Z

Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

pkg/storage/replica_command.go, line 3882 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

can you help me understand why this uses a backoff even in the success case? if it's for throttling, why does it need to be exponential? why does it need to be randomized?

exponential backoff makes sense to me in the other loops.

It's doing double-duty as a "retry until no more work" loop and as a "backoff and retry on error" loop. You're right that it should look more like this:

var err error
for i := 0; i < numTries; i++ {
    target := getTarget()
    if target == nil {
        break
    }

    if err = tryAddingReplica(target); err != nil {
         time.Sleep(2**i * time.Second)
    }
}

pkg/storage/replica_command.go, line 3884 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

hm, and we don't need to get the system config again?

Er, you tell me. We only use the system config to get the zone constraints. If those have changed out from under us, our ADD_REPLICAs probably targeted the wrong stores, and the lease transfer is hopeless anyway. Perhaps I'm misunderstanding zone constraints, though.

Comments from Reviewable

bdarnell · 2017-06-01T16:05:38Z

The replication queue does not explicitly avoid undoing recent moves, but it doesn't happen too often because the adds and removals are both based on the same heuristics. (If it did, we'd want to put some sort of bias against removing recently-added replicas because otherwise we'd just be wasting a lot of work).

Reviewed 4 of 4 files at r1.
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

pkg/storage/replica_command.go, line 3863 at r1 (raw file):

}

func doScatter(ctx context.Context, repl *Replica) error {

I think we tend to prefer the name fooImpl instead of doFoo.

pkg/storage/replica_command.go, line 3882 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

It's doing double-duty as a "retry until no more work" loop and as a "backoff and retry on error" loop. You're right that it should look more like this:

var err error
for i := 0; i < numTries; i++ {
    target := getTarget()
    if target == nil {
        break
    }

    if err = tryAddingReplica(target); err != nil {
         time.Sleep(2**i * time.Second)
    }
}

You could also use the util/retry loop but call retry.Reset after each success (that's what the Reset method is for).

If there's not going to be a sleep on success, I don't see why the backoff on failure should use a non-default retry configuration.

Side note: This retry.WithMaxAttempts function is not idiomatic for the retry package. We originally had a function-based interface like this but we decided to move away from that for various reasons (for historical context, dig around #1396). I think we should probably remove this WithMaxAttempts method for consistency with the rest of the package.

pkg/storage/replica_command.go, line 3884 at r1 (raw file):

our ADD_REPLICAs probably targeted the wrong stores

You're using the past tense there and it's true that the add_replicas we made in the past iterations may have targeted the wrong stores, but if it's worth refreshing the range descriptor it's probably worth getting the zone config again for future iterations.

pkg/storage/replica_command.go, line 3899 at r1 (raw file):

				log.Infof(ctx, "scatter: no rebalance targets found, moving on")
			}
			return nil

This is the only successful exit condition of the loop (other than hitting maxAttempts, which appears to surprisingly return without error). I'm concerned that this will just move things around up to maxAttempts times instead of converging. Once we've performed numReplicas moves, further moves are probably not worth doing.

pkg/storage/replica_command.go, line 3910 at r1 (raw file):

		}
		if err := repl.changeReplicas(
			ctx, roachpb.ADD_REPLICA, replicationTarget, desc, SnapshotRequest_REBALANCE,

Only adding new replicas here and relying on the replication queue to downreplicate puts us in an unusual state with too many replicas (up to 8, if we started from 3 and use up our maxAttempts of 5). That's better than ending up with too few replicas but I still think it would be better to alternate adds and removes to make sure we stay within 1 of the desired replication factor.

Comments from Reviewable

danhhz · 2017-06-01T16:18:58Z

Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.

pkg/storage/replica_command.go, line 3882 at r1 (raw file):

	// Start rebalancing by adding replicas of this range to the stores the
	// allocator tells us to, if any exist.
	maybeMoreWorkErr := errors.New("allocator may request suggest additional rebalance targets")

instead of a sentinel error, you use use r := retry.StartWithCtx and r.Next() directly. (see the impl of retry.WithMaxAttempts)

pkg/storage/replica_command.go, line 3991 at r1 (raw file):

// adminScatter moves replicas and leaseholders for a selection of ranges.
// Scatter is best-effort; ranges that cannot be moved will include an error

As discussed offline, it sounds from your testing that scatter is reliable enough now that we can actually fail if the retries aren't enough to resolve an error. So then we can actually fail the request instead of logging the error and ignoring it (aka no longer best-effort)

Comments from Reviewable

BramGruneir · 2017-06-01T17:12:17Z

pkg/storage/replica_command.go, line 3910 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Only adding new replicas here and relying on the replication queue to downreplicate puts us in an unusual state with too many replicas (up to 8, if we started from 3 and use up our maxAttempts of 5). That's better than ending up with too few replicas but I still think it would be better to alternate adds and removes to make sure we stay within 1 of the desired replication factor.

Unless I'm mistaken, removing replicas outside of the replicate queue is how things get bungled up in the first place.

Comments from Reviewable

bdarnell · 2017-06-01T20:19:45Z

Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.

pkg/storage/replica_command.go, line 3910 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Unless I'm mistaken, removing replicas outside of the replicate queue is how things get bungled up in the first place.

Yeah. This code is currently relying on the replicate queue to do the removals, but it's doing all the adds before it starts waiting on the downreplication. I think what i'm suggesting is to alternate adding a replica and waiting for one replica to be removed.

Could we also use the replication queue to do the adds? If we do r.store.replicateQueue.Add() on this replica, the queue will make the change we want here, right? Scatter is really just a way of eagerly prioritizing certain ranges that we want the replicate queue to process. We could just feed the replica into the replicate queue repeatedly until things stabilize.

Comments from Reviewable

BramGruneir · 2017-06-01T20:32:43Z

pkg/storage/replica_command.go, line 3910 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yeah. This code is currently relying on the replicate queue to do the removals, but it's doing all the adds before it starts waiting on the downreplication. I think what i'm suggesting is to alternate adding a replica and waiting for one replica to be removed.

Could we also use the replication queue to do the adds? If we do r.store.replicateQueue.Add() on this replica, the queue will make the change we want here, right? Scatter is really just a way of eagerly prioritizing certain ranges that we want the replicate queue to process. We could just feed the replica into the replicate queue repeatedly until things stabilize.

Do you mean r.store.replicateQueue.addReplica() ?
If so, then yeah, no need to issue the changeReplicas directly. Note that this isn't really relying on the queue, but it nice to not have to repeat code. We might want to move the logging messages to the add/remove functions instead of leaving them in processOneChange() which will help with the event logging improvements I'm going to work.

As for waiting for it to down-replicate first before adding another replica, I think it will be slower since the down-replicates can quickly cycle through down-replications (as long as the replica being removed is not the leader). If we keep the number of replicas lower, the chance of picking the current replica is greatly increased. Also, scatter would have to wait between down-replicates and this waiting on the other queue will take time.

Comments from Reviewable

tbg · 2017-06-01T20:34:13Z

pkg/storage/allocator.go

@@ -525,7 +529,9 @@ func (a *Allocator) ShouldTransferLease(
 		log.Infof(ctx, "ShouldTransferLease (lease-holder=%d):\n%s", leaseStoreID, sl)
 	}

-	transferDec, _ := a.shouldTransferLeaseUsingStats(ctx, sl, source, existing, stats)
+	transferDec, _ := a.shouldTransferLeaseUsingStats(
+		ctx, sl, source, existing, stats, false, /* alwaysAllowDecisionWithoutStats */


/* !alwaysAllowDecisionWithoutStats */

tbg · 2017-06-01T20:39:33Z

pkg/storage/replicate_queue.go

@@ -438,6 +440,7 @@ func (rq *replicateQueue) transferLease(
 		repl.stats,
 		checkTransferLeaseSource,
 		checkCandidateFullness,
+		false, /* alwaysAllowDecisionWithoutStats */


!alwaysAllowDecisionWithoutStats

tbg · 2017-06-01T20:39:37Z

pkg/storage/replica_command.go

+		return err
+	}
+
+	// XXX: These retry settings are pulled from nowhere.


TODO(benesch)?

tbg · 2017-06-01T20:39:37Z

pkg/storage/replica_command.go

+func doScatter(ctx context.Context, repl *Replica) error {
+	desc := repl.Desc()
+
+	sysCfg, _ := repl.store.cfg.Gossip.GetSystemConfig()


What's the omitted return value here? Not an error, right?

bdarnell · 2017-06-01T20:59:44Z

Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.

pkg/storage/replica_command.go, line 3910 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Do you mean r.store.replicateQueue.addReplica() ?
If so, then yeah, no need to issue the changeReplicas directly. Note that this isn't really relying on the queue, but it nice to not have to repeat code. We might want to move the logging messages to the add/remove functions instead of leaving them in processOneChange() which will help with the event logging improvements I'm going to work.

As for waiting for it to down-replicate first before adding another replica, I think it will be slower since the down-replicates can quickly cycle through down-replications (as long as the replica being removed is not the leader). If we keep the number of replicas lower, the chance of picking the current replica is greatly increased. Also, scatter would have to wait between down-replicates and this waiting on the other queue will take time.

No, I mean baseQueue.Add (which is accessible on replicateQueue via embedding. The queue would be responsible for all the allocator interactions too (and do it on its own thread).

Yeah, it'll be slower to down-replicate before up-replicating. The question is whether that's a reasonable price to pay for keeping things more "normal". We haven't tested high replication factors like that much. I would lean towards the more conservative approach here unless it ends up being a substantial part of the overall restore time.

Comments from Reviewable

a-robinson · 2017-06-01T21:37:27Z

Reviewed 4 of 4 files at r1.
Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.

pkg/storage/allocator.go, line 431 at r1 (raw file):

	checkTransferLeaseSource bool,
	checkCandidateFullness bool,
	alwaysAllowDecisionWithoutStats bool,

It's not your fault, but this method is getting really ugly and could use some refactoring. Three boolean parameters in a row is quite the code smell.

pkg/storage/allocator.go, line 559 at r1 (raw file):

	existing []roachpb.ReplicaDescriptor,
	stats *replicaStats,
	alwaysAllowDecisionWithoutStats bool,

Does this need to be a parameter here or can callers just interpret an empty replica descriptor as needing to move on to not using stats?

pkg/storage/replica_command.go, line 3866 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

What's the omitted return value here? Not an error, right?

It indicates whether or not the config is set, so if we move forward when it's false we could very well be ignoring the real system config (using an empty one in its place).

pkg/storage/replica_command.go, line 3882 at r1 (raw file):

"allocator may request suggest additional rebalance targets"

I think the error message needs a little more work. It has two verbs and it's not clear what it means.

Also, while it's not super important, if you make this a package level var it'll save an allocation each time this method is called

pkg/storage/replica_command.go, line 3910 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

No, I mean baseQueue.Add (which is accessible on replicateQueue via embedding. The queue would be responsible for all the allocator interactions too (and do it on its own thread).

Yeah, it'll be slower to down-replicate before up-replicating. The question is whether that's a reasonable price to pay for keeping things more "normal". We haven't tested high replication factors like that much. I would lean towards the more conservative approach here unless it ends up being a substantial part of the overall restore time.

For what it's worth, I'm also worried about adding large numbers of replicas. It seems risky, particularly in this sort of high-flux situation where new members might not even have started participating in raft consensus before others are being removed.

It might make for a good stress test, though :)

pkg/storage/replica_command.go, line 3973 at r1 (raw file):

		// Get an updated range descriptor, as we might sleep for several seconds
		// between retries.
		desc = repl.Desc()

What happens if this particular replica gets removed? Will the descriptor keep getting updated here or will we just run into maxAttempts?

Comments from Reviewable

benesch · 2017-06-01T21:41:46Z

I'll do another round of addressing feedback shortly! Wanted to get my initial thoughts out.

Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.

pkg/storage/replica_command.go, line 3884 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

our ADD_REPLICAs probably targeted the wrong stores

You're using the past tense there and it's true that the add_replicas we made in the past iterations may have targeted the wrong stores, but if it's worth refreshing the range descriptor it's probably worth getting the zone config again for future iterations.

Oh, sorry. I thought this was in the lease transfer loop, at which point all ADD_REPLICAs have been issued. Makes sense to refresh the zone config here. I'll fix that.

pkg/storage/replica_command.go, line 3899 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This is the only successful exit condition of the loop (other than hitting maxAttempts, which appears to surprisingly return without error). I'm concerned that this will just move things around up to maxAttempts times instead of converging. Once we've performed numReplicas moves, further moves are probably not worth doing.

Won't hitting maxAttempts propagate the last-seen error?

(Heads up: I'm going to refactor this to use a for loop and retry.Next() as you suggest.)

pkg/storage/replica_command.go, line 3910 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

For what it's worth, I'm also worried about adding large numbers of replicas. It seems risky, particularly in this sort of high-flux situation where new members might not even have started participating in raft consensus before others are being removed.

It might make for a good stress test, though :)

I initially prototyped an implementation that did exactly that: it would just repeatedly queue the replica. It didn't work very well. When you issue the scatter command after RESTORE creates a bunch of empty ranges, most of the leases are on one node. That node will process most of the scatter requests and stick the relevant replicas in its replicate queue. But if the allocator indicates a lease transfer is necessary, the replicate queue will prefer transferring the lease to rebalancing replicas. So that node with too many leases will immediately transfer most of its leases away. It's useless for the scatter command to requeue those replicas on that node, since the replicate queue sets needsLease: true. You'd have to somehow forward the scatter request to the new leaseholder.

I thought of a couple workarounds for this, but none that were satisfying. The first was to have the client repeatedly issue scatter requests. In that world, scatter was essentially a "please force-add this to your replicate queue, then return and let me know if there's work to be done." This more or less worked, but it was conceptually odd (at the very least, it was certainly not a "scatter" command but more of a "is range balanced?" command) and required a ton of client-side logic to e.g. retry only the failing ranges. It also required setting a wantScatter flag on the replica, because by default the replicate queue will only transfer one lease per second, which is far too slow for RESTORE.

The other workaround, which I never actually tried to implement, was a range local key to indicate that a range wanted to be aggressively scattered. The scatter command would set this key on the original leaseholder and it would follow the range around until... well, that was exactly the problem. It wasn't clear when that flag should be unset. Plus, the replicate queue and allocator go to great lengths to be completely stateless, and a wantScatter range-local key throws that out the window.

FWIW, my reading of the replicate queue indicates it will happily upreplicate a range without waiting for it to downreplicate, so I think the replicate queue is already susceptible to the problem you're describing, @bdarnell. I really haven't verified this claim, though.

Comments from Reviewable

bdarnell · 2017-06-01T21:48:36Z

Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.

pkg/storage/replica_command.go, line 3899 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Won't hitting maxAttempts propagate the last-seen error?

(Heads up: I'm going to refactor this to use a for loop and retry.Next() as you suggest.)

Yeah. I thought I saw a path by which it could return nil without succeeding, but that's only if maxAttempts were zero.

pkg/storage/replica_command.go, line 3910 at r1 (raw file):

it will happily upreplicate a range without waiting for it to downreplicate

Yes, if the allocator tells it to. But if we're just queuing the replica and letting the allocator do its thing, it would never do that.

Comments from Reviewable

benesch · 2017-06-01T21:52:18Z

Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.

pkg/storage/replica_command.go, line 3910 at r1 (raw file):

Yes, if the allocator tells it to. But if we're just queuing the replica and letting the allocator do its thing, it would never do that.

But isn't that exactly what we do here? We'll only add a replica if the allocator tells us to. The only difference is that we ask the allocator to include throttled stores when making its decision.

Comments from Reviewable

benesch · 2017-06-01T22:04:19Z

Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.

pkg/storage/replica_command.go, line 3866 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

It indicates whether or not the config is set, so if we move forward when it's false we could very well be ignoring the real system config (using an empty one in its place).

Eep! I definitely copied this line from elsewhere (and a quick grep turns up a few places where we ignore the second parameter), but I haven't looked closely. Perhaps it's valid to ignore the second param at those other call sites.

pkg/storage/replica_command.go, line 3882 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

"allocator may request suggest additional rebalance targets"

I think the error message needs a little more work. It has two verbs and it's not clear what it means.

Also, while it's not super important, if you make this a package level var it'll save an allocation each time this method is called

Hah, thanks! I'm going to rewrite this bit with a for loop and retry.Next() so this error should vanish.

pkg/storage/replica_command.go, line 3973 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

What happens if this particular replica gets removed? Will the descriptor keep getting updated here or will we just run into maxAttempts?

Oooh, good question. I very quickly got lost while trying to find the answer, but it definitely seems like I'd need to manually query the range descriptor here.

FWIW, @danhhz suggested that we don't bother waiting for the downreplication to happen. Thoughts?

Comments from Reviewable

bdarnell · 2017-06-01T22:06:21Z

Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.

pkg/storage/replica_command.go, line 3910 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Yes, if the allocator tells it to. But if we're just queuing the replica and letting the allocator do its thing, it would never do that.

But isn't that exactly what we do here? We'll only add a replica if the allocator tells us to. The only difference is that we ask the allocator to include throttled stores when making its decision.

You're asking the allocator for a rebalance target, which the replication queue would only do if ComputeAction had decided that was the next thing to do. It wouldn't do that if it was above the target replication factor and needed to be downreplicated.

The similarity here is exactly why I'm trying to figure out if we can use the replicate queue instead of duplicating part of its logic. But the throttled replica thing makes it tricky; I don't see a clean way to plumb that through, so maybe it's better to just make the RebalanceTarget and AddReplica calls here.

Comments from Reviewable

bdarnell · 2017-06-01T22:09:00Z

Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.

pkg/storage/replica_command.go, line 3973 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Oooh, good question. I very quickly got lost while trying to find the answer, but it definitely seems like I'd need to manually query the range descriptor here.

FWIW, @danhhz suggested that we don't bother waiting for the downreplication to happen. Thoughts?

If we don't wait for downreplication then we may start the restore while the range still has too many replicas, leading to wasted work as we copy the restored data to 6 replicas instead of 3.

Comments from Reviewable

benesch · 2017-06-01T22:15:33Z

Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.

pkg/storage/replica_command.go, line 3910 at r1 (raw file):
Ahhh, of course. Thanks!

The similarity here is exactly why I'm trying to figure out if we can use the replicate queue instead of duplicating part of its logic. But the throttled replica thing makes it tricky; I don't see a clean way to plumb that through, so maybe it's better to just make the RebalanceTarget and AddReplica calls here.

👍

I'll switch this to alternate between upreplicating and waiting for downreplication. The one downside I forsee is that we'll never remove our own replica, as we'll be the leaseholder when downreplication occurs. I think we can just transfer the lease first, though—there's no downside in issuing an ADD_REPLICA command from a node that's not the leaseholder (and potentially doesn't have a copy of that replica at all), is there?

Comments from Reviewable

bdarnell · 2017-06-01T22:20:49Z

Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.

pkg/storage/replica_command.go, line 3910 at r1 (raw file):

there's no downside in issuing an ADD_REPLICA command from a node that's not the leaseholder (and potentially doesn't have a copy of that replica at all), is there?

You can't call Replica.changeReplicas from a node that doesn't have a copy of the range (the preemptive snapshot is sent from the node that calls changeReplicas). Holding the lease is not strictly required, but if you don't have the lease then there's a greater risk that you may race with the replicateQueue of the node that does have the lease and result in conflicting transactions. Also, holding the lease offers some protection against the replica being removed out from under you while the changeReplicas is in progress. I don't know how gracefully it handles that kind of failure.

Comments from Reviewable

a-robinson · 2017-06-02T03:14:40Z

Review status: all files reviewed at latest revision, 14 unresolved discussions, some commit checks failed.

pkg/storage/replica_command.go, line 3866 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Eep! I definitely copied this line from elsewhere (and a quick grep turns up a few places where we ignore the second parameter), but I haven't looked closely. Perhaps it's valid to ignore the second param at those other call sites.

I just did a quick scan and it looks like all the non-test call sites that ignore the second return value are calling GetSystemConfig immediately after being notified that the new system config gossip was received, which makes it safe. We should respect the second return value here.

pkg/storage/replica_command.go, line 3973 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If we don't wait for downreplication then we may start the restore while the range still has too many replicas, leading to wasted work as we copy the restored data to 6 replicas instead of 3.

To answer my own question here, the result of repl.Desc will not be up to date if the replica has been removed.

Comments from Reviewable

benesch · 2017-06-08T18:29:49Z

Ran into some hiccups while attempting to address review feedback; Peter suggested I post them here before going any further.

First, @a-robinson rightly pointed out that r.Desc() won't be updated if the replica is transferred away from the node executing the scatter. Unfortunately, it doesn't seem like there's any way to get the updated range descriptor from the storage package, save for issuing a RangeLookup request. This isn't done anywhere else in the storage package—is it a reasonable thing to do?

Second, @bdarnell and @a-robinson both agreed that adding up to REPL-FACTOR replicas before removing any could be destabilizing, but I don't see any easy alternative. If we attempt to alternate adds and removes, for at least some ranges, the replicate queue is almost certain to transfer our lease away before we've added all the replicas we wanted to. This is another incarnation of scatter really wanting some range-local state.

So I guess I'm asking: should I continue with this attempt or try another approach? It seems to me this approach is only viable if we're a) comfortable making RangeLookup requests from the implementation of scatter, and b) comfortable running with 2*REPL-FACTOR replicas for a short period of time.

tamird · 2017-06-08T18:32:03Z

I don't see a problem with issuing RangeLookups from the scatter implementation, FWIW.

…

On Jun 8, 2017 11:30, "Nikhil Benesch" ***@***.***> wrote: Ran into some hiccups while attempting to address review feedback; Peter suggested I post them here before going any further. First, @a-robinson <https://github.com/a-robinson> rightly pointed out that r.Desc() won't be updated if the replica is transferred away from the node executing the scatter. Unfortunately, it doesn't seem like there's any way to get the updated range descriptor from the storage package, save for issuing a RangeLookup request. This isn't done anywhere else in the storage package—is it a reasonable thing to do? Second, @bdarnell <https://github.com/bdarnell> and @a-robinson <https://github.com/a-robinson> both agreed that adding up to REPL-FACTOR replicas before removing any could be destabilizing, but I don't see any easy alternative. If we attempt to alternate adds and removes, for at least some ranges, the replicate queue is almost certain to transfer our lease away before we've added all the replicas we wanted to. This is another incarnation of scatter *really* wanting some range-local state. So I guess I'm asking: should I continue with this attempt or try another approach? It seems to me this approach is only viable if we're a) comfortable making RangeLookup requests from the implementation of scatter, and b) comfortable running with 2*REPL-FACTOR replicas for a short period of time. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#16249 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABdsPNd0QC3QqJ4bK39tEiBQlbJ99T_Lks5sCD22gaJpZM4NsOTw> .

bdarnell · 2017-06-08T18:44:26Z

We do RangeLookup calls from the storage package in replicaGCQueue. Doing so here is fine too.

I'm OK with temporarily going up to 2x the target replication as long as we're waiting for the replication factor to get back down to the target before the restore process starts its heavy writes.

danhhz · 2017-06-08T18:52:16Z

as long as we're waiting for the replication factor to get back down to the target before the restore process starts its heavy writes

I don't think it's awful to start the Import part of RESTORE while this is all down-replicating. We limit how many requests are in flight at a time, so once everything has finished down-replicating (a few minutes at most?), every Import sent after that will be replicated normally. (I remember Nikhil and I deciding at some point that not waiting for the down-replication was considerably simpler for some reason, but now I can't remember why. @benesch?)

benesch · 2017-06-08T19:01:57Z

It's simpler not to wait for downreplication when scatter isn't best-effort—i.e., in a world where one range failing to scatter causes DistSender to give up on the entire request (a minute after observing the first failure, I think?). The downreplication is O(n) in the number of ranges scattered and so the 2tb restore will quickly run out of retries while waiting for downreplication. We could solve this by either interspersing splits and scatters in restore or drastically increasing the retry backoff.

…

On Thu, Jun 8, 2017 at 2:52 PM, Daniel Harrison ***@***.***> wrote: as long as we're waiting for the replication factor to get back down to the target before the restore process starts its heavy writes I don't think it's awful to start the Import part of RESTORE while this is all down-replicating. We limit how many requests are in flight at a time, so once everything has finished down-replicating (a few minutes at most?), every Import sent after that will be replicated normally. (I remember Nikhil and I deciding at some point that not waiting for the down-replication was considerably simpler for some reason, but now I can't remember why. @benesch <https://github.com/benesch>?) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16249 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA15IKnolTE7OvqhAcq8V0Qv_zALrN4kks5sCELwgaJpZM4NsOTw> .

benesch · 2017-07-12T14:57:33Z

Ok, I've filed #17000 to track the slow downreplication and #17001 to track the empty snapshot issue.

benesch · 2017-07-12T22:55:01Z

Ha! The problem is 1000% deletion tombstones. Behold the most successful scatter yet

courtesy of this diff:

--- a/pkg/storage/replica_raftstorage.go
+++ b/pkg/storage/replica_raftstorage.go
@@ -617,15 +617,11 @@ func clearRangeData(
        defer iter.Close()
 
        const metadataRanges = 2
-       for i, keyRange := range makeAllKeyRanges(desc) {
+       for _, keyRange := range makeAllKeyRanges(desc) {
                // The metadata ranges have a relatively small number of keys making usage
                // of range tombstones (as created by ClearRange) a pessimization.
                var err error
-               if i < metadataRanges {
-                       err = batch.ClearIterRange(iter, keyRange.start, keyRange.end)
-               } else {
-                       err = batch.ClearRange(keyRange.start, keyRange.end)
-               }
+               err = batch.ClearIterRange(iter, keyRange.start, keyRange.end)
                if err != nil {
                        return err
                }

benesch · 2017-07-13T04:16:09Z

Ok, rebased to a) remove the limit on empty snapshots entirely, and b) use ClearIterRange instead of ClearRange on ranges with less than 64 keys. It's like magic:

This might be the last time I say this: PTAL!

danhhz · 2017-07-13T12:34:27Z

Bravo, this is really tremendous work! You stuck with it long after I would have just committed the hacks

, but I'd wait for someone on core to give a final ok before merging

Review status: 0 of 12 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.

pkg/ccl/sqlccl/restore.go, line 775 at r14 (raw file):

		}
		if err := g.Wait(); err != nil {
			return failed, errors.Wrapf(err, "scattering %d ranges", len(importRequests))

I'm happy to leave this as a followup, but are there any outstanding concerns with making a failed scatter fail the restore? (or did these last two insights fix them?)

Comments from Reviewable

petermattis · 2017-07-13T12:58:01Z

I haven't been following all of the iterations of this PR closely, but the clear-range hack .

Review status: 0 of 12 files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.

pkg/storage/replica_raftstorage.go, line 625 at r14 (raw file):

	const metadataRanges = 2
	// TODO(benesch): further tune this constant, or make RocksDB range deletion
	// tombstones less expensive.

Let's add some more words here:

It is expensive for there to be many range deletion tombstones in the same sstable because all of the tombstones in an sstable are loaded whenever the sstable is accessed. So we avoid using range deletion unless there is some minimum number of keys. The value here was pulled out of thin air. It might be better to make this dependent on the size of the data being deleted. Or perhaps we should fix RocksDB to handle large numbers of tombstones in an sstable better.

Comments from Reviewable

tamird · 2017-07-13T20:05:40Z

Reviewed 1 of 3 files at r4, 10 of 10 files at r12, 2 of 2 files at r13, 9 of 9 files at r14.
Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.

Comments from Reviewable

a-robinson · 2017-07-13T21:23:49Z

Reviewed 10 of 10 files at r12, 2 of 2 files at r13.
Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.

Comments from Reviewable

bdarnell · 2017-07-14T16:05:38Z

LGTM

Reviewed 1 of 10 files at r12, 2 of 2 files at r13, 9 of 9 files at r14.
Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.

Comments from Reviewable

Scattering is *much* more reliable when empty snapshots are not limited, and this doesn't seem to have any other adverse effects in my testing. It's also early in the release cycle, so we'll have time to fix any bugs this may introduce.

Having upwards of 4000 RocksDB range tombstones in one SST renders a node useless, as operations that used to take microseconds take dozens of milliseconds. Under most workloads, this situation is rare: reads don't create tombstones, and inserting data will eventually cause a compaction that cleans up these tombstones. When presplitting for a 2TB restore on an otherwise idle cluster, however, up to 12k snapshots may be applied before any data is ingested. This quickly generates more range deletion tombstones than RocksDB can handle. As a quick fix, this commit avoids deletion tombstones when clearing ranges with less than 64 keys. Down the road, we may want to investigate teaching RocksDB to automatically compact any file that exceeds a certain number of range deletion tombstones, but this should do the trick for now.

Replace the existing "toy" implementation of scatter with a real implementation that uses the zone configuration and the allocator's recommendations.

benesch · 2017-07-17T01:47:48Z

Thanks, all, for some serious advice and reviews!

Review status: 0 of 14 files reviewed at latest revision, 11 unresolved discussions, all commit checks successful.

pkg/ccl/sqlccl/restore.go, line 775 at r14 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

I'm happy to leave this as a followup, but are there any outstanding concerns with making a failed scatter fail the restore? (or did these last two insights fix them?)

I'm not quite convinced yet that scatter is reliable enough; often it's just one lease that fails to transfer or just one range that fails to upreplicate, and it'd be a shame to fail the whole restore over that. Tossing it in a retry loop should fix it, but since you're happy leaving that to a follow up PR, I'm definitely going to leave that to a followup PR.

pkg/storage/replica_raftstorage.go, line 625 at r14 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Let's add some more words here:

It is expensive for there to be many range deletion tombstones in the same sstable because all of the tombstones in an sstable are loaded whenever the sstable is accessed. So we avoid using range deletion unless there is some minimum number of keys. The value here was pulled out of thin air. It might be better to make this dependent on the size of the data being deleted. Or perhaps we should fix RocksDB to handle large numbers of tombstones in an sstable better.

Done.

Comments from Reviewable

`ALTER TABLE... SCATTER` expects to receive a list of ranges that were scattered. This information was accidentally dropped in the new scatter implementation (dbd90cf, cockroachdb#16249). This commit restores the old behavior, and adds a test to boot. Fixes cockroachdb#17153.

benesch requested review from danhhz, BramGruneir, a-robinson and RaduBerinde May 31, 2017 21:34

tbg reviewed Jun 1, 2017

View reviewed changes

benesch mentioned this pull request Jul 12, 2017

unthrottled empty snapshots bring a cluster to its knees #17001

Closed

benesch force-pushed the real-scatter branch from a4e11ab to 19e9b75 Compare July 12, 2017 17:48

benesch force-pushed the real-scatter branch 3 times, most recently from f293fac to 24ae61e Compare July 13, 2017 04:05

tbg mentioned this pull request Jul 14, 2017

storage: don't send log truncations through Raft #16993

Merged

benesch force-pushed the real-scatter branch from 24ae61e to 8380114 Compare July 16, 2017 15:55

benesch added 2 commits July 16, 2017 12:59

storage: don't limit empty snapshot application

b6a4a8d

Scattering is *much* more reliable when empty snapshots are not limited, and this doesn't seem to have any other adverse effects in my testing. It's also early in the release cycle, so we'll have time to fix any bugs this may introduce.

benesch force-pushed the real-scatter branch from 8380114 to d218bc3 Compare July 16, 2017 16:59

storage: teach scatter to use the allocator and zone config

dbd90cf

Replace the existing "toy" implementation of scatter with a real implementation that uses the zone configuration and the allocator's recommendations.

benesch force-pushed the real-scatter branch from d218bc3 to dbd90cf Compare July 16, 2017 18:41

benesch merged commit 07d6107 into cockroachdb:master Jul 17, 2017

danhhz mentioned this pull request Jul 17, 2017

storage: make AdminScatter less best-effort #15049

Closed

benesch mentioned this pull request Aug 2, 2017

storage: restore range information in AdminScatterResponse #17409

Merged

benesch deleted the real-scatter branch August 16, 2017 14:47

storage: teach scatter to use the allocator and zone config #16249

storage: teach scatter to use the allocator and zone config #16249

Conversation

benesch commented May 31, 2017

cockroach-teamcity commented May 31, 2017

tamird commented May 31, 2017

RaduBerinde commented May 31, 2017

benesch commented Jun 1, 2017

bdarnell commented Jun 1, 2017

danhhz commented Jun 1, 2017

BramGruneir commented Jun 1, 2017

bdarnell commented Jun 1, 2017

BramGruneir commented Jun 1, 2017

tbg Jun 1, 2017

Choose a reason for hiding this comment

tbg Jun 1, 2017

Choose a reason for hiding this comment

tbg Jun 1, 2017

Choose a reason for hiding this comment

tbg Jun 1, 2017

Choose a reason for hiding this comment

bdarnell commented Jun 1, 2017

a-robinson commented Jun 1, 2017

benesch commented Jun 1, 2017

bdarnell commented Jun 1, 2017

benesch commented Jun 1, 2017

benesch commented Jun 1, 2017

bdarnell commented Jun 1, 2017

bdarnell commented Jun 1, 2017

benesch commented Jun 1, 2017

bdarnell commented Jun 1, 2017

a-robinson commented Jun 2, 2017

benesch commented Jun 8, 2017

tamird commented Jun 8, 2017 via email

bdarnell commented Jun 8, 2017

danhhz commented Jun 8, 2017

benesch commented Jun 8, 2017 via email

benesch commented Jul 12, 2017

benesch commented Jul 12, 2017

benesch commented Jul 13, 2017

danhhz commented Jul 13, 2017

petermattis commented Jul 13, 2017

tamird commented Jul 13, 2017

a-robinson commented Jul 13, 2017

bdarnell commented Jul 14, 2017

benesch commented Jul 17, 2017