storage: transfer leases on overfull stores #9465

petermattis · 2016-09-20T01:23:33Z

If the remove-target is the lease holder, transfer the lease to another
store to allow removal of the replica from the overfull store.

Fixes #9462.

This change is

petermattis · 2016-09-20T01:25:34Z

@bdarnell I recall you had concerns about using TransferLease in "production", but I don't recall what they were. Gamma is experiencing pretty dramatic replica imbalance right now due to being unable to rebalance away from the lease holder. Local testing shows this PR completely fixes that problem.

Cc @cockroachdb/stability

tbg · 2016-09-20T03:28:06Z

At least #7996 needs fixing.

tbg · 2016-09-20T03:28:49Z

From #6929 (comment):

SGTM for tests and orderly shutdown; I think we'll probably want a full RFC once we're thinking about introducing an RPC for use in live clusters.

bdarnell · 2016-09-20T04:15:05Z

storage/replicate_queue.go

@@ -162,17 +165,31 @@ func (rq *replicateQueue) process(
 		log.Event(ctx, "removing a replica")
 		// We require the lease in order to process replicas, so
 		// repl.store.StoreID() corresponds to the lease-holder's store ID.
-		removeReplica, err := rq.allocator.RemoveTarget(desc.Replicas, repl.store.StoreID())
+		removeReplica, err := rq.allocator.RemoveTarget(desc.Replicas, 0)


This is the only non-test user of the second argument to RemoveArgument, so if we do this, we could remove that argument completely. However, as I said in #9462, we need to have a strong preference for rebalancing ranges for which we don't have the lease, or we reintroduce #5737 and the associated availability problems.

I don't think we'll see the problem from #5737 as we'll never remove the lease holder. Instead, if the allocator wants to remove the lease holder we'll first transfer the lease.

Let me think about whether we could make the preference for removing a non-lease holder stronger.

bdarnell · 2016-09-20T04:21:05Z

storage/replicate_queue.go

 		if removeReplica.StoreID == repl.store.StoreID() {
-			return nil
+			var targetID roachpb.StoreID
+			for _, replica := range desc.Replicas {


Choosing the first non-self replica as the target seems problematic. That store might be overloaded too, and could lead to ping-ponging between the first two stores in the list. We should at least randomize the choice, and ideally choose the most under-loaded one.

Yeah, this was too naive. Finding the lease loaded store from the existing replicas (excluding the lease holder) is straightforward, though I'm unable to upload the change right now.

tamird · 2016-09-20T15:51:34Z

Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.

storage/replicate_queue.go, line 181 at r1 (raw file):

          }
          if targetID != 0 {
              log.VEventf(1, ctx, "transferring lease to s%d", removeReplica, targetID)

incorrect number of arguments

storage/replicate_queue.go, line 183 at r1 (raw file):

              log.VEventf(1, ctx, "transferring lease to s%d", removeReplica, targetID)
              if err := rq.db.AdminTransferLease(ctx, desc.StartKey.AsRawKey(), targetID); err != nil {
                  return errors.Wrapf(err, "%s: unable to transfer lease")

incorrect number of arguments

Comments from Reviewable

tamird · 2016-09-20T16:22:20Z

You should rebase this to pick up #9442.

petermattis · 2016-09-20T18:07:22Z

Review status: 0 of 5 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.

storage/replicate_queue.go, line 168 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I don't think we'll see the problem from #5737 as we'll never remove the lease holder. Instead, if the allocator wants to remove the lease holder we'll first transfer the lease.

Let me think about whether we could make the preference for removing a non-lease holder stronger.

I've changed the heuristic so that we only transfer leases from overfull nodes. This might still be too aggressive. Let me know if you have a better idea.

storage/replicate_queue.go, line 181 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

incorrect number of arguments

Done.

storage/replicate_queue.go, line 183 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

incorrect number of arguments

Done.

Comments from Reviewable

tamird · 2016-09-20T19:45:38Z

Reviewed 5 of 5 files at r2.
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

storage/allocator.go, line 327 at r2 (raw file):

  leaseStoreID roachpb.StoreID,
) roachpb.ReplicaDescriptor {
  bestIdx := -1

what's the point of tracking bestIdx rather than bestReplica?

storage/client_split_test.go, line 700 at r2 (raw file):

  descID := uint32(keys.MaxReservedDescID + 1)
  config.TestingSetZoneConfig(descID, config.ZoneConfig{
      ReplicaAttrs:  []roachpb.Attributes{{}},

how come these were needed?

storage/replicate_queue.go, line 177 at r2 (raw file):

          if target.StoreID != 0 {
              log.VEventf(1, ctx, "transferring lease to s%d", target.StoreID)
              if err := repl.AdminTransferLease(target.StoreID); err != nil {

extremely minor nit: you can blindly return errors.Wrapf here.

Comments from Reviewable

petermattis · 2016-09-20T20:26:35Z

Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

storage/allocator.go, line 327 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

what's the point of tracking bestIdx rather than bestReplica?

Just the way I coded it. Changed.

storage/client_split_test.go, line 700 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

how come these were needed?

The tests were creating zone configs without any replicas which was causing the replicas to be removed. That now fails when we try to index `zone.ReplicaAttrs[0]`. Such zone configs are not possible in production due to checks in `config`.

storage/replicate_queue.go, line 177 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

extremely minor nit: you can blindly return errors.Wrapf here.

True, though that relies on the subtle behavior that `errors.Wrapf(nil,...)` returns `nil`. I'd prefer to leave as-is.

Comments from Reviewable

tamird · 2016-09-20T20:45:08Z

Reviewed 2 of 2 files at r3.
Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.

Comments from Reviewable

bdarnell · 2016-09-21T02:33:04Z

Between the production use of TransferLease and the possibility for new unavailability and thrashing, this should probably be post-code-yellow.

Reviewed 3 of 5 files at r2, 2 of 2 files at r3.
Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.

storage/allocator.go, line 330 at r3 (raw file):

  bestRangeCount := int32(math.MaxInt32)
  for _, repl := range existing {
      if leaseStoreID == repl.StoreID {

A store could be overfull compared to the list used in TransferLeaseSource and still be underfull compared to the replicas of this particular range. If leaseStoreID turns out to be the least loaded of the replicas, we should abort the transfer instead of transferring the lease to a more-loaded replica.

storage/allocator_test.go, line 751 at r3 (raw file):

      expectedTarget roachpb.StoreID
  }{
      // No existing lease holder, prefer lease loaded replica.

s/lease loaded/least loaded/ (and below)

storage/replicate_queue.go, line 168 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I've changed the heuristic so that we only transfer leases from overfull nodes. This might still be too aggressive. Let me know if you have a better idea.

It won't be as bad as #5737 because we don't have to wait for the lease expiration, but transferring a lease is still an availability blip. If we get to this point (`case AllocatorRemove`), aren't we pretty much guaranteed to be overfull? The new check doesn't seem to add much.

I think we need to consider transferring leases only after we have exhausted our non-lease-holder replicas, although that's difficult with the way the replicate queue iterates over replicas.

We could also consider "number of leases" as a store capacity metric that we also try to balance, so a node with more leases than the mean would transfer its leases away independently of transferring its ranges. This would also be a proxy for load-based balancing. I'm not sure how easy it would be to give the allocator two metrics to optimize (without thrashing), though.

Comments from Reviewable

petermattis · 2016-09-21T18:27:19Z

I guess I feel more aggressive about getting this in. Note that the recent restart of gamma more than doubled throughput once we're using more than just a single node to process traffic. @tschottdorf and @andreimatei have a bit of work to do to productionize lease transfer. I'm not going to merge this PR until that is done.

Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.

storage/allocator.go, line 330 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

A store could be overfull compared to the list used in TransferLeaseSource and still be underfull compared to the replicas of this particular range. If leaseStoreID turns out to be the least loaded of the replicas, we should abort the transfer instead of transferring the lease to a more-loaded replica.

If `leaseStoreID` is the least loaded of the replicas then `RemoveTarget` wouldn't have returned it. Am I missing something?

storage/allocator_test.go, line 751 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/lease loaded/least loaded/ (and below)

Done.

storage/replicate_queue.go, line 168 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It won't be as bad as #5737 because we don't have to wait for the lease expiration, but transferring a lease is still an availability blip. If we get to this point (case AllocatorRemove), aren't we pretty much guaranteed to be overfull? The new check doesn't seem to add much.

I think we need to consider transferring leases only after we have exhausted our non-lease-holder replicas, although that's difficult with the way the replicate queue iterates over replicas.

We could also consider "number of leases" as a store capacity metric that we also try to balance, so a node with more leases than the mean would transfer its leases away independently of transferring its ranges. This would also be a proxy for load-based balancing. I'm not sure how easy it would be to give the allocator two metrics to optimize (without thrashing), though.

We can get to `AllocatorRemove` without being overfull. If one of the other replicas is on an overfull store we'll add a new replica (in `AllocatorNoop`) and then in the next iteration we'll hit `AllocatorRemove`.

I suppose I could look at the lease-holder vs total replica counts and only transfer leases if the the number of leases we hold puts us "overfull". Let me investigate what would be required to do that.

Comments from Reviewable

tamird · 2016-09-21T19:08:00Z

Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.

Comments from Reviewable

bdarnell · 2016-09-22T04:43:01Z

I think I had an incomplete understanding of the goal here - I thought it was to unblock the rebalance operations to get replicas more evenly distributed, but we didn't really care about balancing leadership except as necessary to unblock the rebalancing. If we want to balance leadership (and see it improving performance), then it does seem more important, although I no longer think the replicate queue is the right place to do this (unless we go all the way to giving the replicate queue multiple target metrics to optimize).

Instead, we could make balancer-like decisions in redirectOnOrAcquireLeaderLease: if there is no lease holder and we are above the mean (in terms of leases held) and one of our peers is below, we could send a NotLeaseHolderError with a "hint" pointing to the store we want to acquire the lease instead of acquiring it ourselves. That wouldn't even need to rely on the transfer mechanism. We'll need to gossip the number of leases held and I'm a little worried about thrashing (but this can probably be mitigated by allowing a wide margin on either side of the mean), but I think it's better to balance leases more directly than to tie it to the replication queue that is primarily concerned with equalizing storage.

Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

storage/allocator.go, line 330 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

If leaseStoreID is the least loaded of the replicas then RemoveTarget wouldn't have returned it. Am I missing something?

No, I am. I keep forgetting that all rebalancing decisions are made in the context of a range; the mental model I keep reverting to is that we first decide whether this store is overfull compared to the cluster as a whole and then choose a range to move, instead of initially restricting our considerations to members of a range.

storage/replicate_queue.go, line 168 at r1 (raw file):

We can get to AllocatorRemove without being overfull. If one of the other replicas is on an overfull store we'll add a new replica (in AllocatorNoop) and then in the next iteration we'll hit AllocatorRemove.

OK. Please add this as a comment.

Comments from Reviewable

petermattis · 2016-09-22T15:50:35Z

Hmm, let me take a deeper look at your suggestion. This PR started with the goal of unblocking rebalancing. The effect on performance was not considered (though it could have been predicted).

Review status: 4 of 6 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.

storage/replicate_queue.go, line 168 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We can get to AllocatorRemove without being overfull. If one of the other replicas is on an overfull store we'll add a new replica (in AllocatorNoop) and then in the next iteration we'll hit AllocatorRemove.

OK. Please add this as a comment.

Done.

Comments from Reviewable

petermattis · 2016-09-22T15:59:34Z

Bouncing lease renewals in redirectOnOrAcquireLeaderLease to balance leases across the cluster seems to require a similar mechanism to TransferLease if we want to avoid the lease stasis period. I'm not that familiar with the intricacies of this code, though, so perhaps I'm missing some way this could be done simply. Looking right now it seems quite a bit more complex than transferring leases during rebalancing. Let me chat with @tschottdorf.

Review status: 4 of 6 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.

Comments from Reviewable

bdarnell · 2016-09-23T02:02:45Z

If we're acquiring a fresh lease in redirectOnOrAcquireLeaderLease, the lease stasis period has already passed (remember that the stasis period is before the expiration). If we're renewing the lease, then we hold the current lease and can propose the TransferLease command before returning our NotLeaseHolderError.

petermattis · 2016-09-24T12:37:29Z

Ok, the actual mechanism of transferring leases in redirectOnOrAcquireLeaderLease looks reasonable, but the policy of when to do so appears to be challenging. Gossiping the number of leases held by a store in StoreCapacity is straightforward. How to make use of the info in redirectOnOrAcquireLeaderLease? If we're trying to balance lease holders throughout the cluster we have a feedback problem. redirectOnOrAcquireLeaderLease can rebalance lease holders much faster than the periodic gossip of StoreCapacity can throttle the rebalancing. Note that this same problem occurs if we try to do lease holder rebalancing in replicateQueue.

A smaller practical problem with adding lease holder rebalancing to redirectOnOrAcquireLeaderLease is that doing so seems to require the zone config which we currently don't have access to at that point in the code.

Lastly, we seem to have some sort of performance hiccup when transferring leases. With the current state of this PR I sometimes see 5-6sec query times that correspond with a lease being transferred. These are very reliably reproduced in a 2-3min block_writer run. Haven't had a chance to dig into what is happening yet.

andreimatei · 2016-09-24T19:05:50Z

I'd like to add my 2c here: it seems to me that, if the replicate_queue decides that it wants to rebalance a range for optimizing whatever metric it's trying to optimize, it should always have the ability to do so. Relying on the redirectOnOrAcquire to have a side-effect that improves the ability of the the replicate_queue to keep the cluster balanced seems awkward to me.

Also, dist-sql is hoping to be able to control who becomes lease holders for ranges that don't have an active leases by just sending requests to them. If redirectOnOrAcquire on such nodes starts refusing to acquire a lease, dist-sql would not get what it wants. Although, if we go forward with @spencerkimball's proposal for non-expiring leases, then I guess dist-sql loses this ability to affect leases and we need another mechanism for making leases follow traffic anyway...

bdarnell · 2016-09-25T09:05:37Z

@petermattis I think for balancing lease ownership (whether we do it when acquiring the lease or in the rebalance queue) we just have to allow a large enough deviation from the mean that we won't thrash too much in the interval between store capacity gossips. We could also use some very crude heuristics to avoid the "unable to rebalance anything at all" case: refuse to acquire or renew any more leases when we hold the lease on 80% of our replicas. This doesn't require any cross-node communication.

@andreimatei I don't see how distsql would ever be able to completely control lease placement - what if two distsql transactions have different plans? Please leave final decisions about replica and lease placement to the lower levels and limit distsql's involvement to providing hints when the lease is idle.

spencerkimball · 2016-09-26T12:46:00Z

Using redirectOnOrAcquireLeaderLease for this purpose right now would
complicate things and then need to be rewritten anyway as the changes
necessary for moving to an epoch-based range leader lease are nearly
complete and are showing up as a more pressing performance problem with
large clusters.

On Sunday, September 25, 2016, Ben Darnell [email protected] wrote:

@petermattis https://github.com/petermattis I think for balancing lease
ownership (whether we do it when acquiring the lease or in the rebalance
queue) we just have to allow a large enough deviation from the mean that we
won't thrash too much in the interval between store capacity gossips. We
could also use some very crude heuristics to avoid the "unable to rebalance
anything at all" case: refuse to acquire or renew any more leases when we
hold the lease on 80% of our replicas. This doesn't require any cross-node
communication.

@andreimatei https://github.com/andreimatei I don't see how distsql
would ever be able to completely control lease placement - what if two
distsql transactions have different plans? Please leave final decisions
about replica and lease placement to the lower levels and limit distsql's
involvement to providing hints when the lease is idle.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#9465 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF3MTX_6emi9QdtXcI-9JiB8p7RTA3bbks5qtjlugaJpZM4KBIXv
.

andreimatei · 2016-09-26T16:52:13Z

@petermattis you said you sometimes see a performance hiccup after a LeaseTransfer. Maybe it's #8816

petermattis · 2016-09-26T16:57:18Z

Yes, perhaps. My hiccups occur right after transferring a lease. We seem to be reproposing raft commands but not making progress because there is no raft leader. Its not clear how we got into that state. The one instance I've looked at so far shows that we transferred the lease for the RHS of a split immediately after the split. I'm wondering if the campaigning after a split is interacting with the raft leadership transfer and a lease transfer. I'm adding some more debug logging and should know more soon.

andreimatei · 2016-09-26T17:00:09Z

Note that if #8816 is indeed the problem, I have a PR out to improve the state of things. It was waiting for the code to decolor.

petermattis · 2016-09-26T17:00:57Z

@andreimatei where is this PR you speak of?

andreimatei · 2016-09-26T17:02:59Z

#8837, linked from the bug.

Remove the hack to campaign the RHS on one and only one of the nodes. The hack is no longer necessary now that we're campaigning idle ranges when the first proposal is received. By removing the hack, we're effectively leaving the RHS of a split as a lazily loaded Raft group. Tweaked multiTestContext so that it allows eager campaigning of idle replicas immediately upon startup. Discovered while investigating performance hiccups in cockroachdb#9465.

Remove the hack to campaign the RHS on one and only one of the nodes. The hack is no longer necessary now that we're campaigning idle ranges when the first proposal is received. By removing the hack, we're effectively leaving the RHS of a split as a lazily loaded Raft group. Tweaked Store.Bootstrap so that it allows eager campaigning of idle replicas immediately upon startup. Discovered while investigating performance hiccups in cockroachdb#9465.

tbg · 2016-10-04T01:19:32Z

Assigned stability label to mark for supervised merge.

If the remove-target is the lease holder, transfer the lease to another store to allow removal of the replica from the overfull store. Fixes cockroachdb#9462.

petermattis · 2016-11-03T17:22:34Z

Closing this PR. I'll be pulling out pieces into separate PRs and more closely following the approach outlined in #10262.

bdarnell reviewed Sep 20, 2016

View reviewed changes

petermattis force-pushed the pmattis/rebalance-transfer-lease branch 2 times, most recently from 08b852e to 5973a12 Compare September 20, 2016 17:53

petermattis force-pushed the pmattis/rebalance-transfer-lease branch from 5973a12 to 0431f13 Compare September 20, 2016 20:26

petermattis force-pushed the pmattis/rebalance-transfer-lease branch from 0431f13 to 64e17b7 Compare September 21, 2016 18:27

petermattis force-pushed the pmattis/rebalance-transfer-lease branch from 64e17b7 to 511c106 Compare September 22, 2016 15:50

petermattis force-pushed the pmattis/rebalance-transfer-lease branch from 511c106 to 008d728 Compare September 22, 2016 21:18

petermattis force-pushed the pmattis/rebalance-transfer-lease branch from 008d728 to 0356ff3 Compare September 26, 2016 13:31

tbg mentioned this pull request Sep 26, 2016

storage: leadership handoff #326

Closed

This was referenced Sep 26, 2016

storage: don't campaign the RHS after splits #9550

Merged

stability: drastically reducing range_max_bytes on a running cluster causes stall #9545

Closed

petermattis mentioned this pull request Sep 30, 2016

storage, server: no more timestamp arithmetic in leases in and timestamp cache #9504

Merged

tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Oct 4, 2016

storage: transfer leases on overfull stores

813bace

If the remove-target is the lease holder, transfer the lease to another store to allow removal of the replica from the overfull store. Fixes cockroachdb#9462.

petermattis force-pushed the pmattis/rebalance-transfer-lease branch from 9b0203d to 813bace Compare October 25, 2016 00:32

petermattis closed this Nov 3, 2016

petermattis deleted the pmattis/rebalance-transfer-lease branch November 28, 2016 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: transfer leases on overfull stores #9465

storage: transfer leases on overfull stores #9465

petermattis commented Sep 20, 2016 •

edited by rjnn

Loading

petermattis commented Sep 20, 2016

tbg commented Sep 20, 2016 •

edited

Loading

tbg commented Sep 20, 2016

bdarnell Sep 20, 2016

petermattis Sep 20, 2016

bdarnell Sep 20, 2016

petermattis Sep 20, 2016

tamird commented Sep 20, 2016

tamird commented Sep 20, 2016

petermattis commented Sep 20, 2016

tamird commented Sep 20, 2016

petermattis commented Sep 20, 2016

tamird commented Sep 20, 2016

bdarnell commented Sep 21, 2016

petermattis commented Sep 21, 2016

tamird commented Sep 21, 2016

bdarnell commented Sep 22, 2016

petermattis commented Sep 22, 2016

petermattis commented Sep 22, 2016

bdarnell commented Sep 23, 2016

petermattis commented Sep 24, 2016

andreimatei commented Sep 24, 2016

bdarnell commented Sep 25, 2016

spencerkimball commented Sep 26, 2016

andreimatei commented Sep 26, 2016

petermattis commented Sep 26, 2016

andreimatei commented Sep 26, 2016

petermattis commented Sep 26, 2016

andreimatei commented Sep 26, 2016

tbg commented Oct 4, 2016

petermattis commented Nov 3, 2016

storage: transfer leases on overfull stores #9465

storage: transfer leases on overfull stores #9465

Conversation

petermattis commented Sep 20, 2016 • edited by rjnn Loading

petermattis commented Sep 20, 2016

tbg commented Sep 20, 2016 • edited Loading

tbg commented Sep 20, 2016

bdarnell Sep 20, 2016

Choose a reason for hiding this comment

petermattis Sep 20, 2016

Choose a reason for hiding this comment

bdarnell Sep 20, 2016

Choose a reason for hiding this comment

petermattis Sep 20, 2016

Choose a reason for hiding this comment

tamird commented Sep 20, 2016

tamird commented Sep 20, 2016

petermattis commented Sep 20, 2016

tamird commented Sep 20, 2016

petermattis commented Sep 20, 2016

tamird commented Sep 20, 2016

bdarnell commented Sep 21, 2016

petermattis commented Sep 21, 2016

tamird commented Sep 21, 2016

bdarnell commented Sep 22, 2016

petermattis commented Sep 22, 2016

petermattis commented Sep 22, 2016

bdarnell commented Sep 23, 2016

petermattis commented Sep 24, 2016

andreimatei commented Sep 24, 2016

bdarnell commented Sep 25, 2016

spencerkimball commented Sep 26, 2016

andreimatei commented Sep 26, 2016

petermattis commented Sep 26, 2016

andreimatei commented Sep 26, 2016

petermattis commented Sep 26, 2016

andreimatei commented Sep 26, 2016

tbg commented Oct 4, 2016

petermattis commented Nov 3, 2016

petermattis commented Sep 20, 2016 •

edited by rjnn

Loading

tbg commented Sep 20, 2016 •

edited

Loading