Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: transfer leases on overfull stores #9465

Conversation

petermattis
Copy link
Collaborator

@petermattis petermattis commented Sep 20, 2016

If the remove-target is the lease holder, transfer the lease to another
store to allow removal of the replica from the overfull store.

Fixes #9462.


This change is Reviewable

@petermattis
Copy link
Collaborator Author

@bdarnell I recall you had concerns about using TransferLease in "production", but I don't recall what they were. Gamma is experiencing pretty dramatic replica imbalance right now due to being unable to rebalance away from the lease holder. Local testing shows this PR completely fixes that problem.

Cc @cockroachdb/stability

@tbg
Copy link
Member

tbg commented Sep 20, 2016

At least #7996 needs fixing.

@tbg
Copy link
Member

tbg commented Sep 20, 2016

From #6929 (comment):

SGTM for tests and orderly shutdown; I think we'll probably want a full RFC once we're thinking about introducing an RPC for use in live clusters.

@@ -162,17 +165,31 @@ func (rq *replicateQueue) process(
log.Event(ctx, "removing a replica")
// We require the lease in order to process replicas, so
// repl.store.StoreID() corresponds to the lease-holder's store ID.
removeReplica, err := rq.allocator.RemoveTarget(desc.Replicas, repl.store.StoreID())
removeReplica, err := rq.allocator.RemoveTarget(desc.Replicas, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only non-test user of the second argument to RemoveArgument, so if we do this, we could remove that argument completely. However, as I said in #9462, we need to have a strong preference for rebalancing ranges for which we don't have the lease, or we reintroduce #5737 and the associated availability problems.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'll see the problem from #5737 as we'll never remove the lease holder. Instead, if the allocator wants to remove the lease holder we'll first transfer the lease.

Let me think about whether we could make the preference for removing a non-lease holder stronger.

if removeReplica.StoreID == repl.store.StoreID() {
return nil
var targetID roachpb.StoreID
for _, replica := range desc.Replicas {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choosing the first non-self replica as the target seems problematic. That store might be overloaded too, and could lead to ping-ponging between the first two stores in the list. We should at least randomize the choice, and ideally choose the most under-loaded one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was too naive. Finding the lease loaded store from the existing replicas (excluding the lease holder) is straightforward, though I'm unable to upload the change right now.

@tamird
Copy link
Contributor

tamird commented Sep 20, 2016

Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


storage/replicate_queue.go, line 181 at r1 (raw file):

          }
          if targetID != 0 {
              log.VEventf(1, ctx, "transferring lease to s%d", removeReplica, targetID)

incorrect number of arguments


storage/replicate_queue.go, line 183 at r1 (raw file):

              log.VEventf(1, ctx, "transferring lease to s%d", removeReplica, targetID)
              if err := rq.db.AdminTransferLease(ctx, desc.StartKey.AsRawKey(), targetID); err != nil {
                  return errors.Wrapf(err, "%s: unable to transfer lease")

incorrect number of arguments


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Sep 20, 2016

You should rebase this to pick up #9442.

@petermattis petermattis force-pushed the pmattis/rebalance-transfer-lease branch 2 times, most recently from 08b852e to 5973a12 Compare September 20, 2016 17:53
@petermattis
Copy link
Collaborator Author

Review status: 0 of 5 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


storage/replicate_queue.go, line 168 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I don't think we'll see the problem from #5737 as we'll never remove the lease holder. Instead, if the allocator wants to remove the lease holder we'll first transfer the lease.

Let me think about whether we could make the preference for removing a non-lease holder stronger.

I've changed the heuristic so that we only transfer leases from overfull nodes. This might still be too aggressive. Let me know if you have a better idea.

storage/replicate_queue.go, line 181 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

incorrect number of arguments

Done.

storage/replicate_queue.go, line 183 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

incorrect number of arguments

Done.

Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Sep 20, 2016

Reviewed 5 of 5 files at r2.
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


storage/allocator.go, line 327 at r2 (raw file):

  leaseStoreID roachpb.StoreID,
) roachpb.ReplicaDescriptor {
  bestIdx := -1

what's the point of tracking bestIdx rather than bestReplica?


storage/client_split_test.go, line 700 at r2 (raw file):

  descID := uint32(keys.MaxReservedDescID + 1)
  config.TestingSetZoneConfig(descID, config.ZoneConfig{
      ReplicaAttrs:  []roachpb.Attributes{{}},

how come these were needed?


storage/replicate_queue.go, line 177 at r2 (raw file):

          if target.StoreID != 0 {
              log.VEventf(1, ctx, "transferring lease to s%d", target.StoreID)
              if err := repl.AdminTransferLease(target.StoreID); err != nil {

extremely minor nit: you can blindly return errors.Wrapf here.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/rebalance-transfer-lease branch from 5973a12 to 0431f13 Compare September 20, 2016 20:26
@petermattis
Copy link
Collaborator Author

Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


storage/allocator.go, line 327 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

what's the point of tracking bestIdx rather than bestReplica?

Just the way I coded it. Changed.

storage/client_split_test.go, line 700 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

how come these were needed?

The tests were creating zone configs without any replicas which was causing the replicas to be removed. That now fails when we try to index `zone.ReplicaAttrs[0]`. Such zone configs are not possible in production due to checks in `config`.

storage/replicate_queue.go, line 177 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

extremely minor nit: you can blindly return errors.Wrapf here.

True, though that relies on the subtle behavior that `errors.Wrapf(nil,...)` returns `nil`. I'd prefer to leave as-is.

Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Sep 20, 2016

Reviewed 2 of 2 files at r3.
Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


Comments from Reviewable

@bdarnell
Copy link
Contributor

Between the production use of TransferLease and the possibility for new unavailability and thrashing, this should probably be post-code-yellow.


Reviewed 3 of 5 files at r2, 2 of 2 files at r3.
Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


storage/allocator.go, line 330 at r3 (raw file):

  bestRangeCount := int32(math.MaxInt32)
  for _, repl := range existing {
      if leaseStoreID == repl.StoreID {

A store could be overfull compared to the list used in TransferLeaseSource and still be underfull compared to the replicas of this particular range. If leaseStoreID turns out to be the least loaded of the replicas, we should abort the transfer instead of transferring the lease to a more-loaded replica.


storage/allocator_test.go, line 751 at r3 (raw file):

      expectedTarget roachpb.StoreID
  }{
      // No existing lease holder, prefer lease loaded replica.

s/lease loaded/least loaded/ (and below)


storage/replicate_queue.go, line 168 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I've changed the heuristic so that we only transfer leases from overfull nodes. This might still be too aggressive. Let me know if you have a better idea.

It won't be as bad as #5737 because we don't have to wait for the lease expiration, but transferring a lease is still an availability blip. If we get to this point (`case AllocatorRemove`), aren't we pretty much guaranteed to be overfull? The new check doesn't seem to add much.

I think we need to consider transferring leases only after we have exhausted our non-lease-holder replicas, although that's difficult with the way the replicate queue iterates over replicas.

We could also consider "number of leases" as a store capacity metric that we also try to balance, so a node with more leases than the mean would transfer its leases away independently of transferring its ranges. This would also be a proxy for load-based balancing. I'm not sure how easy it would be to give the allocator two metrics to optimize (without thrashing), though.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/rebalance-transfer-lease branch from 0431f13 to 64e17b7 Compare September 21, 2016 18:27
@petermattis
Copy link
Collaborator Author

I guess I feel more aggressive about getting this in. Note that the recent restart of gamma more than doubled throughput once we're using more than just a single node to process traffic. @tschottdorf and @andreimatei have a bit of work to do to productionize lease transfer. I'm not going to merge this PR until that is done.


Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


storage/allocator.go, line 330 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

A store could be overfull compared to the list used in TransferLeaseSource and still be underfull compared to the replicas of this particular range. If leaseStoreID turns out to be the least loaded of the replicas, we should abort the transfer instead of transferring the lease to a more-loaded replica.

If `leaseStoreID` is the least loaded of the replicas then `RemoveTarget` wouldn't have returned it. Am I missing something?

storage/allocator_test.go, line 751 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/lease loaded/least loaded/ (and below)

Done.

storage/replicate_queue.go, line 168 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It won't be as bad as #5737 because we don't have to wait for the lease expiration, but transferring a lease is still an availability blip. If we get to this point (case AllocatorRemove), aren't we pretty much guaranteed to be overfull? The new check doesn't seem to add much.

I think we need to consider transferring leases only after we have exhausted our non-lease-holder replicas, although that's difficult with the way the replicate queue iterates over replicas.

We could also consider "number of leases" as a store capacity metric that we also try to balance, so a node with more leases than the mean would transfer its leases away independently of transferring its ranges. This would also be a proxy for load-based balancing. I'm not sure how easy it would be to give the allocator two metrics to optimize (without thrashing), though.

We can get to `AllocatorRemove` without being overfull. If one of the other replicas is on an overfull store we'll add a new replica (in `AllocatorNoop`) and then in the next iteration we'll hit `AllocatorRemove`.

I suppose I could look at the lease-holder vs total replica counts and only transfer leases if the the number of leases we hold puts us "overfull". Let me investigate what would be required to do that.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Sep 21, 2016

Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


Comments from Reviewable

@bdarnell
Copy link
Contributor

I think I had an incomplete understanding of the goal here - I thought it was to unblock the rebalance operations to get replicas more evenly distributed, but we didn't really care about balancing leadership except as necessary to unblock the rebalancing. If we want to balance leadership (and see it improving performance), then it does seem more important, although I no longer think the replicate queue is the right place to do this (unless we go all the way to giving the replicate queue multiple target metrics to optimize).

Instead, we could make balancer-like decisions in redirectOnOrAcquireLeaderLease: if there is no lease holder and we are above the mean (in terms of leases held) and one of our peers is below, we could send a NotLeaseHolderError with a "hint" pointing to the store we want to acquire the lease instead of acquiring it ourselves. That wouldn't even need to rely on the transfer mechanism. We'll need to gossip the number of leases held and I'm a little worried about thrashing (but this can probably be mitigated by allowing a wide margin on either side of the mean), but I think it's better to balance leases more directly than to tie it to the replication queue that is primarily concerned with equalizing storage.


Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


storage/allocator.go, line 330 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

If leaseStoreID is the least loaded of the replicas then RemoveTarget wouldn't have returned it. Am I missing something?

No, I am. I keep forgetting that all rebalancing decisions are made in the context of a range; the mental model I keep reverting to is that we first decide whether this store is overfull compared to the cluster as a whole and then choose a range to move, instead of initially restricting our considerations to members of a range.

storage/replicate_queue.go, line 168 at r1 (raw file):

We can get to AllocatorRemove without being overfull. If one of the other replicas is on an overfull store we'll add a new replica (in AllocatorNoop) and then in the next iteration we'll hit AllocatorRemove.

OK. Please add this as a comment.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/rebalance-transfer-lease branch from 64e17b7 to 511c106 Compare September 22, 2016 15:50
@petermattis
Copy link
Collaborator Author

Hmm, let me take a deeper look at your suggestion. This PR started with the goal of unblocking rebalancing. The effect on performance was not considered (though it could have been predicted).


Review status: 4 of 6 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


storage/replicate_queue.go, line 168 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We can get to AllocatorRemove without being overfull. If one of the other replicas is on an overfull store we'll add a new replica (in AllocatorNoop) and then in the next iteration we'll hit AllocatorRemove.

OK. Please add this as a comment.

Done.

Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Bouncing lease renewals in redirectOnOrAcquireLeaderLease to balance leases across the cluster seems to require a similar mechanism to TransferLease if we want to avoid the lease stasis period. I'm not that familiar with the intricacies of this code, though, so perhaps I'm missing some way this could be done simply. Looking right now it seems quite a bit more complex than transferring leases during rebalancing. Let me chat with @tschottdorf.


Review status: 4 of 6 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/rebalance-transfer-lease branch from 511c106 to 008d728 Compare September 22, 2016 21:18
@bdarnell
Copy link
Contributor

If we're acquiring a fresh lease in redirectOnOrAcquireLeaderLease, the lease stasis period has already passed (remember that the stasis period is before the expiration). If we're renewing the lease, then we hold the current lease and can propose the TransferLease command before returning our NotLeaseHolderError.

@petermattis
Copy link
Collaborator Author

Ok, the actual mechanism of transferring leases in redirectOnOrAcquireLeaderLease looks reasonable, but the policy of when to do so appears to be challenging. Gossiping the number of leases held by a store in StoreCapacity is straightforward. How to make use of the info in redirectOnOrAcquireLeaderLease? If we're trying to balance lease holders throughout the cluster we have a feedback problem. redirectOnOrAcquireLeaderLease can rebalance lease holders much faster than the periodic gossip of StoreCapacity can throttle the rebalancing. Note that this same problem occurs if we try to do lease holder rebalancing in replicateQueue.

A smaller practical problem with adding lease holder rebalancing to redirectOnOrAcquireLeaderLease is that doing so seems to require the zone config which we currently don't have access to at that point in the code.

Lastly, we seem to have some sort of performance hiccup when transferring leases. With the current state of this PR I sometimes see 5-6sec query times that correspond with a lease being transferred. These are very reliably reproduced in a 2-3min block_writer run. Haven't had a chance to dig into what is happening yet.

@andreimatei
Copy link
Contributor

I'd like to add my 2c here: it seems to me that, if the replicate_queue decides that it wants to rebalance a range for optimizing whatever metric it's trying to optimize, it should always have the ability to do so. Relying on the redirectOnOrAcquire to have a side-effect that improves the ability of the the replicate_queue to keep the cluster balanced seems awkward to me.

Also, dist-sql is hoping to be able to control who becomes lease holders for ranges that don't have an active leases by just sending requests to them. If redirectOnOrAcquire on such nodes starts refusing to acquire a lease, dist-sql would not get what it wants. Although, if we go forward with @spencerkimball's proposal for non-expiring leases, then I guess dist-sql loses this ability to affect leases and we need another mechanism for making leases follow traffic anyway...

@bdarnell
Copy link
Contributor

@petermattis I think for balancing lease ownership (whether we do it when acquiring the lease or in the rebalance queue) we just have to allow a large enough deviation from the mean that we won't thrash too much in the interval between store capacity gossips. We could also use some very crude heuristics to avoid the "unable to rebalance anything at all" case: refuse to acquire or renew any more leases when we hold the lease on 80% of our replicas. This doesn't require any cross-node communication.

@andreimatei I don't see how distsql would ever be able to completely control lease placement - what if two distsql transactions have different plans? Please leave final decisions about replica and lease placement to the lower levels and limit distsql's involvement to providing hints when the lease is idle.

@spencerkimball
Copy link
Member

Using redirectOnOrAcquireLeaderLease for this purpose right now would
complicate things and then need to be rewritten anyway as the changes
necessary for moving to an epoch-based range leader lease are nearly
complete and are showing up as a more pressing performance problem with
large clusters.

On Sunday, September 25, 2016, Ben Darnell [email protected] wrote:

@petermattis https://github.com/petermattis I think for balancing lease
ownership (whether we do it when acquiring the lease or in the rebalance
queue) we just have to allow a large enough deviation from the mean that we
won't thrash too much in the interval between store capacity gossips. We
could also use some very crude heuristics to avoid the "unable to rebalance
anything at all" case: refuse to acquire or renew any more leases when we
hold the lease on 80% of our replicas. This doesn't require any cross-node
communication.

@andreimatei https://github.com/andreimatei I don't see how distsql
would ever be able to completely control lease placement - what if two
distsql transactions have different plans? Please leave final decisions
about replica and lease placement to the lower levels and limit distsql's
involvement to providing hints when the lease is idle.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#9465 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF3MTX_6emi9QdtXcI-9JiB8p7RTA3bbks5qtjlugaJpZM4KBIXv
.

@petermattis petermattis force-pushed the pmattis/rebalance-transfer-lease branch from 008d728 to 0356ff3 Compare September 26, 2016 13:31
@tbg tbg mentioned this pull request Sep 26, 2016
@andreimatei
Copy link
Contributor

@petermattis you said you sometimes see a performance hiccup after a LeaseTransfer. Maybe it's #8816

@petermattis
Copy link
Collaborator Author

Yes, perhaps. My hiccups occur right after transferring a lease. We seem to be reproposing raft commands but not making progress because there is no raft leader. Its not clear how we got into that state. The one instance I've looked at so far shows that we transferred the lease for the RHS of a split immediately after the split. I'm wondering if the campaigning after a split is interacting with the raft leadership transfer and a lease transfer. I'm adding some more debug logging and should know more soon.

@andreimatei
Copy link
Contributor

Note that if #8816 is indeed the problem, I have a PR out to improve the state of things. It was waiting for the code to decolor.

@petermattis
Copy link
Collaborator Author

@andreimatei where is this PR you speak of?

@andreimatei
Copy link
Contributor

#8837, linked from the bug.

petermattis added a commit to petermattis/cockroach that referenced this pull request Sep 26, 2016
Remove the hack to campaign the RHS on one and only one of the
nodes. The hack is no longer necessary now that we're campaigning idle
ranges when the first proposal is received. By removing the hack, we're
effectively leaving the RHS of a split as a lazily loaded Raft
group. Tweaked multiTestContext so that it allows eager campaigning of
idle replicas immediately upon startup.

Discovered while investigating performance hiccups in cockroachdb#9465.
petermattis added a commit to petermattis/cockroach that referenced this pull request Sep 27, 2016
Remove the hack to campaign the RHS on one and only one of the
nodes. The hack is no longer necessary now that we're campaigning idle
ranges when the first proposal is received. By removing the hack, we're
effectively leaving the RHS of a split as a lazily loaded Raft
group. Tweaked Store.Bootstrap so that it allows eager campaigning of
idle replicas immediately upon startup.

Discovered while investigating performance hiccups in cockroachdb#9465.
@tbg
Copy link
Member

tbg commented Oct 4, 2016

Assigned stability label to mark for supervised merge.

@tbg tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Oct 4, 2016
If the remove-target is the lease holder, transfer the lease to another
store to allow removal of the replica from the overfull store.

Fixes cockroachdb#9462.
@petermattis petermattis force-pushed the pmattis/rebalance-transfer-lease branch from 9b0203d to 813bace Compare October 25, 2016 00:32
@petermattis
Copy link
Collaborator Author

Closing this PR. I'll be pulling out pieces into separate PRs and more closely following the approach outlined in #10262.

@petermattis petermattis closed this Nov 3, 2016
@petermattis petermattis deleted the pmattis/rebalance-transfer-lease branch November 28, 2016 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants