Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: unexpected preemptive transfer of leases upon adding new nodes #51867

Closed
bobvawter opened this issue Jul 23, 2020 · 6 comments
Closed

kv: unexpected preemptive transfer of leases upon adding new nodes #51867

bobvawter opened this issue Jul 23, 2020 · 6 comments
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-investigation Further steps needed to qualify. C-label will change. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) T-kv KV Team

Comments

@bobvawter
Copy link
Contributor

bobvawter commented Jul 23, 2020

Impact: causes performance disruption during cluster scale-up

Background

An enterprise customer was running a "hot" cluster and, while adding additional nodes, saw unexpected lease transfers that exacerbated CPU use on heavily-loaded nodes. There was no data loss or downtime, but the concern is that a "galloping stampede" failure is made more likely by this behavior.

This behavior can be reliably recreated by the customer and in my testing as well.

Observation

Adding a node in one locality appears to cause all of the other nodes in that locality to drain away their leases. In this graph, I added a new node to DC1 whose leases-per-store values are seen to be going down. They’re transferring to DC2.

Screen Shot 2020-07-23 at 5 12 58 PM

There is a DC3, but the locality and lease preference are configured as follows. Notice that the locality and lease preferences wind up ganging DC1 and DC2 together.

DC1: cloud=gce,continent=na,zone=us-east1-b
DC2: cloud=gce,continent=na,zone=us-west1-b
DC3: cloud=gce,continent=eu,zone=europe-west2-b

ALTER DATABASE tpcc CONFIGURE ZONE USING
constraints = '[]',
lease_preferences = '[[+continent=na]]';

The leases eventually rebalance.

Screen Shot 2020-07-23 at 5 27 00 PM

This behavior occurs with no load, light load, and in over-loaded cases.

Tested with 20.1.1.

gz#5876

@bobvawter bobvawter added C-investigation Further steps needed to qualify. C-label will change. A-kv-distribution Relating to rebalancing and leasing. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) labels Jul 23, 2020
@lunevalex
Copy link
Collaborator

@bobvawter thanks for raising this issue, can you add your reproduction steps, so we could test this exact scenario?

@bobvawter
Copy link
Contributor Author

export CLUSTER=$USER-leases
./roachprod create $CLUSTER -n 10
./roachprod stage $CLUSTER release v20.1.1
./roachprod start $CLUSTER:1-3 --secure -a "--locality=cloud=gce,continent=na,zone=us-east1-b"
./roachprod start $CLUSTER:4-6 --secure -a "--locality=cloud=gce,continent=na,zone=us-west1-b"
./roachprod start $CLUSTER:7-9 --secure -a "--locality=cloud=gce,continent=eu,zone=europe-west1-b"

./roachprod sql $CLUSTER:1
create user me with password 'HelloWorld!';
grant admin to me;

# Load some data into the cluster
./roachprod ssh $CLUSTER:10
./cockroach workload init tpcc --drop --data-loader import --warehouses 500 'postgres://[email protected]:26257?sslcert=certs%2Fnode.crt&sslkey=certs%2Fnode.key&sslrootcert=certs%2Fca.crt&sslmode=verify-full'

# Move the leases and wait for DB to settle
ALTER DATABASE tpcc CONFIGURE ZONE USING
constraints = '[]',
lease_preferences = '[[+continent=na]]';


# Add a new node and watch the leases move
./roachprod start $CLUSTER:10 --secure -a "--locality=cloud=gce,continent=na,zone=us-west1-b"

Localities

@ajwerner
Copy link
Contributor

I've been digging into this one a bit. My main observation is that we're moving these leases as part of the simulation for rebalancing for the new node. All of the nodes decide they want to remove themselves for various reasons. Some of it seems random and some of it seems like there are these hard to understand calculations around convergence. We end up transferring the leases super fast but we can't move the data nearly that fast.

Even in less dramatic cases than this, in a single region, we see crazy oscillations due to this very code path. My sense is that what would fix it is to make the considerRebalance target respect the usual lease transfer rate limiting in the case where it would choose to move a replica. I'm going to experiment with such a change.

@ajwerner
Copy link
Contributor

ajwerner commented Sep 12, 2020

tl;dr I think we should make this better in the short term with an inbound PR that just slows the rate at which we transfer away leases for the purpose of rebalancing, long term we should centralize the rebalancing planning in a whole new allocator scheme.

Alright, I'm becoming increasingly convinced that the shedding of leases inside the locality of the node being added is going to be quite hard to fix, or at least, subtle to fix. We generally always seem to prioritize rebalancing over lease transfers for lease balancing. In that world, the ranges which will always shed their leases when simulating rebalancing to the new node are the ones in the same locality.

As noted above, I think a more severe problem in some ways is the rate at which these nodes can shed their leases. This is actually worse in cases where there is just one locality. In this case the leaseholder will shed it's lease far faster than the new leaseholder will rebalance the range away. We can see some stores shedding a number of leases a second.

Below is a repro with the transfer rate capped according to our usual lease transfer max rate. In this scenario we still see leases drained away before coming back to the new node's locality but it's less severe.

image


To me, the more compelling part of this prototype is that it remedies the severe fluctuations in leases in the single locality case. Customers with active workloads have observed much more dramatic oscillations.

Screenshot_2020-09-11 Replication Dashboard Cockroach Console(1)

ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 14, 2020
This commit plumbs the lease transfer rate limit into the lease transfer which
occurs as a part of rebalancing decision making. The decision about whether a
node should remove itself for the purpose of rebalancing happens at a much
higher rate than the the actual rebalancing. This can lead to massive
oscillations of leaseholders as the calculation on which replica should be
removed in the face of node addition changes dramatically. We utilize
a rate limit in lease transfer decisions to smooth this decision making
elsewhere. This limit however is not utilized currently when deciding to
transfer a lease for the purpose of rebalancing data. This change has
shown very positive impact mitigating these lease transfer storms. See
the commentary on cockroachdb#51867.

Release note (bug fix): Made lease transfers during rebalancing adhere
to the rate limit utilized in other lease transfer cases which eliminates
unexpected lease oscillations when adding a new node.
craig bot pushed a commit that referenced this issue Sep 15, 2020
54322: kvserver: observe lease transfer interval in considerRebalance r=ajwerner a=ajwerner

This commit plumbs the lease transfer rate limit into the lease transfer which
occurs as a part of rebalancing decision making. The decision about whether a
node should remove itself for the purpose of rebalancing happens at a much
higher rate than the the actual rebalancing. This can lead to massive
oscillations of leaseholders as the calculation on which replica should be
removed in the face of node addition changes dramatically. We utilize
a rate limit in lease transfer decisions to smooth this decision making
elsewhere. This limit however is not utilized currently when deciding to
transfer a lease for the purpose of rebalancing data. This change has
shown very positive impact mitigating these lease transfer storms. See
the commentary on #51867.

Release note (bug fix): Made lease transfers during rebalancing adhere
to the rate limit utilized in other lease transfer cases which eliminates
unexpected lease oscillations when adding a new node.

Co-authored-by: Andrew Werner <[email protected]>
ajwerner added a commit to ajwerner/cockroach that referenced this issue Feb 24, 2021
This commit plumbs the lease transfer rate limit into the lease transfer which
occurs as a part of rebalancing decision making. The decision about whether a
node should remove itself for the purpose of rebalancing happens at a much
higher rate than the the actual rebalancing. This can lead to massive
oscillations of leaseholders as the calculation on which replica should be
removed in the face of node addition changes dramatically. We utilize
a rate limit in lease transfer decisions to smooth this decision making
elsewhere. This limit however is not utilized currently when deciding to
transfer a lease for the purpose of rebalancing data. This change has
shown very positive impact mitigating these lease transfer storms. See
the commentary on cockroachdb#51867.

Release note (bug fix): Made lease transfers during rebalancing adhere
to the rate limit utilized in other lease transfer cases which eliminates
unexpected lease oscillations when adding a new node.
@ajwerner
Copy link
Contributor

#61038 does not completely solve #51867 but it does dramatically reduce the impact. It was sort of a two-level problem. Ideally, when adding a new node, leases and data would flow to that node and at no point would the leases or data on other nodes diminish below the steady state where the cluster will eventually converge. My change doesn't make that the case.
The root of the bug is that when we decide we want to move data to the new node and we decide that we want to replace the current leaseholder then we have to transfer the lease away. The problem is that by the time we transfer the lease away, the new leaseholder, may, for various reasons (usually snapshot concurrency/bandwidth), not want to send the snapshot immediately.

However, the bigger problem here, on some level, was that a very big storm of lease transfers is quite bad for latency and availability. The above dynamics where the current leaseholder would transfer leases away because of a placement decision was not being governed by our lease transfer rate limiting. As such, they were shedding leases far faster than the lease rebalancing logic was doing anything about it. At that point the leases would get way out of whack and the lease rebalancing logic would get aggressive. When the lease transfers are slower than the information propagation, stuff remains pretty smooth and latency remains reasonably undeterred.

All that being said, if you're moving a bunch of data around reasonably slowly, then the lease are going to become unbalanced for a time.

This is one of those deeper problems with a decentralized, heuristic based placement driver.

@nvanbenschoten
Copy link
Member

Merging this issue into #67740, as both issues have identified the same behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-investigation Further steps needed to qualify. C-label will change. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) T-kv KV Team
Projects
None yet
Development

No branches or pull requests

7 participants