Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prototype] storage: Make rebalance decisions at store-level #26608

Conversation

a-robinson
Copy link
Contributor

@a-robinson a-robinson commented Jun 11, 2018

As outlined in recent comments on #26059, we need to bring back some
form of stats-based rebalancing in order to perform well on TPC-C
without manual partitioning and replica placement.

This commit contains a prototype that demonstrates the effectiveness of
changing our approach to making rebalancing decisions from making them
in the replicate queue, which operates on arbitrarily ordered replicas
of the ranges on a store, to making them at a higher-level. This
prototype makes them at a cluster level by running the logic on only one
node, but my real proposal is to make them at the store level.

This change in abstraction reflects what a human would do if asked to
even out the load on a cluster given perfect information about
everything happening in the cluster:

  1. First, determine which stores have the most load on them (or overfull
    -- but for the prototype I only considered the one dimension that
    affects TPC-C the most)
  2. Decide whether the most loaded stores are so overloaded that action
    needs to be taken.
  3. Examine the hottest replicas on the store (maybe not the absolute
    hottest in practice, since moving that one could disrupt user traffic,
    but in the prototype this seems to work fine) and attempt to move them
    to under-utilized stores. If this can be done simply by transferring
    leases to under-utilized stores, then do so. If moving leases isn't
    enough, then also rebalance replicas from the hottest store to
    under-utilized stores.
  4. Repeat periodically to handle changes in load or cluster membership.

In a real version of this code, the plan is roughly:

  1. Each store will independently run their own control loop like this
    that is only responsible for moving leases/replicas off itself, not off
    other stores. This avoids needing a centralized coordinator, and will
    avoid the need to use the raft debug endpoint as long as we start
    gossiping QPS per store info, since the store already has details about
    the replicas on itself.
  2. The existing replicate queue will stop making decisions motivated by
    balance. It will switch to only making decisions based on
    constraints/diversity/lease preferences, which is still needed since
    the new store-level logic will only check for store-level balance,
    not that all replicas' constraints are properly met.
  3. The new code will have to avoid violating constraints/diversity/lease
    preferences.
  4. The new code should consider range count, disk fullness, and maybe
    writes per second as well.
  5. In order to avoid making decisions based on bad data, I'd like to
    extend lease transfers to pass along QPS data to the new leaseholder
    and preemptive snapshots to pass along WPS data to the new replica.
    This may not be strictly necessary, as shown by the success of this
    prototype, but should make for more reliable decision making.

I tested this out on TPC-C 5k on 15 nodes and am able to consistently
get 94% efficiency, which is the max I've seen using a build of the
workload generator that erroneously includes the ramp-up period in its
final stats. The first run with this code only got 85% because it took a
couple minutes to make all the lease transfers it wanted, but then all
subsequent runs got the peak efficiency while making negligibly few
lease transfers.

Note that I didn't even have to implement replica rebalancing to get
these results, which oddly contradicts my previous claims. However, I
believe that's because I did the initial split/scatter using a binary
containing #26438, so the replicas were already better scattered than by
default. I ran TPC-C on that build without these changes a couple times,
though, and didn't get better than 65% efficiency, so the scatter wasn't
the cause of the good results here.

Touches #26059, #17979

Release note: None


per-node exec.success metric over a few runs with this change, showing the per-node KV QPS converging over time:

screen shot 2018-06-11 at 3 32 17 pm

per-node goroutine counts over the same few runs, showing the overloaded outlier node getting healthier and converging to the other nodes:

screen shot 2018-06-11 at 3 32 46 pm

I can write up a more formal RFC if folks think the write-up above (plus the success of this prototype) is insufficient as a proposal, but it's a good summary of my current plans.

@BramGruneir @nvanbenschoten @petermattis

@a-robinson a-robinson requested a review from a team June 11, 2018 21:15
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@a-robinson
Copy link
Contributor Author

Underlying the reasoning to change from the per-replica replicate queue to a new store-level construct is that the replicate queue doesn't have a great sense of how the replica it's looking at compares to the other replicas on the store. We have the 10th/25th/50th/75th/90th percentiles, but in practice those don't appear to be nearly as useful as being able to grab the top N replicas on the store and pick which of them should be moved.

And even if the replicate queue could determine that the replica it was looking at was the hottest replica, or the Nth hottest replica, it's going to be important for it to also know whether the other hot replicas are able to be moved. If the top 17 replicas are movable, maybe we don't want to move the 18th hottest replicas. But if the top 17 replicas can't be moved due to zone constraints and leaseholder preferences, then we probably do want to move the 18th hottest replica. It's going to be a lot easier (and will require breaking way fewer abstractions) to do that from something operating at the store-level than at the replica-level.

@nvanbenschoten
Copy link
Member

Those results look really nice! Did you try performing the experiment on a 30 node cluster as well? I'm sure this will be even more helpful there.

One detail that pops out to me is that we're basing everything on qps without making a distinction between read and write qps. This is interesting because I'd expect transferring a lease for a range with high read-qps to be more impactful than transferring the lease for a range with high write-qps, which probably requires a full replica rebalance to alleviate an overloaded node. Do you think it makes sense to track these separately?

I can write up a more formal RFC if folks think the write-up above (plus the success of this prototype) is insufficient as a proposal, but it's a good summary of my current plans.

I don't necessarily think a full RFC is necessary for any changes here (especially as you're prototyping them), but it would probably be a good idea to eventually go back and update https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20161026_leaseholder_rebalancing.md#future-directions with your new findings and plan.


Review status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/prototype_allocator.go, line 29 at r1 (raw file):

)

var testQPSThreshold = settings.RegisterNonNegativeFloatSetting(

The prototype doesn't actually use this, right?


Comments from Reviewable

As outlined in recent comments on cockroachdb#26059, we need to bring back some
form of stats-based rebalancing in order to perform well on TPC-C
without manual partitioning and replica placement.

This commit contains a prototype that demonstrates the effectiveness of
changing our approach to making rebalancing decisions from making them
in the replicate queue, which operates on arbitrarily ordered replicas
of the ranges on a store, to making them at a higher-level. This
prototype makes them at a cluster level by running the logic on only one
node, but my real proposal is to make them at the store level.

This change in abstraction reflects what a human would do if asked to
even out the load on a cluster given perfect information about
everything happening in the cluster:

1. First, determine which stores have the most load on them (or overfull
   -- but for the prototype I only considered the one dimension that
   affects TPC-C the most)
2. Decide whether the most loaded stores are so overloaded that action
   needs to be taken.
3. Examine the hottest replicas on the store (maybe not the absolute
   hottest in practice, since moving that one could disrupt user traffic,
   but in the prototype this seems to work fine) and attempt to move them
   to under-utilized stores.  If this can be done simply by transferring
   leases to under-utilized stores, then do so. If moving leases isn't
   enough, then also rebalance replicas from the hottest store to
   under-utilized stores.
4. Repeat periodically to handle changes in load or cluster membership.

In a real versino of this code, the plan is roughly:
1. Each store will independently run their own control loop like this
   that is only responsible for moving leases/replicas off itself, not off
   other stores. This avoids needing a centralized coordinator, and will
   avoid the need to use the raft debug endpoint as long as we start
   gossiping QPS per store info, since the store already has details about
   the replicas on itself.
2. The existing replicate queue will stop making decisions motivated by
   balance. It will switch to only making decisions based on
   constraints/diversity/lease preferences, which is still needed since
   the new store-level logic will only check for store-level balance,
   not that all replicas' constraints are properly met.
3. The new code will have to avoid violating constraints/diversity/lease
   preferences.
4. The new code should consider range count, disk fullness, and maybe
   writes per second as well.
5. In order to avoid making decisions based on bad data, I'd like to
   extend lease transfers to pass along QPS data to the new leaseholder
   and preemptive snapshots to pass along WPS data to the new replica.
   This may not be strictly necessary, as shown by the success of this
   prototype, but should make for more reliable decision making.

I tested this out on TPC-C 5k on 15 nodes and am able to consistently
get 94% efficiency, which is the max I've seen using a build of the
workload generator that erroneously includes the ramp-up period in its
final stats. The first run with this code only got 85% because it took a
couple minutes to make all the lease transfers it wanted, but then all
subsequent runs got the peak efficiency while making negligibly few
lease transfers.

Note that I didn't even have to implement replica rebalancing to get
these results, which oddly contradicts my previous claims. However, I
believe that's because I did the initial split/scatter using a binary
containing cockroachdb#26438, so the replicas were already better scattered than by
default. I ran TPC-C on that build without these changes a couple times,
though, and didn't get better than 65% efficiency, so the scatter wasn't
the cause of the good results here.

Touches cockroachdb#26059, cockroachdb#17979

Release note: None
With this update, TPC-C 10k on 30 went from overloaded to running at
peak efficiency over the course of about 4 hours (the manual
partitioning approach takes many hours to move all the replicas as well,
for a point of comparison). This is without having to run the replica
scatter from cockroachdb#26438.

Doing a 5 minute run to get a result that doesn't include all the
rebalancing time shows:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  290.9s   124799.1  97.0%    548.6    486.5    872.4   1140.9   2281.7  10200.5

I think it may have a small bug in it still, since at one point early on
one of the replicas from the warehouse table on the node doing the
relocating thought that it had 16-17k QPS, which wasn't true by any
other metric in the system. Restarting the node fixed it though.
I'm not too concerned about the bug, since I assume I just made a code
mistake, not that anything about the approach fundamentally leads to a
random SQL table replica gets 10s of thousands of QPS.

Range 1 is also back to getting a ton of QPS (~3k) even though I raised
the range cache size from 1M to 50M. Looking at slow query traces shows
a lot of range lookups, way more than I'd expect given that ranges
weren't moving around at the time of the traces.

Release note: None
@a-robinson a-robinson force-pushed the store-level-allocator-prototype branch from 61805d0 to 853ac61 Compare June 14, 2018 16:00
@a-robinson
Copy link
Contributor Author

This appears to do the trick on tpc-c 10k, as explained in my latest commit message:

screen shot 2018-06-14 at 10 24 45 am

@nvanbenschoten
Copy link
Member

Nice!

over the course of about 4 hours (the manual
partitioning approach takes many hours to move all the replicas as well,
for a point of comparison)

Was the rate of rebalancing steady throughout this entire 4 hour period?

Range 1 is also back to getting a ton of QPS (~3k) even though I raised
the range cache size from 1M to 50M. Looking at slow query traces shows
a lot of range lookups, way more than I'd expect given that ranges
weren't moving around at the time of the traces.

Did this eventually settle down? If not, I think it's worth exploring at some point. We're probably running into the issue mentioned in #18988, which will result in about 2x more range lookups than strictly necessary during heavy rebalancing scenerios. There might also be other issues lurking as well.


Review status: :shipit: complete! 0 of 0 LGTMs obtained


Comments from Reviewable

@a-robinson
Copy link
Contributor Author

a-robinson commented Jun 14, 2018

Was the rate of rebalancing steady throughout this entire 4 hour period?

Not the entire time. About 2 hours were actually spent primarily doing lease transfers, which makes sense given the code but probably made the rebalancing period take longer than needed. Judging by the logs, that was because of the bug I mentioned where a replica on n1 had a ridiculously high reported QPS. Once that resolved itself, though, the rest of the time was a pretty steady stream of replica adds/removes.

Did this eventually settle down? If not, I think it's worth exploring at some point. We're probably running into the issue mentioned in #18988, which will result in about 2x more range lookups than strictly necessary during heavy rebalancing scenerios. There might also be other issues lurking as well.

Not that I ever saw. Agreed that it's worth getting to the bottom of.

@a-robinson
Copy link
Contributor Author

If not, I think it's worth exploring at some point. We're probably running into the issue mentioned in #18988, which will result in about 2x more range lookups than strictly necessary during heavy rebalancing scenerios.

That's certainly part of it. But what isn't clear is why the non-meta descriptors are being evicted. For example, check out this partial excerpt of events on an arbitrarily chosen range on one of the nodes. We keep evicting it and re-adding it to the range cache even though it isn't changing at all:

I180615 19:25:44.241605 903330514 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2438/10/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:25:49.341302 903397366 kv/range_cache.go:578  [s1,n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:26:07.868229 903462027 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2431/9/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:26:13.973296 903495353 kv/range_cache.go:578  [s1,r55617/4:/Table/56/1/24{30-40},n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:29:21.724366 904226264 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2439/8/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:29:21.882479 904246100 kv/range_cache.go:578  [client=10.142.0.54:60202,user=root,txn=4f910e13,n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:32:04.364989 904812433 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2438/1/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:32:04.624679 904920683 kv/range_cache.go:578  [client=10.142.0.54:39730,user=root,txn=a79b8f61,n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:32:51.906363 905107678 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2434/1/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:32:52.032818 905120871 kv/range_cache.go:578  [client=10.142.0.54:58400,user=root,txn=7f61f484,n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:33:14.223086 905210365 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2435/2/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:33:14.299220 905210708 kv/range_cache.go:578  [s1,r55617/4:/Table/56/1/24{30-40},n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:33:46.655785 905316841 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2436/1/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:33:46.662854 905349469 kv/range_cache.go:578  [client=10.142.0.54:37896,user=root,txn=6366a404,n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:34:50.580809 905608042 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2436/10/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:34:50.622198 905608720 kv/range_cache.go:578  [s1,r55617/4:/Table/56/1/24{30-40},n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:38:55.529085 906576259 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2438/1/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:38:58.620422 906589469 kv/range_cache.go:578  [client=10.142.0.54:38534,user=root,txn=978a2b88,n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]

@a-robinson
Copy link
Contributor Author

Almost all of the evictions in my current tpc-c 10k cluster are due to context cancellations for QueryTxn and HeartbeatTxn requests. We evict range cache entries whenever we get a SendError, which includes errors due to context cancellation. A cancelled context seems like a poor reason to invalidate a range cache entry, and a spectacularly bad reason to invalidate a meta2 range cache entry.

@a-robinson
Copy link
Contributor Author

Sorry, I forgot to include the relevant code in my last comment:

case *roachpb.SendError, *roachpb.RangeNotFoundError:
// We've tried all the replicas without success. Either
// they're all down, or we're using an out-of-date range
// descriptor. Invalidate the cache and try again with the new
// metadata.
log.Event(ctx, "evicting range descriptor on send error and backoff for re-lookup")
if err := evictToken.Evict(ctx); err != nil {

@a-robinson
Copy link
Contributor Author

And now that I've added in a patch for that, I'm seeing some other questionable behavior causing a lot of range cache evictions. It appears as though whenever we get a RangeNotFound error from one of the replicas, we don't even bother trying the other known replicas. Is there a reason for that?

Although it looks like the biggest issue is still our sending of HeartbeatTxn requests to r1, which we were supposed to have fixed in #26238, but is still happening even though my build includes that change. cc @spencerkimball

Example logs from logspy grepping for r1:

I180615 23:19:44.457027 3577906 kv/leaseholder_cache.go:71  [client=10.142.0.54:37694,user=root,txn=7226f5ad,n26] r1: lookup leaseholder: 7
I180615 23:19:44.457041 3577906 kv/dist_sender.go:1229  [client=10.142.0.54:37694,user=root,txn=7226f5ad,n26] r1: sending batch 1 HeartbeatTxn to (n7,s7):3
I180615 23:19:44.465616 3405758 kv/range_iter.go:174  [client=10.142.0.54:39656,user=root,txn=52767306,n26] key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8] err: <nil>
I180615 23:19:44.465639 3405758 kv/range_iter.go:207  [client=10.142.0.54:39656,user=root,txn=52767306,n26] returning; key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8]
I180615 23:19:44.465661 3405758 kv/leaseholder_cache.go:71  [client=10.142.0.54:39656,user=root,txn=52767306,n26] r1: lookup leaseholder: 7
I180615 23:19:44.465691 3405758 kv/dist_sender.go:1229  [client=10.142.0.54:39656,user=root,txn=52767306,n26] r1: sending batch 1 HeartbeatTxn to (n7,s7):3
I180615 23:19:44.475011 3578003 kv/range_iter.go:174  [client=10.142.0.54:49640,user=root,txn=5734eab5,n26] key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8] err: <nil>
I180615 23:19:44.475038 3578003 kv/range_iter.go:207  [client=10.142.0.54:49640,user=root,txn=5734eab5,n26] returning; key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8]
I180615 23:19:44.475061 3578003 kv/leaseholder_cache.go:71  [client=10.142.0.54:49640,user=root,txn=5734eab5,n26] r1: lookup leaseholder: 7
I180615 23:19:44.475096 3578003 kv/dist_sender.go:1229  [client=10.142.0.54:49640,user=root,txn=5734eab5,n26] r1: sending batch 1 HeartbeatTxn to (n7,s7):3
I180615 23:19:44.477202 3526183 kv/range_iter.go:174  [client=10.142.0.54:49814,user=root,txn=ed9e9e23,ts=1529104765.652544134,0,n26] key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8] err: <nil>
I180615 23:19:44.477226 3526183 kv/range_iter.go:207  [client=10.142.0.54:49814,user=root,txn=ed9e9e23,ts=1529104765.652544134,0,n26] returning; key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8]
I180615 23:19:44.477249 3526183 kv/leaseholder_cache.go:71  [client=10.142.0.54:49814,user=root,txn=ed9e9e23,ts=1529104765.652544134,0,n26] r1: lookup leaseholder: 7
I180615 23:19:44.477274 3526183 kv/dist_sender.go:1229  [client=10.142.0.54:49814,user=root,txn=ed9e9e23,ts=1529104765.652544134,0,n26] r1: sending batch 1 HeartbeatTxn to (n7,s7):3
I180615 23:19:44.484863 3577985 kv/range_iter.go:174  [client=10.142.0.54:59634,user=root,txn=0abb9680,n26] key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8] err: <nil>
I180615 23:19:44.484875 3577985 kv/range_iter.go:207  [client=10.142.0.54:59634,user=root,txn=0abb9680,n26] returning; key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8]
I180615 23:19:44.484892 3577985 kv/leaseholder_cache.go:71  [client=10.142.0.54:59634,user=root,txn=0abb9680,n26] r1: lookup leaseholder: 7
I180615 23:19:44.484913 3577985 kv/dist_sender.go:1229  [client=10.142.0.54:59634,user=root,txn=0abb9680,n26] r1: sending batch 1 HeartbeatTxn to (n7,s7):3
I180615 23:19:44.485149 3578189 kv/range_iter.go:174  [client=10.142.0.54:57154,user=root,txn=8fe9129e,n26] key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8] err: <nil>
I180615 23:19:44.485166 3578277 kv/range_iter.go:174  [client=10.142.0.54:60726,user=root,txn=e2717e42,n26] key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8] err: <nil>
I180615 23:19:44.485173 3578189 kv/range_iter.go:207  [client=10.142.0.54:57154,user=root,txn=8fe9129e,n26] returning; key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8]
I180615 23:19:44.485195 3578189 kv/leaseholder_cache.go:71  [client=10.142.0.54:57154,user=root,txn=8fe9129e,n26] r1: lookup leaseholder: 7
I180615 23:19:44.485205 3578277 kv/range_iter.go:207  [client=10.142.0.54:60726,user=root,txn=e2717e42,n26] returning; key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8]

a-robinson added a commit to a-robinson/cockroach that referenced this pull request Jun 16, 2018
When running TPC-C 10k on a 30 node cluster without partitioning, range
1 was receiving thousands of qps while all other ranges were receiving
no more than low hundreds of qps (more details in cockroachdb#26608. Part of it was
context cancellations causing range descriptors to be evicted from the
range cache (cockroachdb#26764), but an even bigger part of it was HeartbeatTxns
being sent for transactions with no anchor key, accounting for thousands
of QPS even after cockroachdb#26764 was fixed.

This causes the same outcome as the old code without the load, because
without this change we'd just send the request and get back a
REASON_TXN_NOT_FOUND error, which would cause the function to return
true.

It's possible that we should instead avoid the heartbeat loop at all for
transactions without a key, or that we should put in more effort to
prevent such requests from even counting as transactions (a la cockroachdb#26741,
which perhaps makes this change unnecessary?). Advice would be great.

Release note: None
@a-robinson
Copy link
Contributor Author

Although it looks like the biggest issue is still our sending of HeartbeatTxn requests to r1, which we were supposed to have fixed in #26238, but is still happening even though my build includes that change. cc @spencerkimball

Sorry, I was mixing up HeartbeatTxns and QueryTxns. This was a separate problem.

I've sent out #26764 for the context cancellation problem and #26765 for all the HeartbeatTxn requests to r1. I'm not doing anything about the cache evictions caused by RangeNotFound errors on a single replica, since I can see some sense in them, even though doing so is now the primary cause of requests to r1, which is still seeing a few dozen qps (#18988 isn't helping).

a-robinson added a commit to a-robinson/cockroach that referenced this pull request Jun 16, 2018
SendErrors cause range descriptors to be evicted from the range cache,
but happen innocuously all the time as requests like HeartbeatTxns are
cancelled because they're no longer needed.

This was part of the cause of range 1 getting thousands of qps on 30
node TPC-C testing, as initially called out in the comments on cockroachdb#26608.

Release note: None
@bdarnell
Copy link
Contributor

It appears as though whenever we get a RangeNotFound error from one of the replicas, we don't even bother trying the other known replicas. Is there a reason for that?

That sounds like a mistake. As the comment at dist_sender.go:1034 says, these errors are only supposed to reach this level if we've exhausted all the replicas.

The whole notion of range descriptor invalidation could use some reworking. Let's discuss that on #18988 instead of adding to this issue.

a-robinson added a commit to a-robinson/cockroach that referenced this pull request Jun 18, 2018
SendErrors cause range descriptors to be evicted from the range cache,
but happen innocuously all the time as requests like HeartbeatTxns are
cancelled because they're no longer needed.

This was part of the cause of range 1 getting thousands of qps on 30
node TPC-C testing, as initially called out in the comments on cockroachdb#26608.

Release note: None
craig bot pushed a commit that referenced this pull request Jun 18, 2018
26764: kv: Don't treat context cancellation as a SendError r=a-robinson a=a-robinson

SendErrors cause range descriptors to be evicted from the range cache,
but happen innocuously all the time as requests like HeartbeatTxns are
cancelled because they're no longer needed.

This was part of the cause of range 1 getting thousands of qps on 30
node TPC-C testing, as initially called out in the comments on #26608.

Release note: None

---------------------------

`kv` is probably the package I'm least familiar with all the details of, so this may be a bad way of accomplishing this. If so, please let me know. If it's reasonable, I'll add a test that context cancellations don't affect the contents of the range cache.

Co-authored-by: Alex Robinson <[email protected]>
@a-robinson
Copy link
Contributor Author

No need to keep this open any longer. #28340 and #28852 implement this for real.

@a-robinson a-robinson closed this Aug 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants