storage: diagnose performance issues with tpc-c when not using partitioning #26059

a-robinson · 2018-05-24T20:29:59Z

No description provided.

a-robinson · 2018-05-30T04:23:25Z

After some initial investigation today, there's one problem so big that it pretty much obscures whatever problems may exist beneath it -- bad replica balance. For the most part, the replicas for any given table mostly fall on the same 3 nodes. The worst offender here is the tpcc.warehouse table, which gets the most load of any table as far as I'm aware (please correct that if I'm wrong). Almost all of the replicas for its 1000 ranges ended up on the same 3 nodes, meaning that the best leaseholder rebalancing in the world couldn't make for good performance. Those 3 nodes are indeed the 3 most loaded along dimensions such as number of BatchRequests processed, CPU usage, and keys written per second, with 1 of them in particular really standing out in goroutines (20k vs 2k for the next highest).

This could be fixed in a number of ways, such as:

Load-based rebalancing
Trying to diversify the nodes a table's ranges are on when rebalancing
Implementing a version of ALTER TABLE ... SCATTER that scatters replicas in addition to leases

The last one should get the quickest results and we've talked about wanting it anyway, so I'll work on that tomorrow and see where it leaves things.

I do have a couple questions about oddities I've noticed, though:

r1 has a lot of qps. Way more qps than any other range (460 qps to it vs 50 to the next hottest range in my latest full measurement, and it appears to be growing over time; it's now over 1000 qps). That's way higher than the cluster's rate of NotLeaseholderErrors. Judging by the fact that its keys written per second is basically, 0, they're all reads. According to the UI, there is no rebalancing activity going on, neither replicas nor leases. I increased the cluster setting for the number of range cache entries from 1M to 10M and it's had no effect. I enabled slow query tracing to try and see which requests were hitting it, and I'm seeing it repeatedly involved in QueryTxn attempts to resolve intents. It looks pretty fishy to me, but I haven't checked against the code yet. Here's an example trace: https://gist.github.com/a-robinson/17ebf561bf450b40e7e89edd967962ca. Is this expected?
The tpcc.warehouse ranges on n11 (the one with the massive number of goroutines and most qps) are getting significantly more qps than those whose leases are on the other two nodes (https://gist.github.com/a-robinson/0aebc9aea3fb042255d3572e52ce63ac). Under normal operation, I'd expect each warehouse to get the same number of qps. Is that a correct understanding of tpc-c? And given that n11 seems to be the worst bottleneck, I'd expect each range on it to have fewer qps, not more, due to the slowdown. Any idea what might be going on here?

Finally, for future reference (cc @piyush-singh), things that would have been nice to have for the sake of debugging the performance issue:

Breakdown of which ranges are on each node, i.e. a breakdown of how many are from each table
Breakdown of where the replicas for each table are located, i.e. their distribution across the nodes
Breakdown of where the leaseholders for each table are located, i.e. their distribution across the nodes
List of the ranges receiving the most traffic (QPS and/or keys written per second), where their leaseholders are, where their replicas are, which tables they're from

I ended up manually constructing most of this data with the raft debug endpoint + a few python scripts as well as SHOW TESTING_RANGES + some bash.

a-robinson · 2018-05-30T04:24:23Z

And to be clear, the point of the first paragraph above is that leaseholder rebalancing can't fix the problem by itself if all the warehouse range replicas are on the same 3 nodes.

petermattis · 2018-05-30T12:52:09Z

r1 has a lot of qps. Way more qps than any other range (460 qps to it vs 50 to the next hottest range in my latest full measurement, and it appears to be growing over time; it's now over 1000 qps). That's way higher than the cluster's rate of NotLeaseholderErrors. Judging by the fact that it's keys written per second is basically, 0, they're all reads. According to the UI, there is no rebalancing activity going on, neither replicas nor leases. I increased the cluster setting for the number of range cache entries from 1M to 10M and it's had no effect. I enabled slow query tracing to try and see which requests were hitting it, and I'm seeing it repeatedly involved in QueryTxn attempts to resolve intents. It looks pretty fishy to me, but I haven't checked against the code yet. Here's an example trace: https://gist.github.com/a-robinson/17ebf561bf450b40e7e89edd967962ca. Is this expected?

i'd expect r1 to be mostly idle in the steady state. @nvanbenschoten, @tschottdorf any theories?

The tpcc.warehouse ranges on n11 (the one with the massive number of goroutines and most qps) are getting significantly more qps than those whose leases are on the other two nodes (https://gist.github.com/a-robinson/0aebc9aea3fb042255d3572e52ce63ac). Under normal operation, I'd expect each warehouse to get the same number of qps. Is that a correct understanding of tpc-c? And given that n11 seems to be the worst bottleneck, I'd expect each range on it to have fewer qps, not more, due to the slowdown. Any idea what might be going on here?

This is not expected. Each warehouse should receive approximately the same qps. No idea what is going on, but it is worth pulling on this thread. It might be indicative of a bug in the load generator.

Finally, for future reference (cc @piyush-singh), things that would have been nice to have for the sake of debugging the performance issue:

Heh, I came up with a similar list when debugging tpcc performance issues. I believe there are issues filed for many of these. @piyush-singh please double check.

a-robinson · 2018-05-30T14:35:27Z

This is not expected. Each warehouse should receive approximately the same qps. No idea what is going on, but it is worth pulling on this thread. It might be indicative of a bug in the load generator.

I can't confirm it right now since I tore the cluster down when I was done last night, but I think it could conceivably be due to errors from that node. I recall there being a fairly large (albeit spiky) amount of "Replica Errors" in the UI's graph of RPC errors. It's possible that those were mostly coming from n11's replicas, and that retries were then bumping up the count of batches sent to n11's replicas.

bdarnell · 2018-05-30T15:02:55Z

Implementing a version of ALTER TABLE ... SCATTER that scatters replicas in addition to ranges

What does "scatters replicas in addition to ranges" mean?

r1 has a lot of qps.

That's surprising. This is probably due to overly-broad cache invalidation when the DistSender gets a NotLeaseHolder error (or a few other things). It sounds like we might be invalidating both meta1 and meta2 when we should only be invalidating meta2.

piyush-singh · 2018-05-30T15:18:19Z

We will be showing location of replicas in a debug page shortly (#24855), and @vilterp and I have discussed using this layout to similarly show leaseholder counts.

@petermattis mentioned similar requests for information about ranges in #23379 (nodes to which they were replicated, size, load). I'll make a note to think through how best to show ranges in the UI for our last milestone since we'll be making some layout changes anyways.

Thanks for the write up and detailed requests @a-robinson

nvanbenschoten · 2018-05-30T15:27:17Z

r1 has a lot of qps. Way more qps than any other range (460 qps to it vs 50 to the next hottest range in my latest full measurement, and it appears to be growing over time; it's now over 1000 qps). That's way higher than the cluster's rate of NotLeaseholderErrors. Judging by the fact that it's keys written per second is basically, 0, they're all reads. According to the UI, there is no rebalancing activity going on, neither replicas nor leases. I increased the cluster setting for the number of range cache entries from 1M to 10M and it's had no effect. I enabled slow query tracing to try and see which requests were hitting it, and I'm seeing it repeatedly involved in QueryTxn attempts to resolve intents. It looks pretty fishy to me, but I haven't checked against the code yet. Here's an example trace: https://gist.github.com/a-robinson/17ebf561bf450b40e7e89edd967962ca. Is this expected?

The trace is pretty suspicious for this issue. As you mentioned, we see a large number of QueryTxn requests in the trace. A lot of these something like:

event:[n11,s11,r55510/2:/Table/55/1/9{00-10}] no pusher found; backing off
 30575.736ms    390.389ms                                operation:dist sender
 30575.767ms      0.030ms                                event:[s11,r55510/2:/Table/55/1/9{00-10},n11] querying next range at /Min
 30575.806ms      0.040ms                                event:[s11,r55510/2:/Table/55/1/9{00-10},n11] r1: sending batch 1 QueryTxn to (n20,s19):8
 30575.827ms      0.021ms                                event:[s11,r55510/2:/Table/55/1/9{00-10},n11] sending request to alex-tpcc-0002:26257
 30575.848ms      0.021ms                                    operation:/cockroach.roachpb.Internal/Batch component:gRPC sb:1 span.kind:client txnid:779a7ccb-5b5d-4a8c-8c02-33a5937f9510
 30576.756ms      0.908ms                                        operation:/cockroach.roachpb.Internal/Batch component:gRPC sb:1 span.kind:server txnid:779a7ccb-5b5d-4a8c-8c02-33a5937f9510
 30576.776ms      0.019ms                                        event:[n20] 1 QueryTxn
 30576.781ms      0.005ms                                        event:[n20,s19] executing 1 requests
 30626.903ms     50.122ms                                        event:[n20,s19,r1/8:/{Min-System/}] read-only path
 30626.914ms      0.011ms                                        event:[n20,s19,r1/8:/{Min-System/}] command queue
 30626.918ms      0.004ms                                        event:[n20,s19,r1/8:/{Min-System/}] waiting for read lock
 30626.970ms      0.052ms                                        event:[n20,s19,r1/8:/{Min-System/}] read completed

Note that they're addressed to /Min. This trace is coming from

cockroach/pkg/storage/txnwait/queue.go

Lines 682 to 684 in f8a7889

    
           updatedPusher, waitingTxns, pErr = q.queryTxnStatus( 
        
           	ctx, pusher.TxnMeta, true, waitingTxns, q.store.Clock().Now(), 
        
           )

and I suspect that we're querying a transaction that does not have a record yet (i.e. pusher.TxnMeta.Key == nil). For what it's worth, @spencerkimball has been dealing with issues like this lately with some of his contentionQueue changes.

things that would have been nice to have for the sake of debugging the performance issue:

@vilterp's replica matrix (#24855) would address at least the first two of these needs. Might be worth trying it out here.

a-robinson · 2018-05-30T15:37:08Z

What does "scatters replicas in addition to ranges" mean?

Sorry, I guess I can't type late at night. I meant "scatters replicas in addition to leases".

spencerkimball · 2018-05-30T16:24:35Z

Yes, we would be seeing spurious QueryTxn requests to r1 in the current code because the txn wait queue isn't checking whether there's a key available for the pusher before querying it for dependencies. The lack of a key addresses the QueryTxn request to r1. This is a one line fix as part of my current PR, but I'll break it out and send it for review.

26238: storage: avoid querying pusher where transaction record can't exist r=spencerkimball a=spencerkimball This change checks whether a txn has a non-nil key before querying it if it's waiting on an extant transaction which owns a conflicting intent. Previously, the code would query the pusher's txn, even if the key was nil, which would just send a spurious `QueryTxn` request to the first range in the keyspace. See #26059 Release note: None Co-authored-by: Spencer Kimball <[email protected]>

26245: backport-2.0: storage: avoid querying pusher where transaction record can't exist r=tschottdorf a=tschottdorf Backport 1/1 commits from #26238. /cc @cockroachdb/release --- This change checks whether a txn has a non-nil key before querying it if it's waiting on an extant transaction which owns a conflicting intent. Previously, the code would query the pusher's txn, even if the key was nil, which would just send a spurious `QueryTxn` request to the first range in the keyspace. See #26059 Release note: None Co-authored-by: Spencer Kimball <[email protected]>

a-robinson · 2018-05-31T20:14:26Z

Before scattering replicas, running tpc-c 5k on 15 nodes:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  300.3s    29544.7  45.9%  26271.4  12348.0  85899.3 103079.2 103079.2 103079.2

After scattering replicas (and leases), running tpc-c 5k on 15 nodes:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  300.4s    61047.7  94.9%   1106.1    117.4    369.1   9663.7  19327.4  28991.0

So if our goal is to have a usable workaround, scattering replicas basically gets us there. And if we want to remove the manual aspect of it, it looks like it'll be more impactful to work on stats-based replica rebalancing than lease rebalancing.

petermattis · 2018-05-31T20:19:12Z

@a-robinson Nice improvement. The p99 and pMax latency looks higher than what I remember for tpcc-5k. Would be interesting to see how this compares to tpcc-5k with partitioning.

a-robinson · 2018-05-31T20:39:10Z

Running with --ramp=30s to ramp up the traffic more smoothly seems to help a lot with the tail latencies:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  300.4s    60544.4  94.2%    143.7    134.2    218.1    260.0    453.0   4831.8

Of course, the sample sizes here aren't exactly large, so maybe the difference is just noise.

And to follow up on my questions from the other day:

r1 isn't by far the hottest range anymore. Its qps hasn't dropped to 0, either, though. The slow query log hasn't been nearly as illuminating this time, just showing a handful of normal range-lookup events. It's weird it hasn't dropped closer to 0, but it's not causing any problems anymore.
All the warehouse ranges are now down around 2 qps, with no outliers. So the outliers the other day probably were caused by error retries.

The hottest ranges in the cluster at this point are all from the tpcc.new_order table, followed by the tpcc.item table. Top 50 ranges in case you're curious, with the rangeIDs in the mid-hundreds being from new_order and the 30-thousands from item:

idx rangeID    QPS                nodeIDs     leaseholder node
  1:   496      217.292003       [5, 6, 4]      6
  2:   793      216.760419      [5, 10, 6]      5
  3:   653      216.523500      [5, 10, 6]      10
  4:   448      216.106647      [5, 10, 6]      10
  5:   557      215.943809      [15, 3, 1]      15
  6:   458      215.132572      [15, 12, 1]     15
  7:   533      193.794032      [5, 15, 4]      5
  8:   730      188.153795       [5, 6, 3]      6
  9:   293      159.987734      [10, 6, 9]      9
 10:   413      154.168710      [5, 15, 4]      5
 11:   613      152.213182      [15, 6, 1]      15
 12:   267      151.527005       [5, 6, 4]      6
 13:   797      151.262274       [5, 6, 3]      3
 14:   611      150.982270       [6, 4, 3]      6
 15:   457      150.963159       [5, 6, 3]      5
 16:   724      150.941294       [5, 3, 1]      3
 17:   615      150.589744      [5, 12, 4]      12
 18: 33644      123.715932       [8, 6, 7]      8
 19: 33197      112.783797       [8, 6, 7]      6
 20:   388      112.059562      [5, 10, 4]      4
 21:   540       97.112390       [5, 6, 4]      4
 22: 34061       94.627812       [8, 6, 7]      7
 23: 32384       92.206044       [8, 6, 7]      8
 24: 32110       86.193180       [8, 6, 7]      6
 25: 32675       82.701674       [8, 6, 7]      6
 26: 34390       73.451197       [8, 6, 7]      6
 27: 34878       70.419347       [8, 6, 7]      7
 28:   266       69.683754      [5, 15, 4]      15
 29: 31457       67.308302       [8, 6, 7]      6
 30: 33275       67.023898       [8, 6, 7]      7
 31:   976       65.729066      [5, 10, 6]      5
 32: 36672       63.460209      [5, 15, 4]      15
 33: 34217       60.512212       [8, 6, 7]      6
 34:   493       59.469083      [5, 6, 12]      6
 35: 34104       54.001622       [8, 6, 7]      7
 36: 32033       53.894249       [8, 6, 7]      6
 37: 31546       52.105440       [8, 6, 7]      6
 38: 32199       52.043978       [8, 6, 7]      7
 39: 34701       50.376507       [8, 6, 7]      8
 40: 32436       49.500585       [8, 6, 7]      8
 41: 32335       47.352396       [8, 6, 7]      6
 42: 32210       46.446430       [8, 6, 7]      8
 43: 34580       45.592824       [8, 6, 7]      8
 44: 32663       44.654265       [8, 6, 7]      8
 45: 33395       44.197098       [8, 6, 7]      8
 46: 33678       44.139657       [8, 6, 7]      7
 47: 31786       43.393267       [8, 6, 7]      8
 48:     1       42.440460      [2, 14, 12]     2
 49: 34153       41.976303       [8, 6, 7]      7
 50: 33196       40.899344       [8, 6, 7]      8

vilterp · 2018-05-31T21:06:19Z

Hey @a-robinson, we can discuss this in a separate issue, but how did you generate the hot ranges table above? Would love to get it into an endpoint so we can show it in the UI.

petermattis · 2018-05-31T21:16:03Z

The new_order table has warehouse ID as the first component of its primary key. I'm surprised to see such a delta in QPS across those new_order ranges. And I'm guessing this is only a fraction of the new_order ranges.

a-robinson · 2018-05-31T22:10:47Z

Hey @a-robinson, we can discuss this in a separate issue, but how did you generate the hot ranges table above? Would love to get it into an endpoint so we can show it in the UI.

hottest_ranges.py from https://gist.github.com/a-robinson/54fbaa6628ae9f1ad9c6185ecd28edb9 run on the output of the /_status/raft endpoint.

Disclaimer: the qps numbers are averages over the last 30 minutes, not instantaneous measurements. Also, that raft debug endpoint is very slow (>1 minute) on a large cluster.

And I'm guessing this is only a fraction of the new_order ranges.

No, they're all represented there except for one outlier that at the very end of the table that's responsible for fewer warehouses. Although that's ignoring the bunch of ranges in the table that are responsible for effectively none of the keyspace (e.g. /1764/3/2642-/1764/3/2642/0), which makes it look like we're choosing split points incorrectly somewhere (reminiscent of #24896):

root@localhost:26257/defaultdb> show testing_ranges from table tpcc.new_order;
+-----------------+-----------------+----------+-------------+--------------+
|    Start Key    |     End Key     | Range ID |  Replicas   | Lease Holder |
+-----------------+-----------------+----------+-------------+--------------+
| NULL            | /0              |      192 | {3,4,12}    |           12 |
| /0              | /0/1/2101/0     |    36721 | {3,4,12}    |           12 |
| /0/1/2101/0     | /138/9/2506     |      266 | {4,5,15}    |           15 |
| /138/9/2506     | /272/10/2564    |    36672 | {4,5,15}    |           15 |
| /272/10/2564    | /272/10/2564/0  |      337 | {4,5,15}    |           15 |
| /272/10/2564/0  | /521/6/2127     |      413 | {4,5,15}    |           15 |
| /521/6/2127     | /521/6/2127/0   |      460 | {5,10,15}   |           15 |
| /521/6/2127/0   | /770/1/2590     |      533 | {4,5,15}    |           15 |
| /770/1/2590     | /770/1/2590/0   |      198 | {3,6,15}    |           15 |
| /770/1/2590/0   | /1000           |      293 | {6,9,10}    |            6 |
| /1000           | /1018/7/2153    |    36711 | {6,9,10}    |            6 |
| /1018/7/2153    | /1018/7/2153/0  |      412 | {4,9,14}    |            9 |
| /1018/7/2153/0  | /1267/2/2616    |      458 | {1,12,15}   |            1 |
| /1267/2/2616    | /1267/2/2616/0  |      552 | {1,4,15}    |            1 |
| /1267/2/2616/0  | /1515/8/2179    |      557 | {1,3,15}    |            1 |
| /1515/8/2179    | /1515/8/2179/0  |      215 | {4,5,12}    |           12 |
| /1515/8/2179/0  | /1764/3/2642    |      267 | {2,4,6}     |            6 |
| /1764/3/2642    | /1764/3/2642/0  |      339 | {4,5,11}    |           11 |
| /1764/3/2642/0  | /1948/8/2638    |      388 | {4,5,10}    |            4 |
| /1948/8/2638    | /1948/8/2638/0  |      492 | {3,6,14}    |           14 |
| /1948/8/2638/0  | /2000           |      611 | {3,4,6}     |            3 |
| /2000           | /2197/4/2201    |    36697 | {3,4,6}     |            3 |
| /2197/4/2201    | /2197/4/2201/0  |      262 | {2,4,6,15}  |            2 |
| /2197/4/2201/0  | /2445/9/2664    |      457 | {5,6,13}    |            6 |
| /2445/9/2664    | /2445/9/2664/0  |      536 | {3,6,12}    |           12 |
| /2445/9/2664/0  | /2605/5/2267    |      540 | {2,6,8,15}  |            8 |
| /2605/5/2267    | /2605/5/2267/0  |      721 | {3,4,6}     |            3 |
| /2605/5/2267/0  | /2853/10/2730   |      730 | {3,5,6}     |            3 |
| /2853/10/2730   | /2853/10/2730/0 |      331 | {4,6,15}    |           15 |
| /2853/10/2730/0 | /2921/10/2590   |      493 | {5,6,12}    |           12 |
| /2921/10/2590   | /2921/10/2590/0 |      612 | {1,2,3}     |            1 |
| /2921/10/2590/0 | /3000           |      613 | {1,6,15}    |            1 |
| /3000           | /3170/6/2153    |    36685 | {1,6,15}    |            1 |
| /3170/6/2153    | /3170/6/2153/0  |      614 | {1,6,15}    |            1 |
| /3170/6/2153/0  | /3419/1/2616    |      615 | {4,5,12}    |           12 |
| /3419/1/2616    | /3419/1/2616/0  |      384 | {4,5,12}    |            4 |
| /3419/1/2616/0  | /3667/7/2179    |      496 | {4,5,6}     |            6 |
| /3667/7/2179    | /3667/7/2179/0  |      551 | {3,6,7}     |            3 |
| /3667/7/2179/0  | /3916/2/2642    |      724 | {1,3,5}     |            5 |
| /3916/2/2642    | /3916/2/2642/0  |      794 | {3,5,15}    |           15 |
| /3916/2/2642/0  | /4000           |      797 | {3,5,6}     |            3 |
| /4000           | /4164/8/2205    |    36692 | {3,5,6}     |            3 |
| /4164/8/2205    | /4164/8/2205/0  |      417 | {4,6,11,13} |            6 |
| /4164/8/2205/0  | /4413/3/2668    |      448 | {5,6,10}    |           10 |
| /4413/3/2668    | /4413/3/2668/0  |      535 | {3,5,6}     |            3 |
| /4413/3/2668/0  | /4661/9/2231    |      653 | {5,6,10}    |           10 |
| /4661/9/2231    | /4661/9/2231/0  |      726 | {4,5,13}    |           13 |
| /4661/9/2231/0  | /4910/4/2694    |      793 | {5,9,10}    |            5 |
| /4910/4/2694    | /4910/4/2694/0  |      845 | {5,10,13}   |           10 |
| /4910/4/2694/0  | /4999/10/3001   |      976 | {4,5,10}    |           10 |
| /4999/10/3001   | /5000           |     1091 | {4,6,11,13} |            6 |
| /5000           | NULL            |    36731 | {4,6,13}    |            4 |
+-----------------+-----------------+----------+-------------+--------------+
(52 rows)

petermattis · 2018-06-01T13:35:02Z

No, they're all represented there except for one outlier that at the very end of the table that's responsible for fewer warehouses. Although that's ignoring the bunch of ranges in the table that are responsible for effectively none of the keyspace (e.g. /1764/3/2642-/1764/3/2642/0), which makes it look like we're choosing split points incorrectly somewhere (reminiscent of #24896):

We don't manually split the new_order table. I'm not sure why not. Perhaps there are enough ranges already to not require splitting. The strange nearly empty ranges might indicate badness (or a bug) in the automatic splitting heuristics.

a-robinson · 2018-06-01T22:53:04Z

Using partitioning gets basically the same results as with replica scattering:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  300.4s    60837.6  94.6%     84.7     71.3    130.0    201.3    352.3   1006.6

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  300.4s    60587.1  94.2%     88.9     79.7    130.0    184.5    369.1    906.0

The strange nearly empty ranges might indicate badness (or a bug) in the automatic splitting heuristics.

It looks to me like the tpc-c 5k fixture I'm RESTOREing has the bad splits already in it. It's as if the cluster that created the backup had partitioned the table and was running a version before #24896. Either that, or the RESTORE process itself creates the bad splits somehow. The bad splits are consistently there as soon as the data has been restored.

petermattis · 2018-06-02T14:06:59Z

It looks to me like the tpc-c 5k fixture I'm RESTOREing has the bad splits already in it. It's as if the cluster that created the backup had partitioned the table and was running a version before #24896. Either that, or the RESTORE process itself creates the bad splits somehow. The bad splits are consistently there as soon as the data has been restored.

If this is reproducible, can you file a separate issue about this? Seems like something for the bulkio folks to investigate.

As outlined in recent comments on cockroachdb#26059, we need to bring back some form of stats-based rebalancing in order to perform well on TPC-C without manual partitioning and replica placement. This commit contains a prototype that demonstrates the effectiveness of changing our approach to making rebalancing decisions from making them in the replicate queue, which operates on arbitrarily ordered replicas of the ranges on a store, to making them at a higher-level. This prototype makes them at a cluster level by running the logic on only one node, but my real proposal is to make them at the store level. This change in abstraction reflects what a human would do if asked to even out the load on a cluster given perfect information about everything happening in the cluster: 1. First, determine which stores have the most load on them (or overfull -- but for the prototype I only considered the one dimension that affects TPC-C the most) 2. Decide whether the most loaded stores are so overloaded that action needs to be taken. 3. Examine the hottest replicas on the store (maybe not the absolute hottest in practice, since moving that one could disrupt user traffic, but in the prototype this seems to work fine) and attempt to move them to under-utilized stores. If this can be done simply by transferring leases to under-utilized stores, then do so. If moving leases isn't enough, then also rebalance replicas from the hottest store to under-utilized stores. 4. Repeat periodically to handle changes in load or cluster membership. In a real versino of this code, the plan is roughly: 1. Each store will independently run their own control loop like this that is only responsible for moving leases/replicas off itself, not off other stores. This avoids needing a centralized coordinator, and will avoid the need to use the raft debug endpoint as long as we start gossiping QPS per store info, since the store already has details about the replicas on itself. 2. The existing replicate queue will stop making decisions motivated by balance. It will switch to only making decisions based on constraints/diversity/lease preferences, which is still needed since the new store-level logic will only check for store-level balance, not that all replicas' constraints are properly met. 3. The new code will have to avoid violating constraints/diversity/lease preferences. 4. The new code should consider range count, disk fullness, and maybe writes per second as well. 5. In order to avoid making decisions based on bad data, I'd like to extend lease transfers to pass along QPS data to the new leaseholder and preemptive snapshots to pass along WPS data to the new replica. This may not be strictly necessary, as shown by the success of this prototype, but should make for more reliable decision making. I tested this out on TPC-C 5k on 15 nodes and am able to consistently get 94% efficiency, which is the max I've seen using a build of the workload generator that erroneously includes the ramp-up period in its final stats. The first run with this code only got 85% because it took a couple minutes to make all the lease transfers it wanted, but then all subsequent runs got the peak efficiency while making negligibly few lease transfers. Note that I didn't even have to implement replica rebalancing to get these results, which oddly contradicts my previous claims. However, I believe that's because I did the initial split/scatter using a binary containing cockroachdb#26438, so the replicas were already better scattered than by default. I ran TPC-C on that build without these changes a couple times, though, and didn't get better than 65% efficiency, so the scatter wasn't the cause of the good results here. Touches cockroachdb#26059, cockroachdb#17979 Release note: None

As outlined in recent comments on cockroachdb#26059, we need to bring back some form of stats-based rebalancing in order to perform well on TPC-C without manual partitioning and replica placement. This commit contains a prototype that demonstrates the effectiveness of changing our approach to making rebalancing decisions from making them in the replicate queue, which operates on arbitrarily ordered replicas of the ranges on a store, to making them at a higher-level. This prototype makes them at a cluster level by running the logic on only one node, but my real proposal is to make them at the store level. This change in abstraction reflects what a human would do if asked to even out the load on a cluster given perfect information about everything happening in the cluster: 1. First, determine which stores have the most load on them (or overfull -- but for the prototype I only considered the one dimension that affects TPC-C the most) 2. Decide whether the most loaded stores are so overloaded that action needs to be taken. 3. Examine the hottest replicas on the store (maybe not the absolute hottest in practice, since moving that one could disrupt user traffic, but in the prototype this seems to work fine) and attempt to move them to under-utilized stores. If this can be done simply by transferring leases to under-utilized stores, then do so. If moving leases isn't enough, then also rebalance replicas from the hottest store to under-utilized stores. 4. Repeat periodically to handle changes in load or cluster membership. In a real versino of this code, the plan is roughly: 1. Each store will independently run their own control loop like this that is only responsible for moving leases/replicas off itself, not off other stores. This avoids needing a centralized coordinator, and will avoid the need to use the raft debug endpoint as long as we start gossiping QPS per store info, since the store already has details about the replicas on itself. 2. The existing replicate queue will stop making decisions motivated by balance. It will switch to only making decisions based on constraints/diversity/lease preferences, which is still needed since the new store-level logic will only check for store-level balance, not that all replicas' constraints are properly met. 3. The new code will have to avoid violating constraints/diversity/lease preferences. 4. The new code should consider range count, disk fullness, and maybe writes per second as well. 5. In order to avoid making decisions based on bad data, I'd like to extend lease transfers to pass along QPS data to the new leaseholder and preemptive snapshots to pass along WPS data to the new replica. This may not be strictly necessary, as shown by the success of this prototype, but should make for more reliable decision making. I tested this out on TPC-C 5k on 15 nodes and am able to consistently get 94% efficiency, which is the max I've seen using a build of the workload generator that erroneously includes the ramp-up period in its final stats. The first run with this code only got 85% because it took a couple minutes to make all the lease transfers it wanted, but then all subsequent runs got the peak efficiency while making negligibly few lease transfers. Note that I didn't even have to implement replica rebalancing to get these results, which oddly contradicts my previous claims. However, I believe that's because I did the initial split/scatter using a binary containing cockroachdb#26438, so the replicas were already better scattered than by default. I ran TPC-C on that build without these changes a couple times, though, and didn't get better than 65% efficiency, so the scatter wasn't the cause of the good results here. Touches cockroachdb#26059, cockroachdb#17979 Release note: None [prototype] storage: Extend new allocator to also move range replicas With this update, TPC-C 10k on 30 went from overloaded to running at peak efficiency over the course of about 4 hours (the manual partitioning approach takes many hours to move all the replicas as well, for a point of comparison). This is without having to run the replica scatter from cockroachdb#26438. Doing a 5 minute run to get a result that doesn't include all the rebalancing time shows: _elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms) 290.9s 124799.1 97.0% 548.6 486.5 872.4 1140.9 2281.7 10200.5 I think it may have a small bug in it still, since at one point early on one of the replicas from the warehouse table on the node doing the relocating thought that it had 16-17k QPS, which wasn't true by any other metric in the system. Restarting the node fixed it though. I'm not too concerned about the bug, since I assume I just made a code mistake, not that anything about the approach fundamentally leads to a random SQL table replica gets 10s of thousands of QPS. Range 1 is also back to getting a ton of QPS (~3k) even though I raised the range cache size from 1M to 50M. Looking at slow query traces shows a lot of range lookups, way more than I'd expect given that ranges weren't moving around at the time of the traces. Release note: None Release note: None

a-robinson · 2018-08-21T19:28:52Z

Consider this investigated. Resulted in #28340 and #28340 after the prototype in #26608.

a-robinson added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-distribution Relating to rebalancing and leasing. labels May 24, 2018

a-robinson added this to the 2.1 milestone May 24, 2018

a-robinson self-assigned this May 24, 2018

spencerkimball mentioned this issue May 30, 2018

storage: avoid querying pusher where transaction record can't exist #26238

Merged

tbg mentioned this issue May 30, 2018

backport-2.0: storage: avoid querying pusher where transaction record can't exist #26245

Merged

a-robinson mentioned this issue Jun 4, 2018

backupccl: empty splits in tpc-c 5k restore of new_order table #26375

Closed

a-robinson mentioned this issue Jun 5, 2018

WIP: storage: Support scattering replicas in addition to leases #26438

Closed

a-robinson mentioned this issue Jun 11, 2018

[prototype] storage: Make rebalance decisions at store-level #26608

Closed

a-robinson mentioned this issue Jul 23, 2018

storage: test replicate queue behavior #11987

Closed

robert-s-lee mentioned this issue Aug 15, 2018

Collect performance diagnostic information cockroachdb/docs#3547

Closed

a-robinson closed this as completed Aug 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: diagnose performance issues with tpc-c when not using partitioning #26059

storage: diagnose performance issues with tpc-c when not using partitioning #26059

a-robinson commented May 24, 2018

a-robinson commented May 30, 2018 •

edited

Loading

a-robinson commented May 30, 2018

petermattis commented May 30, 2018

a-robinson commented May 30, 2018

bdarnell commented May 30, 2018

piyush-singh commented May 30, 2018

nvanbenschoten commented May 30, 2018

a-robinson commented May 30, 2018

spencerkimball commented May 30, 2018

a-robinson commented May 31, 2018 •

edited

Loading

petermattis commented May 31, 2018

a-robinson commented May 31, 2018

vilterp commented May 31, 2018

petermattis commented May 31, 2018

a-robinson commented May 31, 2018

petermattis commented Jun 1, 2018

a-robinson commented Jun 1, 2018

petermattis commented Jun 2, 2018

a-robinson commented Aug 21, 2018

storage: diagnose performance issues with tpc-c when not using partitioning #26059

storage: diagnose performance issues with tpc-c when not using partitioning #26059

Comments

a-robinson commented May 24, 2018

a-robinson commented May 30, 2018 • edited Loading

a-robinson commented May 30, 2018

petermattis commented May 30, 2018

a-robinson commented May 30, 2018

bdarnell commented May 30, 2018

piyush-singh commented May 30, 2018

nvanbenschoten commented May 30, 2018

a-robinson commented May 30, 2018

spencerkimball commented May 30, 2018

a-robinson commented May 31, 2018 • edited Loading

petermattis commented May 31, 2018

a-robinson commented May 31, 2018

vilterp commented May 31, 2018

petermattis commented May 31, 2018

a-robinson commented May 31, 2018

petermattis commented Jun 1, 2018

a-robinson commented Jun 1, 2018

petermattis commented Jun 2, 2018

a-robinson commented Aug 21, 2018

a-robinson commented May 30, 2018 •

edited

Loading

a-robinson commented May 31, 2018 •

edited

Loading