TPC-C 50k #30284

a-robinson · 2018-09-17T15:22:23Z

Last week I tried running TPC-C 50k on 120 GCE VMs. The cluster was able to consistently hit 80% efficiency, or just over 500k tpmC. This isn’t terrible, but I certainly wouldn’t call it a success, especially given the large tail latencies. Below are notes on what I did, problems I ran into, and areas for additional investigation/improvement. I'll follow up with additional testing soon.

I set up a 120 node cluster using roachprod with the n1-highcpu-16 machine type. I initially created 128 nodes by accident, then had to decommission the latter 8. It’s unlikely this had an effect on the experiments. I used partitioning with --racks=40 to avoid the influence of rebalancing on the experiment.
Gossip connectivity logging was overwhelming, both in the amount and frequency of output (gossip: connectivity logging is extremely spammy in 128 node cluster #30088). It also was consuming a lot of bandwidth, which @petermattis has done much to address since then.
ulimit -n 50000 was needed before being able to run the workload generator from the 121st VM (instead of my normal ulimit 10000). I’m not familiar enough with our tpc-c implementation to know whether this is expected, but it certainly isn’t unreasonable given the number of warehouses.
The workload generator repeatedly OOM’ed after a few minutes of running on the n1-highcpu-16 machine (which has 14.4 GB of memory). I had to create a separate, larger VM for it to run on, where it ended up in a steady state of using around 20 CPUs and 15 GB of memory. I was pretty surprised that this much memory was needed, but haven’t looked into whether it’s a bug. It didn’t continue to grow after getting to around 15GB, so there doesn’t appear to be an issue with it continuously leaking memory.
The UI was quite unwieldy with this many nodes, even when load wasn’t running against the cluster. It ran really, really slowly. Chrome was not happy.
The tail latencies were not good. They’d be ok for the first 10-20 seconds of full load, then things would bog down. QPS takes a hit, p50 latencies spike to 1-2 seconds, and tail latencies spike above 10 seconds (example numbers below [1]).
There were a LOT of “not leaseholder” errors. Low thousands per second, IIRC (I don’t have a screenshot saved). I followed up on this separately on a smaller cluster and tracked it down to kv: Don't evict from leaseholder cache on context cancellations #30163. I haven’t since retested with 50k warehouses since the fix, only 10k. While I think it will help, I don’t believe it was the real limiting factor. I also think it’s unfortunate that we don’t populate the leaseholder cache after a successful request if we don’t already have an entry for the range in the cache (we only do so after a NotLeaseHolderError), but the code isn’t structured very well for it and I doubt it’d really help performance.
What appeared to be a bigger problem was contention. I turned on slow query tracing for a little while, and the main offenders for pretty much all the slow queries are waiting on overlapping commands in the command queue, waiting in the contention queue / pushing other transactions, and on transaction restarts (i.e. RETRY_SERIALIZABLE errors). Transaction restarts are incredibly costly when problems 1 and 2 mean that each attempt at a transaction is already taking upwards of 10 seconds, and also making contention worse for other transactions.

I need to do the testing again with more emphasis on whether there’s a resource bottleneck (are the CPUs pegged, how much I/O is being done, etc), and if so why. But before doing that I’d like to understand the contention a little better. I was under the impression that tpc-c was a mostly uncontended workload, so I clearly need to get a little more familiar with it.

I’ve included a couple log files with the slow query traces below [2] [3]. Do we have a visualization tool for these? It might make a nice free friday project given how much the different spans get interleaved together in the log output.

[1]

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  303.2s   515552.3  80.2%   3669.2   1946.2   7247.8  11811.2  28991.0 103079.2

[2] cockroach5.log
[3] cockroach6.log

The text was updated successfully, but these errors were encountered:

awoods187 · 2018-09-17T17:34:45Z

My understanding is that TPC-C is actually a pretty good contending workload due to the five transactions across the warehouses. cc @jordanlewis and @arjunravinarayan who may know more

jordanlewis · 2018-09-17T17:38:24Z

There's definitely a lot of expected contention in TPC-C. See for example the warehouse.w_ytd field, and the district fields d_ytd and d_next_o_id - these need to be read and written by lots of concurrent transactions.

rjnn · 2018-09-17T17:41:46Z

These contentions clogging up the queue and slowing everything down seem very reminiscent of the issues we experienced when running TPC-C 1k and 10k in the spring. We really haven't made much progress on graceful degradation under high levels of contention, and that seems to be showing up here. I'd wager that a more easily reproducible test setup that gets at the same problems that we see here is to test TPC-C 10k with 1-node-less-than-whatever-it-is-we-need-for-~100%-efficiency.

awoods187 · 2018-09-17T17:45:12Z

CC @nvanbenschoten on @arjunravinarayan 's comment above because he's run those tests semi-recently

petermattis · 2018-09-17T17:55:05Z

@a-robinson This is a better number than I was expecting on a first attempt.

nvanbenschoten · 2018-09-17T18:24:41Z

ulimit -n 50000 was needed before being able to run the workload generator from the 121st VM (instead of my normal ulimit 10000). I’m not familiar enough with our tpc-c implementation to know whether this is expected, but it certainly isn’t unreasonable given the number of warehouses.

That isn't too surprising. Each warehouse will have 10 workers goroutines in the workload generator.

The UI was quite unwieldy with this many nodes, even when load wasn’t running against the cluster. It ran really, really slowly. Chrome was not happy.

I observed this as well when running a 128 node KV cluster. This is worth exploring in 2.2. cc. @piyush-singh.

There were a LOT of “not leaseholder” errors. Low thousands per second, IIRC (I don’t have a screenshot saved). I followed up on this separately on a smaller cluster and tracked it down to #30163. I haven’t since retested with 50k warehouses since the fix, only 10k. While I think it will help, I don’t believe it was the real limiting factor. I also think it’s unfortunate that we don’t populate the leaseholder cache after a successful request if we don’t already have an entry for the range in the cache (we only do so after a NotLeaseHolderError), but the code isn’t structured very well for it and I doubt it’d really help performance.

Was this before or after you bumped the kv.range_descriptor_cache.size (which also controls the leaseholder cache size)?

What appeared to be a bigger problem was contention. I turned on slow query tracing for a little while, and the main offenders for pretty much all the slow queries are waiting on overlapping commands in the command queue, waiting in the contention queue / pushing other transactions, and on transaction restarts (i.e. RETRY_SERIALIZABLE errors). Transaction restarts are incredibly costly when problems 1 and 2 mean that each attempt at a transaction is already taking upwards of 10 seconds, and also making contention worse for other transactions.

TPC-C does have a fair bit of contention, as @jordanlewis mentioned, but it shouldn't scale with the size of the data set. Outside of the occasional cross-warehouse transaction (1% if newOrders), all operations should be scoped to a single warehouse, so this is surprising to me. You could disable these cross-warehouse transactions entirely and see if that makes much of a difference.

Did you try out any request tracing to see where long running transactions were spending most of their time? With a large enough VM, you should be able to point this size cluster at Jaeger and get a lot of valuable insight.

One thing I'd be interesting in looking at is the intentresolver.async.throttled metric when this cluster is running. If we're having trouble cleaning up intents then we'd expect to see more contention and a slowdown across the board.

We really haven't made much progress on graceful degradation under high levels of contention

I don't think that's a fair thing to say. 2.1 will be significantly better at degrading gracefully under high levels of contention due to #25014.

a-robinson · 2018-09-17T19:51:18Z

The UI was quite unwieldy with this many nodes, even when load wasn’t running against the cluster. It ran really, really slowly. Chrome was not happy.

I observed this as well when running a 128 node KV cluster. This is worth exploring in 2.2. cc. @piyush-singh.

I opened #30287 for this. It sounds like it'll be worked on for 2.2.

Was this before or after you bumped the kv.range_descriptor_cache.size (which also controls the leaseholder cache size)?

Changing the cache size had no effect. I'm fairly confident that #30163 will have resolved it.

Did you try out any request tracing to see where long running transactions were spending most of their time? With a large enough VM, you should be able to point this size cluster at Jaeger and get a lot of valuable insight.

Just the slow query trace logging (sql.trace.txn.enable_threshold), which gives the same sort of info but in a much less easily consumable format. Seeing the spans in a nicer UI might reveal something that the raw traces themselves were obfuscating.

One thing I'd be interesting in looking at is the intentresolver.async.throttled metric when this cluster is running. If we're having trouble cleaning up intents then we'd expect to see more contention and a slowdown across the board.

Will do, thanks.

a-robinson · 2018-09-18T16:49:41Z

I'm testing today on a bigger cluster based on feedback from yesterday, and while the pMax is still up there, I'm getting some otherwise very successful runs. The longest I've done so far was this 10 minute run:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  600.5s   631851.3  98.3%    510.8    402.7    906.0   1140.9   2281.7 103079.2

The first two 5-minute runs I did got 90% and 91% efficiencies with around a 10s p99, but since them everything has been above 94%. This does back up the thought that the contention problems on the 120 node cluster were just from general overload, suggesting that while we may not scale perfectly linearly (since we can do TPC-C 10k with 24 nodes or fewer) we weren't hitting some sort of scalability cliff.

This is with 135 nodes. I was going to try 150 nodes as discussed with @nvanbenschoten, but we didn't have enough GCE cpu quota for that, and apparently we didn't need it.

There are still a good deal of transaction restarts, and it's not clear that #30163 was as successful as expected. The NotLeaseholderErrors are decreasing over time, but fairly slowly. I'll keep an eye on them.

awoods187 · 2018-09-18T17:01:19Z

This is awesome news!!

a-robinson · 2018-09-18T20:30:00Z

The intentresolver.async.throttled is indeed elevated, at a rate of around 1000 throttled async intent resolutions per second during steady state 50k load, even on the 135 node cluster that is doing reasonably well. Its ramp up correlates nicely with the general pMax increase during the ramp-up period. Good call, @nvanbenschoten.

a-robinson · 2018-09-19T13:23:30Z

Remaining issues to dig into after yesterday:

workload/tpcc concurrency issues causing it to not print its output every second (workload: Optimize ticking of histograms for large numbers of workers #30807)
std::bad_alloc crashing reliably at startup on two of the nodes (storage: Nodes crashing with std::bad_alloc within a second of starting up #30389)
The intentresolver.async.throttled metric seems correlated to slowdowns - can tuning the async intent resolver limit help? I wasn't able to properly test this in my non-partitioned cluster because it didn't show up as a bottleneck there. This can continue to be tracked as part of storage: explore using a goroutine pool for async intent resolution #30780.
Large cluster gossip problems? I haven't checked through the latest run's logs very thoroughly, but it only looked like there was a lot of logging around a node's startup time. (gossip: connectivity logging is extremely spammy in 128 node cluster #30088)
Massive amounts of [Zipkin collector msg backlog too long, disposing spans. count 1] output lines get dumped into the log when zipkin backs up (there was also a lot of memory pressure while this was happening, although that may have just been due to the tracing in general with so much load on the nodes). (tracing: Rate limit logs from zipkin collector library #30678)
In Jaeger traces, I was seeing 400+ second delays for "releasing %d tables" to complete (and thus for the associated transaction span to complete). ~~This may be a bug in SQL that's causing the 100+ second pMax latencies.~~ In my latest attempts to reproduce this, any similarly long traces have real work being done the entire time. I strongly suspect this was due to jaeger not getting all spans sent to it, which is something I observed in other traces as well.

This involves some slightly awkward refactoring of the log.EveryN code into a different package because the tracing package can't use the log package. Resolves the massive log spam I saw when my tpc-c 50k cluster in cockroachdb#30284 caused my jaeger node to OOM, making for a lot of complaints from the zipkin library about its backlog being too long. Release note: None

30678: tracing: Rate limit logs from zipkin collector library r=a-robinson a=a-robinson This involves some slightly awkward refactoring of the log.EveryN code into a different package because the tracing package can't use the log package. Resolves the massive log spam I saw when my tpc-c 50k cluster in #30284 caused my jaeger node to OOM, making for a lot of complaints from the zipkin library about its backlog being too long. Release note: None Co-authored-by: Alex Robinson <[email protected]>

This involves some slightly awkward refactoring of the log.EveryN code into a different package because the tracing package can't use the log package. Resolves the massive log spam I saw when my tpc-c 50k cluster in cockroachdb#30284 caused my jaeger node to OOM, making for a lot of complaints from the zipkin library about its backlog being too long. Release note: None

30807: workload: Optimize ticking of histograms for large numbers of workers r=a-robinson a=a-robinson **workload: Avoid defer in NamedHistogram.tick** Minor, but this was showing up on CPU profiles taking a little over 1% of CPU. Given that the histogram ticking is CPU-bound for large tpc-c invocations, saving 1% for basically zero effort seems worth it. **workload: Use sync.Pool for hdr histograms** The allocation of these histograms was taking about 20% of CPU on locally simulated tpc-c 50k runs (not even including all the extra work it imposed on the garbage collector). This drops newHistogram down to 7%, 6% of which is from the memory clearing done by hdrhistogram.Reset(). This noticeably cuts down on the amount of output ticks that are skipped by tpc-c 50k on my laptop, but doesn't totally eliminate them yet. **workload: Paralellize ticking of different histograms** This avoided skipped ticks when running on my laptop (with all actual sql requests mocked out with a time.Sleep(25ms)). Before this, ticks were only being printed every 3 or 4 seconds instead of every second. This will also only scale so far and will probably need to be further optimized in the future, but it appears to be good enough for 50k warehouses, so it's good enough for now. There's still some funniness going on at the start of runs, though, with the first second or two being skipped and with some elevated latencies about 10 seconds in (even though no real requests are being sent to a database in this local testing). **workload: Call callback without mutex held in NamedHistogram.tick** **workload: Optimize locking in Histograms.Get** Also avoid including locking in the timing. This has a very noticeable effect on the tail latencies in my local 50k worker testing with the SQL queries stubbed out. --- Touches #30284 (comment) 30833: bump cockroach-go and pgx dependencies r=RaduBerinde a=RaduBerinde This change is useful for some experiments with pgx. These packages only affect testing code. Release note: None Co-authored-by: Alex Robinson <[email protected]> Co-authored-by: Radu Berinde <[email protected]>

a-robinson · 2018-10-09T17:59:04Z

Yesterday I tried to get tpc-c 50k working on 135 VMs without partitioning. It was not successful. As in the original experiments without partitioning, the problem was that one node would always get overloaded and backed up before the others, ending up with tens of thousands of goroutines blocking in the command queue and contention queue, while all other nodes have no more than a few thousand goroutines. The backed up node was often not the one with the most qps as recorded by our average qps per store metric, so there's likely more to understand there.

However, I suspect such further understanding can be done on a much smaller cluster. For what it's worth, I also suspect that this is a scenario where, if we really wanted to, we could reach the desired throughput by just adding more nodes to avoid any node getting so overloaded.

This run allowed me to finish off the checkboxes from my previous comment, and found a few other issues worth resolving - #31135, #31147, #31134

@awoods187 I plan on tabling this for a while (or completely closing it) as we work on other things like load-based splitting. Let me know if it needs any specific action.

a-robinson · 2018-10-09T18:00:11Z

And for what it's worth, load-based splitting would have helped in my test yesterday. At one point one of the ranges in tpcc.district was very hot due to a lot of contention on it, and I actually chose to manually split it to better distribute the load/contention.

a-robinson · 2018-10-11T19:46:58Z

I realized today that I made a big mistake in the test setup a couple days ago that likely hurt the results. I never ran workload run tpcc with the --split flag to ensure all the ranges were properly split. I passed over that command because I mentally associated it with partitioning, when in reality it also does a bunch of important pre-splitting even when the --partitions flag isn't specified. When running the benchmark, I did notice many ranges being much hotter than normal, especially from the districts table.

In other words, it's possible that 50k does work without partitioning on 135 VMs and that I just screwed up the test. I don't plan on revisiting it soon, though, as I still think that focusing on smaller (5k to 20k warehouses) clusters is a more effective use of machine resources.

awoods187 · 2018-10-11T20:25:44Z

That's exciting to hear perf may be even better and I concur with your comment about not needing to re-test it and focus on smaller workloads.

nvanbenschoten · 2018-10-12T17:48:57Z

It might be worth running two small 5k clusters, one with and one without the --split flag, to get a ballpark estimate of how much missing that flag might have cost us in the 50k run.

a-robinson added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-coreperf labels Sep 17, 2018

a-robinson self-assigned this Sep 17, 2018

a-robinson mentioned this issue Sep 18, 2018

storage: Nodes crashing with std::bad_alloc within a second of starting up #30389

Closed

nvanbenschoten added this to the 2.1 milestone Sep 25, 2018

a-robinson mentioned this issue Sep 26, 2018

tracing: Rate limit logs from zipkin collector library #30678

Merged

a-robinson mentioned this issue Sep 26, 2018

release-2.1: tracing: Rate limit logs from zipkin collector library #30694

Merged

a-robinson mentioned this issue Oct 1, 2018

workload: Optimize ticking of histograms for large numbers of workers #30807

Merged

nvanbenschoten mentioned this issue Oct 1, 2018

workload: expose pprof http endpoints #30810

Closed

petermattis removed this from the 2.1 milestone Oct 5, 2018

nvanbenschoten removed the A-coreperf label Oct 12, 2018

a-robinson closed this as completed Jan 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPC-C 50k #30284

TPC-C 50k #30284

a-robinson commented Sep 17, 2018

awoods187 commented Sep 17, 2018 •

edited

Loading

jordanlewis commented Sep 17, 2018

rjnn commented Sep 17, 2018

awoods187 commented Sep 17, 2018

petermattis commented Sep 17, 2018 •

edited

Loading

nvanbenschoten commented Sep 17, 2018

a-robinson commented Sep 17, 2018

a-robinson commented Sep 18, 2018

awoods187 commented Sep 18, 2018

a-robinson commented Sep 18, 2018

a-robinson commented Sep 19, 2018 •

edited

Loading

a-robinson commented Oct 9, 2018

a-robinson commented Oct 9, 2018

a-robinson commented Oct 11, 2018

awoods187 commented Oct 11, 2018

nvanbenschoten commented Oct 12, 2018

TPC-C 50k #30284

TPC-C 50k #30284

Comments

a-robinson commented Sep 17, 2018

awoods187 commented Sep 17, 2018 • edited Loading

jordanlewis commented Sep 17, 2018

rjnn commented Sep 17, 2018

awoods187 commented Sep 17, 2018

petermattis commented Sep 17, 2018 • edited Loading

nvanbenschoten commented Sep 17, 2018

a-robinson commented Sep 17, 2018

a-robinson commented Sep 18, 2018

awoods187 commented Sep 18, 2018

a-robinson commented Sep 18, 2018

a-robinson commented Sep 19, 2018 • edited Loading

a-robinson commented Oct 9, 2018

a-robinson commented Oct 9, 2018

a-robinson commented Oct 11, 2018

awoods187 commented Oct 11, 2018

nvanbenschoten commented Oct 12, 2018

awoods187 commented Sep 17, 2018 •

edited

Loading

petermattis commented Sep 17, 2018 •

edited

Loading

a-robinson commented Sep 19, 2018 •

edited

Loading