Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPC-C 50k #30284

Closed
a-robinson opened this issue Sep 17, 2018 · 16 comments
Closed

TPC-C 50k #30284

a-robinson opened this issue Sep 17, 2018 · 16 comments
Assignees
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@a-robinson
Copy link
Contributor

Last week I tried running TPC-C 50k on 120 GCE VMs. The cluster was able to consistently hit 80% efficiency, or just over 500k tpmC. This isn’t terrible, but I certainly wouldn’t call it a success, especially given the large tail latencies. Below are notes on what I did, problems I ran into, and areas for additional investigation/improvement. I'll follow up with additional testing soon.

  • I set up a 120 node cluster using roachprod with the n1-highcpu-16 machine type. I initially created 128 nodes by accident, then had to decommission the latter 8. It’s unlikely this had an effect on the experiments. I used partitioning with --racks=40 to avoid the influence of rebalancing on the experiment.
  • Gossip connectivity logging was overwhelming, both in the amount and frequency of output (gossip: connectivity logging is extremely spammy in 128 node cluster #30088). It also was consuming a lot of bandwidth, which @petermattis has done much to address since then.
  • ulimit -n 50000 was needed before being able to run the workload generator from the 121st VM (instead of my normal ulimit 10000). I’m not familiar enough with our tpc-c implementation to know whether this is expected, but it certainly isn’t unreasonable given the number of warehouses.
  • The workload generator repeatedly OOM’ed after a few minutes of running on the n1-highcpu-16 machine (which has 14.4 GB of memory). I had to create a separate, larger VM for it to run on, where it ended up in a steady state of using around 20 CPUs and 15 GB of memory. I was pretty surprised that this much memory was needed, but haven’t looked into whether it’s a bug. It didn’t continue to grow after getting to around 15GB, so there doesn’t appear to be an issue with it continuously leaking memory.
  • The UI was quite unwieldy with this many nodes, even when load wasn’t running against the cluster. It ran really, really slowly. Chrome was not happy.
  • The tail latencies were not good. They’d be ok for the first 10-20 seconds of full load, then things would bog down. QPS takes a hit, p50 latencies spike to 1-2 seconds, and tail latencies spike above 10 seconds (example numbers below [1]).
  • There were a LOT of “not leaseholder” errors. Low thousands per second, IIRC (I don’t have a screenshot saved). I followed up on this separately on a smaller cluster and tracked it down to kv: Don't evict from leaseholder cache on context cancellations #30163. I haven’t since retested with 50k warehouses since the fix, only 10k. While I think it will help, I don’t believe it was the real limiting factor. I also think it’s unfortunate that we don’t populate the leaseholder cache after a successful request if we don’t already have an entry for the range in the cache (we only do so after a NotLeaseHolderError), but the code isn’t structured very well for it and I doubt it’d really help performance.
  • What appeared to be a bigger problem was contention. I turned on slow query tracing for a little while, and the main offenders for pretty much all the slow queries are waiting on overlapping commands in the command queue, waiting in the contention queue / pushing other transactions, and on transaction restarts (i.e. RETRY_SERIALIZABLE errors). Transaction restarts are incredibly costly when problems 1 and 2 mean that each attempt at a transaction is already taking upwards of 10 seconds, and also making contention worse for other transactions.

I need to do the testing again with more emphasis on whether there’s a resource bottleneck (are the CPUs pegged, how much I/O is being done, etc), and if so why. But before doing that I’d like to understand the contention a little better. I was under the impression that tpc-c was a mostly uncontended workload, so I clearly need to get a little more familiar with it.

I’ve included a couple log files with the slow query traces below [2] [3]. Do we have a visualization tool for these? It might make a nice free friday project given how much the different spans get interleaved together in the log output.

[1]

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  303.2s   515552.3  80.2%   3669.2   1946.2   7247.8  11811.2  28991.0 103079.2

[2] cockroach5.log
[3] cockroach6.log

@a-robinson a-robinson added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-coreperf labels Sep 17, 2018
@a-robinson a-robinson self-assigned this Sep 17, 2018
@awoods187
Copy link
Contributor

awoods187 commented Sep 17, 2018

My understanding is that TPC-C is actually a pretty good contending workload due to the five transactions across the warehouses. cc @jordanlewis and @arjunravinarayan who may know more

@jordanlewis
Copy link
Member

There's definitely a lot of expected contention in TPC-C. See for example the warehouse.w_ytd field, and the district fields d_ytd and d_next_o_id - these need to be read and written by lots of concurrent transactions.

@rjnn
Copy link
Contributor

rjnn commented Sep 17, 2018

These contentions clogging up the queue and slowing everything down seem very reminiscent of the issues we experienced when running TPC-C 1k and 10k in the spring. We really haven't made much progress on graceful degradation under high levels of contention, and that seems to be showing up here. I'd wager that a more easily reproducible test setup that gets at the same problems that we see here is to test TPC-C 10k with 1-node-less-than-whatever-it-is-we-need-for-~100%-efficiency.

@awoods187
Copy link
Contributor

CC @nvanbenschoten on @arjunravinarayan 's comment above because he's run those tests semi-recently

@petermattis
Copy link
Collaborator

petermattis commented Sep 17, 2018

@a-robinson This is a better number than I was expecting on a first attempt.

@nvanbenschoten
Copy link
Member

ulimit -n 50000 was needed before being able to run the workload generator from the 121st VM (instead of my normal ulimit 10000). I’m not familiar enough with our tpc-c implementation to know whether this is expected, but it certainly isn’t unreasonable given the number of warehouses.

That isn't too surprising. Each warehouse will have 10 workers goroutines in the workload generator.

The UI was quite unwieldy with this many nodes, even when load wasn’t running against the cluster. It ran really, really slowly. Chrome was not happy.

I observed this as well when running a 128 node KV cluster. This is worth exploring in 2.2. cc. @piyush-singh.

There were a LOT of “not leaseholder” errors. Low thousands per second, IIRC (I don’t have a screenshot saved). I followed up on this separately on a smaller cluster and tracked it down to #30163. I haven’t since retested with 50k warehouses since the fix, only 10k. While I think it will help, I don’t believe it was the real limiting factor. I also think it’s unfortunate that we don’t populate the leaseholder cache after a successful request if we don’t already have an entry for the range in the cache (we only do so after a NotLeaseHolderError), but the code isn’t structured very well for it and I doubt it’d really help performance.

Was this before or after you bumped the kv.range_descriptor_cache.size (which also controls the leaseholder cache size)?

What appeared to be a bigger problem was contention. I turned on slow query tracing for a little while, and the main offenders for pretty much all the slow queries are waiting on overlapping commands in the command queue, waiting in the contention queue / pushing other transactions, and on transaction restarts (i.e. RETRY_SERIALIZABLE errors). Transaction restarts are incredibly costly when problems 1 and 2 mean that each attempt at a transaction is already taking upwards of 10 seconds, and also making contention worse for other transactions.

TPC-C does have a fair bit of contention, as @jordanlewis mentioned, but it shouldn't scale with the size of the data set. Outside of the occasional cross-warehouse transaction (1% if newOrders), all operations should be scoped to a single warehouse, so this is surprising to me. You could disable these cross-warehouse transactions entirely and see if that makes much of a difference.

Did you try out any request tracing to see where long running transactions were spending most of their time? With a large enough VM, you should be able to point this size cluster at Jaeger and get a lot of valuable insight.

One thing I'd be interesting in looking at is the intentresolver.async.throttled metric when this cluster is running. If we're having trouble cleaning up intents then we'd expect to see more contention and a slowdown across the board.

We really haven't made much progress on graceful degradation under high levels of contention

I don't think that's a fair thing to say. 2.1 will be significantly better at degrading gracefully under high levels of contention due to #25014.

@a-robinson
Copy link
Contributor Author

The UI was quite unwieldy with this many nodes, even when load wasn’t running against the cluster. It ran really, really slowly. Chrome was not happy.

I observed this as well when running a 128 node KV cluster. This is worth exploring in 2.2. cc. @piyush-singh.

I opened #30287 for this. It sounds like it'll be worked on for 2.2.

Was this before or after you bumped the kv.range_descriptor_cache.size (which also controls the leaseholder cache size)?

Changing the cache size had no effect. I'm fairly confident that #30163 will have resolved it.

Did you try out any request tracing to see where long running transactions were spending most of their time? With a large enough VM, you should be able to point this size cluster at Jaeger and get a lot of valuable insight.

Just the slow query trace logging (sql.trace.txn.enable_threshold), which gives the same sort of info but in a much less easily consumable format. Seeing the spans in a nicer UI might reveal something that the raw traces themselves were obfuscating.

One thing I'd be interesting in looking at is the intentresolver.async.throttled metric when this cluster is running. If we're having trouble cleaning up intents then we'd expect to see more contention and a slowdown across the board.

Will do, thanks.

@a-robinson
Copy link
Contributor Author

I'm testing today on a bigger cluster based on feedback from yesterday, and while the pMax is still up there, I'm getting some otherwise very successful runs. The longest I've done so far was this 10 minute run:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  600.5s   631851.3  98.3%    510.8    402.7    906.0   1140.9   2281.7 103079.2

The first two 5-minute runs I did got 90% and 91% efficiencies with around a 10s p99, but since them everything has been above 94%. This does back up the thought that the contention problems on the 120 node cluster were just from general overload, suggesting that while we may not scale perfectly linearly (since we can do TPC-C 10k with 24 nodes or fewer) we weren't hitting some sort of scalability cliff.

This is with 135 nodes. I was going to try 150 nodes as discussed with @nvanbenschoten, but we didn't have enough GCE cpu quota for that, and apparently we didn't need it.

There are still a good deal of transaction restarts, and it's not clear that #30163 was as successful as expected. The NotLeaseholderErrors are decreasing over time, but fairly slowly. I'll keep an eye on them.

@awoods187
Copy link
Contributor

This is awesome news!!

@a-robinson
Copy link
Contributor Author

The intentresolver.async.throttled is indeed elevated, at a rate of around 1000 throttled async intent resolutions per second during steady state 50k load, even on the 135 node cluster that is doing reasonably well. Its ramp up correlates nicely with the general pMax increase during the ramp-up period. Good call, @nvanbenschoten.

@a-robinson
Copy link
Contributor Author

a-robinson commented Sep 19, 2018

Remaining issues to dig into after yesterday:

@nvanbenschoten nvanbenschoten added this to the 2.1 milestone Sep 25, 2018
a-robinson added a commit to a-robinson/cockroach that referenced this issue Sep 26, 2018
This involves some slightly awkward refactoring of the log.EveryN code
into a different package because the tracing package can't use the log
package.

Resolves the massive log spam I saw when my tpc-c 50k cluster in cockroachdb#30284
caused my jaeger node to OOM, making for a lot of complaints from the
zipkin library about its backlog being too long.

Release note: None
craig bot pushed a commit that referenced this issue Sep 26, 2018
30678: tracing: Rate limit logs from zipkin collector library r=a-robinson a=a-robinson

This involves some slightly awkward refactoring of the log.EveryN code
into a different package because the tracing package can't use the log
package.

Resolves the massive log spam I saw when my tpc-c 50k cluster in #30284
caused my jaeger node to OOM, making for a lot of complaints from the
zipkin library about its backlog being too long.

Release note: None

Co-authored-by: Alex Robinson <[email protected]>
a-robinson added a commit to a-robinson/cockroach that referenced this issue Sep 26, 2018
This involves some slightly awkward refactoring of the log.EveryN code
into a different package because the tracing package can't use the log
package.

Resolves the massive log spam I saw when my tpc-c 50k cluster in cockroachdb#30284
caused my jaeger node to OOM, making for a lot of complaints from the
zipkin library about its backlog being too long.

Release note: None
andreimatei pushed a commit to andreimatei/cockroach that referenced this issue Sep 27, 2018
This involves some slightly awkward refactoring of the log.EveryN code
into a different package because the tracing package can't use the log
package.

Resolves the massive log spam I saw when my tpc-c 50k cluster in cockroachdb#30284
caused my jaeger node to OOM, making for a lot of complaints from the
zipkin library about its backlog being too long.

Release note: None
craig bot pushed a commit that referenced this issue Oct 1, 2018
30807: workload: Optimize ticking of histograms for large numbers of workers r=a-robinson a=a-robinson

**workload: Avoid defer in NamedHistogram.tick**

Minor, but this was showing up on CPU profiles taking a little over 1%
of CPU. Given that the histogram ticking is CPU-bound for large tpc-c
invocations, saving 1% for basically zero effort seems worth it.

**workload: Use sync.Pool for hdr histograms**

The allocation of these histograms was taking about 20% of CPU on
locally simulated tpc-c 50k runs (not even including all the extra work
it imposed on the garbage collector). This drops newHistogram down to
7%, 6% of which is from the memory clearing done by
hdrhistogram.Reset().

This noticeably cuts down on the amount of output ticks that are skipped
by tpc-c 50k on my laptop, but doesn't totally eliminate them yet.

**workload: Paralellize ticking of different histograms**

This avoided skipped ticks when running on my laptop (with all actual
sql requests mocked out with a time.Sleep(25ms)). Before this, ticks
were only being printed every 3 or 4 seconds instead of every second.

This will also only scale so far and will probably need to be further
optimized in the future, but it appears to be good enough for 50k
warehouses, so it's good enough for now.

There's still some funniness going on at the start of runs, though, with
the first second or two being skipped and with some elevated latencies
about 10 seconds in (even though no real requests are being sent to a
database in this local testing).

**workload: Call callback without mutex held in NamedHistogram.tick**

**workload: Optimize locking in Histograms.Get**

Also avoid including locking in the timing. This has a very noticeable
effect on the tail latencies in my local 50k worker testing with the SQL
queries stubbed out.

---

Touches #30284 (comment)

30833: bump cockroach-go and pgx dependencies r=RaduBerinde a=RaduBerinde

This change is useful for some experiments with pgx. These packages
only affect testing code.

Release note: None

Co-authored-by: Alex Robinson <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
@petermattis petermattis removed this from the 2.1 milestone Oct 5, 2018
@a-robinson
Copy link
Contributor Author

Yesterday I tried to get tpc-c 50k working on 135 VMs without partitioning. It was not successful. As in the original experiments without partitioning, the problem was that one node would always get overloaded and backed up before the others, ending up with tens of thousands of goroutines blocking in the command queue and contention queue, while all other nodes have no more than a few thousand goroutines. The backed up node was often not the one with the most qps as recorded by our average qps per store metric, so there's likely more to understand there.

However, I suspect such further understanding can be done on a much smaller cluster. For what it's worth, I also suspect that this is a scenario where, if we really wanted to, we could reach the desired throughput by just adding more nodes to avoid any node getting so overloaded.

This run allowed me to finish off the checkboxes from my previous comment, and found a few other issues worth resolving - #31135, #31147, #31134

@awoods187 I plan on tabling this for a while (or completely closing it) as we work on other things like load-based splitting. Let me know if it needs any specific action.

@a-robinson
Copy link
Contributor Author

And for what it's worth, load-based splitting would have helped in my test yesterday. At one point one of the ranges in tpcc.district was very hot due to a lot of contention on it, and I actually chose to manually split it to better distribute the load/contention.

@a-robinson
Copy link
Contributor Author

I realized today that I made a big mistake in the test setup a couple days ago that likely hurt the results. I never ran workload run tpcc with the --split flag to ensure all the ranges were properly split. I passed over that command because I mentally associated it with partitioning, when in reality it also does a bunch of important pre-splitting even when the --partitions flag isn't specified. When running the benchmark, I did notice many ranges being much hotter than normal, especially from the districts table.

In other words, it's possible that 50k does work without partitioning on 135 VMs and that I just screwed up the test. I don't plan on revisiting it soon, though, as I still think that focusing on smaller (5k to 20k warehouses) clusters is a more effective use of machine resources.

@awoods187
Copy link
Contributor

That's exciting to hear perf may be even better and I concur with your comment about not needing to re-test it and focus on smaller workloads.

@nvanbenschoten
Copy link
Member

It might be worth running two small 5k clusters, one with and one without the --split flag, to get a ballpark estimate of how much missing that flag might have cost us in the 50k run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

6 participants