-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TPC-C 50k #30284
Comments
My understanding is that TPC-C is actually a pretty good contending workload due to the five transactions across the warehouses. cc @jordanlewis and @arjunravinarayan who may know more |
There's definitely a lot of expected contention in TPC-C. See for example the |
These contentions clogging up the queue and slowing everything down seem very reminiscent of the issues we experienced when running TPC-C 1k and 10k in the spring. We really haven't made much progress on graceful degradation under high levels of contention, and that seems to be showing up here. I'd wager that a more easily reproducible test setup that gets at the same problems that we see here is to test TPC-C 10k with 1-node-less-than-whatever-it-is-we-need-for-~100%-efficiency. |
CC @nvanbenschoten on @arjunravinarayan 's comment above because he's run those tests semi-recently |
@a-robinson This is a better number than I was expecting on a first attempt. |
That isn't too surprising. Each warehouse will have 10 workers goroutines in the workload generator.
I observed this as well when running a 128 node KV cluster. This is worth exploring in 2.2. cc. @piyush-singh.
Was this before or after you bumped the
TPC-C does have a fair bit of contention, as @jordanlewis mentioned, but it shouldn't scale with the size of the data set. Outside of the occasional cross-warehouse transaction (1% if newOrders), all operations should be scoped to a single warehouse, so this is surprising to me. You could disable these cross-warehouse transactions entirely and see if that makes much of a difference. Did you try out any request tracing to see where long running transactions were spending most of their time? With a large enough VM, you should be able to point this size cluster at Jaeger and get a lot of valuable insight. One thing I'd be interesting in looking at is the
I don't think that's a fair thing to say. 2.1 will be significantly better at degrading gracefully under high levels of contention due to #25014. |
I opened #30287 for this. It sounds like it'll be worked on for 2.2.
Changing the cache size had no effect. I'm fairly confident that #30163 will have resolved it.
Just the slow query trace logging (
Will do, thanks. |
I'm testing today on a bigger cluster based on feedback from yesterday, and while the pMax is still up there, I'm getting some otherwise very successful runs. The longest I've done so far was this 10 minute run:
The first two 5-minute runs I did got 90% and 91% efficiencies with around a 10s p99, but since them everything has been above 94%. This does back up the thought that the contention problems on the 120 node cluster were just from general overload, suggesting that while we may not scale perfectly linearly (since we can do TPC-C 10k with 24 nodes or fewer) we weren't hitting some sort of scalability cliff. This is with 135 nodes. I was going to try 150 nodes as discussed with @nvanbenschoten, but we didn't have enough GCE cpu quota for that, and apparently we didn't need it. There are still a good deal of transaction restarts, and it's not clear that #30163 was as successful as expected. The NotLeaseholderErrors are decreasing over time, but fairly slowly. I'll keep an eye on them. |
This is awesome news!! |
The |
Remaining issues to dig into after yesterday:
|
This involves some slightly awkward refactoring of the log.EveryN code into a different package because the tracing package can't use the log package. Resolves the massive log spam I saw when my tpc-c 50k cluster in cockroachdb#30284 caused my jaeger node to OOM, making for a lot of complaints from the zipkin library about its backlog being too long. Release note: None
30678: tracing: Rate limit logs from zipkin collector library r=a-robinson a=a-robinson This involves some slightly awkward refactoring of the log.EveryN code into a different package because the tracing package can't use the log package. Resolves the massive log spam I saw when my tpc-c 50k cluster in #30284 caused my jaeger node to OOM, making for a lot of complaints from the zipkin library about its backlog being too long. Release note: None Co-authored-by: Alex Robinson <[email protected]>
This involves some slightly awkward refactoring of the log.EveryN code into a different package because the tracing package can't use the log package. Resolves the massive log spam I saw when my tpc-c 50k cluster in cockroachdb#30284 caused my jaeger node to OOM, making for a lot of complaints from the zipkin library about its backlog being too long. Release note: None
This involves some slightly awkward refactoring of the log.EveryN code into a different package because the tracing package can't use the log package. Resolves the massive log spam I saw when my tpc-c 50k cluster in cockroachdb#30284 caused my jaeger node to OOM, making for a lot of complaints from the zipkin library about its backlog being too long. Release note: None
30807: workload: Optimize ticking of histograms for large numbers of workers r=a-robinson a=a-robinson **workload: Avoid defer in NamedHistogram.tick** Minor, but this was showing up on CPU profiles taking a little over 1% of CPU. Given that the histogram ticking is CPU-bound for large tpc-c invocations, saving 1% for basically zero effort seems worth it. **workload: Use sync.Pool for hdr histograms** The allocation of these histograms was taking about 20% of CPU on locally simulated tpc-c 50k runs (not even including all the extra work it imposed on the garbage collector). This drops newHistogram down to 7%, 6% of which is from the memory clearing done by hdrhistogram.Reset(). This noticeably cuts down on the amount of output ticks that are skipped by tpc-c 50k on my laptop, but doesn't totally eliminate them yet. **workload: Paralellize ticking of different histograms** This avoided skipped ticks when running on my laptop (with all actual sql requests mocked out with a time.Sleep(25ms)). Before this, ticks were only being printed every 3 or 4 seconds instead of every second. This will also only scale so far and will probably need to be further optimized in the future, but it appears to be good enough for 50k warehouses, so it's good enough for now. There's still some funniness going on at the start of runs, though, with the first second or two being skipped and with some elevated latencies about 10 seconds in (even though no real requests are being sent to a database in this local testing). **workload: Call callback without mutex held in NamedHistogram.tick** **workload: Optimize locking in Histograms.Get** Also avoid including locking in the timing. This has a very noticeable effect on the tail latencies in my local 50k worker testing with the SQL queries stubbed out. --- Touches #30284 (comment) 30833: bump cockroach-go and pgx dependencies r=RaduBerinde a=RaduBerinde This change is useful for some experiments with pgx. These packages only affect testing code. Release note: None Co-authored-by: Alex Robinson <[email protected]> Co-authored-by: Radu Berinde <[email protected]>
Yesterday I tried to get tpc-c 50k working on 135 VMs without partitioning. It was not successful. As in the original experiments without partitioning, the problem was that one node would always get overloaded and backed up before the others, ending up with tens of thousands of goroutines blocking in the command queue and contention queue, while all other nodes have no more than a few thousand goroutines. The backed up node was often not the one with the most qps as recorded by our average qps per store metric, so there's likely more to understand there. However, I suspect such further understanding can be done on a much smaller cluster. For what it's worth, I also suspect that this is a scenario where, if we really wanted to, we could reach the desired throughput by just adding more nodes to avoid any node getting so overloaded. This run allowed me to finish off the checkboxes from my previous comment, and found a few other issues worth resolving - #31135, #31147, #31134 @awoods187 I plan on tabling this for a while (or completely closing it) as we work on other things like load-based splitting. Let me know if it needs any specific action. |
And for what it's worth, load-based splitting would have helped in my test yesterday. At one point one of the ranges in tpcc.district was very hot due to a lot of contention on it, and I actually chose to manually split it to better distribute the load/contention. |
I realized today that I made a big mistake in the test setup a couple days ago that likely hurt the results. I never ran In other words, it's possible that 50k does work without partitioning on 135 VMs and that I just screwed up the test. I don't plan on revisiting it soon, though, as I still think that focusing on smaller (5k to 20k warehouses) clusters is a more effective use of machine resources. |
That's exciting to hear perf may be even better and I concur with your comment about not needing to re-test it and focus on smaller workloads. |
It might be worth running two small 5k clusters, one with and one without the |
Last week I tried running TPC-C 50k on 120 GCE VMs. The cluster was able to consistently hit 80% efficiency, or just over 500k tpmC. This isn’t terrible, but I certainly wouldn’t call it a success, especially given the large tail latencies. Below are notes on what I did, problems I ran into, and areas for additional investigation/improvement. I'll follow up with additional testing soon.
--racks=40
to avoid the influence of rebalancing on the experiment.ulimit -n 50000
was needed before being able to run the workload generator from the 121st VM (instead of my normalulimit 10000
). I’m not familiar enough with our tpc-c implementation to know whether this is expected, but it certainly isn’t unreasonable given the number of warehouses.NotLeaseHolderError
), but the code isn’t structured very well for it and I doubt it’d really help performance.I need to do the testing again with more emphasis on whether there’s a resource bottleneck (are the CPUs pegged, how much I/O is being done, etc), and if so why. But before doing that I’d like to understand the contention a little better. I was under the impression that tpc-c was a mostly uncontended workload, so I clearly need to get a little more familiar with it.
I’ve included a couple log files with the slow query traces below [2] [3]. Do we have a visualization tool for these? It might make a nice free friday project given how much the different spans get interleaved together in the log output.
[1]
[2] cockroach5.log
[3] cockroach6.log
The text was updated successfully, but these errors were encountered: