-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench/nodes=3/cpu=16 failed #61973
Comments
(roachtest).tpccbench/nodes=3/cpu=16 failed on master@bdff5338ca725bf1cfddf7e3f648bbf02ab42999:
More
Artifacts: /tpccbench/nodes=3/cpu=16
See this test on roachdash |
(roachtest).tpccbench/nodes=3/cpu=16 failed on master@e09b93fe62541c3a94f32a723778660b528a0792:
More
Artifacts: /tpccbench/nodes=3/cpu=16
See this test on roachdash |
Here's a refined view of this test's teamcity history on master. I'm filtering for only the GCE suite, which is where it fails consistently. What happened between 15th-18th Feb? |
Time for a bisection?
|
Here are all the changes that went in. Notably, #59992 is in there.
280ead9 Merge #60729 Looking at the more recent failure, I think there's still an OOM in there.
Here are the last logs from each node. n2's logs ended about 2m before the rest. We see periods of 0QPS in the workload:
One thing that's a bit more assuring however is that with #61777 there's basically no evidence of tracing memory usage in the heap profiles. |
The fact that we're no longer efficient at 2200 warehouses is concerning. We used to be as of #55721. Roachperf suggests that we were that efficient before tpccbench started failing ~ Feb 15th (see here). This test stops and restarts the cluster when for each tpcc run. It runs the entire workload multiple times to find the point at which we can sustain a given warehouse count at 85% efficiency. We've done something to make ourselves a lot less efficient at 2200 warehouses, and because that's still our estimated max (#55721), that's the point we start our search from. Ignoring for a second that we want to pin down what exactly resulted in this inefficiency, because the point where we start our search (2200 warehouses) is so high, the test right away puts the cluster into overload territory. As seen above, the roachtest infrastructure itself will not play well with this kind of overload. I've filed #62010 to track that thread. |
What I'm going to do now is to try running tpccbench at 2100 and step down as needed to see what the max efficiency we should expect to have. That should at least sidestep the lack of #62010. It'll also tell us exactly how much we've regressed. After that I'll try to isolate where the regression is coming from. I'd be a bit surprised if it was still due to #59992, especially given these tests are running with #61777 and now they have minimal tracing memory overhead (as shown in the profiles above). |
Probably also worth running multiple runs of the experiment with different values of |
2100 warehouses passes with flying colors:
Trying 2150 next. We should just drop our max warehouse count. |
So 2150 warehouses is one too many.
Heh, I'm seeing the same failure mode as above, where the VM is just inoperable. |
(roachtest).tpccbench/nodes=3/cpu=16 failed on master@e9387a6e5dfdad71c74ccd0a07c907632613fa3e:
More
Artifacts: /tpccbench/nodes=3/cpu=16
See this test on roachdash |
[x-post] From #62039 (comment):
|
62015: cli: Add some more warning comments to unsafe-remove-dead-replicas r=knz a=bdarnell The comments always said this tool was meant to be used with the supervision of a CRL engineer, but didn't otherwise make the risks and downsides clear. Add some more explicit warnings which can also serve as guidance for the supervising engineer. Release note: None 62039: roachtest: stabilize tpccbench/nodes=3/cpu=16 r=irfansharif a=irfansharif Fixes #61973. With tracing, our top-of-line TPC-C performance took a hit. Given that the TPC-C line searcher starts off at the estimated max, we're now starting off at "overloaded" territory; this makes for a very unhappy roachtest. Ideally we'd have something like #62010, or even admission control, to not make this test less noisy. Until then we can start off at a lower max warehouse count. This "fix" is still not a panacea, the entire tpccbench suite as written tries to nudge the warehouse count until the efficiency is sub-85%. Unfortunately, with our current infrastructure that's a stand-in for "the point where nodes are overloaded and VMs no longer reachable". See #61974. --- A longer-term approach to these tests could instead be as follows. We could start our search at whatever the max warehouse count is (making sure we've re-configure the max warehouses accordingly). These tests could then PASS/FAIL for that given warehouse count, and only if FAIL, could capture the extent of the regression by probing lower warehouse counts. This is in contrast to what we're doing today where we capture how high we can go (and by design risking going into overload territory, with no protections for it). Doing so lets us use this test suite to capture regressions from a given baseline, rather than hoping our roachperf dashboards capture unexpected perf improvements (if they're expected, we should update max warehouses accordingly). In the steady state, we should want the roachperf dashboards to be mostly flatlined, with step-increases when we're re-upping the max warehouse count to incorporate various system-wide performance increases. Release note: None Co-authored-by: Ben Darnell <[email protected]> Co-authored-by: irfan sharif <[email protected]>
(roachtest).tpccbench/nodes=3/cpu=16 failed on master@4d44ddf24153d8ef8e0a996fdbe75ac5607f9574:
More
Artifacts: /tpccbench/nodes=3/cpu=16
Related:
roachtest: tpccbench/nodes=3/cpu=16 failed #61696 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot branch-release-21.1 release-blocker
roachtest: tpccbench/nodes=3/cpu=16 failed #55802 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot T-bulkio branch-release-20.1
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: