-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #55817
Comments
This looks a lot like what we've been seeing recently in #55599 on the We see that the node was OOM killed:
We see from the logs that the system has 14GiB of memory available, and so this anon-rss of 14043276kB is an issue:
But strangely, we don't see this much memory in use in the memory profiles, which were taken 2 minutes before the OOM, the memory stats, last captured 1 minute before the OOM, or the runtime stats log, last reported about 10 seconds before the OOM. In the memory stats, we see about 1.6 GB of memory allocated (
So within these 10 seconds, there must have been a large amount of heap-allocated memory to push us over the 14 GB limit. One thing that's pretty apparent from the logs is that the cluster is not super happy during this run. We see many requests take minutes to complete. It's not clear why this is. Most of the log lines related to this come from within the DistSender and within contention handling, so one possibility is that transactions somewhere deadlocked. But the workload logs do not agree with this, as throughput never stalls during the lifetime of the workload. So it's not clear why the cluster was so unhappy. One unsubstantiated theory for why this may lead to an OOM is because we seem to be very loud in the logs once these slow requests complete. We see that the node has 5413 active goroutines a few seconds before the OOM. Perhaps the requests all completed within some short interval and all performed verbose logging in response. |
The next thing to look at is the goroutine dumps to determine where goroutines were when everything was unhappy. |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@93b4f0405144660ac8dd4f24f1a588c09dcc3814:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@7f10df809db5076075b6ec63bc744b62109ee459:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@0884565b5005096ad7ac713a89f4f6d70a4e2406:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@36822fee607976556f265c683da0eaffcfdfec8b:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@36822fee607976556f265c683da0eaffcfdfec8b:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@4aabe9625bcaf74d6c25bfc3c88cfe7662cb2d80:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@abd8a5e802a06822843870e4a358aa3cd789a3c9:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@8729c06ed8e3baba67ab5651588b00e015e696ee:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@17c55e166b24fb7a6e27c81fcd68b05c7a9b9849:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@a3063889b89770d9ecca6d7b83f19998de5bfc7a:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@d9cf6cc8d0bc86468df889755426bafeadea7e36:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@745f1a1f5a603fb03357790e416e2560aabcc4c3:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
No I have not yet been able to look into it. But it seems both #55688 and this are now failing because of some other problem that causes a crash.
|
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@a40588bf7114e16bf7e54f3b8dc64788a04225bc:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
Looking at the latest failure in #55817 (comment) logs/3.unredacted/cockroach.log shows this happening thousands of times
Then there's a couple hundred of these
There's also a few
|
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@4fced82f7e6dad8e3ed67ea2154582f3c4188340:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@c15afb605a18772689caf46c3dd74c4aca33badd:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@89dd2d14c29787378c434ed54937757ef5d9877c:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@0de15384d560d1b4f09315d88204b6b2d7dfb32c:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@82e9f59f73098e3f6d5a0684526216df456df690:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@2308ed0493a3f0b77f1e0412a6e18a4ebe0fb7d2:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
58014: roachtest/tpcc: don't scatter on each tpccbench search iteration r=ajwerner a=nvanbenschoten Fixes #48255. Fixes #53443. Fixes #54258. Fixes #54570. Fixes #55599. Fixes #55688. Fixes #55817. Fixes #55939. Fixes #56996. Fixes #57062. Fixes #57864. This needs to be backported to `release-20.1` and `release-20.2` In #55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of #56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance. Co-authored-by: Nathan VanBenschoten <[email protected]>
58014: roachtest/tpcc: don't scatter on each tpccbench search iteration r=nvanbenschoten a=nvanbenschoten Fixes #48255. Fixes #53443. Fixes #54258. Fixes #54570. Fixes #55599. Fixes #55688. Fixes #55817. Fixes #55939. Fixes #56996. Fixes #57062. Fixes #57864. This needs to be backported to `release-20.1` and `release-20.2` In #55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of #56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance. Co-authored-by: Nathan VanBenschoten <[email protected]>
Fixes cockroachdb#48255. Fixes cockroachdb#53443. Fixes cockroachdb#54258. Fixes cockroachdb#54570. Fixes cockroachdb#55599. Fixes cockroachdb#55688. Fixes cockroachdb#55817. Fixes cockroachdb#55939. Fixes cockroachdb#56996. Fixes cockroachdb#57062. Fixes cockroachdb#57864. This needs to be backported to `release-20.1` and `release-20.2` In cockroachdb#55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of cockroachdb#56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance.
Fixes cockroachdb#48255. Fixes cockroachdb#53443. Fixes cockroachdb#54258. Fixes cockroachdb#54570. Fixes cockroachdb#55599. Fixes cockroachdb#55688. Fixes cockroachdb#55817. Fixes cockroachdb#55939. Fixes cockroachdb#56996. Fixes cockroachdb#57062. Fixes cockroachdb#57864. This needs to be backported to `release-20.1` and `release-20.2` In cockroachdb#55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of cockroachdb#56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance.
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@d67a35edddabcdd18954196a5e20bfd2a55a27e4:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:
roachtest: tpccbench/nodes=6/cpu=16/multi-az failed [Attempt to create load generator failed] #55688 roachtest: tpccbench/nodes=6/cpu=16/multi-az failed C-test-failure O-roachtest O-robot branch-master release-blocker
roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #55599 roachtest: tpccbench/nodes=6/cpu=16/multi-az failed C-test-failure O-roachtest O-robot branch-release-20.1 release-blocker
roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #55544 roachtest: tpccbench/nodes=6/cpu=16/multi-az failed C-test-failure O-roachtest O-robot branch-release-19.2 release-blocker
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: