-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #58298
Comments
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@365c5504b75c9a9260365a628a5110c48312178b:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@6d49a323b52966becfe8a2c38a1a8ccdf8ee58a1:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@b93fd531b93cb010729cb73fe679cdff9388cf27:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@d7bbe0060531063b9bee29f69bc4d23d41b84e3d:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
Another overload-to-death. cc @nvanbenschoten |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@cee475331ca3629b503cd2e7c7919b72c98a5ca5:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
The stacks show >400 goroutines with stacks containing
This may not be the root cause. I'm just pointing them out because I chanced upon them. The logs look just about as unhappy as they do for the other open tpccbench problems. |
I'm able to reproduce this without too much difficultly, which lines up with the frequency at which we've been seeing this fail recently. I still don't have a firm grasp on what's going wrong, but I see from each failure that the cluster gets very unhappy shortly after completing its import, scattering, and then running what should be a low degree of load. One thing that did jump out from some of the goroutine dumps is that we see a rapid growth of goroutines in I'm going to explore this more tomorrow, but I'm wondering whether there's an easy fix here - can we just run |
Interesting. Can you tell whether the failures are new (or was this always a failure mode of tpccbench that we just hadn't caught on to because of the SCATTER failure mode)? |
that rings a bell |
It's not yet clear. This is easy enough to reproduce that I'm just going to try to bisect it and see. |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@339275585b7d30b9ee2d49b0c696b9ddb8d51ad4:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@dbc7245c5d8c9f009072353fec261419e573032c:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@03797b17417ae34451537f8f76d66ac69dba2d07:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@a786c51627fe66e47b4a4445c67b2a9077ae2a93:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@7b0ccdda99b81613e70f421c9374483c3feddff3:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
One thing that's now clear to me is that we see these OOMs much more regularly in the |
I also now think I have a better understanding of why TPC-C in particular is good at triggering these OOMs. In these TPC-C tests and especially in the ramp-up period, we seem to use a high concurrency and then a low-ish |
I had 10 more pass last night on the same config as #58298 (comment). So there's definitely progress being made here. Hopefully |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@d86781c07065421f4a4d8bf5d988900ab07fdce5:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
My computer suspended before the runs finished, but the clusters were still up and looked fine. Unfortunately, that failure just above has 64d6d87. |
We are still seeing memory issues on tpccbench/nodes=6/cpu=16/multi-az which need to be investigated. Turn off background tracing while we do. Touches #58298. We're also reverting an earlier commit as part of this one (d252400). This revert is needed given we've not yet addressed an underlying bug (#59203). Release note: None
59431: tracing: revert trace.mode default to legacy r=irfansharif a=tbg We are still seeing memory issues on tpccbench/nodes=6/cpu=16/multi-az which need to be investigated. Turn off background tracing while we do. Touches #58298. Release note: None Co-authored-by: Tobias Grieger <[email protected]>
We are still seeing memory issues on tpccbench/nodes=6/cpu=16/multi-az which need to be investigated. Turn off background tracing while we do. Touches #58298. We're also reverting an earlier commit as part of this one (d252400). This revert is needed given we've not yet addressed an underlying bug (#59203). Release note: None
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@f7c5898f3d552f7ab0751cdd9ffa95cdfd6b8a76:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@1a459a81dba35b6a091f0a2954aa33d50f1e5d24:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
This had always-on tracing off. Unfortunately, roachtest also did not fully collect the artifacts and so we don't have heap profiles for any of the nodes. I think (@nvanbenschoten had the same suspicion in another instance of this) that the parallelization of
|
For the failure before that has |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@0e6727832d58faf0f900601cd6fa6807e0a2ba75:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@64c4aef909f4382523cd9248341ca9f4448d841a:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@bf9744bad5a416a4b06907f0f3dd42896f7342f3:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
Fixes cockroachdb#60852. Fixes cockroachdb#60833. Fixes cockroachdb#58298. Fixes cockroachdb#59428. Fixes cockroachdb#60756. Fixes cockroachdb#60848. Fixes cockroachdb#60849. In cockroachdb#60852 and related issues, we saw that the introduction of a non-nullable `RaftCommand.ClosedTimestamp`, coupled with the `ClosedTimestampFooter` encoding strategy we use, led to encoded `RaftCommand` protos with their ClosedTimestamp field set twice. This is ok from a correctness perspective, at least as protobuf is concerned, but it led to a subtle interaction where the process of passing through sideloading (`maybeInlineSideloadedRaftCommand(maybeSideloadEntriesImpl(e))`) would reduce the size of an encoded RaftCommand by 3 bytes (the encoded size of an empty `hlc.Timestamp`). This was resulting in an `uncommittedSize` leak in Raft, which was eventually stalling on its `MaxUncommittedEntriesSize` limit. This commit fixes this issue by making `RaftCommand.ClosedTimestamp` nullable. With the field marked as nullable, it will no longer be encoded as an empty timestamp when unset, ensuring that when the encoded `ClosedTimestampFooter` is appended, it contains the only instance of the `ClosedTimestamp` field.
60836: opt: support UPDATE with partial UNIQUE WITHOUT INDEX constraints r=mgartner a=mgartner This commit add uniqueness checks for partial `UNIQUE WITHOUT INDEX` constraints during `UPDATE` statements. As partial of this change, I discovered that #60535 introduced a regression where columns not required by uniqueness checks are not pruned. I've left TODOs in the column pruning tests and plan on fixing this in a follow-up PR. There is no release note because these constraints are gated behind the experimental_enable_unique_without_index_constraints session variable. Release note: None 60992: kv: make RaftCommand.ClosedTimestamp nullable r=nvanbenschoten a=nvanbenschoten Fixes #60852. Fixes #60833. Fixes #58298. Fixes #59428. Fixes #60756. Fixes #60848. Fixes #60849. In #60852 and related issues, we saw that the introduction of a non-nullable `RaftCommand.ClosedTimestamp`, coupled with the `ClosedTimestampFooter` encoding strategy we use, led to encoded `RaftCommand` protos with their ClosedTimestamp field set twice. This is ok from a correctness perspective, at least as protobuf is concerned, but it led to a subtle interaction where the process of passing through sideloading (`maybeInlineSideloadedRaftCommand(maybeSideloadEntriesImpl(e))`) would reduce the size of an encoded RaftCommand by 3 bytes (the encoded size of an empty `hlc.Timestamp`). This was resulting in an `uncommittedSize` leak in Raft, which was eventually stalling on its `MaxUncommittedEntriesSize` limit. This commit fixes this issue by making `RaftCommand.ClosedTimestamp` nullable. With the field marked as nullable, it will no longer be encoded as an empty timestamp when unset, ensuring that when the encoded `ClosedTimestampFooter` is appended, it contains the only instance of the `ClosedTimestamp` field. cc. @cockroachdb/bulk-io Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
Relates to cockroachdb#58298. One thing I've noticed when looking into cockroachdb#58298 is that we were often badly overloading the cluster during the rebalance wait period. During this time, we just want to apply a small amount of load to help instruct load-based splitting and rebalancing. But in some cases, we were completely overloading the cluster. We also weren't ramping up the load, as we had intended to. This commit fixes both of these issues. It adds a ramp period for the first quarter of the rebalance time and it scales the txn rate based on the expected max warehouse count instead of the loaded warehouse count.
Relates to cockroachdb#58298. One thing I've noticed when looking into cockroachdb#58298 is that we were often badly overloading the cluster during the rebalance wait period. During this time, we just want to apply a small amount of load to help instruct load-based splitting and rebalancing. But in some cases, we were completely overloading the cluster. We also weren't ramping up the load, as we had intended to. This commit fixes both of these issues. It adds a ramp period for the first quarter of the rebalance time and it scales the txn rate based on the expected max warehouse count instead of the loaded warehouse count.
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@08c89a597a06520c30faf01965f9c74fe9b9854f:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: