-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #48255
Comments
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@61f18db7dd9a054d9a4648f67546202f760b5000:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
@nvanbenschoten this looks like we genuinely ran into the timeout. We spent ~4h to get data into the cluster:
Then we die in the 8th benchmark pass. It's interesting that we have.. quite a bit of variance here:
but maybe that's expected. I feel like you've done something about these tests timing out recently. Is that already reflected here? |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@1520ad2ba7c926f8043de8b6e044ab35c2f67b13:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@a51c5e3d03497b84a74ef61c7658326c31615a93:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
" pq: setting updated but timed out waiting to read new value" Looks like #48273 in disguise. Is there anything else to do here? |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@eb69237e5ee14728f32bbb667695ebb7472d9535:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@17c8048e80935f8a01477416980d18bf39cba1bb:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@3a03f3843a8cdf04f82c52753c61cf01b0d2ddcd:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@456a07cfc1e53b87abc7709052e54efb1450e758:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@3e0de239121813ea4d47873388a2828a66d9edf7:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@9304ecd70e9f3ba4cb16b5443a10b4e17d7baee0:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@e3fb5aa18d0f5064f7ba5d4df3864e94b3abb96d:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@7b8288b26c3aad649ed6e2d89f679d46b5f3988d:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@d251e175421d9303492a4923fb933515987163b6:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@fe5da5c2df069dc7f820f5c3e2f3e03f1cb7b661:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@361215163c597bd1460bba65ca3298f37e29aacc:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@54cdc5fec0e7dc835af7d2fc4231b52d49a71bf8:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@76935fd62c5a76f88b754cd3f9a5bfb3ccf1d8c2:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@22906f72e795f9f2c69828e65194f4177833ffbb:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@5af065eeb4f520ee93901e75fbef5d877f06585c:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@677f6f89f97d492f5eb443fcdb326d5695ddc5d7:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@ce3f9b29fee565a2994ca84d2ecd20db7fe59d0b:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@1ed669de1c3b77798fade7ad9f056edc0bf27c36:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@9b2aeea6ca553f79dd737052ac23c91763a0b713:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@258ef5765d0205f487b1481a9c26059db2a70362:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@c41a2ed503cd31d95bc3ee4365663811772c8bd3:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@960b4cfc54c8b78df56e62f07d7a07b986ceacff:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@53b08837cf5e76504e437bad80c97e05989d2c60:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@3018f66208f68fb90c77e3fad01f395a8f10ca8b:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@de91557c0634ab3797356d445ba39e37d45d8205:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
58014: roachtest/tpcc: don't scatter on each tpccbench search iteration r=ajwerner a=nvanbenschoten Fixes #48255. Fixes #53443. Fixes #54258. Fixes #54570. Fixes #55599. Fixes #55688. Fixes #55817. Fixes #55939. Fixes #56996. Fixes #57062. Fixes #57864. This needs to be backported to `release-20.1` and `release-20.2` In #55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of #56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance. Co-authored-by: Nathan VanBenschoten <[email protected]>
58014: roachtest/tpcc: don't scatter on each tpccbench search iteration r=nvanbenschoten a=nvanbenschoten Fixes #48255. Fixes #53443. Fixes #54258. Fixes #54570. Fixes #55599. Fixes #55688. Fixes #55817. Fixes #55939. Fixes #56996. Fixes #57062. Fixes #57864. This needs to be backported to `release-20.1` and `release-20.2` In #55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of #56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance. Co-authored-by: Nathan VanBenschoten <[email protected]>
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@eda9189cecbbc279f1857f6e6b992bdfd363397e:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
Fixes cockroachdb#48255. Fixes cockroachdb#53443. Fixes cockroachdb#54258. Fixes cockroachdb#54570. Fixes cockroachdb#55599. Fixes cockroachdb#55688. Fixes cockroachdb#55817. Fixes cockroachdb#55939. Fixes cockroachdb#56996. Fixes cockroachdb#57062. Fixes cockroachdb#57864. This needs to be backported to `release-20.1` and `release-20.2` In cockroachdb#55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of cockroachdb#56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance.
Fixes cockroachdb#48255. Fixes cockroachdb#53443. Fixes cockroachdb#54258. Fixes cockroachdb#54570. Fixes cockroachdb#55599. Fixes cockroachdb#55688. Fixes cockroachdb#55817. Fixes cockroachdb#55939. Fixes cockroachdb#56996. Fixes cockroachdb#57062. Fixes cockroachdb#57864. This needs to be backported to `release-20.1` and `release-20.2` In cockroachdb#55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of cockroachdb#56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance.
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on master@3b612692db93aa7c87493705e1fad85c9c664f6c:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
Related:
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #46387 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-release-19.2 release-blocker
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #46343 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-release-19.1 release-blocker
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: