-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/kvserver: TestSnapshotsToDrainingNodes failed #133517
Comments
Looks similar to #117064. |
Yup, seems like we're off by 6 like we were in #117064. I'll try and stress this to see if it reproduces, but I'll remove the release blocker from this one. |
No failures after stressing for 1000 runs, ~2 hours. I'm going to change this test to not make these tight assertions in the hopes that it deflakes things. |
This test was making tight assertions about the size of the snapshot that was sent. To do so, it was trying to reimplement the actual snapshot sending logic in `kvBatchSnapshotStrategy.Send()`. So these tight assertions weren't of much use -- they were asserting that we were correctly re-implementing `kvBatchSnapshotStrategy.Send()` in `getExpectedSnapshotSizeBytes`. We weren't, as evidenced by some rare flakes. This patch loosens assertions to deflake the test. Closes cockroachdb#133517 <what was there before: Previously, ...> <why it needed to change: This was inadequate because ...> <what you did about it: To address this, this patch ...> Release note: None
…133690 #133693 133234: workload: tpcc consistency check added flag as-of. r=srosenberg,nameisbhaskar,vidit-bhat a=shailendra-patel While running the consistency checker on the tpcc database with an active tpcc workload, the consistency check fails with a retryable error, such as restart transaction:`TransactionRetryWithProtoRefreshError: ReadWithinUncertaintyIntervalError:` To fix this, added a new flag `as-of` which allows to run consistency check using `AS OF SYSTEM TIME`. Epic: none Release note: None 133347: crossclsuter/logical: add settings/stats to ldr ingest chunking r=dt a=dt 133607: sql: check object type when revoking privilege r=rafiss a=rafiss fixes #131157 Release note (bug fix): Fix an unhandled error that could occur when using `REVOKE ... ON SEQUENCE FROM ... user` on an object that is not a sequence. 133608: schemachanger: force prod values in expensive test r=rafiss a=rafiss fixes #133437 Release note: None 133616: roachtest: validate token return in perturbation/* tests r=kvoli a=andrewbaptist This commit adds validation that all RAC tokens are returned on all stable nodes at the end of the test. Fixes: #133410 Release note: None 133681: roachtest: minor fixes in rebalance/by-load test r=arulajmani a=kvoli `%` was not escaped, causing it to be substituted with values which were meant to go later. e.g., from: ``` node 0 has core count normalized CPU utilization ts datapoint not in [0%!,(float64=1.4920845083839689)100{[{{%!](string=cr.node.sys.cpu.combined.percent-normalized) %!] ... ``` To ``` node idx 0 has core count normalized CPU utilization ts datapoint not in [0%,100%] ... ``` --- The `rebalance/by-load/*` roachtests compare the CPU of nodes and assert that the distribution of node cpu is bounded +- 20%. The previous metric: ``` sys.cpu.combined.percent_normalized ``` Would occasionally over-report the CPU, as greater than 100% (>1.0), which is impossible. Use the host CPU instead, which will look at the machines CPU utilization, rather than any cockroach processes. ``` sys.cpu.host.combined.percent_normalized ``` Part of: #133004 Part of: #133054 Part of: #132019 Part of: #133223 Part of: #132633 Release note: None 133683: license: don't hit EnvOrDefaultInt64 in hot path r=fqazi,mgartner a=tbg Saves 0.3%cpu on sysbench. Fixes #133088. Release note: None Epic: None 133686: rac2: order testingRCRange.mu before RaftMu in tests r=sumeerbhola a=kvoli `testingRCRange.mu` was being acquired, and held before acquiring `RaftMu` in `testingRCRange.admit()`, which conflicted with different ordering (reversed). This was a test only issue with `TestRangeController`. Order `testingRCRange.mu` before `RaftMu` in `admit()`. Fixes: #133650 Release note: None 133690: roachtest: always pass a Context to queries r=kvoli a=andrewbaptist Queries can hang if there is no context passed to them. In roachtests, a context can be cancelled if there is a VM preemption. It is always better to use the test context and avoid this risk. This change updates the perturbation/* tests to always pass a context. Fixes: #133625 Release note: None 133693: kvserver: deflake TestSnapshotsToDrainingNodes r=kvoli a=arulajmani This test was making tight assertions about the size of the snapshot that was sent. To do so, it was trying to reimplement the actual snapshot sending logic in `kvBatchSnapshotStrategy.Send()`. So these tight assertions weren't of much use -- they were asserting that we were correctly re-implementing `kvBatchSnapshotStrategy.Send()` in `getExpectedSnapshotSizeBytes`. We weren't, as evidenced by some rare flakes. This patch loosens assertions to deflake the test. Closes #133517 Release note: None Co-authored-by: Shailendra Patel <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Arul Ajmani <[email protected]>
133347: crossclsuter/logical: add settings/stats to ldr ingest chunking r=dt a=dt 133608: schemachanger: force prod values in expensive test r=rafiss a=rafiss fixes #133437 Release note: None 133681: roachtest: minor fixes in rebalance/by-load test r=arulajmani a=kvoli `%` was not escaped, causing it to be substituted with values which were meant to go later. e.g., from: ``` node 0 has core count normalized CPU utilization ts datapoint not in [0%!,(float64=1.4920845083839689)100{[{{%!](string=cr.node.sys.cpu.combined.percent-normalized) %!] ... ``` To ``` node idx 0 has core count normalized CPU utilization ts datapoint not in [0%,100%] ... ``` --- The `rebalance/by-load/*` roachtests compare the CPU of nodes and assert that the distribution of node cpu is bounded +- 20%. The previous metric: ``` sys.cpu.combined.percent_normalized ``` Would occasionally over-report the CPU, as greater than 100% (>1.0), which is impossible. Use the host CPU instead, which will look at the machines CPU utilization, rather than any cockroach processes. ``` sys.cpu.host.combined.percent_normalized ``` Part of: #133004 Part of: #133054 Part of: #132019 Part of: #133223 Part of: #132633 Release note: None 133683: license: don't hit EnvOrDefaultInt64 in hot path r=fqazi,mgartner a=tbg Saves 0.3%cpu on sysbench. Fixes #133088. Release note: None Epic: None 133693: kvserver: deflake TestSnapshotsToDrainingNodes r=kvoli a=arulajmani This test was making tight assertions about the size of the snapshot that was sent. To do so, it was trying to reimplement the actual snapshot sending logic in `kvBatchSnapshotStrategy.Send()`. So these tight assertions weren't of much use -- they were asserting that we were correctly re-implementing `kvBatchSnapshotStrategy.Send()` in `getExpectedSnapshotSizeBytes`. We weren't, as evidenced by some rare flakes. This patch loosens assertions to deflake the test. Closes #133517 Release note: None Co-authored-by: David Taylor <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Arul Ajmani <[email protected]>
Based on the specified backports for linked PR #133693, I applied the following new label(s) to this issue: branch-release-23.1, branch-release-23.2, branch-release-24.1, branch-release-24.2. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This test was making tight assertions about the size of the snapshot that was sent. To do so, it was trying to reimplement the actual snapshot sending logic in `kvBatchSnapshotStrategy.Send()`. So these tight assertions weren't of much use -- they were asserting that we were correctly re-implementing `kvBatchSnapshotStrategy.Send()` in `getExpectedSnapshotSizeBytes`. We weren't, as evidenced by some rare flakes. This patch loosens assertions to deflake the test. Closes #133517 Release note: None
This test was making tight assertions about the size of the snapshot that was sent. To do so, it was trying to reimplement the actual snapshot sending logic in `kvBatchSnapshotStrategy.Send()`. So these tight assertions weren't of much use -- they were asserting that we were correctly re-implementing `kvBatchSnapshotStrategy.Send()` in `getExpectedSnapshotSizeBytes`. We weren't, as evidenced by some rare flakes. This patch loosens assertions to deflake the test. Closes #133517 Release note: None
This test was making tight assertions about the size of the snapshot that was sent. To do so, it was trying to reimplement the actual snapshot sending logic in `kvBatchSnapshotStrategy.Send()`. So these tight assertions weren't of much use -- they were asserting that we were correctly re-implementing `kvBatchSnapshotStrategy.Send()` in `getExpectedSnapshotSizeBytes`. We weren't, as evidenced by some rare flakes. This patch loosens assertions to deflake the test. Closes #133517 Release note: None
kv/kvserver.TestSnapshotsToDrainingNodes failed on release-24.3 @ 3839fb143f98e5009a994ad58fb6b82dc7badc55:
Parameters:
attempt=1
race=true
run=3
shard=26
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-43654
The text was updated successfully, but these errors were encountered: