-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rac2: order testingRCRange.mu before RaftMu in tests #133686
Conversation
Passes 1k runs under dev test pkg/kv/kvserver/kvflowcontrol/rac21-vs-fcTestRangeController/handle_raft_event --deadlock --stress
...
//pkg/kv/kvserver/kvflowcontrol/rac2:rac2_test PASSED in 1.5s
Stats over 1000 runs: max = 1.5s, min = 0.7s, avg = 0.8s, dev = 0.1s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)
-- commits
line 2 at r1:
the code comment in this PR says raftMu is ordered before testingRCRange.mu (which is the correct phrasing).
// We need to ensure that r.mu isn't held before (and while) holding
// RaftMu, in order to order the locks correctly (RaftMu before
// testingRCRange.mu).
pkg/kv/kvserver/kvflowcontrol/rac2/range_controller_test.go
line 508 at r1 (raw file):
} func (r *testingRCRange) admit(ctx context.Context, storeID roachpb.StoreID, av AdmittedVector) {
I think we need a code comment where testingRCRange.mu
is declared on what the lock ordering is. And possibly include why -- I think it is because methods like testingRCRange.{SendMsgApp,SendMsgAppRaftMuLocked}
need to acquire testingRCRange.mu
and are called with raftMu
held.
pkg/kv/kvserver/kvflowcontrol/rac2/range_controller_test.go
line 534 at r1 (raw file):
// Ensure that Match doesn't lag behind the highest index in the // AdmittedVector. repl.info.Match = max(repl.info.Match, v)
shouldn't we be doing this updating of repl
which lives in r.mu.r.replicaSet
in the closure above while holding r.mu
?
`testingRCRange.mu` was being acquired, and held before acquiring `RaftMu` in `testingRCRange.admit()`, which conflicted with different ordering (reversed). This was a test only issue with `TestRangeController`. Order `RaftMu` before `testingRCRange.mu` in `admit()`. Fixes: cockroachdb#133650 Release note: None
1f226e8
to
5d74d15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TYFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @sumeerbhola)
Previously, sumeerbhola wrote…
the code comment in this PR says raftMu is ordered before testingRCRange.mu (which is the correct phrasing).
// We need to ensure that r.mu isn't held before (and while) holding // RaftMu, in order to order the locks correctly (RaftMu before // testingRCRange.mu).
Fixed. The lock ordering notation / wording always gives me the run around.
pkg/kv/kvserver/kvflowcontrol/rac2/range_controller_test.go
line 508 at r1 (raw file):
Previously, sumeerbhola wrote…
I think we need a code comment where
testingRCRange.mu
is declared on what the lock ordering is. And possibly include why -- I think it is because methods liketestingRCRange.{SendMsgApp,SendMsgAppRaftMuLocked}
need to acquiretestingRCRange.mu
and are called withraftMu
held.
Yeah that is why. Added.
pkg/kv/kvserver/kvflowcontrol/rac2/range_controller_test.go
line 534 at r1 (raw file):
Previously, sumeerbhola wrote…
shouldn't we be doing this updating of
repl
which lives inr.mu.r.replicaSet
in the closure above while holdingr.mu
?
Yes we should, interesting that it didn't trip the race detector (which I didn't run locally because it doesn't work with deadlock).
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @pav-kv)
TYFTR! bors r=sumeerbhola |
…133690 #133693 133234: workload: tpcc consistency check added flag as-of. r=srosenberg,nameisbhaskar,vidit-bhat a=shailendra-patel While running the consistency checker on the tpcc database with an active tpcc workload, the consistency check fails with a retryable error, such as restart transaction:`TransactionRetryWithProtoRefreshError: ReadWithinUncertaintyIntervalError:` To fix this, added a new flag `as-of` which allows to run consistency check using `AS OF SYSTEM TIME`. Epic: none Release note: None 133347: crossclsuter/logical: add settings/stats to ldr ingest chunking r=dt a=dt 133607: sql: check object type when revoking privilege r=rafiss a=rafiss fixes #131157 Release note (bug fix): Fix an unhandled error that could occur when using `REVOKE ... ON SEQUENCE FROM ... user` on an object that is not a sequence. 133608: schemachanger: force prod values in expensive test r=rafiss a=rafiss fixes #133437 Release note: None 133616: roachtest: validate token return in perturbation/* tests r=kvoli a=andrewbaptist This commit adds validation that all RAC tokens are returned on all stable nodes at the end of the test. Fixes: #133410 Release note: None 133681: roachtest: minor fixes in rebalance/by-load test r=arulajmani a=kvoli `%` was not escaped, causing it to be substituted with values which were meant to go later. e.g., from: ``` node 0 has core count normalized CPU utilization ts datapoint not in [0%!,(float64=1.4920845083839689)100{[{{%!](string=cr.node.sys.cpu.combined.percent-normalized) %!] ... ``` To ``` node idx 0 has core count normalized CPU utilization ts datapoint not in [0%,100%] ... ``` --- The `rebalance/by-load/*` roachtests compare the CPU of nodes and assert that the distribution of node cpu is bounded +- 20%. The previous metric: ``` sys.cpu.combined.percent_normalized ``` Would occasionally over-report the CPU, as greater than 100% (>1.0), which is impossible. Use the host CPU instead, which will look at the machines CPU utilization, rather than any cockroach processes. ``` sys.cpu.host.combined.percent_normalized ``` Part of: #133004 Part of: #133054 Part of: #132019 Part of: #133223 Part of: #132633 Release note: None 133683: license: don't hit EnvOrDefaultInt64 in hot path r=fqazi,mgartner a=tbg Saves 0.3%cpu on sysbench. Fixes #133088. Release note: None Epic: None 133686: rac2: order testingRCRange.mu before RaftMu in tests r=sumeerbhola a=kvoli `testingRCRange.mu` was being acquired, and held before acquiring `RaftMu` in `testingRCRange.admit()`, which conflicted with different ordering (reversed). This was a test only issue with `TestRangeController`. Order `testingRCRange.mu` before `RaftMu` in `admit()`. Fixes: #133650 Release note: None 133690: roachtest: always pass a Context to queries r=kvoli a=andrewbaptist Queries can hang if there is no context passed to them. In roachtests, a context can be cancelled if there is a VM preemption. It is always better to use the test context and avoid this risk. This change updates the perturbation/* tests to always pass a context. Fixes: #133625 Release note: None 133693: kvserver: deflake TestSnapshotsToDrainingNodes r=kvoli a=arulajmani This test was making tight assertions about the size of the snapshot that was sent. To do so, it was trying to reimplement the actual snapshot sending logic in `kvBatchSnapshotStrategy.Send()`. So these tight assertions weren't of much use -- they were asserting that we were correctly re-implementing `kvBatchSnapshotStrategy.Send()` in `getExpectedSnapshotSizeBytes`. We weren't, as evidenced by some rare flakes. This patch loosens assertions to deflake the test. Closes #133517 Release note: None Co-authored-by: Shailendra Patel <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Arul Ajmani <[email protected]>
Build failed (retrying...): |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 5d74d15 to blathers/backport-release-24.3-133686: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 24.3.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
blathers backport 24.3 |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 5d74d15 to blathers/backport-release-24.3-133686: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 24.3 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
blathers backport 24.3 |
testingRCRange.mu
was being acquired, and held before acquiringRaftMu
intestingRCRange.admit()
, which conflicted with different ordering (reversed). This was a test only issue withTestRangeController
.Order
testingRCRange.mu
beforeRaftMu
inadmit()
.Fixes: #133650
Release note: None