Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rac2: order testingRCRange.mu before RaftMu in tests #133686

Merged
merged 1 commit into from
Oct 29, 2024

Conversation

kvoli
Copy link
Collaborator

@kvoli kvoli commented Oct 29, 2024

testingRCRange.mu was being acquired, and held before acquiring RaftMu in testingRCRange.admit(), which conflicted with different ordering (reversed). This was a test only issue with TestRangeController.

Order testingRCRange.mu before RaftMu in admit().

Fixes: #133650
Release note: None

@kvoli kvoli self-assigned this Oct 29, 2024
@kvoli kvoli added the backport-24.3.x Flags PRs that need to be backported to 24.3 label Oct 29, 2024
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli
Copy link
Collaborator Author

kvoli commented Oct 29, 2024

Passes 1k runs under --stress --deadlock:

dev test pkg/kv/kvserver/kvflowcontrol/rac21-vs-fcTestRangeController/handle_raft_event --deadlock --stress
...
//pkg/kv/kvserver/kvflowcontrol/rac2:rac2_test                           PASSED in 1.5s
  Stats over 1000 runs: max = 1.5s, min = 0.7s, avg = 0.8s, dev = 0.1s

@kvoli kvoli marked this pull request as ready for review October 29, 2024 16:12
@kvoli kvoli requested a review from a team as a code owner October 29, 2024 16:12
@kvoli kvoli requested review from pav-kv and sumeerbhola October 29, 2024 16:12
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)


-- commits line 2 at r1:
the code comment in this PR says raftMu is ordered before testingRCRange.mu (which is the correct phrasing).

		// We need to ensure that r.mu isn't held before (and while) holding
		// RaftMu, in order to order the locks correctly (RaftMu before
		// testingRCRange.mu).

pkg/kv/kvserver/kvflowcontrol/rac2/range_controller_test.go line 508 at r1 (raw file):

}

func (r *testingRCRange) admit(ctx context.Context, storeID roachpb.StoreID, av AdmittedVector) {

I think we need a code comment where testingRCRange.mu is declared on what the lock ordering is. And possibly include why -- I think it is because methods like testingRCRange.{SendMsgApp,SendMsgAppRaftMuLocked} need to acquire testingRCRange.mu and are called with raftMu held.


pkg/kv/kvserver/kvflowcontrol/rac2/range_controller_test.go line 534 at r1 (raw file):

		// Ensure that Match doesn't lag behind the highest index in the
		// AdmittedVector.
		repl.info.Match = max(repl.info.Match, v)

shouldn't we be doing this updating of repl which lives in r.mu.r.replicaSet in the closure above while holding r.mu?

`testingRCRange.mu` was being acquired, and held before acquiring
`RaftMu` in `testingRCRange.admit()`, which conflicted with different
ordering (reversed). This was a test only issue with
`TestRangeController`.

Order `RaftMu` before `testingRCRange.mu` in `admit()`.

Fixes: cockroachdb#133650
Release note: None
@kvoli kvoli force-pushed the 241029.rac2-test-deadlock branch from 1f226e8 to 5d74d15 Compare October 29, 2024 16:59
Copy link
Collaborator Author

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @sumeerbhola)


-- commits line 2 at r1:

Previously, sumeerbhola wrote…

the code comment in this PR says raftMu is ordered before testingRCRange.mu (which is the correct phrasing).

		// We need to ensure that r.mu isn't held before (and while) holding
		// RaftMu, in order to order the locks correctly (RaftMu before
		// testingRCRange.mu).

Fixed. The lock ordering notation / wording always gives me the run around.


pkg/kv/kvserver/kvflowcontrol/rac2/range_controller_test.go line 508 at r1 (raw file):

Previously, sumeerbhola wrote…

I think we need a code comment where testingRCRange.mu is declared on what the lock ordering is. And possibly include why -- I think it is because methods like testingRCRange.{SendMsgApp,SendMsgAppRaftMuLocked} need to acquire testingRCRange.mu and are called with raftMu held.

Yeah that is why. Added.


pkg/kv/kvserver/kvflowcontrol/rac2/range_controller_test.go line 534 at r1 (raw file):

Previously, sumeerbhola wrote…

shouldn't we be doing this updating of repl which lives in r.mu.r.replicaSet in the closure above while holding r.mu?

Yes we should, interesting that it didn't trip the race detector (which I didn't run locally because it doesn't work with deadlock).

Fixed.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @pav-kv)

@kvoli
Copy link
Collaborator Author

kvoli commented Oct 29, 2024

TYFTR!

bors r=sumeerbhola

craig bot pushed a commit that referenced this pull request Oct 29, 2024
…133690 #133693

133234: workload: tpcc consistency check added flag as-of. r=srosenberg,nameisbhaskar,vidit-bhat a=shailendra-patel

While running the consistency checker on the tpcc database with an active tpcc workload, the consistency check fails with a retryable error, such as restart transaction:`TransactionRetryWithProtoRefreshError: ReadWithinUncertaintyIntervalError:`
To fix this, added a new flag `as-of` which allows to run consistency check using `AS OF SYSTEM TIME`.

Epic: none
Release note: None

133347: crossclsuter/logical: add settings/stats to ldr ingest chunking r=dt a=dt



133607: sql: check object type when revoking privilege r=rafiss a=rafiss

fixes #131157
Release note (bug fix): Fix an unhandled error that could occur when using `REVOKE ... ON SEQUENCE FROM ... user` on an object that is not a sequence.

133608: schemachanger: force prod values in expensive test r=rafiss a=rafiss

fixes #133437
Release note: None

133616: roachtest: validate token return in perturbation/* tests r=kvoli a=andrewbaptist

This commit adds validation that all RAC tokens are returned on all stable nodes at the end of the test.

Fixes: #133410

Release note: None

133681: roachtest: minor fixes in rebalance/by-load test r=arulajmani a=kvoli

`%` was not escaped, causing it to be substituted with values which
were meant to go later.

e.g., from:

```
node 0 has core count normalized CPU utilization ts datapoint not in [0%!,(float64=1.4920845083839689)100{[{{%!](string=cr.node.sys.cpu.combined.percent-normalized) %!]
...
```

To

```
node idx 0 has core count normalized CPU utilization ts datapoint not in [0%,100%]
...
```

---

The `rebalance/by-load/*` roachtests compare the CPU of nodes and assert
that the distribution of node cpu is bounded +- 20%. The previous metric:

```
sys.cpu.combined.percent_normalized
```

Would occasionally over-report the CPU, as greater than 100% (>1.0),
which is impossible. Use the host CPU instead, which will look at the
machines CPU utilization, rather than any cockroach processes.

```
sys.cpu.host.combined.percent_normalized
```

Part of: #133004
Part of: #133054
Part of: #132019
Part of: #133223
Part of: #132633
Release note: None

133683: license: don't hit EnvOrDefaultInt64 in hot path r=fqazi,mgartner a=tbg

Saves 0.3%cpu on sysbench.

Fixes #133088.

Release note: None
Epic: None


133686: rac2: order testingRCRange.mu before RaftMu in tests r=sumeerbhola a=kvoli

`testingRCRange.mu` was being acquired, and held before acquiring `RaftMu` in `testingRCRange.admit()`, which conflicted with different ordering (reversed). This was a test only issue with `TestRangeController`.

Order `testingRCRange.mu` before `RaftMu` in `admit()`.

Fixes: #133650
Release note: None

133690: roachtest: always pass a Context to queries r=kvoli a=andrewbaptist

Queries can hang if there is no context passed to them. In roachtests, a context can be cancelled if there is a VM preemption. It is always better to use the test context and avoid this risk. This change updates the perturbation/* tests to always pass a context.

Fixes: #133625

Release note: None

133693: kvserver: deflake TestSnapshotsToDrainingNodes r=kvoli a=arulajmani

This test was making tight assertions about the size of the snapshot that was sent. To do so, it was trying to reimplement the actual snapshot sending logic in `kvBatchSnapshotStrategy.Send()`. So these tight assertions weren't of much use -- they were asserting that we were correctly re-implementing `kvBatchSnapshotStrategy.Send()` in `getExpectedSnapshotSizeBytes`. We weren't, as evidenced by some rare flakes.

This patch loosens assertions to deflake the test.

Closes #133517
Release note: None

Co-authored-by: Shailendra Patel <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Arul Ajmani <[email protected]>
@craig
Copy link
Contributor

craig bot commented Oct 29, 2024

Build failed (retrying...):

@craig craig bot merged commit c14a15d into cockroachdb:master Oct 29, 2024
22 of 23 checks passed
Copy link

blathers-crl bot commented Oct 29, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 5d74d15 to blathers/backport-release-24.3-133686: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.3.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@kvoli
Copy link
Collaborator Author

kvoli commented Oct 30, 2024

blathers backport 24.3

Copy link

blathers-crl bot commented Oct 30, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 5d74d15 to blathers/backport-release-24.3-133686: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.3 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@kvoli
Copy link
Collaborator Author

kvoli commented Oct 30, 2024

blathers backport 24.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-24.3.x Flags PRs that need to be backported to 24.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv/kvserver/kvflowcontrol/rac2: TestRangeController failed
3 participants