Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestReplicaLatchingOptimisticEvaluationKeyLimit failed #135197

Open
cockroach-teamcity opened this issue Nov 14, 2024 · 4 comments
Open
Assignees
Labels
A-testing Testing tools and infrastructure branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Nov 14, 2024

kv/kvserver.TestReplicaLatchingOptimisticEvaluationKeyLimit failed with artifacts on release-24.3 @ e0cfe10898e0a6e14c65cc2f224f16f79b0561a8:

Fatal error:

panic: test timed out after 15m0s
running tests:
	TestReplicaLatchingOptimisticEvaluationKeyLimit (14m17s)
	TestReplicaLatchingOptimisticEvaluationKeyLimit/point-reads=true (14m17s)
	TestReplicaLatchingOptimisticEvaluationKeyLimit/point-reads=true/{writeKey:b_limit:1_interferes:false} (14m17s)

Stack:

goroutine 132135 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2366 +0x30c
created by time.goFunc
	GOROOT/src/time/sleep.go:177 +0x38
Log preceding fatal error

=== RUN   TestReplicaLatchingOptimisticEvaluationKeyLimit
    test_log_scope.go:165: test logs captured to: /artifacts/tmp/_tmp/be0807728adc72fa72837d95e44d8976/logTestReplicaLatchingOptimisticEvaluationKeyLimit3913151085
    test_log_scope.go:76: use -show-logs to present logs inline
=== RUN   TestReplicaLatchingOptimisticEvaluationKeyLimit/point-reads=true
=== RUN   TestReplicaLatchingOptimisticEvaluationKeyLimit/point-reads=true/{writeKey:b_limit:0_interferes:true}
=== RUN   TestReplicaLatchingOptimisticEvaluationKeyLimit/point-reads=true/{writeKey:b_limit:1_interferes:false}

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-44397

@cockroach-teamcity cockroach-teamcity added branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Nov 14, 2024
@arulajmani
Copy link
Collaborator

arulajmani commented Nov 14, 2024

Reminds me of #123986 and #130973 at first glance.

@arulajmani
Copy link
Collaborator

No, can't be that -- the test is entirely different. Here, we ensure that the write goes through and is blocked after acquiring latches. In the variant that's failing, we construct a BatchRequest which has 4 gets for keys a, b, c, and d. We write to key B, and issue the read batch with limit 1. The expectation is that it'll go through the optimistic evaluation path and not conflict on the write's latches, yet, from the logs, we see:

W241114 17:44:26.683072 131378 kv/kvserver/spanlatch/manager.go:605 ⋮ [s1,r1/1:‹/M{in-ax}›] 15  have been waiting 15s to acquire read latch ‹b›@1731606251.682441150,0 for request Get [‹"a"›], Get [‹"b"›], Get [‹"c"›], Get [‹"d"›], [max_span_request_keys: 1], [target_bytes: 0], held by write latch ‹b›@1731606251.682376100,0 for request Put [‹"b"›]

From the failure mode, it looks like either we didn't attempt optimistic evaluation or we optimistic evaluation failed so we fell back to pessimistic evaluation and waited on latches.

arulajmani added a commit to arulajmani/cockroach that referenced this issue Nov 15, 2024
This test would fail opaquely previously, which wasn't helpful. This
patch sets up tracing on the read path to help investigate future
failures.

Informs cockroachdb#135197

Release note: None
@arulajmani
Copy link
Collaborator

I stressed this 11K times without issue. Given this has never failed and no active work has happened in this area recently, I'll remove the release blocker label here.

I've also sent out #135234, which should give us more visibility into future failures.

@arulajmani arulajmani added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure P-3 Issues/test failures with no fix SLA and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Nov 15, 2024
@arulajmani arulajmani self-assigned this Dec 2, 2024
@arulajmani
Copy link
Collaborator

#135234 (comment)

craig bot pushed a commit that referenced this issue Dec 2, 2024
135234: kvserver: improve TestReplicaLatchingOptimisticEvaluationKeyLimit r=arulajmani a=arulajmani

This test would fail opaquely previously, which wasn't helpful. This patch sets up tracing on the read path to help investigate future failures.

Informs #135197

Release note: None

136293: sql/catalog/lease: Add diagnostics to TestRangefeedUpdatesHandledProperlyInTheFaceOfRaces r=spilchen a=spilchen

This change enhances diagnostics for TestRangefeedUpdatesHandledProperlyInTheFaceOfRaces. The test involves a concurrent query and an ALTER operation on the same table. The query starts first, acquiring the table descriptor lease, and is then suspended by the test. Next, the ALTER operation begins, creating a new version of the descriptor. It pauses at the end of its execution, waiting for only one version of the descriptor to remain. When the new descriptor version is detected, the query resumes, allowing the ALTER to complete.

In the failure case, the test did not detect the new version of the descriptor, even though the ALTER operation had already updated it and was waiting at waitForOneVersion. This change adds extra logging to capture the descriptor changes observed during the test, helping diagnose the issue if it recurs.

Epic: none
Closes #135777
Release note: none

Co-authored-by: Arul Ajmani <[email protected]>
Co-authored-by: Matt Spilchen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team
Projects
None yet
Development

No branches or pull requests

2 participants