kv/kvserver: TestReplicaLatchingOptimisticEvaluationKeyLimit failed #135197

cockroach-teamcity · 2024-11-14T18:15:07Z

kv/kvserver.TestReplicaLatchingOptimisticEvaluationKeyLimit failed with artifacts on release-24.3 @ e0cfe10898e0a6e14c65cc2f224f16f79b0561a8:

Fatal error:

panic: test timed out after 15m0s
running tests:
	TestReplicaLatchingOptimisticEvaluationKeyLimit (14m17s)
	TestReplicaLatchingOptimisticEvaluationKeyLimit/point-reads=true (14m17s)
	TestReplicaLatchingOptimisticEvaluationKeyLimit/point-reads=true/{writeKey:b_limit:1_interferes:false} (14m17s)

Stack:

goroutine 132135 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2366 +0x30c
created by time.goFunc
	GOROOT/src/time/sleep.go:177 +0x38

Log preceding fatal error

=== RUN   TestReplicaLatchingOptimisticEvaluationKeyLimit
    test_log_scope.go:165: test logs captured to: /artifacts/tmp/_tmp/be0807728adc72fa72837d95e44d8976/logTestReplicaLatchingOptimisticEvaluationKeyLimit3913151085
    test_log_scope.go:76: use -show-logs to present logs inline
=== RUN   TestReplicaLatchingOptimisticEvaluationKeyLimit/point-reads=true
=== RUN   TestReplicaLatchingOptimisticEvaluationKeyLimit/point-reads=true/{writeKey:b_limit:0_interferes:true}
=== RUN   TestReplicaLatchingOptimisticEvaluationKeyLimit/point-reads=true/{writeKey:b_limit:1_interferes:false}

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv _{This test on roachdash | Improve this report!}

Jira issue: CRDB-44397

The text was updated successfully, but these errors were encountered:

arulajmani · 2024-11-14T23:24:11Z

Reminds me of #123986 and #130973 at first glance.

arulajmani · 2024-11-14T23:52:14Z

No, can't be that -- the test is entirely different. Here, we ensure that the write goes through and is blocked after acquiring latches. In the variant that's failing, we construct a BatchRequest which has 4 gets for keys a, b, c, and d. We write to key B, and issue the read batch with limit 1. The expectation is that it'll go through the optimistic evaluation path and not conflict on the write's latches, yet, from the logs, we see:

W241114 17:44:26.683072 131378 kv/kvserver/spanlatch/manager.go:605 â‹® [s1,r1/1:â€¹/M{in-ax}â€º] 15  have been waiting 15s to acquire read latch â€¹bâ€º@1731606251.682441150,0 for request Get [â€¹"a"â€º], Get [â€¹"b"â€º], Get [â€¹"c"â€º], Get [â€¹"d"â€º], [max_span_request_keys: 1], [target_bytes: 0], held by write latch â€¹bâ€º@1731606251.682376100,0 for request Put [â€¹"b"â€º]

From the failure mode, it looks like either we didn't attempt optimistic evaluation or we optimistic evaluation failed so we fell back to pessimistic evaluation and waited on latches.

This test would fail opaquely previously, which wasn't helpful. This patch sets up tracing on the read path to help investigate future failures. Informs cockroachdb#135197 Release note: None

arulajmani · 2024-11-15T00:21:59Z

I stressed this 11K times without issue. Given this has never failed and no active work has happened in this area recently, I'll remove the release blocker label here.

I've also sent out #135234, which should give us more visibility into future failures.

arulajmani · 2024-12-02T14:58:32Z

#135234 (comment)

135234: kvserver: improve TestReplicaLatchingOptimisticEvaluationKeyLimit r=arulajmani a=arulajmani This test would fail opaquely previously, which wasn't helpful. This patch sets up tracing on the read path to help investigate future failures. Informs #135197 Release note: None 136293: sql/catalog/lease: Add diagnostics to TestRangefeedUpdatesHandledProperlyInTheFaceOfRaces r=spilchen a=spilchen This change enhances diagnostics for TestRangefeedUpdatesHandledProperlyInTheFaceOfRaces. The test involves a concurrent query and an ALTER operation on the same table. The query starts first, acquiring the table descriptor lease, and is then suspended by the test. Next, the ALTER operation begins, creating a new version of the descriptor. It pauses at the end of its execution, waiting for only one version of the descriptor to remain. When the new descriptor version is detected, the query resumes, allowing the ALTER to complete. In the failure case, the test did not detect the new version of the descriptor, even though the ALTER operation had already updated it and was waiting at waitForOneVersion. This change adds extra logging to capture the descriptor changes observed during the test, helping diagnose the issue if it recurs. Epic: none Closes #135777 Release note: none Co-authored-by: Arul Ajmani <[email protected]> Co-authored-by: Matt Spilchen <[email protected]>

arulajmani mentioned this issue Nov 15, 2024

kvserver: improve TestReplicaLatchingOptimisticEvaluationKeyLimit #135234

Merged

arulajmani self-assigned this Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv/kvserver: TestReplicaLatchingOptimisticEvaluationKeyLimit failed #135197

kv/kvserver: TestReplicaLatchingOptimisticEvaluationKeyLimit failed #135197

cockroach-teamcity commented Nov 14, 2024 •

edited by cockroach-jira-scripts

Loading

arulajmani commented Nov 14, 2024 •

edited

Loading

arulajmani commented Nov 14, 2024

arulajmani commented Nov 15, 2024

arulajmani commented Dec 2, 2024

kv/kvserver: TestReplicaLatchingOptimisticEvaluationKeyLimit failed #135197

kv/kvserver: TestReplicaLatchingOptimisticEvaluationKeyLimit failed #135197

Comments

cockroach-teamcity commented Nov 14, 2024 • edited by cockroach-jira-scripts Loading

arulajmani commented Nov 14, 2024 • edited Loading

arulajmani commented Nov 14, 2024

arulajmani commented Nov 15, 2024

arulajmani commented Dec 2, 2024

cockroach-teamcity commented Nov 14, 2024 •

edited by cockroach-jira-scripts

Loading

arulajmani commented Nov 14, 2024 •

edited

Loading