kv/kvserver: TestReliableIntentCleanup failed #66895

cockroach-teamcity · 2021-06-25T18:40:50Z

kv/kvserver.TestReliableIntentCleanup failed with artifacts on master @ b6598d0101dc76c284e14f7809338ae14ce07f4f:

=== RUN   TestReliableIntentCleanup
    test_log_scope.go:73: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestReliableIntentCleanup808306975
    test_log_scope.go:74: use -show-logs to present logs inline
=== CONT  TestReliableIntentCleanup
    intent_resolver_integration_test.go:686: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestReliableIntentCleanup808306975
--- FAIL: TestReliableIntentCleanup (684.84s)
=== RUN   TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true
            --- FAIL: TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true (372.38s)
=== RUN   TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000
        --- FAIL: TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000 (427.86s)
=== RUN   TestReliableIntentCleanup/ForceSyncIntentResolution=true
    --- FAIL: TestReliableIntentCleanup/ForceSyncIntentResolution=true (477.76s)
=== RUN   TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true/txn=true/finalize=cancel/abort=push
    intent_resolver_integration_test.go:614: 
        	Error Trace:	intent_resolver_integration_test.go:404
        	            				intent_resolver_integration_test.go:614
        	            				intent_resolver_integration_test.go:674
        	            				subtest.go:34
        	Error:      	found stale intents
        	Test:       	TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true/txn=true/finalize=cancel/abort=push
        	Messages:   	count=1329 first={{"key\x00\rߠl"} id=a185e422 key="key\x00\x06\x84\x0f\xc4" pri=0.00548926 epo=0 ts=1624645357.969730944,0 min=1624645357.969730944,0 seq=213} last={{"key\x00\x0f\xff\xbe\x9f"} id=a185e422 key="key\x00\x06\x84\x0f\xc4" pri=0.00548926 epo=0 ts=1624645357.969730944,0 min=1624645357.969730944,0 seq=8871}
                        --- FAIL: TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true/txn=true/finalize=cancel/abort=push (100.78s)
=== RUN   TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true/txn=true/finalize=cancel
                    --- FAIL: TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true/txn=true/finalize=cancel (202.64s)
=== RUN   TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true/txn=true
                --- FAIL: TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true/txn=true (369.00s)

Reproduce

To reproduce, try:

make stressrace TESTS=TestReliableIntentCleanup PKG=./pkg/kv/kvserver TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

Parameters in this failure:

GOFLAGS=-json

/cc @cockroachdb/kv erikgrinaker _{This test on roachdash | Improve this report!

Jira issue: CRDB-8279
Epic CRDB-27234}

The text was updated successfully, but these errors were encountered:

66959: roachtest: unskip import/tpcc/warehouses=4000/geo r=pbardea a=pbardea This test completed successfully 5/5 times and it was skipped a long time ago (~2 years), unskipping and will keep an eye on it in case it starts OOMing again. Will continue running in background to get more successful runs. Release note: None 67003: kvserver: remove TestRollbackSyncRangedIntentResolution r=tbg a=erikgrinaker This test is flaky since intent resolution is non-deterministic, doesn't seem worth it to keep it around. Release note: None 67005: kvserver: skip TestReliableIntentCleanup r=tbg a=erikgrinaker Flakorama. Unclear if it'll be possible to salvage this test without significant changes to txn cleanup. Touches #66895. Release note: None 67006: kvserver: fix data race on replicaScanner.stopper r=tbg a=erikgrinaker In #65781, a race was introduced on `replicaScanner.stopper`, since it is set by `Start()` and accessed by `RemoveReplica()` but `Start()` is called asynchronously. To avoid introducing a mutex around the stopper or refactoring the store construction, this adds a hack that writes the stopper directly to `replicaScanner.stopper` synchronously during `Store.Start()`. Release note: None /cc @cockroachdb/kv Co-authored-by: Paul Bardea <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>

nvanbenschoten · 2022-03-09T22:24:01Z

@erikgrinaker in #67005, you said "Unclear if it'll be possible to salvage this test without significant changes to txn cleanup." Do you recall what is going wrong here and why you're skeptical that this can be deflaked without changing txn cleanup?

erikgrinaker · 2022-03-10T09:45:54Z

Essentially, this test asserts that we always clean up intents and transaction records in all cases when a transaction is finalized. That turns out not to be the case -- I think especially when context cancellation was involved. It works most of the time, but not always. I think I concluded that to properly fix this we would have to have a persistent queue of transactions pending cleanup, instead of relying on the current constellation of actors that are responsible for it.

nvanbenschoten · 2022-03-10T15:21:49Z

Should we revive the test without context cancellation for now? It seems like valuable test coverage.

erikgrinaker · 2022-03-10T15:59:33Z

Should we revive the test without context cancellation for now? It seems like valuable test coverage.

I'm pretty sure it was flaky in other cases too. Txn cleanup just seems fundamentally flaky. But we can give it a shot.

erikgrinaker · 2022-03-10T16:27:43Z

Looks like it'll take a bit of work to revive this. I'll pick it up for later.

erikgrinaker · 2023-03-14T18:43:11Z

Didn't get around to this, and likely won't any time soon. Moving back to KV.

erikgrinaker · 2023-06-30T20:29:48Z

Still flaky.

shralex · 2023-07-07T01:06:43Z

@arulajmani can you please take a look ?

shralex · 2023-10-05T14:54:42Z

Instead of deflaking this test, which seems to be super difficult and may not worth the time investment, can we delete the flaky parts of the test and keep the others ?

erikgrinaker · 2023-10-05T15:00:35Z

Instead of deflaking this test, which seems to be super difficult and may not worth the time investment

To be clear, it's not the test that's flaky, it's intent resolution. Whether it's a problem that intent resolution is flaky is a different matter, of course, but this test aims to assert that it's reliable so if it's not reliable and we don't care that it's not reliable then I'm not sure why we need this test.

kvoli · 2023-12-01T15:33:27Z

Assigning a P3 -- see above comment.

cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jun 25, 2021

erikgrinaker self-assigned this Jun 25, 2021

erikgrinaker mentioned this issue Jun 29, 2021

kvserver: skip TestReliableIntentCleanup #67005

Merged

tbg added the skipped-test label Dec 15, 2021

tbg added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Feb 1, 2022

AlexTalks added the T-kv KV Team label Feb 18, 2022

tbg removed the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label May 17, 2022

tbg added the S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) label May 30, 2022

erikgrinaker added the T-kv-replication label May 31, 2022

erikgrinaker removed the T-kv-replication label Mar 14, 2023

erikgrinaker removed their assignment Mar 14, 2023

erikgrinaker self-assigned this Jun 27, 2023

This was referenced Jun 27, 2023

kvserver: fix txn/intent cleanup on client context cancellation #105615

Closed

kvserver: unskip and deflake TestReliableIntentCleanup #105614

Merged

craig bot closed this as completed in 1e222ff Jun 30, 2023

erikgrinaker reopened this Jun 30, 2023

erikgrinaker removed their assignment Jun 30, 2023

erikgrinaker added N-followup Needs followup. A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. and removed N-followup Needs followup. labels Jul 6, 2023

shralex assigned arulajmani Jul 7, 2023

nvanbenschoten unassigned arulajmani Jul 24, 2023

shralex assigned arulajmani Oct 5, 2023

kvoli added the P-3 Issues/test failures with no fix SLA label Dec 1, 2023

arulajmani removed their assignment Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv/kvserver: TestReliableIntentCleanup failed #66895

kv/kvserver: TestReliableIntentCleanup failed #66895

cockroach-teamcity commented Jun 25, 2021 •

edited by rickystewart

Loading

nvanbenschoten commented Mar 9, 2022

erikgrinaker commented Mar 10, 2022 •

edited

Loading

nvanbenschoten commented Mar 10, 2022

erikgrinaker commented Mar 10, 2022

erikgrinaker commented Mar 10, 2022

erikgrinaker commented Mar 14, 2023

erikgrinaker commented Jun 30, 2023

shralex commented Jul 7, 2023

shralex commented Oct 5, 2023

erikgrinaker commented Oct 5, 2023 •

edited

Loading

kvoli commented Dec 1, 2023

kv/kvserver: TestReliableIntentCleanup failed #66895

kv/kvserver: TestReliableIntentCleanup failed #66895

Comments

cockroach-teamcity commented Jun 25, 2021 • edited by rickystewart Loading

nvanbenschoten commented Mar 9, 2022

erikgrinaker commented Mar 10, 2022 • edited Loading

nvanbenschoten commented Mar 10, 2022

erikgrinaker commented Mar 10, 2022

erikgrinaker commented Mar 10, 2022

erikgrinaker commented Mar 14, 2023

erikgrinaker commented Jun 30, 2023

shralex commented Jul 7, 2023

shralex commented Oct 5, 2023

erikgrinaker commented Oct 5, 2023 • edited Loading

kvoli commented Dec 1, 2023

cockroach-teamcity commented Jun 25, 2021 •

edited by rickystewart

Loading

erikgrinaker commented Mar 10, 2022 •

edited

Loading

erikgrinaker commented Oct 5, 2023 •

edited

Loading