Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestReliableIntentCleanup failed #66895

Open
cockroach-teamcity opened this issue Jun 25, 2021 · 11 comments · Fixed by #105614
Open

kv/kvserver: TestReliableIntentCleanup failed #66895

cockroach-teamcity opened this issue Jun 25, 2021 · 11 comments · Fixed by #105614
Labels
A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) skipped-test T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 25, 2021

kv/kvserver.TestReliableIntentCleanup failed with artifacts on master @ b6598d0101dc76c284e14f7809338ae14ce07f4f:

=== RUN   TestReliableIntentCleanup
    test_log_scope.go:73: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestReliableIntentCleanup808306975
    test_log_scope.go:74: use -show-logs to present logs inline
=== CONT  TestReliableIntentCleanup
    intent_resolver_integration_test.go:686: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestReliableIntentCleanup808306975
--- FAIL: TestReliableIntentCleanup (684.84s)
=== RUN   TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true
            --- FAIL: TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true (372.38s)
=== RUN   TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000
        --- FAIL: TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000 (427.86s)
=== RUN   TestReliableIntentCleanup/ForceSyncIntentResolution=true
    --- FAIL: TestReliableIntentCleanup/ForceSyncIntentResolution=true (477.76s)
=== RUN   TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true/txn=true/finalize=cancel/abort=push
    intent_resolver_integration_test.go:614: 
        	Error Trace:	intent_resolver_integration_test.go:404
        	            				intent_resolver_integration_test.go:614
        	            				intent_resolver_integration_test.go:674
        	            				subtest.go:34
        	Error:      	found stale intents
        	Test:       	TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true/txn=true/finalize=cancel/abort=push
        	Messages:   	count=1329 first={{"key\x00\rߠl"} id=a185e422 key="key\x00\x06\x84\x0f\xc4" pri=0.00548926 epo=0 ts=1624645357.969730944,0 min=1624645357.969730944,0 seq=213} last={{"key\x00\x0f\xff\xbe\x9f"} id=a185e422 key="key\x00\x06\x84\x0f\xc4" pri=0.00548926 epo=0 ts=1624645357.969730944,0 min=1624645357.969730944,0 seq=8871}
                        --- FAIL: TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true/txn=true/finalize=cancel/abort=push (100.78s)
=== RUN   TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true/txn=true/finalize=cancel
                    --- FAIL: TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true/txn=true/finalize=cancel (202.64s)
=== RUN   TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true/txn=true
                --- FAIL: TestReliableIntentCleanup/ForceSyncIntentResolution=true/numKeys=100000/singleRange=true/txn=true (369.00s)
Reproduce

To reproduce, try:

make stressrace TESTS=TestReliableIntentCleanup PKG=./pkg/kv/kvserver TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

Parameters in this failure:

  • GOFLAGS=-json

/cc @cockroachdb/kv erikgrinaker

This test on roachdash | Improve this report!

Jira issue: CRDB-8279

Epic CRDB-27234

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jun 25, 2021
@erikgrinaker erikgrinaker self-assigned this Jun 25, 2021
craig bot pushed a commit that referenced this issue Jun 29, 2021
66959: roachtest: unskip import/tpcc/warehouses=4000/geo r=pbardea a=pbardea

This test completed successfully 5/5 times and it was skipped a long time
ago (~2 years), unskipping and will keep an eye on it in case it starts
OOMing again.

Will continue running in background to get more successful runs.

Release note: None

67003: kvserver: remove TestRollbackSyncRangedIntentResolution r=tbg a=erikgrinaker

This test is flaky since intent resolution is non-deterministic, doesn't
seem worth it to keep it around.

Release note: None

67005: kvserver: skip TestReliableIntentCleanup r=tbg a=erikgrinaker

Flakorama. Unclear if it'll be possible to salvage this test without
significant changes to txn cleanup.

Touches #66895.

Release note: None

67006: kvserver: fix data race on replicaScanner.stopper r=tbg a=erikgrinaker

In #65781, a race was introduced on `replicaScanner.stopper`, since it
is set by `Start()` and accessed by `RemoveReplica()` but `Start()` is
called asynchronously.

To avoid introducing a mutex around the stopper or refactoring the store
construction, this adds a hack that writes the stopper directly to
`replicaScanner.stopper` synchronously during `Store.Start()`.

Release note: None

/cc @cockroachdb/kv 

Co-authored-by: Paul Bardea <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
@tbg tbg added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Feb 1, 2022
@AlexTalks AlexTalks added the T-kv KV Team label Feb 18, 2022
@nvanbenschoten
Copy link
Member

@erikgrinaker in #67005, you said "Unclear if it'll be possible to salvage this test without significant changes to txn cleanup." Do you recall what is going wrong here and why you're skeptical that this can be deflaked without changing txn cleanup?

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Mar 10, 2022

Essentially, this test asserts that we always clean up intents and transaction records in all cases when a transaction is finalized. That turns out not to be the case -- I think especially when context cancellation was involved. It works most of the time, but not always. I think I concluded that to properly fix this we would have to have a persistent queue of transactions pending cleanup, instead of relying on the current constellation of actors that are responsible for it.

@nvanbenschoten
Copy link
Member

Should we revive the test without context cancellation for now? It seems like valuable test coverage.

@erikgrinaker
Copy link
Contributor

Should we revive the test without context cancellation for now? It seems like valuable test coverage.

I'm pretty sure it was flaky in other cases too. Txn cleanup just seems fundamentally flaky. But we can give it a shot.

@erikgrinaker
Copy link
Contributor

Looks like it'll take a bit of work to revive this. I'll pick it up for later.

@tbg tbg removed the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label May 17, 2022
@tbg tbg added the S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) label May 30, 2022
@erikgrinaker
Copy link
Contributor

Didn't get around to this, and likely won't any time soon. Moving back to KV.

@craig craig bot closed this as completed in 1e222ff Jun 30, 2023
@erikgrinaker
Copy link
Contributor

Still flaky.

@erikgrinaker erikgrinaker reopened this Jun 30, 2023
@erikgrinaker erikgrinaker removed their assignment Jun 30, 2023
@erikgrinaker erikgrinaker added N-followup Needs followup. A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. and removed N-followup Needs followup. labels Jul 6, 2023
@shralex
Copy link
Contributor

shralex commented Jul 7, 2023

@arulajmani can you please take a look ?

@shralex
Copy link
Contributor

shralex commented Oct 5, 2023

Instead of deflaking this test, which seems to be super difficult and may not worth the time investment, can we delete the flaky parts of the test and keep the others ?

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Oct 5, 2023

Instead of deflaking this test, which seems to be super difficult and may not worth the time investment

To be clear, it's not the test that's flaky, it's intent resolution. Whether it's a problem that intent resolution is flaky is a different matter, of course, but this test aims to assert that it's reliable so if it's not reliable and we don't care that it's not reliable then I'm not sure why we need this test.

@kvoli
Copy link
Collaborator

kvoli commented Dec 1, 2023

Assigning a P3 -- see above comment.

@kvoli kvoli added the P-3 Issues/test failures with no fix SLA label Dec 1, 2023
@arulajmani arulajmani removed their assignment Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) skipped-test T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants