-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/kvserver: TestReliableIntentCleanup failed #66895
Comments
66959: roachtest: unskip import/tpcc/warehouses=4000/geo r=pbardea a=pbardea This test completed successfully 5/5 times and it was skipped a long time ago (~2 years), unskipping and will keep an eye on it in case it starts OOMing again. Will continue running in background to get more successful runs. Release note: None 67003: kvserver: remove TestRollbackSyncRangedIntentResolution r=tbg a=erikgrinaker This test is flaky since intent resolution is non-deterministic, doesn't seem worth it to keep it around. Release note: None 67005: kvserver: skip TestReliableIntentCleanup r=tbg a=erikgrinaker Flakorama. Unclear if it'll be possible to salvage this test without significant changes to txn cleanup. Touches #66895. Release note: None 67006: kvserver: fix data race on replicaScanner.stopper r=tbg a=erikgrinaker In #65781, a race was introduced on `replicaScanner.stopper`, since it is set by `Start()` and accessed by `RemoveReplica()` but `Start()` is called asynchronously. To avoid introducing a mutex around the stopper or refactoring the store construction, this adds a hack that writes the stopper directly to `replicaScanner.stopper` synchronously during `Store.Start()`. Release note: None /cc @cockroachdb/kv Co-authored-by: Paul Bardea <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>
@erikgrinaker in #67005, you said "Unclear if it'll be possible to salvage this test without significant changes to txn cleanup." Do you recall what is going wrong here and why you're skeptical that this can be deflaked without changing txn cleanup? |
Essentially, this test asserts that we always clean up intents and transaction records in all cases when a transaction is finalized. That turns out not to be the case -- I think especially when context cancellation was involved. It works most of the time, but not always. I think I concluded that to properly fix this we would have to have a persistent queue of transactions pending cleanup, instead of relying on the current constellation of actors that are responsible for it. |
Should we revive the test without context cancellation for now? It seems like valuable test coverage. |
I'm pretty sure it was flaky in other cases too. Txn cleanup just seems fundamentally flaky. But we can give it a shot. |
Looks like it'll take a bit of work to revive this. I'll pick it up for later. |
Didn't get around to this, and likely won't any time soon. Moving back to KV. |
Still flaky. |
@arulajmani can you please take a look ? |
Instead of deflaking this test, which seems to be super difficult and may not worth the time investment, can we delete the flaky parts of the test and keep the others ? |
To be clear, it's not the test that's flaky, it's intent resolution. Whether it's a problem that intent resolution is flaky is a different matter, of course, but this test aims to assert that it's reliable so if it's not reliable and we don't care that it's not reliable then I'm not sure why we need this test. |
Assigning a P3 -- see above comment. |
kv/kvserver.TestReliableIntentCleanup failed with artifacts on master @ b6598d0101dc76c284e14f7809338ae14ce07f4f:
Reproduce
To reproduce, try:
Parameters in this failure:
This test on roachdash | Improve this report!
Jira issue: CRDB-8279
Epic CRDB-27234
The text was updated successfully, but these errors were encountered: