-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: TestRaftSSTableSideloadingTruncation/loosely-coupled=true seems flaky #77046
Comments
Reproduced using stress, and added some additional instrumentation
There is a race which is only a problem for the test: the FlushEnd callback is after the version edit, but before the new read state is installed. |
This helps unit tests that use the callback to trigger a read using an iterator with IterOptions.OnlyReadGuaranteedDurable. Informs cockroachdb/cockroach#77046
This helps unit tests that use the callback to trigger a read using an iterator with IterOptions.OnlyReadGuaranteedDurable. Informs cockroachdb/cockroach#77046
This helps unit tests that use the callback to trigger a read using an iterator with IterOptions.OnlyReadGuaranteedDurable. Informs cockroachdb/cockroach#77046
This should be fixed by #77131. |
I had a test failure after 6000 runs. I am trying to reproduce with instrumentation, but no success so far. |
Found it. There is a silly race in how the raftLogTruncator queues the durability callbacks. Failed run with additional instrumentation:
|
The existing code admitted the following interleaving between thread-1, running the async raft log truncation, and thread-2 which is running a new durabilityAdvancedCallback. thread-1: executes queued := t.mu.queuedDurabilityCB and sees queued is false thread-2: sees t.mu.runningTruncation is true and sets t.mu.queuedDurabilityCB = true thread-1: Sets t.mu.runningTruncation = false and returns Now the queued callback will never run. This can happen in tests that wait for truncation before doing the next truncation step, because they will stop waiting once the truncation is observed on a Replica, which happens before any of the steps listed above for thread-1. Fixes cockroachdb#77046 Release justification: Bug fix Release note: None
75285: spanconfig: introduce the ProtectedTSReader interface r=adityamaru,nvanbenschoten,ajwerner a=arulajmani See individual commits for details. Release justification: low risk, high benefit changes to existing functionality. 76929: settings: Add syntax for cluster settings r=raduberinde,rafiss a=ajstorm Before this commit, there was no syntax to SET or SHOW cluster settings which exist for a given tenant. This commit adds the following syntax: * ALTER TENANT <id> SET CLUSTER SETTING <setting> = <value> * ALTER TENANT ALL SET CLUSTER SETTING <setting> = <value> * ALTER TENANT <id> RESET CLUSTER SETTING <setting> * ALTER TENANT ALL RESET CLUSTER SETTING <setting> * SHOW CLUSTER SETTING <setting> FOR TENANT <id> * SHOW [ALL] CLUSTER SETTINGS FOR TENANT <id> Note that the syntax is added but the underlying commands are currently unimplemented. The implementation of these commands will come with a subsequent commit. Release note (sql change): Added syntax for modifying cluster settings at the tenant level. Informs: #73857. 76943: Unary Complement execution has different results when the parameters are different r=otan a=ecwall fixes #74493 Release note (sql change): Return ambiguous unary operator error for ambiguous input like ~'1' which can be interpreted as an integer (resulting in -2) or a bit string (resulting in 0). Release justification: Improves a confusing error message saying that an operator is invalid instead of ambiguous. 77064: spanconfigkvsubscriber,kvserver: fix KVSubscriber bug r=arulajmani a=arulajmani We had a bug in the KVSubscriber where we were invoking a copy of the handler instead of the handler stored. This meant that we'd never treat handlers as "initialized". As a result, we would always invoke them with the everything span, and as a result, visit all replicas on the stores in reaction to span config updates. See datadriven test diffs for an illustration. Fixing the above lead to unearthing an interesting bug in how we were deciding to enqueue replicas in the split queue. Previously, if we received a span config update that implied a split and the update corresponded to the right-hand side post split, we would skip enqueuing the replica in the split queue. The assumption was that we'd get an update corresponding to the LHS of the split for the same replica and that update would enqueue the replica. This doesn't always hold true though. For example, consider the case when a new table is created and must be split from its (left) adjacent table's range. This only results in a single update, corresponding to the new table's span, which is the right-hand side post split. This patch moves to nudging the split queue for all updates, not just left-hand side updates, for the reason above. Release note: None Release justification: bug fixes in new functionality 77245: kvserver: fix race in durability callback queueing in raftLogTruncator r=erikgrinaker a=sumeerbhola The existing code admitted the following interleaving between thread-1, running the async raft log truncation, and thread-2 which is running a new durabilityAdvancedCallback. thread-1: executes queued := t.mu.queuedDurabilityCB and sees queued is false thread-2: sees t.mu.runningTruncation is true and sets t.mu.queuedDurabilityCB = true thread-1: Sets t.mu.runningTruncation = false and returns Now the queued callback will never run. This can happen in tests that wait for truncation before doing the next truncation step, because they will stop waiting once the truncation is observed on a Replica, which happens before any of the steps listed above for thread-1. Fixes #77046 Release justification: Bug fix Release note: None Co-authored-by: arulajmani <[email protected]> Co-authored-by: Adam Storm <[email protected]> Co-authored-by: Evan Wall <[email protected]> Co-authored-by: sumeerbhola <[email protected]>
This test is skipped again, as a result of cockroachdb/pebble#2064. It can happen that pebble is holding the ingested files open, so the truncation will not delete the sideloaded files, and the test will notice that and fail. I haven't been seeing this on master, but I see it in #89632 (comment). |
was this unskipped? it doesn't appear to be skipped on master |
Hmm, my bad - I had skipped it in #89632 (comment) (I thought) but apparently not. Either way, the root cause was cockroachdb/pebble#2064 which is since fixed so assuming that pebble PR got picked up we're good here! |
Thanks, yeah, pebble was bumped to include it, so I'll close this out. |
Example build: https://teamcity.cockroachdb.com/viewLog.html?buildId=4459715&buildTypeId=Cockroach_UnitTests
The text was updated successfully, but these errors were encountered: