kvserver: TestRaftSSTableSideloadingTruncation/loosely-coupled=true seems flaky #77046

adityamaru · 2022-02-25T17:44:51Z

TestRaftSSTableSideloadingTruncation/loosely-coupled=true
    raft_log_queue_test.go:923: condition failed to evaluate within 45s: expected firstIndex == 14, got 11
        goroutine 29970 [running]:
        runtime/debug.Stack()
        	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithin({0x5711eb8, 0xc000e24b60}, 0x0, 0x0)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:60 +0x5f
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon({0x5711eb8, 0xc000e24b60}, 0x1)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:41 +0x4a
        github.com/cockroachdb/cockroach/pkg/kv/kvserver.waitForTruncationForTesting(0x0, 0x0, 0x0, 0x0)
        	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_log_queue_test.go:923 +0x65
        github.com/cockroachdb/cockroach/pkg/kv/kvserver.TestRaftSSTableSideloadingTruncation.func1(0xc000e24b60, 0xb7)
        	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_sideload_test.go:860 +0x633
        github.com/cockroachdb/cockroach/pkg/testutils.RunTrueAndFalse.func1(0xffdf20)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/subtest.go:23 +0x24
        testing.tRunner(0xc000e24b60, 0xc00369cc48)
        	/usr/local/go/src/testing/testing.go:1259 +0x102
        created by testing.(*T).Run
        	/usr/local/go/src/testing/testing.go:1306 +0x35a

Example build: https://teamcity.cockroachdb.com/viewLog.html?buildId=4459715&buildTypeId=Cockroach_UnitTests

The text was updated successfully, but these errors were encountered:

sumeerbhola · 2022-02-26T00:43:56Z

Reproduced using stress, and added some additional instrumentation

truncating to index < 14
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, 13
tryEnactTruncations enactIndex: -1, raftAppliedIndex: 0
    raft_log_queue_test.go:923: condition failed to evaluate within 45s: expected firstIndex == 14, got 11

There is a race which is only a problem for the test: the FlushEnd callback is after the version edit, but before the new read state is installed.
https://github.com/cockroachdb/pebble/blob/7e5c8ee1daa4cea7bb03e237e6d1a65b9f0cd3a3/compaction.go#L1455-L1466
Moving the FlushEnd after the read state installation fixes the problem (ran for 10min, while before it was failing after 1min).

This helps unit tests that use the callback to trigger a read using an iterator with IterOptions.OnlyReadGuaranteedDurable. Informs cockroachdb/cockroach#77046

sumeerbhola · 2022-03-01T14:44:31Z

This should be fixed by #77131.
No failure after > 4000 runs, so I am closing the issue.

sumeerbhola · 2022-03-01T16:42:43Z

I had a test failure after 6000 runs. I am trying to reproduce with instrumentation, but no success so far.
I am still inclined to believe that the failure is a problem with the test and not the actual code -- the test is dependent on a single flush callback for liveness because there won't be any more writes, and if the "durable" observation made in that callback is somehow stale, then the test will fail.

sumeerbhola · 2022-03-01T19:05:52Z

Found it. There is a silly race in how the raftLogTruncator queues the durability callbacks.

Failed run with additional instrumentation:
(rai is the RaftAppliedIndex. Ignore lui)

=== RUN   TestRaftSSTableSideloadingTruncation/loosely-coupled=true
addPendingTruncation: 13
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, enactIndex: 0, rai: 24, lui: 0
addPendingTruncation: 14
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, enactIndex: 0, rai: 25, lui: 0
addPendingTruncation: 15
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, enactIndex: 0, rai: 26, lui: 0
addPendingTruncation: 16
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, enactIndex: 0, rai: 27, lui: 0
addPendingTruncation: 17
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, enactIndex: 0, rai: 28, lui: 0
addPendingTruncation: 18
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, enactIndex: 0, rai: 29, lui: 0
addPendingTruncation: 19
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, enactIndex: 0, rai: 30, lui: 0
addPendingTruncation: 20
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, enactIndex: 0, rai: 31, lui: 0
addPendingTruncation: 21
durabilityAdvancedCallback run false, queued true

The existing code admitted the following interleaving between thread-1, running the async raft log truncation, and thread-2 which is running a new durabilityAdvancedCallback. thread-1: executes queued := t.mu.queuedDurabilityCB and sees queued is false thread-2: sees t.mu.runningTruncation is true and sets t.mu.queuedDurabilityCB = true thread-1: Sets t.mu.runningTruncation = false and returns Now the queued callback will never run. This can happen in tests that wait for truncation before doing the next truncation step, because they will stop waiting once the truncation is observed on a Replica, which happens before any of the steps listed above for thread-1. Fixes cockroachdb#77046 Release justification: Bug fix Release note: None

75285: spanconfig: introduce the ProtectedTSReader interface r=adityamaru,nvanbenschoten,ajwerner a=arulajmani See individual commits for details. Release justification: low risk, high benefit changes to existing functionality. 76929: settings: Add syntax for cluster settings r=raduberinde,rafiss a=ajstorm Before this commit, there was no syntax to SET or SHOW cluster settings which exist for a given tenant. This commit adds the following syntax: * ALTER TENANT <id> SET CLUSTER SETTING <setting> = <value> * ALTER TENANT ALL SET CLUSTER SETTING <setting> = <value> * ALTER TENANT <id> RESET CLUSTER SETTING <setting> * ALTER TENANT ALL RESET CLUSTER SETTING <setting> * SHOW CLUSTER SETTING <setting> FOR TENANT <id> * SHOW [ALL] CLUSTER SETTINGS FOR TENANT <id> Note that the syntax is added but the underlying commands are currently unimplemented. The implementation of these commands will come with a subsequent commit. Release note (sql change): Added syntax for modifying cluster settings at the tenant level. Informs: #73857. 76943: Unary Complement execution has different results when the parameters are different r=otan a=ecwall fixes #74493 Release note (sql change): Return ambiguous unary operator error for ambiguous input like ~'1' which can be interpreted as an integer (resulting in -2) or a bit string (resulting in 0). Release justification: Improves a confusing error message saying that an operator is invalid instead of ambiguous. 77064: spanconfigkvsubscriber,kvserver: fix KVSubscriber bug r=arulajmani a=arulajmani We had a bug in the KVSubscriber where we were invoking a copy of the handler instead of the handler stored. This meant that we'd never treat handlers as "initialized". As a result, we would always invoke them with the everything span, and as a result, visit all replicas on the stores in reaction to span config updates. See datadriven test diffs for an illustration. Fixing the above lead to unearthing an interesting bug in how we were deciding to enqueue replicas in the split queue. Previously, if we received a span config update that implied a split and the update corresponded to the right-hand side post split, we would skip enqueuing the replica in the split queue. The assumption was that we'd get an update corresponding to the LHS of the split for the same replica and that update would enqueue the replica. This doesn't always hold true though. For example, consider the case when a new table is created and must be split from its (left) adjacent table's range. This only results in a single update, corresponding to the new table's span, which is the right-hand side post split. This patch moves to nudging the split queue for all updates, not just left-hand side updates, for the reason above. Release note: None Release justification: bug fixes in new functionality 77245: kvserver: fix race in durability callback queueing in raftLogTruncator r=erikgrinaker a=sumeerbhola The existing code admitted the following interleaving between thread-1, running the async raft log truncation, and thread-2 which is running a new durabilityAdvancedCallback. thread-1: executes queued := t.mu.queuedDurabilityCB and sees queued is false thread-2: sees t.mu.runningTruncation is true and sets t.mu.queuedDurabilityCB = true thread-1: Sets t.mu.runningTruncation = false and returns Now the queued callback will never run. This can happen in tests that wait for truncation before doing the next truncation step, because they will stop waiting once the truncation is observed on a Replica, which happens before any of the steps listed above for thread-1. Fixes #77046 Release justification: Bug fix Release note: None Co-authored-by: arulajmani <[email protected]> Co-authored-by: Adam Storm <[email protected]> Co-authored-by: Evan Wall <[email protected]> Co-authored-by: sumeerbhola <[email protected]>

tbg · 2022-10-27T14:37:05Z

This test is skipped again, as a result of cockroachdb/pebble#2064. It can happen that pebble is holding the ingested files open, so the truncation will not delete the sideloaded files, and the test will notice that and fail.

I haven't been seeing this on master, but I see it in #89632 (comment).

jbowens · 2022-11-04T22:32:07Z

was this unskipped? it doesn't appear to be skipped on master

tbg · 2022-11-07T08:30:25Z

Hmm, my bad - I had skipped it in #89632 (comment) (I thought) but apparently not. Either way, the root cause was cockroachdb/pebble#2064 which is since fixed so assuming that pebble PR got picked up we're good here!

jbowens · 2022-11-07T14:42:28Z

Thanks, yeah, pebble was bumped to include it, so I'll close this out.

adityamaru added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 25, 2022

adityamaru assigned sumeerbhola Feb 25, 2022

blathers-crl bot added the T-storage Storage Team label Feb 25, 2022

adityamaru mentioned this issue Feb 25, 2022

spanconfigreconciler{ccl}: apply system span config diffs to the store #76948

Merged

sumeerbhola mentioned this issue Feb 26, 2022

db: produce the FlushEnd event after installing readState cockroachdb/pebble#1540

Merged

sumeerbhola mentioned this issue Mar 1, 2022

kv/kvserver: TestProactiveRaftLogTruncate failed #77178

Closed

sumeerbhola closed this as completed Mar 1, 2022

sumeerbhola reopened this Mar 1, 2022

sumeerbhola mentioned this issue Mar 1, 2022

kvserver: fix race in durability callback queueing in raftLogTruncator #77245

Merged

craig bot closed this as completed in 3ba6dcb Mar 2, 2022

tbg reopened this Oct 27, 2022

tbg added C-test-failure Broken test (automatically or manually discovered). and removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Oct 27, 2022

tbg added the skipped-test label Oct 27, 2022

tbg mentioned this issue Oct 27, 2022

go.mod: bump raft #89632

Merged

3 tasks

jbowens assigned jbowens and unassigned sumeerbhola Oct 31, 2022

jbowens closed this as completed Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: TestRaftSSTableSideloadingTruncation/loosely-coupled=true seems flaky #77046

kvserver: TestRaftSSTableSideloadingTruncation/loosely-coupled=true seems flaky #77046

adityamaru commented Feb 25, 2022

sumeerbhola commented Feb 26, 2022

sumeerbhola commented Mar 1, 2022

sumeerbhola commented Mar 1, 2022

sumeerbhola commented Mar 1, 2022

tbg commented Oct 27, 2022

jbowens commented Nov 4, 2022

tbg commented Nov 7, 2022

jbowens commented Nov 7, 2022

kvserver: TestRaftSSTableSideloadingTruncation/loosely-coupled=true seems flaky #77046

kvserver: TestRaftSSTableSideloadingTruncation/loosely-coupled=true seems flaky #77046

Comments

adityamaru commented Feb 25, 2022

sumeerbhola commented Feb 26, 2022

sumeerbhola commented Mar 1, 2022

sumeerbhola commented Mar 1, 2022

sumeerbhola commented Mar 1, 2022

tbg commented Oct 27, 2022

jbowens commented Nov 4, 2022

tbg commented Nov 7, 2022

jbowens commented Nov 7, 2022