Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: TestRaftSSTableSideloadingTruncation/loosely-coupled=true seems flaky #77046

Closed
adityamaru opened this issue Feb 25, 2022 · 8 comments · Fixed by #77245
Closed

kvserver: TestRaftSSTableSideloadingTruncation/loosely-coupled=true seems flaky #77046

adityamaru opened this issue Feb 25, 2022 · 8 comments · Fixed by #77245
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). skipped-test T-storage Storage Team

Comments

@adityamaru
Copy link
Contributor

TestRaftSSTableSideloadingTruncation/loosely-coupled=true
    raft_log_queue_test.go:923: condition failed to evaluate within 45s: expected firstIndex == 14, got 11
        goroutine 29970 [running]:
        runtime/debug.Stack()
        	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithin({0x5711eb8, 0xc000e24b60}, 0x0, 0x0)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:60 +0x5f
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon({0x5711eb8, 0xc000e24b60}, 0x1)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:41 +0x4a
        github.com/cockroachdb/cockroach/pkg/kv/kvserver.waitForTruncationForTesting(0x0, 0x0, 0x0, 0x0)
        	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_log_queue_test.go:923 +0x65
        github.com/cockroachdb/cockroach/pkg/kv/kvserver.TestRaftSSTableSideloadingTruncation.func1(0xc000e24b60, 0xb7)
        	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_sideload_test.go:860 +0x633
        github.com/cockroachdb/cockroach/pkg/testutils.RunTrueAndFalse.func1(0xffdf20)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/subtest.go:23 +0x24
        testing.tRunner(0xc000e24b60, 0xc00369cc48)
        	/usr/local/go/src/testing/testing.go:1259 +0x102
        created by testing.(*T).Run
        	/usr/local/go/src/testing/testing.go:1306 +0x35a

Example build: https://teamcity.cockroachdb.com/viewLog.html?buildId=4459715&buildTypeId=Cockroach_UnitTests

@adityamaru adityamaru added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 25, 2022
@blathers-crl blathers-crl bot added the T-storage Storage Team label Feb 25, 2022
@sumeerbhola
Copy link
Collaborator

Reproduced using stress, and added some additional instrumentation

truncating to index < 14
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, 13
tryEnactTruncations enactIndex: -1, raftAppliedIndex: 0
    raft_log_queue_test.go:923: condition failed to evaluate within 45s: expected firstIndex == 14, got 11

There is a race which is only a problem for the test: the FlushEnd callback is after the version edit, but before the new read state is installed.
https://github.com/cockroachdb/pebble/blob/7e5c8ee1daa4cea7bb03e237e6d1a65b9f0cd3a3/compaction.go#L1455-L1466
Moving the FlushEnd after the read state installation fixes the problem (ran for 10min, while before it was failing after 1min).

sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Feb 26, 2022
This helps unit tests that use the callback to trigger a
read using an iterator with IterOptions.OnlyReadGuaranteedDurable.

Informs cockroachdb/cockroach#77046
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Feb 27, 2022
This helps unit tests that use the callback to trigger a
read using an iterator with IterOptions.OnlyReadGuaranteedDurable.

Informs cockroachdb/cockroach#77046
sumeerbhola added a commit to cockroachdb/pebble that referenced this issue Feb 27, 2022
This helps unit tests that use the callback to trigger a
read using an iterator with IterOptions.OnlyReadGuaranteedDurable.

Informs cockroachdb/cockroach#77046
@sumeerbhola
Copy link
Collaborator

This should be fixed by #77131.
No failure after > 4000 runs, so I am closing the issue.

@sumeerbhola
Copy link
Collaborator

I had a test failure after 6000 runs. I am trying to reproduce with instrumentation, but no success so far.
I am still inclined to believe that the failure is a problem with the test and not the actual code -- the test is dependent on a single flush callback for liveness because there won't be any more writes, and if the "durable" observation made in that callback is somehow stale, then the test will fail.

@sumeerbhola
Copy link
Collaborator

Found it. There is a silly race in how the raftLogTruncator queues the durability callbacks.

Failed run with additional instrumentation:
(rai is the RaftAppliedIndex. Ignore lui)

=== RUN   TestRaftSSTableSideloadingTruncation/loosely-coupled=true
addPendingTruncation: 13
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, enactIndex: 0, rai: 24, lui: 0
addPendingTruncation: 14
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, enactIndex: 0, rai: 25, lui: 0
addPendingTruncation: 15
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, enactIndex: 0, rai: 26, lui: 0
addPendingTruncation: 16
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, enactIndex: 0, rai: 27, lui: 0
addPendingTruncation: 17
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, enactIndex: 0, rai: 28, lui: 0
addPendingTruncation: 18
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, enactIndex: 0, rai: 29, lui: 0
addPendingTruncation: 19
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, enactIndex: 0, rai: 30, lui: 0
addPendingTruncation: 20
durabilityAdvancedCallback run true, queued false
durabilityAdvanced
tryEnactTruncations 1
tryEnactTruncations 1, enactIndex: 0, rai: 31, lui: 0
addPendingTruncation: 21
durabilityAdvancedCallback run false, queued true

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Mar 1, 2022
The existing code admitted the following interleaving between
thread-1, running the async raft log truncation, and thread-2
which is running a new durabilityAdvancedCallback.

thread-1: executes queued := t.mu.queuedDurabilityCB and
 sees queued is false
thread-2: sees t.mu.runningTruncation is true and sets
 t.mu.queuedDurabilityCB = true
thread-1: Sets t.mu.runningTruncation = false and returns

Now the queued callback will never run. This can happen in tests
that wait for truncation before doing the next truncation step,
because they will stop waiting once the truncation is observed
on a Replica, which happens before any of the steps listed above
for thread-1.

Fixes cockroachdb#77046

Release justification: Bug fix

Release note: None
craig bot pushed a commit that referenced this issue Mar 2, 2022
75285: spanconfig: introduce the ProtectedTSReader interface  r=adityamaru,nvanbenschoten,ajwerner a=arulajmani

See individual commits for details. 

Release justification: low risk, high benefit changes to existing
functionality.

76929: settings: Add syntax for cluster settings r=raduberinde,rafiss a=ajstorm

Before this commit, there was no syntax to SET or SHOW cluster settings which
exist for a given tenant. This commit adds the following syntax:

* ALTER TENANT <id> SET CLUSTER SETTING <setting> = <value>
* ALTER TENANT ALL SET CLUSTER SETTING <setting> = <value>
* ALTER TENANT <id> RESET CLUSTER SETTING <setting>
* ALTER TENANT ALL RESET CLUSTER SETTING <setting>
* SHOW CLUSTER SETTING <setting> FOR TENANT <id>
* SHOW [ALL] CLUSTER SETTINGS FOR TENANT <id>

Note that the syntax is added but the underlying commands are currently
unimplemented. The implementation of these commands will come with a subsequent
commit.

Release note (sql change): Added syntax for modifying cluster settings at the
tenant level.

Informs: #73857.

76943: Unary Complement execution has different results when the parameters are different r=otan a=ecwall

fixes #74493

Release note (sql change): Return ambiguous unary operator error for ambiguous input
like ~'1' which can be interpreted as an integer (resulting in -2) or a bit string
(resulting in 0).

Release justification: Improves a confusing error message saying that an operator is
invalid instead of ambiguous.

77064: spanconfigkvsubscriber,kvserver: fix KVSubscriber bug r=arulajmani a=arulajmani

We had a bug in the KVSubscriber where we were invoking a copy of the
handler instead of the handler stored. This meant that we'd never treat
handlers as "initialized". As a result, we would always invoke them with
the everything span, and as a result, visit all replicas on the stores
in reaction to span config updates. See datadriven test diffs for an
illustration.

Fixing the above lead to unearthing an interesting bug in how we were
deciding to enqueue replicas in the split queue. Previously, if we
received  a span config update that implied a split and the update
corresponded to the right-hand side post split, we would skip enqueuing
the replica in the split queue. The assumption was that we'd get an
update corresponding to the LHS of the split for the same replica
and that update would enqueue the replica. This doesn't always hold true
though. For example, consider the case when a new table is created and
must be split from its (left) adjacent table's range. This only results
in a single update, corresponding to the new table's span, which is the
right-hand side post split.

This patch moves to nudging the split queue for all updates, not just
left-hand side updates, for the reason above.

Release note: None
Release justification: bug fixes in new functionality

77245: kvserver: fix race in durability callback queueing in raftLogTruncator r=erikgrinaker a=sumeerbhola

The existing code admitted the following interleaving between
thread-1, running the async raft log truncation, and thread-2
which is running a new durabilityAdvancedCallback.

thread-1: executes queued := t.mu.queuedDurabilityCB and
 sees queued is false
thread-2: sees t.mu.runningTruncation is true and sets
 t.mu.queuedDurabilityCB = true
thread-1: Sets t.mu.runningTruncation = false and returns

Now the queued callback will never run. This can happen in tests
that wait for truncation before doing the next truncation step,
because they will stop waiting once the truncation is observed
on a Replica, which happens before any of the steps listed above
for thread-1.

Fixes #77046

Release justification: Bug fix

Release note: None

Co-authored-by: arulajmani <[email protected]>
Co-authored-by: Adam Storm <[email protected]>
Co-authored-by: Evan Wall <[email protected]>
Co-authored-by: sumeerbhola <[email protected]>
@craig craig bot closed this as completed in 3ba6dcb Mar 2, 2022
@tbg
Copy link
Member

tbg commented Oct 27, 2022

This test is skipped again, as a result of cockroachdb/pebble#2064. It can happen that pebble is holding the ingested files open, so the truncation will not delete the sideloaded files, and the test will notice that and fail.

I haven't been seeing this on master, but I see it in #89632 (comment).

@tbg tbg reopened this Oct 27, 2022
@tbg tbg added C-test-failure Broken test (automatically or manually discovered). and removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Oct 27, 2022
@tbg tbg mentioned this issue Oct 27, 2022
3 tasks
@jbowens jbowens assigned jbowens and unassigned sumeerbhola Oct 31, 2022
@jbowens
Copy link
Collaborator

jbowens commented Nov 4, 2022

was this unskipped? it doesn't appear to be skipped on master

@tbg
Copy link
Member

tbg commented Nov 7, 2022

Hmm, my bad - I had skipped it in #89632 (comment) (I thought) but apparently not. Either way, the root cause was cockroachdb/pebble#2064 which is since fixed so assuming that pebble PR got picked up we're good here!

@jbowens
Copy link
Collaborator

jbowens commented Nov 7, 2022

Thanks, yeah, pebble was bumped to include it, so I'll close this out.

@jbowens jbowens closed this as completed Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). skipped-test T-storage Storage Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants