Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: use AdminSplit in splitTestRange #72383

Merged
merged 10 commits into from
Nov 8, 2021

Conversation

tbg
Copy link
Member

@tbg tbg commented Nov 3, 2021

splitTestRange was previously reaching into the store to finagle a
split. It is used by around a dozen tests. Prototyping around #72374
has shown that these tests frequently need patching up whenever we
adjust (improve) the store's replica handling.

This is a time suck and besides, we also want to be able to test
the Store from within the kvserver (not kvserver_test) package.
So if we can make that happen, and can use AdminSplit, that would
be preferrable.

AdminSplit requires them to run a (somewhat) distributed multi-range
transaction. A first split would hit a single range, but after that the
split batch hits at least two ranges (meta2 and splitKey), and so we
need nontrivial DistSender-like functionality. Splits are also
nontrivial distributed transactions and so we need a TxnCoordSender.
Experience suggests that it's better to use the "real thing" and to
make sure it's configurable enough to fit the use case, rather than
whipping up half-baked replacements.

Luckily, it turned out that DistSender and TxnCoordSender are already
up to the task, and this commit adopts them in
createTestStoreWithoutStart, and changes splitTestRange to use
AdminSplit.

Release note: None

@tbg tbg requested a review from a team as a code owner November 3, 2021 13:33
@tbg tbg requested a review from a team November 3, 2021 13:33
@cockroach-teamcity
Copy link
Member

This change is Reviewable

craig bot pushed a commit that referenced this pull request Nov 4, 2021
72392: kvserver: simplify testContext r=erikgrinaker a=tbg

See individual commits for details. I just spent some time cleaning up
`createTestStore` (#72383) and I noticed that `testContext` is much more
heavily used and has accumulated quite a bit of detritus.

I hope to ultimately make it use `createTestStore` but that will be for
another day.

----

- kvserver: remove two uses of bootstrapRangeOnly
- kvserver: remove TestReplicaLaziness
- kvserver: remove testContext.bootstrapMode
- kvserver: drop useless on-disk engine in a test
- kvserver: simplify testContext
- kvserver: fix up TestRaftSSTableSideloading
- kvserver: prevent pre-populating `testContext.eng`

Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
@tbg tbg requested a review from erikgrinaker November 4, 2021 11:28
pkg/kv/kvserver/replica_command.go Outdated Show resolved Hide resolved
@tbg tbg force-pushed the store-replicas-clarify branch 3 times, most recently from 5b2ec2d to 9c33094 Compare November 4, 2021 13:42
@tbg tbg requested a review from erikgrinaker November 4, 2021 13:42
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow sorry for posting such a junk PR! It's actually not all junk but I submitted a bunch of stuff that came from my initial attempts at getting splits to work. Removed these extra bits.

bors r=erikgrinaker

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvserver/replica_command.go, line 333 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Why this extra call?

Detritus. At one point I was getting mismatch errors here and I wanted to know why. Since rr doesn't work on my platform the easiest way to be able to step into ContainsKey with the right args is to duplicate it here. Removed.

@craig
Copy link
Contributor

craig bot commented Nov 4, 2021

Build failed:

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, was wondering about the DistSender emulation in the mock senders and stuff, but figured you'd added it temporarily for a reason. Nice to see it gone. :)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

@tbg
Copy link
Member Author

tbg commented Nov 4, 2021

bors r-

Apparently it was needed for something? Urgh, going to take a look.

@tbg
Copy link
Member Author

tbg commented Nov 4, 2021

Oh ok that makes sense. I fixed up the tests that use createTestStore, but there are tests using testContext that also call splitTestRange. So I need to do the thing I wanted to do anyway, which is have testContext use the real distributed txn stack as well. Okay, let's see if I can do that.

@tbg tbg force-pushed the store-replicas-clarify branch from 9c33094 to 5f9ea7c Compare November 4, 2021 16:58
@tbg tbg requested a review from a team as a code owner November 4, 2021 16:58
@tbg
Copy link
Member Author

tbg commented Nov 4, 2021

Pretty satisfying - got wrapped up with this just as I'm brushing up against the end of my day. I think this is fit for review though some small thing will certainly be wrong with it. testContext.StartWithStoreConfigAndVersion has become a very thin wrapper around createTestStoreWithoutStart, which I am quite happy about.

@erikgrinaker erikgrinaker self-requested a review November 4, 2021 20:27
tbg added 9 commits November 4, 2021 22:54
`splitTestRange` was previously reaching into the store to finagle a
split. It is used by around a dozen tests. Prototyping around cockroachdb#72374
has shown that these tests frequently need patching up whenever we
adjust (improve) the store's replica handling.

This is a time suck and besides, we also want to be able to test
the Store from within the `kvserver` (not `kvserver_test`) package.
So if we can make that happen, and can use AdminSplit, that would
be preferrable.

AdminSplit requires them to run a (somewhat) distributed multi-range
transaction. A first split would hit a single range, but after that the
split batch hits at least two ranges (meta2 and splitKey), and so we
need nontrivial DistSender-like functionality. Splits are also
nontrivial distributed transactions and so we need a TxnCoordSender.
Experience suggests that it's better to use the "real thing" and to
make sure it's configurable enough to fit the use case, rather than
whipping up half-baked replacements.

Luckily, it turned out that DistSender and TxnCoordSender are already
up to the task, and this commit adopts them in
`createTestStoreWithoutStart`, and changes `splitTestRange` to use
`AdminSplit`.

Release note: None
…ious commit]

This was a very satisfying cleanup. The commit history won't pass CI on
the individual commits, so this commit and all before it need to be
flattened. I'll do this once approved to make reviewing easier as the
changes are mostly non-overlapping (but they overlap through
`splitTestRange`).

Release note: None
The test use key prefixes that collide with kvs from cluster bootstrap
if the replica were properly bootstrapped. This will be the case in a
few commits, so give this test its own prefix to operate under.

Release note: None
These tests work today because the store they're operating under
is strung together and is missing lots of moving parts. Once we
use a more fully initialized stores these tests are prime candidates
for being flaky.

The real solution would be to to add testing knobs so that we could
investigate, in the test, the reasons for gossip, and to make sure
they are all valid reasons. But we are unlikely to break anything
in this area and are going to remove gossip of the system config
altogether in cockroachdb#70560.

So just remove these tests, which prevents them from getting in the
way of refactors of the test harness.

Release note: None
The old code would fail on a properly initialized store where there are
dozens of values in the initial system config.

Release note: None
Once the store is properly initialized, the system config will have
additional entries.

Release note: None
Separated intents are fully baked in now, so this randomization makes
less sense. We continue to randomize in some tests, such as
TestMVCCHistories, so some coverage is retained.

The randomization caused nondeterminism during refactors of testContext
(where it was then affecting tests that were written with separated
intents hard-coded to "on") and so there was an immediate reason to
remove it.

Release note: None
This test was duplicating parts of `createTestStoreWithoutStart` and it
was the one bogus hold-out user of `testSenderFactory`, which we now
get to remove.

Release note: None
@tbg tbg force-pushed the store-replicas-clarify branch from 5f9ea7c to 139cf34 Compare November 4, 2021 21:54
@tbg tbg removed the request for review from a team November 4, 2021 22:21
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM at a glance. CI seems green too.

@tbg
Copy link
Member Author

tbg commented Nov 6, 2021

tftr!

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Nov 6, 2021

Build failed:

@tbg
Copy link
Member Author

tbg commented Nov 6, 2021 via email

It wasn't true that the first gossiped config we'd see would be the one
triggered by the lease.

Interestingly, I found that this test was flaky on master as well but it
isn't [detected] during nightly stress:

```
=== RUN   TestReplicaGossipConfigsOnLease
    test_log_scope.go:79: test logs captured to: /tmp/logTestReplicaGossipConfigsOnLease261191452
    test_log_scope.go:80: use -show-logs to present logs inline
    replica_test.go:1204: unexpected gossip of system config: values:<key:"\222" value:<raw_bytes:"\000\000\000\000\001T" timestamp:<> > >
    replica_test.go:1253: -- test log scope end --
test logs left over in: /tmp/logTestReplicaGossipConfigsOnLease261191452
--- FAIL: TestReplicaGossipConfigsOnLease (0.03s)
```

I checked the last [nightly] and the package did get stressed, with 20
iterations passing. Somewhat odd that I can reproduce this with a
targeted stress pretty much instantly, and it's not showing up for
years in CI.

[detected]: https://github.com/cockroachdb/cockroach/issues?q=is%3Aissue+TestReplicaGossipConfigsOnLease+is%3Aclosed
[nightly]: https://teamcity.cockroachdb.com/viewLog.html?buildId=3693551&buildTypeId=Cockroach_Nightlies_Stress&tab=buildLog&_focus=383#_state=226

Release note: None
@tbg
Copy link
Member Author

tbg commented Nov 8, 2021

I went through 200 iters of the kvserver package (thx roachprod-stress) without issues. The one flake I saw during CI is now fixed; it was also flaky on master before.

bors r=erikgrinaker

@tbg
Copy link
Member Author

tbg commented Nov 8, 2021

bors r-

ha, overlooked the "21 failures" line of my nonstandard invocation of make roachprod-stress 🙈

@craig
Copy link
Contributor

craig bot commented Nov 8, 2021

Canceled.

@tbg
Copy link
Member Author

tbg commented Nov 8, 2021

bors r=erikgrinaker
158 runs so far, 0 failures, over 49m15s
I hadn't passed -p previously, so I was overloading the machines (probably).

@craig
Copy link
Contributor

craig bot commented Nov 8, 2021

Build succeeded:

@craig craig bot merged commit f29118b into cockroachdb:master Nov 8, 2021
tbg added a commit to tbg/cockroach that referenced this pull request Dec 8, 2022
Some tests do add peers and so we'll see some use of the
raft transport. This doesn't have to work properly, just
not crash, which this achieves.

This is essentially a re-do of cockroachdb#69730 which was lost in cockroachdb#72383.

Release note: None
tbg added a commit that referenced this pull request Dec 20, 2022
Some tests do add peers and so we'll see some use of the
raft transport. This doesn't have to work properly, just
not crash, which this achieves.

This is essentially a re-do of #69730 which was lost in #72383.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants