Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: add synctest #31187

Merged
merged 5 commits into from
Oct 10, 2018
Merged

roachtest: add synctest #31187

merged 5 commits into from
Oct 10, 2018

Conversation

tbg
Copy link
Member

@tbg tbg commented Oct 10, 2018

This new roachtest sets up a charybdefs on a single (Ubuntu) node and runs
the synctest cli command against a nemesis that injects random I/O
errors.

The synctest command is new. It simulates a Raft log and can be directed at a
filesystem that is being hit with random failures.

The workload essentially writes ascending keys (flushing each one to disk
synchronously) until an I/O error occurs, at which point it re-opens the
instance to verify that all persisted writes are still there. If the
RocksDB instance was permanently corrupted, it switches to a new, pristine,
directory.
This is used in the roachtest, but is also useful for manual use in user
deployments in which we suspect there is a failure to persist data to disk.

This hasn't found anything, but it's fun to watch and also shows us a
number of errors that we know and love from sentry.

Release note: None

@tbg tbg requested a review from a team as a code owner October 10, 2018 14:08
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from bdarnell October 10, 2018 14:09
tbg added a commit to tbg/roachprod that referenced this pull request Oct 10, 2018
Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which errors does this turn up?

Reviewed 1 of 1 files at r3, 2 of 2 files at r4, 2 of 2 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/synctest.go, line 25 at r5 (raw file):

#!/bin/bash

if [ $1 == "on" ]; then

Where's the actual fault injection?


pkg/cmd/roachtest/synctest.go, line 33 at r5 (raw file):

		Name:   "synctest",
		Nodes:  nodes(1),
		Stable: true, // DO NOT COPY to new tests

Is this stable? What exactly does a passing result mean? Is it worth adding this in the roachtest suite or should we just leave it to be run manually?


pkg/cmd/roachtest/synctest.go, line 47 at r5 (raw file):

			t.Status("running synctest")
			c.Run(ctx, n, "./cockroach debug synctest {store-dir}/faulty ./nemesis.sh")

I don't see where nemesis.sh is written.

@tbg tbg force-pushed the cli/debug-synctest branch 3 times, most recently from f41634e to a1036f8 Compare October 10, 2018 20:15
tbg added 4 commits October 10, 2018 22:16
That better reflects what it's doing and I'm adding an actual synctest
utility.

Release note: None
To allow creating throwaway engines.

Release note: None
This is code that simulates a Raft log and can be directed at a
filesystem that is being hit with random failures.

The workload essentially writes ascending keys (flushing each one to
disk synchronously) until an I/O error occurs, at which point it
re-opens the instance to verify that all persisted writes are still
there. If the RocksDB instance was permanently corrupted, it switches to
a new, pristine, directory.

This is to be used in combination with an upcoming roachtest that uses
charybdefs to inject failures, but it can also be used manually in user
deployments in which we suspect there is a failure to persist data to
disk.

Release note: None
@tbg tbg force-pushed the cli/debug-synctest branch 6 times, most recently from 39c6a76 to 9f12513 Compare October 10, 2018 20:33
This new roachtest sets up a charybdefs on a single (Ubuntu) node and
runs the `synctest` cli command against a nemesis that injects random
I/O errors.

This hasn't found anything, but it's fun to watch and also shows us a
number of errors that we know and love from sentry.

Release note: None
@tbg tbg force-pushed the cli/debug-synctest branch from 9f12513 to 01d6bda Compare October 10, 2018 20:34
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the lines containing "Corruption" (where we fried the RocksDB dir). The remaining syscall errors are injected and the only real info is where in RocksDB they bubble up.

error after seq 1023 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/3/001055.sst: Permission denied
error after seq 1023: IO error: While appending to file: /mnt/data1/cockroach/faulty/3/001055.sst: Permission denied
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=1024).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 1217 (trying 2 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/3/MANIFEST-001056: Block device required
./nemesis off: Clearing all faults conditions
error after seq 1217 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/3/MANIFEST-001056: Block device required
error after seq 1217 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/3/MANIFEST-001056: Block device required
error after seq 1217: IO error: While appending to file: /mnt/data1/cockroach/faulty/3/MANIFEST-001056: Block device required
./nemesis off: Clearing all faults conditions
RocksDB directory /mnt/data1/cockroach/faulty/3 corrupted: could not open rocksdb instance: Corruption: Can't access /001253.sst: IO error: while stat a file for size: /mnt/data1/cockroach/faulty/3/001253.sst: No such file or directory

verifying existing sequence numbers...done (seq=0).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
./nemesis off: Clearing all faults conditions
error after seq 241 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/4/MANIFEST-000008: Too many open files
error after seq 241 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/4/MANIFEST-000008: Too many open files
error after seq 241: IO error: While appending to file: /mnt/data1/cockroach/faulty/4/MANIFEST-000008: Too many open files
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=242).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 309 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/4/MANIFEST-000250: Link number out of range
error after seq 309: IO error: While appending to file: /mnt/data1/cockroach/faulty/4/MANIFEST-000250: Link number out of range
./nemesis off: Clearing all faults conditions
RocksDB directory /mnt/data1/cockroach/faulty/4 corrupted: could not open rocksdb instance: Corruption: Can't access /000321.sst: IO error: while stat a file for size: /mnt/data1/cockroach/faulty/4/000321.sst: No such file or directory

verifying existing sequence numbers...done (seq=0).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 167 (trying 2 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/000174.sst: Resource temporarily unavailable
./nemesis off: Clearing all faults conditions
error after seq 167 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/000174.sst: Resource temporarily unavailable
error after seq 167 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/000174.sst: Resource temporarily unavailable
error after seq 167: IO error: While appending to file: /mnt/data1/cockroach/faulty/5/000174.sst: Resource temporarily unavailable
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=168).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
./nemesis off: Clearing all faults conditions
error after seq 191 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/000199.log: File too large
error after seq 192 (trying 0 additional writes): <nil>
error after seq 193: <nil>
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=193).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
./nemesis off: Clearing all faults conditions
error after seq 245 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/MANIFEST-000206: File too large
error after seq 245 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/MANIFEST-000206: File too large
error after seq 245: IO error: While appending to file: /mnt/data1/cockroach/faulty/5/MANIFEST-000206: File too large
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=245).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 283 (trying 2 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/MANIFEST-000264: Unknown error 41
./nemesis off: Clearing all faults conditions
error after seq 283 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/MANIFEST-000264: Unknown error 41
error after seq 283 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/MANIFEST-000264: Unknown error 41
error after seq 283: IO error: While appending to file: /mnt/data1/cockroach/faulty/5/MANIFEST-000264: Unknown error 41
./nemesis off: Clearing all faults conditions
RocksDB directory /mnt/data1/cockroach/faulty/5 corrupted: could not open rocksdb instance: Corruption: Can't access /000307.sst: IO error: while stat a file for size: /mnt/data1/cockroach/faulty/5/000307.sst: No such file or directory

verifying existing sequence numbers...done (seq=0).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
./nemesis off: Clearing all faults conditions
error after seq 145 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000152.sst: No such device
error after seq 145 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000152.sst: No such device
error after seq 145: IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000152.sst: No such device
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=146).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 171 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/MANIFEST-000153: Illegal seek
error after seq 171: IO error: While appending to file: /mnt/data1/cockroach/faulty/6/MANIFEST-000153: Illegal seek
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=171).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 325 (trying 2 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000342.sst: Level 3 reset
./nemesis off: Clearing all faults conditions
error after seq 325 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000342.sst: Level 3 reset
error after seq 325 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000342.sst: Level 3 reset
error after seq 325: IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000342.sst: Level 3 reset
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=326).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 487 (trying 2 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000505.log: Block device required
./nemesis off: Clearing all faults conditions
error after seq 488 (trying 1 additional writes): <nil>
error after seq 489 (trying 0 additional writes): <nil>
error after seq 490: <nil>
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=490).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 879 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000906.sst: No locks available
error after seq 879: IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000906.sst: No locks available
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=880).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 1089 (trying 2 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/001120.sst: Bad address
./nemesis off: Clearing all faults conditions
error after seq 1089 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/001120.sst: Bad address
error after seq 1089 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/001120.sst: Bad address
error after seq 1089: IO error: While appending to file: /mnt/data1/cockroach/faulty/6/001120.sst: Bad address
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=1090).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/synctest.go, line 25 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Where's the actual fault injection?

I forgot to copy-paste the actual nemesis script that I used for testing this in.


pkg/cmd/roachtest/synctest.go, line 33 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Is this stable? What exactly does a passing result mean? Is it worth adding this in the roachtest suite or should we just leave it to be run manually?

It means that the synctest cli returned nonzero. Which means that it ran for 10 minutes without detecting a failure. I'd like to run this nightly or it'll immediately catch rust. If it turns out to be flaky for the wrong reasons, we can skip it.


pkg/cmd/roachtest/synctest.go, line 47 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't see where nemesis.sh is written.

I forgot to copy-paste the actual nemesis script that I used for testing this in.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r8, 1 of 2 files at r9, 3 of 3 files at r10.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@tbg
Copy link
Member Author

tbg commented Oct 10, 2018 via email

craig bot pushed a commit that referenced this pull request Oct 10, 2018
31013: kv: try next replica on RangeNotFoundError r=nvanbenschoten,bdarnell a=tschottdorf

Previously, if a Batch RPC came back with a RangeNotFoundError, we would
immediately stop trying to send to more replicas, evict the range
descriptor, and start a new attempt after a back-off.

This new attempt could end up using the same replica, so if the
RangeNotFoundError persisted for some amount of time, so would the
unsuccessful retries for requests to it as DistSender doesn't aggressively
shuffle the replicas.

It turns out that there are such situations, and the election-after-restart
roachtest spuriously hit one of them:

1. new replica receives a preemptive snapshot and the ConfChange
2. cluster restarts
3. now the new replica is in this state until the range wakes
   up, which may not happen for some time. 4. the first request to the range
   runs into the above problem

@nvanbenschoten: I think there is an issue to be filed about the tendency
of DistSender to get stuck in unfortunate configurations.

Fixes #30613.

Release note (bug fix): Avoid repeatedly trying a replica that was found to
be in the process of being added.

31187: roachtest: add synctest r=bdarnell a=tschottdorf

This new roachtest sets up a charybdefs on a single (Ubuntu) node and runs
the `synctest` cli command against a nemesis that injects random I/O
errors.

The synctest command is new. It simulates a Raft log and can be directed at a
filesystem that is being hit with random failures.

The workload essentially writes ascending keys (flushing each one to disk
synchronously) until an I/O error occurs, at which point it re-opens the
instance to verify that all persisted writes are still there. If the
RocksDB instance was permanently corrupted, it switches to a new, pristine,
directory.
This is used in the roachtest, but is also useful for manual use in user
deployments in which we suspect there is a failure to persist data to disk.

This hasn't found anything, but it's fun to watch and also shows us a
number of errors that we know and love from sentry.

Release note: None

31215: storage: deflake TestStoreRangeMergeWatcher r=tschottdorf a=benesch

This test could deadlock if the LHS replica on store2 was shut down
before it processed the split at "b". Teach the test to wait for the LHS
replica on store2 to process the split before blocking Raft traffic to
it.

Fixes #31096.
Fixes #31149.
Fixes #31160.
Fixes #31167.

Release note: None

31217: importccl: add explicit default to mysql testdata timestamp r=dt a=dt

this makes the testdata work on mysql 8.0.2+, where the timestamp type no longer has the implicit defaults.

Release note: none.

31221: cluster: Create final cluster version for 2.1 r=bdarnell a=bdarnell

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
Co-authored-by: Nikhil Benesch <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Ben Darnell <[email protected]>
@craig
Copy link
Contributor

craig bot commented Oct 10, 2018

Build succeeded

@craig craig bot merged commit 01d6bda into cockroachdb:master Oct 10, 2018
@tbg tbg deleted the cli/debug-synctest branch October 11, 2018 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants