-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: add synctest #31187
roachtest: add synctest #31187
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which errors does this turn up?
Reviewed 1 of 1 files at r3, 2 of 2 files at r4, 2 of 2 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/synctest.go, line 25 at r5 (raw file):
#!/bin/bash if [ $1 == "on" ]; then
Where's the actual fault injection?
pkg/cmd/roachtest/synctest.go, line 33 at r5 (raw file):
Name: "synctest", Nodes: nodes(1), Stable: true, // DO NOT COPY to new tests
Is this stable? What exactly does a passing result mean? Is it worth adding this in the roachtest suite or should we just leave it to be run manually?
pkg/cmd/roachtest/synctest.go, line 47 at r5 (raw file):
t.Status("running synctest") c.Run(ctx, n, "./cockroach debug synctest {store-dir}/faulty ./nemesis.sh")
I don't see where nemesis.sh is written.
f41634e
to
a1036f8
Compare
That better reflects what it's doing and I'm adding an actual synctest utility. Release note: None
To allow creating throwaway engines. Release note: None
Release note: None
This is code that simulates a Raft log and can be directed at a filesystem that is being hit with random failures. The workload essentially writes ascending keys (flushing each one to disk synchronously) until an I/O error occurs, at which point it re-opens the instance to verify that all persisted writes are still there. If the RocksDB instance was permanently corrupted, it switches to a new, pristine, directory. This is to be used in combination with an upcoming roachtest that uses charybdefs to inject failures, but it can also be used manually in user deployments in which we suspect there is a failure to persist data to disk. Release note: None
39c6a76
to
9f12513
Compare
This new roachtest sets up a charybdefs on a single (Ubuntu) node and runs the `synctest` cli command against a nemesis that injects random I/O errors. This hasn't found anything, but it's fun to watch and also shows us a number of errors that we know and love from sentry. Release note: None
9f12513
to
01d6bda
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the lines containing "Corruption" (where we fried the RocksDB dir). The remaining syscall errors are injected and the only real info is where in RocksDB they bubble up.
error after seq 1023 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/3/001055.sst: Permission denied
error after seq 1023: IO error: While appending to file: /mnt/data1/cockroach/faulty/3/001055.sst: Permission denied
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=1024).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 1217 (trying 2 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/3/MANIFEST-001056: Block device required
./nemesis off: Clearing all faults conditions
error after seq 1217 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/3/MANIFEST-001056: Block device required
error after seq 1217 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/3/MANIFEST-001056: Block device required
error after seq 1217: IO error: While appending to file: /mnt/data1/cockroach/faulty/3/MANIFEST-001056: Block device required
./nemesis off: Clearing all faults conditions
RocksDB directory /mnt/data1/cockroach/faulty/3 corrupted: could not open rocksdb instance: Corruption: Can't access /001253.sst: IO error: while stat a file for size: /mnt/data1/cockroach/faulty/3/001253.sst: No such file or directory
verifying existing sequence numbers...done (seq=0).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
./nemesis off: Clearing all faults conditions
error after seq 241 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/4/MANIFEST-000008: Too many open files
error after seq 241 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/4/MANIFEST-000008: Too many open files
error after seq 241: IO error: While appending to file: /mnt/data1/cockroach/faulty/4/MANIFEST-000008: Too many open files
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=242).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 309 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/4/MANIFEST-000250: Link number out of range
error after seq 309: IO error: While appending to file: /mnt/data1/cockroach/faulty/4/MANIFEST-000250: Link number out of range
./nemesis off: Clearing all faults conditions
RocksDB directory /mnt/data1/cockroach/faulty/4 corrupted: could not open rocksdb instance: Corruption: Can't access /000321.sst: IO error: while stat a file for size: /mnt/data1/cockroach/faulty/4/000321.sst: No such file or directory
verifying existing sequence numbers...done (seq=0).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 167 (trying 2 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/000174.sst: Resource temporarily unavailable
./nemesis off: Clearing all faults conditions
error after seq 167 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/000174.sst: Resource temporarily unavailable
error after seq 167 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/000174.sst: Resource temporarily unavailable
error after seq 167: IO error: While appending to file: /mnt/data1/cockroach/faulty/5/000174.sst: Resource temporarily unavailable
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=168).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
./nemesis off: Clearing all faults conditions
error after seq 191 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/000199.log: File too large
error after seq 192 (trying 0 additional writes): <nil>
error after seq 193: <nil>
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=193).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
./nemesis off: Clearing all faults conditions
error after seq 245 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/MANIFEST-000206: File too large
error after seq 245 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/MANIFEST-000206: File too large
error after seq 245: IO error: While appending to file: /mnt/data1/cockroach/faulty/5/MANIFEST-000206: File too large
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=245).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 283 (trying 2 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/MANIFEST-000264: Unknown error 41
./nemesis off: Clearing all faults conditions
error after seq 283 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/MANIFEST-000264: Unknown error 41
error after seq 283 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/5/MANIFEST-000264: Unknown error 41
error after seq 283: IO error: While appending to file: /mnt/data1/cockroach/faulty/5/MANIFEST-000264: Unknown error 41
./nemesis off: Clearing all faults conditions
RocksDB directory /mnt/data1/cockroach/faulty/5 corrupted: could not open rocksdb instance: Corruption: Can't access /000307.sst: IO error: while stat a file for size: /mnt/data1/cockroach/faulty/5/000307.sst: No such file or directory
verifying existing sequence numbers...done (seq=0).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
./nemesis off: Clearing all faults conditions
error after seq 145 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000152.sst: No such device
error after seq 145 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000152.sst: No such device
error after seq 145: IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000152.sst: No such device
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=146).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 171 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/MANIFEST-000153: Illegal seek
error after seq 171: IO error: While appending to file: /mnt/data1/cockroach/faulty/6/MANIFEST-000153: Illegal seek
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=171).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 325 (trying 2 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000342.sst: Level 3 reset
./nemesis off: Clearing all faults conditions
error after seq 325 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000342.sst: Level 3 reset
error after seq 325 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000342.sst: Level 3 reset
error after seq 325: IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000342.sst: Level 3 reset
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=326).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 487 (trying 2 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000505.log: Block device required
./nemesis off: Clearing all faults conditions
error after seq 488 (trying 1 additional writes): <nil>
error after seq 489 (trying 0 additional writes): <nil>
error after seq 490: <nil>
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=490).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 879 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000906.sst: No locks available
error after seq 879: IO error: While appending to file: /mnt/data1/cockroach/faulty/6/000906.sst: No locks available
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=880).
Writing new entries:
./nemesis on: Restricting random IO restricted to specific syscalls and 1% error probability
error after seq 1089 (trying 2 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/001120.sst: Bad address
./nemesis off: Clearing all faults conditions
error after seq 1089 (trying 1 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/001120.sst: Bad address
error after seq 1089 (trying 0 additional writes): IO error: While appending to file: /mnt/data1/cockroach/faulty/6/001120.sst: Bad address
error after seq 1089: IO error: While appending to file: /mnt/data1/cockroach/faulty/6/001120.sst: Bad address
./nemesis off: Clearing all faults conditions
verifying existing sequence numbers...done (seq=1090).
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/synctest.go, line 25 at r5 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Where's the actual fault injection?
I forgot to copy-paste the actual nemesis script that I used for testing this in.
pkg/cmd/roachtest/synctest.go, line 33 at r5 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Is this stable? What exactly does a passing result mean? Is it worth adding this in the roachtest suite or should we just leave it to be run manually?
It means that the synctest
cli returned nonzero. Which means that it ran for 10 minutes without detecting a failure. I'd like to run this nightly or it'll immediately catch rust. If it turns out to be flaky for the wrong reasons, we can skip it.
pkg/cmd/roachtest/synctest.go, line 47 at r5 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
I don't see where nemesis.sh is written.
I forgot to copy-paste the actual nemesis script that I used for testing this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r8, 1 of 2 files at r9, 3 of 3 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained
TFTR!
bors r=bdarnell
On Thu, Oct 11, 2018 at 12:07 AM Ben Darnell ***@***.***> wrote:
***@***.**** approved this pull request.
Reviewed 1 of 1 files at r8, 1 of 2 files at r9, 3 of 3 files at r10.
*Reviewable <https://reviewable.io/reviews/cockroachdb/cockroach/31187>*
status: [image: ] complete! 0 of 0 LGTMs obtained
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#31187 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135CfTVt4vbipKq-_EFqSXYZ1WbeOJks5ujm-XgaJpZM4XVbBS>
.
--
…-- Tobias
|
31013: kv: try next replica on RangeNotFoundError r=nvanbenschoten,bdarnell a=tschottdorf Previously, if a Batch RPC came back with a RangeNotFoundError, we would immediately stop trying to send to more replicas, evict the range descriptor, and start a new attempt after a back-off. This new attempt could end up using the same replica, so if the RangeNotFoundError persisted for some amount of time, so would the unsuccessful retries for requests to it as DistSender doesn't aggressively shuffle the replicas. It turns out that there are such situations, and the election-after-restart roachtest spuriously hit one of them: 1. new replica receives a preemptive snapshot and the ConfChange 2. cluster restarts 3. now the new replica is in this state until the range wakes up, which may not happen for some time. 4. the first request to the range runs into the above problem @nvanbenschoten: I think there is an issue to be filed about the tendency of DistSender to get stuck in unfortunate configurations. Fixes #30613. Release note (bug fix): Avoid repeatedly trying a replica that was found to be in the process of being added. 31187: roachtest: add synctest r=bdarnell a=tschottdorf This new roachtest sets up a charybdefs on a single (Ubuntu) node and runs the `synctest` cli command against a nemesis that injects random I/O errors. The synctest command is new. It simulates a Raft log and can be directed at a filesystem that is being hit with random failures. The workload essentially writes ascending keys (flushing each one to disk synchronously) until an I/O error occurs, at which point it re-opens the instance to verify that all persisted writes are still there. If the RocksDB instance was permanently corrupted, it switches to a new, pristine, directory. This is used in the roachtest, but is also useful for manual use in user deployments in which we suspect there is a failure to persist data to disk. This hasn't found anything, but it's fun to watch and also shows us a number of errors that we know and love from sentry. Release note: None 31215: storage: deflake TestStoreRangeMergeWatcher r=tschottdorf a=benesch This test could deadlock if the LHS replica on store2 was shut down before it processed the split at "b". Teach the test to wait for the LHS replica on store2 to process the split before blocking Raft traffic to it. Fixes #31096. Fixes #31149. Fixes #31160. Fixes #31167. Release note: None 31217: importccl: add explicit default to mysql testdata timestamp r=dt a=dt this makes the testdata work on mysql 8.0.2+, where the timestamp type no longer has the implicit defaults. Release note: none. 31221: cluster: Create final cluster version for 2.1 r=bdarnell a=bdarnell Release note: None Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Nikhil Benesch <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Ben Darnell <[email protected]>
Build succeeded |
This new roachtest sets up a charybdefs on a single (Ubuntu) node and runs
the
synctest
cli command against a nemesis that injects random I/Oerrors.
The synctest command is new. It simulates a Raft log and can be directed at a
filesystem that is being hit with random failures.
The workload essentially writes ascending keys (flushing each one to disk
synchronously) until an I/O error occurs, at which point it re-opens the
instance to verify that all persisted writes are still there. If the
RocksDB instance was permanently corrupted, it switches to a new, pristine,
directory.
This is used in the roachtest, but is also useful for manual use in user
deployments in which we suspect there is a failure to persist data to disk.
This hasn't found anything, but it's fun to watch and also shows us a
number of errors that we know and love from sentry.
Release note: None