-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: disk stall detector does not fire on dmsetup suspend
#94373
Comments
Crashing the process when a stall is detected happens here, within a callback on a Pebble EventListener: cockroach/pkg/storage/pebble.go Lines 1062 to 1079 in 6af6c6d
The cockroach/pkg/storage/pebble.go Lines 985 to 991 in 6af6c6d
EventListener.DiskSlow invocation wouldn't get stuck in the logging event listener's attempt to write the event to the log file.
|
We could just reverse the order, right? If it is a crashing event, then we don't mind that the Pebble logs miss that event. |
Does the |
Yeah, I'm going to try to grab a stack trace from the above roachtest during the stall to confirm.
Looks like it
|
Repro:
|
I think we also might need to run |
But later on it configures the logging dir as either |
Ah, thanks—maybe this explains it. If charybdefs only delays syscalls for 50ms, we'll eventually return and crash the process.
|
@jbowens : that's another good idea: potentially call |
Grabbing stack traces from the running node, there's a different or additional issue. When we open a new WAL, we call But before we begin using the new WAL, we wrap it with But in the time since disk-health checking was added we've accumulated additional VFS middleware, and the disk-health checking VFS is not the top of the stack. This leaves these non-
The interface assertion contortion here is way too brittle, and I think we should elevate what we need up to the |
If I'm understanding this suggestion correctly, it is trying to solve for ensuring that the goroutine watching for stalled disk IOPs has a runnable thread. Do we have any evidence that is a problem? The Go runtime would have a serious bug if there is a scenario where a runnable goroutine is not run for a significant period of time. As @jbowens is discovering, I suspect we simply have other more basic bugs in the stalled IOPs detection. |
I did some reductionist testing yesterday to look at the go runtime's behavior if you have many goroutines doing IO and then you starve it out. In the limit, the 1.19 runtime will allow only 10,000 OS threads for executing (blocked) syscalls and then panic once it hits the limit. I was never able to break the HTTP service with blocked disk IO because, as I understand it, the epoll loop has a dedicated OS thread that proceeds to dispatch into the rest of the runtime. |
To be defensive, sequence the EventListener responsible for crashing the process during a disk stall first, before the Pebble logging event listener. Informs cockroachdb#94373. Epic: None Release note: None
Expand the vfs.File interface to expose SyncData, SyncTo and Preallocate as first-class methods. Previously, the file created through vfs.NewSyncingFile would perform analagous I/O operations (eg, `Fdatasync`, `sync_file_range`, `Fallocate`) outside the context of the interface. This easily allowed the accidental loss of the disk-health checking over these operations by minor tweaks to the interface of the disk-health checking implementation or by adding intermediary VFS wrappers between the disk-health checking FS and the syncing file. See cockroachdb/cockroach#94373 where the introduction of an additional VFS wrapper resulted in these I/O operations being uncovered by disk-stall detection.
Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE filesystem approach to stalling) and skip them for now. Currently, they're not capable of stalling the disk longer 50us (see cockroachdb#95886), which makes them unreliable at exercising stalls. Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use dmsetup and cgroup bandwidth restrctions respectively to reliably induce a write stall for an indefinite duration. Informs cockroachdb#94373. Epic: None Release note: None
Expand the vfs.File interface to expose SyncData, SyncTo and Preallocate as first-class methods. Previously, the file created through vfs.NewSyncingFile would perform analagous I/O operations (eg, `Fdatasync`, `sync_file_range`, `Fallocate`) outside the context of the interface. This easily allowed the accidental loss of the disk-health checking over these operations by minor tweaks to the interface of the disk-health checking implementation or by adding intermediary VFS wrappers between the disk-health checking FS and the syncing file. See cockroachdb/cockroach#94373 where the introduction of an additional VFS wrapper resulted in these I/O operations being uncovered by disk-stall detection.
Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE filesystem approach to stalling) and skip them for now. Currently, they're not capable of stalling the disk longer 50us (see cockroachdb#95886), which makes them unreliable at exercising stalls. Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use dmsetup and cgroup bandwidth restrctions respectively to reliably induce a write stall for an indefinite duration. Informs cockroachdb#94373. Epic: None Release note: None
Expand the vfs.File interface to expose SyncData, SyncTo and Preallocate as first-class methods. Previously, the file created through vfs.NewSyncingFile would perform analagous I/O operations (eg, `Fdatasync`, `sync_file_range`, `Fallocate`) outside the context of the interface. This easily allowed the accidental loss of the disk-health checking over these operations by minor tweaks to the interface of the disk-health checking implementation or by adding intermediary VFS wrappers between the disk-health checking FS and the syncing file. See cockroachdb/cockroach#94373 where the introduction of an additional VFS wrapper resulted in these I/O operations being uncovered by disk-stall detection.
Expand the vfs.File interface to expose SyncData, SyncTo and Preallocate as first-class methods. Previously, the file created through vfs.NewSyncingFile would perform analagous I/O operations (eg, `Fdatasync`, `sync_file_range`, `Fallocate`) outside the context of the interface. This easily allowed the accidental loss of the disk-health checking over these operations by minor tweaks to the interface of the disk-health checking implementation or by adding intermediary VFS wrappers between the disk-health checking FS and the syncing file. See cockroachdb/cockroach#94373 where the introduction of an additional VFS wrapper resulted in these I/O operations being uncovered by disk-stall detection.
95622: backupccl,storage: add logs around manifest handling and ExportRequest pagination r=stevendanna a=adityamaru backupccl: add logging to backup manifest handling Release note: None storage: log the ExportRequest pagination reason Release note: None Epic: None 95865: cmd/roachtest: adapt disk-stall detection roachtest r=nicktrav,erikgrinaker a=jbowens Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE filesystem approach to stalling) and skip them for now. Currently, they're not capable of stalling the disk longer 50us (see #95886), which makes them unreliable at exercising stalls. Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use dmsetup and cgroup bandwidth restrctions respectively to reliably induce a write stall for an indefinite duration. Informs #94373. Epic: None Release note: None 95999: multitenant: add multitenant/shared-process/basic roachtest r=stevendanna a=msbutler This patch introduces a simple roachtest that runs in a shared-process tenant. This test imports a 500 tpcc workload (about 30 GB of replicated data), and runs the workload for 10 minutes. The test is run on a 4 node, 4vcpu cluster with local ssds. A future patch could complicate the test by running schema changes or other bulk operations. Fixes #95990 Release note: None 96115: schemachanger: Implement `DROP CONSTRAINT` in declarative schema changer r=Xiang-Gu a=Xiang-Gu This PR implements `ALTER TABLE t DROP CONSTRAINT cons_name` in declarative schema changer. Supported constraints include Checks, FK, and UniqueWithoutIndex. Dropping PK or Unique constraints will fall back to legacy schema changer, which in turn spits out an "not supported yet" error. Epic: None 96202: opt: inverted-index accelerate filters of the form j->0 @> '{"b": "c"} r=Shivs11 a=Shivs11 Previously, the optimizer did not plan inverted index scans for filters having an integer as the index for the fetch value in a filter alongside the "contains" or the "contained by" operator. To address this, we now build JSON arrays from fetch value expressions with integer indexes. From these JSON arrays, inverted spans are built for constraining scans over inverted indexes. With these changes chains of both integer and string fetch value operators are now supported alongside the "contains" and the "contained by" operators. (e.g., j->0 `@>` '{"b": "c"}' and j->0 <@ '{"b": "c"}'). Epic: [CRDB-3301](https://cockroachlabs.atlassian.net/browse/CRDB-3301) Fixes: #94667 Release note (performance improvement): The optimizer now plans inverted index scans for queries that filter by JSON fetch value operators (->) with integer indices alongside the "contains" or the "contained by" operators, e.g, json_col->0 `@>` '{"b": "c"}' or json_col->0 <@ '{"b": "c"}' 96235: sem/tree: add support for producing vectorized data from strings r=cucaroach a=cucaroach tree.ValueHandler exposes raw machine type hooks that are used by vec_handler to build coldata.Vec's. Epic: CRDB-18892 Informs: #91831 Release note: None 96328: udf: allow strict UDF with no arguments r=DrewKimball a=DrewKimball This patch fixes the case when a strict UDF (returns null on null input) has no arguments. Previously, attempting to call such a function would result in `ERROR: reflect: call of reflect.Value.Pointer on zero Value`. Fixes #96326 Release note: None 96366: release: skip nil GitHub events r=celiala a=rail Previously, we referenced `*event.Event`, but in some cases the event objects are `nil`. This PR skips the nil GitHub event objects. Epic: none Release note: None Co-authored-by: adityamaru <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Xiang Gu <[email protected]> Co-authored-by: Shivam Saraf <[email protected]> Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Drew Kimball <[email protected]> Co-authored-by: Rail Aliiev <[email protected]>
95865: cmd/roachtest: adapt disk-stall detection roachtest r=nicktrav,erikgrinaker a=jbowens Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE filesystem approach to stalling) and skip them for now. Currently, they're not capable of stalling the disk longer 50us (see #95886), which makes them unreliable at exercising stalls. Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use dmsetup and cgroup bandwidth restrctions respectively to reliably induce a write stall for an indefinite duration. Informs #94373. Epic: None Release note: None 95999: multitenant: add multitenant/shared-process/basic roachtest r=stevendanna a=msbutler This patch introduces a simple roachtest that runs in a shared-process tenant. This test imports a 500 tpcc workload (about 30 GB of replicated data), and runs the workload for 10 minutes. The test is run on a 4 node, 4vcpu cluster with local ssds. A future patch could complicate the test by running schema changes or other bulk operations. Fixes #95990 Release note: None 96202: opt: inverted-index accelerate filters of the form j->0 @> '{"b": "c"} r=Shivs11 a=Shivs11 Previously, the optimizer did not plan inverted index scans for filters having an integer as the index for the fetch value in a filter alongside the "contains" or the "contained by" operator. To address this, we now build JSON arrays from fetch value expressions with integer indexes. From these JSON arrays, inverted spans are built for constraining scans over inverted indexes. With these changes chains of both integer and string fetch value operators are now supported alongside the "contains" and the "contained by" operators. (e.g., j->0 `@>` '{"b": "c"}' and j->0 <@ '{"b": "c"}'). Epic: [CRDB-3301](https://cockroachlabs.atlassian.net/browse/CRDB-3301) Fixes: #94667 Release note (performance improvement): The optimizer now plans inverted index scans for queries that filter by JSON fetch value operators (->) with integer indices alongside the "contains" or the "contained by" operators, e.g, json_col->0 `@>` '{"b": "c"}' or json_col->0 <@ '{"b": "c"}' 96328: udf: allow strict UDF with no arguments r=DrewKimball a=DrewKimball This patch fixes the case when a strict UDF (returns null on null input) has no arguments. Previously, attempting to call such a function would result in `ERROR: reflect: call of reflect.Value.Pointer on zero Value`. Fixes #96326 Release note: None Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Shivam Saraf <[email protected]> Co-authored-by: Drew Kimball <[email protected]>
Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE filesystem approach to stalling) and skip them for now. Currently, they're not capable of stalling the disk longer 50us (see cockroachdb#95886), which makes them unreliable at exercising stalls. Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use dmsetup and cgroup bandwidth restrctions respectively to reliably induce a write stall for an indefinite duration. Informs cockroachdb#94373. Epic: None Release note: None
@jbowens Can we close this out now? The motivating benchmark is now showing disk stalls handled correctly: https://roachperf.crdb.dev/?filter=&view=failover%2Fnon-system%2Fdisk-stall&tab=gce. |
Yeah, let's close it out. |
Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE filesystem approach to stalling) and skip them for now. Currently, they're not capable of stalling the disk longer 50us (see cockroachdb#95886), which makes them unreliable at exercising stalls. Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use dmsetup and cgroup bandwidth restrctions respectively to reliably induce a write stall for an indefinite duration. Informs cockroachdb#94373. Epic: None Release note: None
To be defensive, sequence the EventListener responsible for crashing the process during a disk stall first, before the Pebble logging event listener. Informs cockroachdb#94373. Epic: None Release note: None
The pebble logger could block if we're experiencing a slow / stalling disk. If the call to the pebble logger is synchronous from the EventListener passed into Pebble, it could end up slowing down Pebble's internal disk health checks as those rely on EventListener methods being quick to run. This change updates the logging event listener to asynchronously call the logger on a DiskSlow event. Related to cockroachdb#94373. Epic: none Release note: None.
#94240 added a roachtest that measures pMax latency during a leaseholder disk stall. The expectation was that the disk stall detector would fire after 20 seconds and restart the node, but this never happened. The disk was confirmed to be stalled by running
touch /mnt/data1/foo && sync
, and the relevant cluster settings were confirmed to use the defaults.Repro:
Or alternatively just run the
failover/non-system/disk-stall
roachtest, and inspect e.g. n4 once it's stalled:Jira issue: CRDB-22860
The text was updated successfully, but these errors were encountered: