Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: disk stall detector does not fire on dmsetup suspend #94373

Closed
erikgrinaker opened this issue Dec 28, 2022 · 14 comments
Closed

storage: disk stall detector does not fire on dmsetup suspend #94373

erikgrinaker opened this issue Dec 28, 2022 · 14 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-storage Storage Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Dec 28, 2022

#94240 added a roachtest that measures pMax latency during a leaseholder disk stall. The expectation was that the disk stall detector would fire after 20 seconds and restart the node, but this never happened. The disk was confirmed to be stalled by running touch /mnt/data1/foo && sync, and the relevant cluster settings were confirmed to use the defaults.

Repro:

# Set up a simple linear mapping (pass-through) of the underlying block device and mount it.
$ sudo umount /mnt/data1
$ echo "0 $(sudo blockdev --getsz /dev/nvme0n1) linear /dev/nvme0n1 0" | sudo dmsetup create data1
$ sudo mount /dev/mapper/data1 /mnt/data1

# Start CRDB.
$ roachprod start foo

# Suspend IO.
$ sudo dmsetup suspend --noflush --nolockfs data1

# Resume IO.
$ sudo dmsetup resume --noflush --nolockfs data1

Or alternatively just run the failover/non-system/disk-stall roachtest, and inspect e.g. n4 once it's stalled:

$ roachtest run failover/non-system/disk-stall --debug-always

Jira issue: CRDB-22860

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Dec 28, 2022
@jbowens
Copy link
Collaborator

jbowens commented Jan 24, 2023

Crashing the process when a stall is detected happens here, within a callback on a Pebble EventListener:

DiskSlow: func(info pebble.DiskSlowInfo) {
maxSyncDuration := MaxSyncDuration.Get(&p.settings.SV)
fatalOnExceeded := MaxSyncDurationFatalOnExceeded.Get(&p.settings.SV)
if info.Duration.Seconds() >= maxSyncDuration.Seconds() {
atomic.AddInt64(&p.diskStallCount, 1)
// Note that the below log messages go to the main cockroach log, not
// the pebble-specific log.
if fatalOnExceeded {
log.Fatalf(ctx, "disk stall detected: pebble unable to write to %s in %.2f seconds",
info.Path, redact.Safe(info.Duration.Seconds()))
} else {
log.Errorf(ctx, "disk stall detected: pebble unable to write to %s in %.2f seconds",
info.Path, redact.Safe(info.Duration.Seconds()))
}
return
}
atomic.AddInt64(&p.diskSlowCount, 1)
},

The EventListener installed on Pebble wraps two event listeners, the default logging one and the crashing one:

el := pebble.TeeEventListener(
pebble.MakeLoggingEventListener(pebbleLogger{
ctx: logCtx,
depth: 2, // skip over the EventListener stack frame
}),
p.makeMetricEtcEventListener(ctx),
)
. The crashing one is installed second. I don't see any reason why the EventListener.DiskSlow invocation wouldn't get stuck in the logging event listener's attempt to write the event to the log file.

@sumeerbhola
Copy link
Collaborator

We could just reverse the order, right? If it is a crashing event, then we don't mind that the Pebble logs miss that event.

@petermattis
Copy link
Collaborator

Does the charybdefs-based disk-stall roachtest only stall the store directory and not the logging directory?

@jbowens
Copy link
Collaborator

jbowens commented Jan 24, 2023

Yeah, I'm going to try to grab a stack trace from the above roachtest during the stall to confirm.

Does the charybdefs-based disk-stall roachtest only stall the store directory and not the logging directory?

Looks like it

	c.Run(ctx, n, "sudo charybdefs {store-dir}/faulty -oallow_other,modules=subdir,subdir={store-dir}/real")

@sean-
Copy link
Collaborator

sean- commented Jan 24, 2023

Repro:

  1. Create a 3x AZ cluster w/ 3x nodes per AZ.
  2. Move the crdb process into a cgroup
  3. Determine the device names of the stores mounted on each crdb node (e.g., lsblk)
  4. Run any KV workload that spreads the work across all three nodes (50/50 read/write)
  5. On one (or all nodes) in one of the AZs, run the following:
DEVICE_NAMES="data1|data2"
disklist=$(lsblk|egrep -i "${DEVICE_NAMES}"|awk '{print $1}')

i=1;
for diskname in $disklist
do
echo "`lsblk -np /dev/${diskname}|awk '{print $2}'` $i" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device;
echo "`lsblk -np /dev/${diskname}|awk '{print $2}'` $i" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device;
done



i=1;
for diskname in $disklist
do
echo "`lsblk -np /dev/${diskname}|awk '{print $2}'` $i" > /sys/fs/cgroup/blkio/blkio.throttle.write_iops_device;
echo "`lsblk -np /dev/${diskname}|awk '{print $2}'` $i" > /sys/fs/cgroup/blkio/blkio.throttle.read_iops_device;
done

@jbowens
Copy link
Collaborator

jbowens commented Jan 24, 2023

We could just reverse the order, right?

I think we also might need to run DiskSlow handlers in a separate goroutine. Otherwise, it seems like a stalled log write could halt the goroutine that's responsible for monitoring a file's writes. The first 'DiskSlow' event for a non-fatal duration may never return.

@petermattis
Copy link
Collaborator

Does the charybdefs-based disk-stall roachtest only stall the store directory and not the logging directory?

Looks like it

But later on it configures the logging dir as either faulty/logs or real/logs. That roachtest tests the matrix of stalled logging dir and stalled store dir. See registerDiskStalledDetection.

@jbowens
Copy link
Collaborator

jbowens commented Jan 24, 2023

But later on it configures the logging dir as either faulty/logs or real/logs. That roachtest tests the matrix of stalled logging dir and stalled store dir. See registerDiskStalledDetection.

Ah, thanks—maybe this explains it. If charybdefs only delays syscalls for 50ms, we'll eventually return and crash the process.

	// NB: charybdefs' delay nemesis introduces 50ms per syscall. It would
	// be nicer to introduce a longer delay, but this works.
	tooShortSync := 40 * time.Millisecond

@sean-
Copy link
Collaborator

sean- commented Jan 24, 2023

@jbowens : that's another good idea: potentially call runtime.LockOSThread for the go routine handling DiskSlow so its underlying sys thread can't be preempted for other work that will be caught up in the disk stall.

@jbowens
Copy link
Collaborator

jbowens commented Jan 24, 2023

Grabbing stack traces from the running node, there's a different or additional issue. When we open a new WAL, we call (vfs.FS).Create or (vfs.FS).ReuseForWrite to obtain a file handle for our new WAL. Since the VFS passed into Options.FS is wrapped with disk health checking, this file handle has disk-health checking.

But before we begin using the new WAL, we wrap it with vfs.NewSyncingFile to add periodic syncing to reduce latency spikes. The syncing file performs I/O outside the vfs.File interface (eg, Fdatasync), and this I/O needs to be timed. So the syncing file does a contortion to check if the file that it's wrapping is a disk-health checking file and wrap the operation with the timing if so: https://github.com/cockroachdb/pebble/blob/4199154043c56ed233b670d28114561073009c50/vfs/syncing_file.go#L62-L71

But in the time since disk-health checking was added we've accumulated additional VFS middleware, and the disk-health checking VFS is not the top of the stack. This leaves these non-vfs.File I/O ops untimed.

goroutine 96 [syscall, 21 minutes]:
syscall.Syscall(0x46f65f?, 0x476351?, 0xc001531b11?, 0xc0027e0d80?)
	GOROOT/src/syscall/syscall_linux.go:68 +0x27
golang.org/x/sys/unix.Fdatasync(0x46f9a7?)
	golang.org/x/sys/unix/external/org_golang_x_sys/unix/zsyscall_linux.go:712 +0x30
github.com/cockroachdb/pebble/vfs.(*syncingFile).syncFdatasync.func1()
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/syncing_file_linux.go:63 +0x2a
github.com/cockroachdb/pebble/vfs.NewSyncingFile.func1(0x1080c0f?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/syncing_file.go:73 +0x1a
github.com/cockroachdb/pebble/vfs.(*syncingFile).syncFdatasync(0xc001cb9e00)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/syncing_file_linux.go:62 +0x9c
github.com/cockroachdb/pebble/vfs.(*syncingFile).Sync(0x4f2c97?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/syncing_file.go:136 +0x46
github.com/cockroachdb/pebble/record.(*LogWriter).syncWithLatency(0xc0001dc500)
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:545 +0x43
github.com/cockroachdb/pebble/record.(*LogWriter).flushPending(0xc0001dc500, {0xc001531b11, 0x220, 0x4f7}, {0xc001066000, 0x0, 0x4fdec6?}, 0x16e, 0x16d)
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:532 +0x1dc
github.com/cockroachdb/pebble/record.(*LogWriter).flushLoop(0xc0001dc500, {0x4ecce60, 0xae7af58})
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:466 +0x358
runtime/pprof.Do({0x6c413f8?, 0xc000128000?}, {{0xc000080880?, 0xa159a74074462db8?, 0x67f6ce90e736b243?}}, 0xc0012237c0)
	GOROOT/src/runtime/pprof/runtime.go:40 +0xa3
github.com/cockroachdb/pebble/record.NewLogWriter.func2()
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:351 +0x5c
created by github.com/cockroachdb/pebble/record.NewLogWriter
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:350 +0x456

The interface assertion contortion here is way too brittle, and I think we should elevate what we need up to the vfs.File interface like we did in cockroachdb/pebble#2262.

@jbowens jbowens self-assigned this Jan 24, 2023
@petermattis
Copy link
Collaborator

@jbowens : that's another good idea: potentially call runtime.LockOSThread for the go routine handling DiskSlow so its underlying sys thread can't be preempted for other work that will be caught up in the disk stall.

If I'm understanding this suggestion correctly, it is trying to solve for ensuring that the goroutine watching for stalled disk IOPs has a runnable thread. Do we have any evidence that is a problem? The Go runtime would have a serious bug if there is a scenario where a runnable goroutine is not run for a significant period of time. As @jbowens is discovering, I suspect we simply have other more basic bugs in the stalled IOPs detection.

@bobvawter
Copy link
Contributor

I did some reductionist testing yesterday to look at the go runtime's behavior if you have many goroutines doing IO and then you starve it out. In the limit, the 1.19 runtime will allow only 10,000 OS threads for executing (blocked) syscalls and then panic once it hits the limit. I was never able to break the HTTP service with blocked disk IO because, as I understand it, the epoll loop has a dedicated OS thread that proceeds to dispatch into the rest of the runtime.

stall.tgz

jbowens added a commit to jbowens/cockroach that referenced this issue Jan 25, 2023
To be defensive, sequence the EventListener responsible for crashing the
process during a disk stall first, before the Pebble logging event listener.

Informs cockroachdb#94373.
Epic: None
Release note: None
jbowens added a commit to jbowens/pebble that referenced this issue Jan 25, 2023
Expand the vfs.File interface to expose SyncData, SyncTo and Preallocate as
first-class methods. Previously, the file created through vfs.NewSyncingFile
would perform analagous I/O operations (eg, `Fdatasync`, `sync_file_range`,
`Fallocate`) outside the context of the interface. This easily allowed the
accidental loss of the disk-health checking over these operations by minor
tweaks to the interface of the disk-health checking implementation or by adding
intermediary VFS wrappers between the disk-health checking FS and the syncing
file.

See cockroachdb/cockroach#94373 where the introduction of an additional VFS
wrapper resulted in these I/O operations being uncovered by disk-stall
detection.
jbowens added a commit to jbowens/cockroach that referenced this issue Jan 31, 2023
Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE
filesystem approach to stalling) and skip them for now. Currently, they're not
capable of stalling the disk longer 50us (see cockroachdb#95886), which makes them
unreliable at exercising stalls.

Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use
dmsetup and cgroup bandwidth restrctions respectively to reliably induce a
write stall for an indefinite duration.

Informs cockroachdb#94373.
Epic: None
Release note: None
jbowens added a commit to cockroachdb/pebble that referenced this issue Jan 31, 2023
Expand the vfs.File interface to expose SyncData, SyncTo and Preallocate as
first-class methods. Previously, the file created through vfs.NewSyncingFile
would perform analagous I/O operations (eg, `Fdatasync`, `sync_file_range`,
`Fallocate`) outside the context of the interface. This easily allowed the
accidental loss of the disk-health checking over these operations by minor
tweaks to the interface of the disk-health checking implementation or by adding
intermediary VFS wrappers between the disk-health checking FS and the syncing
file.

See cockroachdb/cockroach#94373 where the introduction of an additional VFS
wrapper resulted in these I/O operations being uncovered by disk-stall
detection.
jbowens added a commit to jbowens/cockroach that referenced this issue Jan 31, 2023
Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE
filesystem approach to stalling) and skip them for now. Currently, they're not
capable of stalling the disk longer 50us (see cockroachdb#95886), which makes them
unreliable at exercising stalls.

Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use
dmsetup and cgroup bandwidth restrctions respectively to reliably induce a
write stall for an indefinite duration.

Informs cockroachdb#94373.
Epic: None
Release note: None
jbowens added a commit to jbowens/pebble that referenced this issue Jan 31, 2023
Expand the vfs.File interface to expose SyncData, SyncTo and Preallocate as
first-class methods. Previously, the file created through vfs.NewSyncingFile
would perform analagous I/O operations (eg, `Fdatasync`, `sync_file_range`,
`Fallocate`) outside the context of the interface. This easily allowed the
accidental loss of the disk-health checking over these operations by minor
tweaks to the interface of the disk-health checking implementation or by adding
intermediary VFS wrappers between the disk-health checking FS and the syncing
file.

See cockroachdb/cockroach#94373 where the introduction of an additional VFS
wrapper resulted in these I/O operations being uncovered by disk-stall
detection.
jbowens added a commit to cockroachdb/pebble that referenced this issue Jan 31, 2023
Expand the vfs.File interface to expose SyncData, SyncTo and Preallocate as
first-class methods. Previously, the file created through vfs.NewSyncingFile
would perform analagous I/O operations (eg, `Fdatasync`, `sync_file_range`,
`Fallocate`) outside the context of the interface. This easily allowed the
accidental loss of the disk-health checking over these operations by minor
tweaks to the interface of the disk-health checking implementation or by adding
intermediary VFS wrappers between the disk-health checking FS and the syncing
file.

See cockroachdb/cockroach#94373 where the introduction of an additional VFS
wrapper resulted in these I/O operations being uncovered by disk-stall
detection.
craig bot pushed a commit that referenced this issue Feb 1, 2023
95622: backupccl,storage: add logs around manifest handling and ExportRequest pagination r=stevendanna a=adityamaru

backupccl: add logging to backup manifest handling

Release note: None

storage: log the ExportRequest pagination reason

Release note: None

Epic: None

95865: cmd/roachtest: adapt disk-stall detection roachtest r=nicktrav,erikgrinaker a=jbowens

Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE
filesystem approach to stalling) and skip them for now. Currently, they're not
capable of stalling the disk longer 50us (see #95886), which makes them
unreliable at exercising stalls.

Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use
dmsetup and cgroup bandwidth restrctions respectively to reliably induce a
write stall for an indefinite duration.

Informs #94373.
Epic: None
Release note: None

95999: multitenant: add multitenant/shared-process/basic roachtest r=stevendanna a=msbutler

This patch introduces a simple roachtest that runs in a shared-process tenant.
This test imports a 500 tpcc workload (about 30 GB of replicated data), and
runs the workload for 10 minutes. The test is run on a 4 node, 4vcpu cluster
with local ssds.

A future patch could complicate the test by running schema changes or other
bulk operations.

Fixes #95990

Release note: None

96115: schemachanger: Implement `DROP CONSTRAINT` in declarative schema changer r=Xiang-Gu a=Xiang-Gu

This PR implements `ALTER TABLE t DROP CONSTRAINT cons_name` in declarative schema changer.

Supported constraints include Checks, FK, and UniqueWithoutIndex.

Dropping PK or Unique constraints will fall back to legacy schema changer, which in turn spits out an "not supported yet" error.

Epic: None

96202: opt: inverted-index accelerate filters of the form j->0 @> '{"b": "c"} r=Shivs11 a=Shivs11

Previously, the optimizer did not plan inverted index scans for filters
having an integer as the index for the fetch value in a filter alongside
the "contains" or the "contained by" operator.

To address this, we now build JSON arrays from fetch value expressions
with integer indexes. From these JSON arrays, inverted spans are built
for constraining scans over inverted indexes. With these changes chains
of both integer and string fetch value operators are now supported
alongside the "contains" and the "contained by" operators.
(e.g., j->0 `@>` '{"b": "c"}' and j->0 <@ '{"b": "c"}').

Epic: [CRDB-3301](https://cockroachlabs.atlassian.net/browse/CRDB-3301)
Fixes: #94667

Release note (performance improvement): The optimizer now plans
inverted index scans for queries that filter by JSON fetch value
operators (->) with integer indices alongside the "contains" or
the "contained by" operators, e.g, json_col->0 `@>` '{"b": "c"}'
or json_col->0 <@ '{"b": "c"}'

96235: sem/tree: add support for producing vectorized data from strings r=cucaroach a=cucaroach

tree.ValueHandler exposes raw machine type hooks that are used by
vec_handler to build coldata.Vec's.

Epic: CRDB-18892
Informs: #91831
Release note: None


96328: udf: allow strict UDF with no arguments r=DrewKimball a=DrewKimball

This patch fixes the case when a strict UDF (returns null on null input) has no arguments. Previously, attempting to call such a function would result in `ERROR: reflect: call of reflect.Value.Pointer on zero Value`.

Fixes #96326

Release note: None

96366: release: skip nil GitHub events r=celiala a=rail

Previously, we referenced `*event.Event`, but in some cases the event objects are `nil`.

This PR skips the nil GitHub event objects.

Epic: none
Release note: None

Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Xiang Gu <[email protected]>
Co-authored-by: Shivam Saraf <[email protected]>
Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: Drew Kimball <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
craig bot pushed a commit that referenced this issue Feb 1, 2023
95865: cmd/roachtest: adapt disk-stall detection roachtest r=nicktrav,erikgrinaker a=jbowens

Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE
filesystem approach to stalling) and skip them for now. Currently, they're not
capable of stalling the disk longer 50us (see #95886), which makes them
unreliable at exercising stalls.

Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use
dmsetup and cgroup bandwidth restrctions respectively to reliably induce a
write stall for an indefinite duration.

Informs #94373.
Epic: None
Release note: None

95999: multitenant: add multitenant/shared-process/basic roachtest r=stevendanna a=msbutler

This patch introduces a simple roachtest that runs in a shared-process tenant.
This test imports a 500 tpcc workload (about 30 GB of replicated data), and
runs the workload for 10 minutes. The test is run on a 4 node, 4vcpu cluster
with local ssds.

A future patch could complicate the test by running schema changes or other
bulk operations.

Fixes #95990

Release note: None

96202: opt: inverted-index accelerate filters of the form j->0 @> '{"b": "c"} r=Shivs11 a=Shivs11

Previously, the optimizer did not plan inverted index scans for filters
having an integer as the index for the fetch value in a filter alongside
the "contains" or the "contained by" operator.

To address this, we now build JSON arrays from fetch value expressions
with integer indexes. From these JSON arrays, inverted spans are built
for constraining scans over inverted indexes. With these changes chains
of both integer and string fetch value operators are now supported
alongside the "contains" and the "contained by" operators.
(e.g., j->0 `@>` '{"b": "c"}' and j->0 <@ '{"b": "c"}').

Epic: [CRDB-3301](https://cockroachlabs.atlassian.net/browse/CRDB-3301)
Fixes: #94667

Release note (performance improvement): The optimizer now plans
inverted index scans for queries that filter by JSON fetch value
operators (->) with integer indices alongside the "contains" or
the "contained by" operators, e.g, json_col->0 `@>` '{"b": "c"}'
or json_col->0 <@ '{"b": "c"}'

96328: udf: allow strict UDF with no arguments r=DrewKimball a=DrewKimball

This patch fixes the case when a strict UDF (returns null on null input) has no arguments. Previously, attempting to call such a function would result in `ERROR: reflect: call of reflect.Value.Pointer on zero Value`.

Fixes #96326

Release note: None

Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Shivam Saraf <[email protected]>
Co-authored-by: Drew Kimball <[email protected]>
jbowens added a commit to jbowens/cockroach that referenced this issue Feb 3, 2023
Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE
filesystem approach to stalling) and skip them for now. Currently, they're not
capable of stalling the disk longer 50us (see cockroachdb#95886), which makes them
unreliable at exercising stalls.

Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use
dmsetup and cgroup bandwidth restrctions respectively to reliably induce a
write stall for an indefinite duration.

Informs cockroachdb#94373.
Epic: None
Release note: None
@erikgrinaker
Copy link
Contributor Author

@jbowens Can we close this out now? The motivating benchmark is now showing disk stalls handled correctly: https://roachperf.crdb.dev/?filter=&view=failover%2Fnon-system%2Fdisk-stall&tab=gce.

@jbowens
Copy link
Collaborator

jbowens commented Feb 6, 2023

Yeah, let's close it out.

@jbowens jbowens closed this as completed Feb 6, 2023
nicktrav pushed a commit to nicktrav/cockroach that referenced this issue Feb 10, 2023
Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE
filesystem approach to stalling) and skip them for now. Currently, they're not
capable of stalling the disk longer 50us (see cockroachdb#95886), which makes them
unreliable at exercising stalls.

Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use
dmsetup and cgroup bandwidth restrctions respectively to reliably induce a
write stall for an indefinite duration.

Informs cockroachdb#94373.
Epic: None
Release note: None
nicktrav pushed a commit to nicktrav/cockroach that referenced this issue Feb 25, 2023
To be defensive, sequence the EventListener responsible for crashing the
process during a disk stall first, before the Pebble logging event listener.

Informs cockroachdb#94373.
Epic: None
Release note: None
nicktrav pushed a commit to nicktrav/cockroach that referenced this issue Feb 25, 2023
The pebble logger could block if we're experiencing a slow
/ stalling disk. If the call to the pebble logger is synchronous
from the EventListener passed into Pebble, it could end up slowing
down Pebble's internal disk health checks as those rely on EventListener
methods being quick to run.

This change updates the logging event listener to asynchronously
call the logger on a DiskSlow event.

Related to cockroachdb#94373.

Epic: none

Release note: None.
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-storage Storage Team
Projects
No open projects
Archived in project
Development

No branches or pull requests

6 participants