Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vfs: include start time in disk health checker timing stack traces #3009

Closed
jbowens opened this issue Oct 23, 2023 · 0 comments · Fixed by #3020
Closed

vfs: include start time in disk health checker timing stack traces #3009

jbowens opened this issue Oct 23, 2023 · 0 comments · Fixed by #3020
Assignees
Labels

Comments

@jbowens
Copy link
Collaborator

jbowens commented Oct 23, 2023

When a disk stall results in the termination of a process and GOTRACEBACK is set appropriately, the panic includes a dump of stack traces. These stack traces can be used to confirm the presence of a goroutine stuck in the described syscall. However, there's no way to verify that the two syscall invocations are actually the same... The previous syscall could have completed and a new one introduced. We could add the unix timestamp in nanoseconds of the operation's start as a parameter to the function performing the timing. This would allow us to inspect the stack trace and verify that the start time matches the alleged start of the disk stall.

jbowens added a commit to jbowens/pebble that referenced this issue Oct 27, 2023
Pass the start time in the form of nanoseconds since the unix epoch as a
parameter to timeDiskOp and timeFilesystemOp. This aids post-mortem debugging
when the disk-health checker fatals the process and GOTRACEBACK is set to dump
the stacks, including arguments. The start time argument will be printed in hex
form, allowing us to decode the start time of the operation. This can be used
to confirm that the timed operation was still inflight at the time stacks were
collected.

Close cockroachdb#3009.
@jbowens jbowens self-assigned this Oct 31, 2023
jbowens added a commit that referenced this issue Oct 31, 2023
Pass the start time in the form of nanoseconds since the unix epoch as a
parameter to timeDiskOp and timeFilesystemOp. This aids post-mortem debugging
when the disk-health checker fatals the process and GOTRACEBACK is set to dump
the stacks, including arguments. The start time argument will be printed in hex
form, allowing us to decode the start time of the operation. This can be used
to confirm that the timed operation was still inflight at the time stacks were
collected.

Close #3009.
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant