Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(iroh-blobs): properly handle Drop in local pool during shutdown #2517

Merged
merged 29 commits into from
Jul 22, 2024

Conversation

rklaehn
Copy link
Contributor

@rklaehn rklaehn commented Jul 16, 2024

Description

The tokio_util LocalPoolHandle does not properly handle Drop during shutdown. Its threads are just spawned as detached. So any drop impl that runs in a local pool thread will be stopped as soon as the process terminates. This can have some bad consequences if that drop operation performs IO, like closing files and committing database transactions.

Here is where the threads get spawned. The std::thread::JoinHandles are just dropped.
https://docs.rs/tokio-util/latest/src/tokio_util/task/spawn_pinned.rs.html#381

Here is some discussion of the observed effects:
https://discord.com/channels/949724860232392765/1260571544414064670

LocalPoolHandle also, of course, is using an unbounded channel:
https://docs.rs/tokio-util/latest/src/tokio_util/task/spawn_pinned.rs.html#372

Breaking Changes

Public interfaces using tokio_util::task::LocalPoolHandle will now use our own LocalPool/LocalPoolHandle.

Notes & open questions

Should we use an unbounded channel like tokio::spawn or LocalPoolHandle::spawn_pinned? Seems like a big footgun. But if not, we need to somehow handle when the queue is full.

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.

@rklaehn rklaehn changed the title fix(iroh-bytes): properly handle Drop in local pool during shutdown fix(iroh-blobs)!: properly handle Drop in local pool during shutdown Jul 16, 2024
@dignifiedquire dignifiedquire added this to the v0.21.0 milestone Jul 17, 2024
tokio really loves their thread locals...
...so we can call shutdown on it
…lled

to avoid panicking the task, so it becomes easier to see the panics that
actually matter.
Also add "tokio style" panic when shutdown versions for backwards compat
@rklaehn rklaehn marked this pull request as ready for review July 19, 2024 11:03
@@ -454,7 +455,10 @@ where

async fn build_inner(self) -> Result<ProtocolBuilder<D>> {
trace!("building node");
let lp = LocalPoolHandle::new(num_cpus::get());
let lp = LocalPool::new(local_pool::Config {
panic_mode: PanicMode::LogAndContinue,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Womp womp, this is because of tokio-rs/tracing#2870, mostly

rt is confusing because we also have a normal runtime
we now have run, which is just a wrapped oneshot sender
iroh/src/node.rs Outdated Show resolved Hide resolved
@rklaehn rklaehn changed the title fix(iroh-blobs)!: properly handle Drop in local pool during shutdown fix(iroh-blobs): properly handle Drop in local pool during shutdown Jul 22, 2024
## Description

Make sure the local pool threads use the main tokio runtime instead of a
current_thread runtime per local pool thread.

This is mostly relevant for call to spawn_blocking and spawn from inside
local tasks. Before they would go to the single threaded runtime of that
thread, now they go to the blocking pool of the main runtime.

This means that the local futures are more tightly integrated with the
main runtime. Everything you can do from a spawned future, you should be
able to also do from a local future.

## Breaking Changes

Surprisingly, none

## Notes & open questions

Still not sure if this is OK to do, but given that even
tokio::runtime::Handle has a block_on fn, it seems intended usage.

## Change checklist

- [x] Self-review.
- [x] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [x] ~~Tests if relevant.~~
- [x] ~~All breaking changes documented.~~
@rklaehn rklaehn enabled auto-merge July 22, 2024 13:21
@rklaehn rklaehn added this pull request to the merge queue Jul 22, 2024
Merged via the queue into main with commit b4506b2 Jul 22, 2024
26 checks passed
@rklaehn rklaehn deleted the safe-local-pool branch July 22, 2024 14:14
github-merge-queue bot pushed a commit that referenced this pull request Aug 1, 2024
…#2559)

## Description

There's this test that was introduced in [the `local_pool`
PR](#2517).
It was ignored via `#[ignore = "todo"]`. Notably, it's *not flaky*, it
always fails.
Our flaky tests are run with `cargo nextest run --run-ignored all
[...]`. We can't be more specific with the `ignore`d tests. The only
options are `default`, `ignored-only` and `all`.
This kind of test is really hard to write. IIUC, `#[should_panic]` can
only test for the panic happening in the thread that the test is
initiated in, it doesn't detect panics that are thrown in threads
spawned from the test. I assume this is the reason writing this test was
abandoned.

Keeping this test with the `#[ignore = "todo"]` on it means we're always
running it in our flaky test suite, which is confusing. We thought this
test was flaky, but it's not. IMO it's better to comment it out/remove
it than to pollute our flaky test results.

## Breaking Changes

None

## Notes & open questions

In this PR I'm commenting this test. Should we remove it instead?
Or do people have ideas on how to make this test work?
Do we have an idea what we're *expecting* of our `LocalPool`
implementation? Should a panic on one of the threads cause a panic in
the `finish()` function?

## Change checklist

- [X] Self-review.
- ~~[ ] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.~~
- ~~[ ] Tests if relevant.~~
- [X] All breaking changes documented.
rklaehn pushed a commit to n0-computer/iroh-blobs that referenced this pull request Oct 22, 2024
… (#2559)

## Description

There's this test that was introduced in [the `local_pool`
PR](n0-computer/iroh#2517).
It was ignored via `#[ignore = "todo"]`. Notably, it's *not flaky*, it
always fails.
Our flaky tests are run with `cargo nextest run --run-ignored all
[...]`. We can't be more specific with the `ignore`d tests. The only
options are `default`, `ignored-only` and `all`.
This kind of test is really hard to write. IIUC, `#[should_panic]` can
only test for the panic happening in the thread that the test is
initiated in, it doesn't detect panics that are thrown in threads
spawned from the test. I assume this is the reason writing this test was
abandoned.

Keeping this test with the `#[ignore = "todo"]` on it means we're always
running it in our flaky test suite, which is confusing. We thought this
test was flaky, but it's not. IMO it's better to comment it out/remove
it than to pollute our flaky test results.

## Breaking Changes

None

## Notes & open questions

In this PR I'm commenting this test. Should we remove it instead?
Or do people have ideas on how to make this test work?
Do we have an idea what we're *expecting* of our `LocalPool`
implementation? Should a panic on one of the threads cause a panic in
the `finish()` function?

## Change checklist

- [X] Self-review.
- ~~[ ] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.~~
- ~~[ ] Tests if relevant.~~
- [X] All breaking changes documented.
rklaehn pushed a commit to n0-computer/iroh-blobs that referenced this pull request Oct 22, 2024
… (#2559)

## Description

There's this test that was introduced in [the `local_pool`
PR](n0-computer/iroh#2517).
It was ignored via `#[ignore = "todo"]`. Notably, it's *not flaky*, it
always fails.
Our flaky tests are run with `cargo nextest run --run-ignored all
[...]`. We can't be more specific with the `ignore`d tests. The only
options are `default`, `ignored-only` and `all`.
This kind of test is really hard to write. IIUC, `#[should_panic]` can
only test for the panic happening in the thread that the test is
initiated in, it doesn't detect panics that are thrown in threads
spawned from the test. I assume this is the reason writing this test was
abandoned.

Keeping this test with the `#[ignore = "todo"]` on it means we're always
running it in our flaky test suite, which is confusing. We thought this
test was flaky, but it's not. IMO it's better to comment it out/remove
it than to pollute our flaky test results.

## Breaking Changes

None

## Notes & open questions

In this PR I'm commenting this test. Should we remove it instead?
Or do people have ideas on how to make this test work?
Do we have an idea what we're *expecting* of our `LocalPool`
implementation? Should a panic on one of the threads cause a panic in
the `finish()` function?

## Change checklist

- [X] Self-review.
- ~~[ ] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.~~
- ~~[ ] Tests if relevant.~~
- [X] All breaking changes documented.
matheus23 pushed a commit that referenced this pull request Nov 14, 2024
…2517)

## Description

The tokio_util LocalPoolHandle does not properly handle Drop during
shutdown. Its threads are just spawned as detached. So any drop impl
that runs in a local pool thread will be stopped as soon as the process
terminates. This can have some bad consequences if that drop operation
performs IO, like closing files and committing database transactions.

Here is where the threads get spawned. The `std::thread::JoinHandle`s
are just dropped.

https://docs.rs/tokio-util/latest/src/tokio_util/task/spawn_pinned.rs.html#381

Here is some discussion of the observed effects:
https://discord.com/channels/949724860232392765/1260571544414064670

LocalPoolHandle also, of course, is using an unbounded channel:

https://docs.rs/tokio-util/latest/src/tokio_util/task/spawn_pinned.rs.html#372

## Breaking Changes

Public interfaces using tokio_util::task::LocalPoolHandle will now use
our own LocalPool/LocalPoolHandle.

## Notes & open questions

Should we use an unbounded channel like tokio::spawn or
LocalPoolHandle::spawn_pinned? Seems like a big footgun. But if not, we
need to somehow handle when the queue is full.

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist

- [x] Self-review.
- [x] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [x] Tests if relevant.
- [x] All breaking changes documented.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants