fix(iroh-blobs): properly handle Drop in local pool during shutdown #2517

rklaehn · 2024-07-16T15:56:53Z

Description

The tokio_util LocalPoolHandle does not properly handle Drop during shutdown. Its threads are just spawned as detached. So any drop impl that runs in a local pool thread will be stopped as soon as the process terminates. This can have some bad consequences if that drop operation performs IO, like closing files and committing database transactions.

Here is where the threads get spawned. The std::thread::JoinHandles are just dropped.
https://docs.rs/tokio-util/latest/src/tokio_util/task/spawn_pinned.rs.html#381

Here is some discussion of the observed effects:
https://discord.com/channels/949724860232392765/1260571544414064670

LocalPoolHandle also, of course, is using an unbounded channel:
https://docs.rs/tokio-util/latest/src/tokio_util/task/spawn_pinned.rs.html#372

Breaking Changes

Public interfaces using tokio_util::task::LocalPoolHandle will now use our own LocalPool/LocalPoolHandle.

Notes & open questions

Should we use an unbounded channel like tokio::spawn or LocalPoolHandle::spawn_pinned? Seems like a big footgun. But if not, we need to somehow handle when the queue is full.

Change checklist

Self-review.
Documentation updates following the style guide, if relevant.
Tests if relevant.
All breaking changes documented.

tokio really loves their thread locals...

...so we can call shutdown on it

…lled to avoid panicking the task, so it becomes easier to see the panics that actually matter.

Also add "tokio style" panic when shutdown versions for backwards compat

rklaehn · 2024-07-19T11:09:13Z

iroh/src/node/builder.rs

@@ -454,7 +455,10 @@ where

    async fn build_inner(self) -> Result<ProtocolBuilder<D>> {
        trace!("building node");
-        let lp = LocalPoolHandle::new(num_cpus::get());
+        let lp = LocalPool::new(local_pool::Config {
+            panic_mode: PanicMode::LogAndContinue,


Womp womp, this is because of tokio-rs/tracing#2870, mostly

rt is confusing because we also have a normal runtime

just use spawn or run

we now have run, which is just a wrapped oneshot sender

iroh-blobs/src/util/local_pool.rs

iroh/src/node.rs

## Description Make sure the local pool threads use the main tokio runtime instead of a current_thread runtime per local pool thread. This is mostly relevant for call to spawn_blocking and spawn from inside local tasks. Before they would go to the single threaded runtime of that thread, now they go to the blocking pool of the main runtime. This means that the local futures are more tightly integrated with the main runtime. Everything you can do from a spawned future, you should be able to also do from a local future. ## Breaking Changes Surprisingly, none ## Notes & open questions Still not sure if this is OK to do, but given that even tokio::runtime::Handle has a block_on fn, it seems intended usage. ## Change checklist - [x] Self-review. - [x] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [x] ~~Tests if relevant.~~ - [x] ~~All breaking changes documented.~~

…#2559) ## Description There's this test that was introduced in [the `local_pool` PR](#2517). It was ignored via `#[ignore = "todo"]`. Notably, it's *not flaky*, it always fails. Our flaky tests are run with `cargo nextest run --run-ignored all [...]`. We can't be more specific with the `ignore`d tests. The only options are `default`, `ignored-only` and `all`. This kind of test is really hard to write. IIUC, `#[should_panic]` can only test for the panic happening in the thread that the test is initiated in, it doesn't detect panics that are thrown in threads spawned from the test. I assume this is the reason writing this test was abandoned. Keeping this test with the `#[ignore = "todo"]` on it means we're always running it in our flaky test suite, which is confusing. We thought this test was flaky, but it's not. IMO it's better to comment it out/remove it than to pollute our flaky test results. ## Breaking Changes None ## Notes & open questions In this PR I'm commenting this test. Should we remove it instead? Or do people have ideas on how to make this test work? Do we have an idea what we're *expecting* of our `LocalPool` implementation? Should a panic on one of the threads cause a panic in the `finish()` function? ## Change checklist - [X] Self-review. - ~~[ ] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant.~~ - ~~[ ] Tests if relevant.~~ - [X] All breaking changes documented.

… (#2559) ## Description There's this test that was introduced in [the `local_pool` PR](n0-computer/iroh#2517). It was ignored via `#[ignore = "todo"]`. Notably, it's *not flaky*, it always fails. Our flaky tests are run with `cargo nextest run --run-ignored all [...]`. We can't be more specific with the `ignore`d tests. The only options are `default`, `ignored-only` and `all`. This kind of test is really hard to write. IIUC, `#[should_panic]` can only test for the panic happening in the thread that the test is initiated in, it doesn't detect panics that are thrown in threads spawned from the test. I assume this is the reason writing this test was abandoned. Keeping this test with the `#[ignore = "todo"]` on it means we're always running it in our flaky test suite, which is confusing. We thought this test was flaky, but it's not. IMO it's better to comment it out/remove it than to pollute our flaky test results. ## Breaking Changes None ## Notes & open questions In this PR I'm commenting this test. Should we remove it instead? Or do people have ideas on how to make this test work? Do we have an idea what we're *expecting* of our `LocalPool` implementation? Should a panic on one of the threads cause a panic in the `finish()` function? ## Change checklist - [X] Self-review. - ~~[ ] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant.~~ - ~~[ ] Tests if relevant.~~ - [X] All breaking changes documented.

…2517) ## Description The tokio_util LocalPoolHandle does not properly handle Drop during shutdown. Its threads are just spawned as detached. So any drop impl that runs in a local pool thread will be stopped as soon as the process terminates. This can have some bad consequences if that drop operation performs IO, like closing files and committing database transactions. Here is where the threads get spawned. The `std::thread::JoinHandle`s are just dropped. https://docs.rs/tokio-util/latest/src/tokio_util/task/spawn_pinned.rs.html#381 Here is some discussion of the observed effects: https://discord.com/channels/949724860232392765/1260571544414064670 LocalPoolHandle also, of course, is using an unbounded channel: https://docs.rs/tokio-util/latest/src/tokio_util/task/spawn_pinned.rs.html#372 ## Breaking Changes Public interfaces using tokio_util::task::LocalPoolHandle will now use our own LocalPool/LocalPoolHandle. ## Notes & open questions Should we use an unbounded channel like tokio::spawn or LocalPoolHandle::spawn_pinned? Seems like a big footgun. But if not, we need to somehow handle when the queue is full.  ## Change checklist - [x] Self-review. - [x] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [x] Tests if relevant. - [x] All breaking changes documented.

rklaehn changed the title ~~fix(iroh-bytes): properly handle Drop in local pool during shutdown~~ fix(iroh-blobs)!: properly handle Drop in local pool during shutdown Jul 16, 2024

rklaehn force-pushed the safe-local-pool branch from 7fe0dc0 to e80e82f Compare July 16, 2024 16:07

Use our own local pool with proper drop impl

8f3469c

rklaehn force-pushed the safe-local-pool branch from e80e82f to 8f3469c Compare July 16, 2024 18:58

rklaehn added 3 commits July 17, 2024 11:43

Implement cancellation

1ec6f58

Use just FuturesUnordered instead of that weird LocalSet/JoinSet shit

faa68b3

Use run_detached in rpc and provider

480627b

dignifiedquire added this to the v0.21.0 milestone Jul 17, 2024

ramfox assigned rklaehn Jul 17, 2024

rklaehn added 8 commits July 17, 2024 19:56

Add back the stupid localset

9bc546e

tokio really loves their thread locals...

Move local pool handle to non-shared part of node

3f84782

...so we can call shutdown on it

Remove panic handling via flume channels

a66cd49

Convoluted shit to cancel the outer task when the inner task is cance…

fe69752

…lled to avoid panicking the task, so it becomes easier to see the panics that actually matter.

Share Drop and shutdown impl

2de20e5

Rename all fns to try_... versions

5ff5a19

Also add "tokio style" panic when shutdown versions for backwards compat

Some renaming, also fix shutdown

82c61a5

Merge branch 'main' into safe-local-pool

e1958f1

rklaehn marked this pull request as ready for review July 19, 2024 11:03

Use LocalPool::single()

e804645

rklaehn commented Jul 19, 2024

View reviewed changes

rklaehn added 3 commits July 19, 2024 14:13

rename rt to local_pool or local_pool_handle

b3b5ada

rt is confusing because we also have a normal runtime

clippy

79732b3

reduce use of spawn_pinned tokio compat

1a9d802

just use spawn or run

rklaehn force-pushed the safe-local-pool branch from 516a719 to 1a9d802 Compare July 19, 2024 11:52

rklaehn added 2 commits July 19, 2024 15:16

Remove last usage of spawn_pinned

a3db594

we now have run, which is just a wrapped oneshot sender

Test: don't shut down local pool

064e132

dignifiedquire reviewed Jul 19, 2024

View reviewed changes

iroh-blobs/src/util/local_pool.rs Outdated Show resolved Hide resolved

dignifiedquire reviewed Jul 19, 2024

View reviewed changes

iroh-blobs/src/util/local_pool.rs Outdated Show resolved Hide resolved

dignifiedquire reviewed Jul 19, 2024

View reviewed changes

iroh/src/node.rs Outdated Show resolved Hide resolved

Even more drastic attempt to keep the local tasks alive

8ab1633

rklaehn added 8 commits July 19, 2024 16:01

Undo experiments

6c66992

Add more logging for pool(s) shutdown

c14f3af

remove unwind

648ccc2

Use old version of spawn_pinned

e7b1ff8

Merge branch 'main' into safe-local-pool

2409165

fix hot loop due to join_next returning None

b80ed23

Some renaming

4957de4

Docs fixes

a2231ba

rklaehn requested a review from dignifiedquire July 22, 2024 09:35

Merge branch 'main' into safe-local-pool

89178d8

rklaehn changed the title ~~fix(iroh-blobs)!: properly handle Drop in local pool during shutdown~~ fix(iroh-blobs): properly handle Drop in local pool during shutdown Jul 22, 2024

rklaehn enabled auto-merge July 22, 2024 13:21

dignifiedquire approved these changes Jul 22, 2024

View reviewed changes

rklaehn added this pull request to the merge queue Jul 22, 2024

Merged via the queue into main with commit b4506b2 Jul 22, 2024
26 checks passed

rklaehn deleted the safe-local-pool branch July 22, 2024 14:14

matheus23 mentioned this pull request Jul 29, 2024

test(iroh-blobs): comment out ignored test (that is not a flaky test) #2559

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(iroh-blobs): properly handle Drop in local pool during shutdown #2517

fix(iroh-blobs): properly handle Drop in local pool during shutdown #2517

rklaehn commented Jul 16, 2024 •

edited

Loading

rklaehn Jul 19, 2024

fix(iroh-blobs): properly handle Drop in local pool during shutdown #2517

fix(iroh-blobs): properly handle Drop in local pool during shutdown #2517

Conversation

rklaehn commented Jul 16, 2024 • edited Loading

Description

Breaking Changes

Notes & open questions

Change checklist

rklaehn Jul 19, 2024

Choose a reason for hiding this comment

rklaehn commented Jul 16, 2024 •

edited

Loading