Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(iroh-blobs): use async_channel instead of flume for local_pool #2533

Merged
merged 6 commits into from
Jul 23, 2024

Conversation

rklaehn
Copy link
Contributor

@rklaehn rklaehn commented Jul 23, 2024

Description

During soft shutdown of the local pool, a Finish message is sent to all threads. On main, this occasionally hangs. Further investigation showed that this is a message that is being sent but not received despite being in the channel. Adding a simple timeout to the select! so the flume recv call is executed again fixes it.

See discussion in https://discord.com/channels/949724860232392765/950683937661935667/1265205285618847744

So it seems that there is a bug in flume that occasionally leads to notifications being dropped.

This PR just does a 1:1 replacement of flume with async_channel.

Before this change, I can get the test_shutdown test to fail easily by running it 1000 times:

for i in $(seq 1 1000); do cargo test --release -p iroh-blobs test_shutdown -- --nocapture >> log.txt; done

Result:

...
sending shutdown message
sending shutdown message
sending shutdown message
sending shutdown message
test util::local_pool::tests::test_shutdown has been running for over 60 seconds

After this change, I can not get test_finish (renamed because it tests finish, not shutdown) to fail at all even after several 1000 tests.

Breaking Changes

None

Notes & open questions

Can somebody talk me out of this? I would prefer to keep flume, but the evidence above seems conclusive...

Note: why not tokio::sync::mpsc::channel? I need a mpmc channel. The handle can be cloned, and can send to any of the n worker threads.

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.

@rklaehn rklaehn marked this pull request as ready for review July 23, 2024 10:53
Copy link
Contributor

@dignifiedquire dignifiedquire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it helps, lets do it

@divagant-martian
Copy link
Contributor

since it's shutdown, could this be related to flume's lack of cancel-safety?

@rklaehn
Copy link
Contributor Author

rklaehn commented Jul 23, 2024

since it's shutdown, could this be related to flume's lack of cancel-safety?

Could be. But what I am seeing is not a message getting lost during recv, but just a message being "stuck". Adding a tokio::time::sleep(...) to the select! statement for testing "fixes" it by waking up the future, then the message gets delivered.

@rklaehn rklaehn enabled auto-merge July 23, 2024 15:39
@rklaehn rklaehn added this pull request to the merge queue Jul 23, 2024
Merged via the queue into main with commit 9052905 Jul 23, 2024
26 checks passed
matheus23 pushed a commit that referenced this pull request Nov 14, 2024
…2533)

## Description

During soft shutdown of the local pool, a Finish message is sent to all
threads. On main, this occasionally hangs. Further investigation showed
that this is a message that is being sent but not received despite being
in the channel. Adding a simple timeout to the select! so the flume recv
call is executed again fixes it.

See discussion in
https://discord.com/channels/949724860232392765/950683937661935667/1265205285618847744

So it seems that there is a bug in flume that occasionally leads to
notifications being dropped.

This PR just does a 1:1 replacement of flume with async_channel.

Before this change, I can get the test_shutdown test to fail easily by
running it 1000 times:

```
for i in $(seq 1 1000); do cargo test --release -p iroh-blobs test_shutdown -- --nocapture >> log.txt; done
```

Result:
```
...
sending shutdown message
sending shutdown message
sending shutdown message
sending shutdown message
test util::local_pool::tests::test_shutdown has been running for over 60 seconds
```

After this change, I can not get test_finish (renamed because it tests
finish, not shutdown) to fail at all even after several 1000 tests.

## Breaking Changes

None

## Notes & open questions

Can somebody talk me out of this? I would prefer to keep flume, but the
evidence above seems conclusive...

Note: why not tokio::sync::mpsc::channel? I need a mpmc channel. The
handle can be cloned, and can send to any of the n worker threads.

## Change checklist

- [x] Self-review.
- [x] ~~Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.~~
- [x] ~~Tests if relevant.~~
- [x] ~~All breaking changes documented.~~
@rklaehn rklaehn deleted the fix-shutdown branch November 20, 2024 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants