-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(iroh-blobs): use async_channel instead of flume for local_pool #2533
Conversation
also remove the backup timer
just to minimize the diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it helps, lets do it
since it's shutdown, could this be related to flume's lack of cancel-safety? |
Could be. But what I am seeing is not a message getting lost during recv, but just a message being "stuck". Adding a tokio::time::sleep(...) to the select! statement for testing "fixes" it by waking up the future, then the message gets delivered. |
…2533) ## Description During soft shutdown of the local pool, a Finish message is sent to all threads. On main, this occasionally hangs. Further investigation showed that this is a message that is being sent but not received despite being in the channel. Adding a simple timeout to the select! so the flume recv call is executed again fixes it. See discussion in https://discord.com/channels/949724860232392765/950683937661935667/1265205285618847744 So it seems that there is a bug in flume that occasionally leads to notifications being dropped. This PR just does a 1:1 replacement of flume with async_channel. Before this change, I can get the test_shutdown test to fail easily by running it 1000 times: ``` for i in $(seq 1 1000); do cargo test --release -p iroh-blobs test_shutdown -- --nocapture >> log.txt; done ``` Result: ``` ... sending shutdown message sending shutdown message sending shutdown message sending shutdown message test util::local_pool::tests::test_shutdown has been running for over 60 seconds ``` After this change, I can not get test_finish (renamed because it tests finish, not shutdown) to fail at all even after several 1000 tests. ## Breaking Changes None ## Notes & open questions Can somebody talk me out of this? I would prefer to keep flume, but the evidence above seems conclusive... Note: why not tokio::sync::mpsc::channel? I need a mpmc channel. The handle can be cloned, and can send to any of the n worker threads. ## Change checklist - [x] Self-review. - [x] ~~Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant.~~ - [x] ~~Tests if relevant.~~ - [x] ~~All breaking changes documented.~~
Description
During soft shutdown of the local pool, a Finish message is sent to all threads. On main, this occasionally hangs. Further investigation showed that this is a message that is being sent but not received despite being in the channel. Adding a simple timeout to the select! so the flume recv call is executed again fixes it.
See discussion in https://discord.com/channels/949724860232392765/950683937661935667/1265205285618847744
So it seems that there is a bug in flume that occasionally leads to notifications being dropped.
This PR just does a 1:1 replacement of flume with async_channel.
Before this change, I can get the test_shutdown test to fail easily by running it 1000 times:
Result:
After this change, I can not get test_finish (renamed because it tests finish, not shutdown) to fail at all even after several 1000 tests.
Breaking Changes
None
Notes & open questions
Can somebody talk me out of this? I would prefer to keep flume, but the evidence above seems conclusive...
Note: why not tokio::sync::mpsc::channel? I need a mpmc channel. The handle can be cloned, and can send to any of the n worker threads.
Change checklist
Documentation updates following the style guide, if relevant.Tests if relevant.All breaking changes documented.