Prevent deadlock if sender/receiver is forgotten #49

sbarral · 2022-08-09T12:11:42Z

Fixes #45.

This is mostly similar to PR #16 by another author, but its motivation is correctness rather than the purported performance advantage. Differences with PR #16:

it adds correctness tests for the case of forgotten sender and forgotten receiver,
the commit is based on 1.7.0,
the listener is not discarded when the future completes: in my understanding this would only help in the rare case where a sender/receiver completes due to a spurious wake-up (without consuming its notification); if my understanding is correct, I would say that the fixed cost of the extra conditional does not carry its weight.

I have made some benchmarking in various scenarios and did not observe a meaningful difference in terms of performance.

That being said, we can't exclude that performance could be degraded in extreme cases. In theory degraded performance could happen with a very large number of fast senders combined with a fast receiver loop: a receiver which rapidly empties the channel would generate many notifications but if the first sender consume all the capacity before the remaining senders are polled, most notifications would be wasted. A symmetric situation could occur with many receivers and a fast sender loop.

sbarral · 2022-08-09T13:37:39Z

Here are the results of the only benchmark that showed significant enough differences.

This benchmark is a beast so I can't really share it easily (I made it for my own purposes), but in a nutshell this is a "funnel" benchmark where each receiver is connected to 13 senders which each send a very small payload (a usize) in a tight loop, and the receiver does basically nothing except receive these payloads in a tight loop. The benchmark runs simultaneously 61 such channels to ensure that all executor threads are working at all time. The below figures were obtained with tokio using 4 threads.

Number of messages per microsecond as a function of channel capacity before the commit (higher is better):

        capacity=1        2.390 msg/µs
        capacity=10      13.094 msg/µs
        capacity=100     24.756 msg/µs
        capacity=1000    28.006 msg/µs
        capacity=10000   28.008 msg/µs

The same after the commit:

        capacity=1        2.342 msg/µs
        capacity=10       4.964 msg/µs
        capacity=100     30.280 msg/µs
        capacity=1000    30.839 msg/µs
        capacity=10000   33.092 msg/µs

So it's a mixed bag really, and overall one could even claim that the performance is better after, but in my opinion "funnel-type" benchmarks are a bit artificial anyway (it's unusual to have both a very fast receiver and very fast senders) so I would not put too much weight on these results.

unixpickle · 2022-09-24T06:59:36Z

I hit the bug that this fixes. It drove me nuts for days and was tough to find. Please merge this!

notgull · 2022-09-24T13:40:16Z

tests/bounded.rs


 fn ms(ms: u64) -> Duration {
    Duration::from_millis(ms)
 }

+fn poll_now<F: Future + Unpin>(f: &mut F) -> Poll<F::Output> {


You could probably just use poll_once from futures_lite here.

You are right, I discarded this solution because the future must be kept alive so can't be consumed, it didn't come to my mind that passing a mutable ref would work just as well since it implement Future too...

notgull · 2022-09-24T13:41:29Z

src/lib.rs

@@ -966,14 +965,7 @@ impl<'a, T> Send<'a, T> {
            let msg = self.msg.take().unwrap();
            // Attempt to send a message.
            match self.sender.try_send(msg) {
-                Ok(()) => {


Why did you get rid of this one?

Hmm, not totally sure what your question is about.
The same thing is done for the Recv future, and this is because the original mechanism was meant to implement a "slow" notification chain to limit the number of simultaneous in-flight notifications. So only one notification was sent at a time, but this made it necessary to unconditionally send another notification, whether this notification was needed or not.
This is no longer necessary since now all notifications are sent, whether or not one is already in-flight.
Note that PR #16 did the exact same thing and the original author seemed to agree it was correct. And all tests pass, which is another indicator that it should be OK :-)

I was getting burned by this issue, hanging the scraper: smol-rs/async-channel#49

notgull

Thanks!

notgull · 2022-09-27T14:48:50Z

@sbarral Do you have any other benchmarks, e.g. basic SPSC or MPSC? I'd like to make sure this PR doesn't regress the common case before I merge it.

sbarral · 2022-09-27T15:43:33Z

@sbarral Do you have any other benchmarks, e.g. basic SPSC or MPSC? I'd like to make sure this PR doesn't regress the common case before I merge it.

The "funnel" benchmark mentioned above is what I would think of as the archetypical MPSC benchmark, and is indeed the one that shows that there is some trade-off.

Truth be told, I didn't really expect any difference in the SPSC case since in that case there can be only one registered listener on either side (1 sender/1 receiver). But just to be sure, I changed the parameters of the "funnel" benchmark to have just one sender, making it effectively an SPSC. So this tests spawns 61 sender/receiver pairs on tokio and measures the throughput.

Number of messages per microsecond as a function of channel capacity on v1.7.1, before the commit (higher is better):

capacity=1        4.257 msg/µs
capacity=10      15.226 msg/µs
capacity=100     22.409 msg/µs
capacity=1000    24.989 msg/µs
capacity=10000   24.930 msg/µs

The same after the commit:

capacity=1        4.353 msg/µs
capacity=10      18.416 msg/µs
capacity=100     30.980 msg/µs
capacity=1000    32.917 msg/µs
capacity=10000   32.954 msg/µs

So that's a nice surprise... My best guess is that this comes from the deletion of the additional notification which you inquired about. In the SPSC case this extra notification is never needed, but even when there is no interested listener it introduces a full fence (see here). This would explain why there is no difference for capacity=1 since this was special-cased before the commit to prevent the additional notification.

notgull · 2022-09-27T18:35:58Z

I don't see any reason not to merge this, thanks!

sbarral · 2022-10-28T10:58:59Z

While there is still time to revert this commit, I wanted to highlight a caveat which I initially missed so that the decision to proceed with this PR (or not) is fully informed.

While I continue to believe that this PR fixes a real flaw, I realize now that the current behaviour was superior in one aspect: it warranted strict fairness (edit: not totally true, see below) by making sure that senders and receivers were unblocked in the same order they were blocked. After this PR, they are still notified in fair order, but nothing guarantees that they will get scheduled fairly. Unfortunately this strict fairness was also the direct cause of issue #45.

A nice idea is tokio's MPSC strategy, which I only became aware of recently. It sends all notifications right away like in this PR and does not strictly ensure fairness, but it will only let a sender push a value if this leaves enough free slots for senders that were notified earlier but not yet scheduled. This prevents deadlocks like in issue #45 and still provides fairness for all practical purposes.

Now I don't really see a way to implement tokio's strategy with the current architecture, so in practice we will probably have to choose between accepting easy deadlocks or resigning from strict fairness.

Edit: worth noting, however, that the behaviour before this PR was not actually strictly fair either: it would still let a non-notified sender/receiver steal a slot from a notified sender/receiver.

sbarral added 2 commits August 9, 2022 13:29

Test if blocked sender/receiver can cause deadlock

1800fc1

Prevent deadlock if sender/receiver is forgotten

ff9d5b7

notgull reviewed Sep 24, 2022

View reviewed changes

Use poll_once from future_lite

9852344

unixpickle added a commit to unixpickle/learning-rust that referenced this pull request Sep 24, 2022

remove async-channel dependency

9aba724

I was getting burned by this issue, hanging the scraper: smol-rs/async-channel#49

notgull approved these changes Sep 26, 2022

View reviewed changes

notgull merged commit 736bc72 into smol-rs:master Sep 27, 2022

sbarral mentioned this pull request Sep 27, 2022

Optimize and simplify notification logic #16

Closed

This was referenced Nov 29, 2022

Release with concurrent-queue 2.0.0 support #52

Closed

Release 1.8.0 #53

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent deadlock if sender/receiver is forgotten #49

Prevent deadlock if sender/receiver is forgotten #49

sbarral commented Aug 9, 2022 •

edited

Loading

sbarral commented Aug 9, 2022

unixpickle commented Sep 24, 2022

notgull Sep 24, 2022

sbarral Sep 24, 2022

notgull Sep 24, 2022

sbarral Sep 24, 2022

notgull left a comment

notgull commented Sep 27, 2022

sbarral commented Sep 27, 2022

notgull commented Sep 27, 2022

sbarral commented Oct 28, 2022 •

edited

Loading

Prevent deadlock if sender/receiver is forgotten #49

Prevent deadlock if sender/receiver is forgotten #49

Conversation

sbarral commented Aug 9, 2022 • edited Loading

sbarral commented Aug 9, 2022

unixpickle commented Sep 24, 2022

notgull Sep 24, 2022

Choose a reason for hiding this comment

sbarral Sep 24, 2022

Choose a reason for hiding this comment

notgull Sep 24, 2022

Choose a reason for hiding this comment

sbarral Sep 24, 2022

Choose a reason for hiding this comment

notgull left a comment

Choose a reason for hiding this comment

notgull commented Sep 27, 2022

sbarral commented Sep 27, 2022

notgull commented Sep 27, 2022

sbarral commented Oct 28, 2022 • edited Loading

sbarral commented Aug 9, 2022 •

edited

Loading

sbarral commented Oct 28, 2022 •

edited

Loading