[libp2p-swarm] Permit more controlled backpressure. #1586

romanb · 2020-05-23T10:19:15Z

This is a more elaborate continuation from #1585 that proposes some first backward-compatible backpressure facilities between the Network and a NetworkBehaviour via the Swarm. Please note that the size of the diff is a bit blown out of proportion only because of an indented match block (and a lot of added commentary).

`Network` to `NetworkBehaviour`

When network connections become slow to process events, which can currently either happen because ConnectionHandler::inject_event is slow or because the task scheduler is busy or overwhelmed, the Swarm starts being unable to deliver new events emitted by the NetworkBehaviour to the respective connection(s). When that happens, it first tries to propagate backpressure to the NetworkBehaviour by calling NetworkBehaviour::inject_connections_busy, i.e. giving the behaviour the event back to be emitted again at a later time. If the behaviour rejects that event (i.e. does not implement backpressure), the Swarm buffers the pending_event and still allows network I/O and consumption of connection events by the NetworkBehaviour via NetworkBehaviour::inject_event to continue despite the buffered event, subject to the configured max_event_lead, which controls the maximum number of calls to NetworkBehaviour::inject_event that may be made before a call to NetworkBehaviour::poll must happen. When max_event_lead is reached and the pending_event is still not consumed, the Swarm waits for the connection (handler) to signal readiness to consume the event, i.e. exerting backpressure on the entire Network. The point where max_event_lead is reached is essentially the point at which the NetworkBehaviour in turn exerts backpressure on the Network, because it cannot consume more events without being allowed to make progress via NetworkBehaviour::poll.

Important: By default, max_event_lead is 1 and the default implementation of NetworkBehaviour::inject_connections_busy is Err(event), meaning these changes are fully backward-compatible and without changing the new configuration option for max_event_lead or implementing NetworkBehaviour::inject_connections_busy, the existing behaviour is preserved, allowing a gradual transition and experimentation.

Missing: What I think is really missing to make this backpressure pipeline complete is a backward-incompatible change to ConnectionHander::inject_event by allowing it to outright return (i.e. reject) the given event and thus exert backpressure on the background task which in turn would propagate backpressure back through the Network to the Swarm by not consuming new events from its inbound channel until the inject_event accepted the event.

`NetworkBehaviour` to `Network`

The NetworkBehaviour has no direct means for backpressure on NetworkBehaviour::inject_event as it is expected from the contract of the trait that whenever the behaviour returns Poll::Pending, it is ready to receive more events, and the max_event_lead bounds by how much the number of calls to NetworkBehaviour::inject_event may exceed those to NetworkBehaviour::poll. Once max_event_lead is reached backpressure propagates back to the Network. In this way, the max_event_lead configures the threshold for backpressure from the NetworkBehaviour towards the Network.

Important: I changed the polling behaviour of the swarm such that it always polls the NetworkBehaviour until it returns Poll::Pending, which signals that the behaviour is waiting for progress and input from the Network (or the network connections signal backpressure, in which case the behaviour is also no longer polled). This is supposed to be an improvement in two ways. Firstly, if a single call to NetworkBehaviour::inject_event happens to result in more than 1 (possibly many) events to be ready to be emitted from NetworkBehaviour::poll, then the behaviour is drained of these events before the next call to inject_event, thus preventing potentially unbounded buffer growth in the behaviour if calls to inject_event and poll always happen roughly at the same frequency. Secondly, polling the behaviour until it is Pending allows it to catch up after a previous increase of the event_lead due to some network slowness. I clarified the documentation on the matter that a behaviour that from some point onwards never again emits Poll::Pending is usually misbehaved, as this is a behaviour that, from some point onwards, no longer needs any input or progress from the Network. I also think that it makes for relatively easy to understand semantics of how the Swarm does the polling, as it makes clear that the NetworkBehaviour takes the lead and the underlying Network is only an input/output facility driven on demand. There may be strange cases where a behaviour is very high on output but low on input, in which case it may be beneficial to sprinkle Poll::Pending in its poll output intentionally to permit network progress, but even if it doesn't, then there is still the safeguard that the Swarm stops polling the behaviour if the network connections signal backpressure.

…r is busy.

Establish an upper bound for now, this is a separate issue.

swarm/src/lib.rs

tomaka · 2020-05-26T11:33:29Z

As I mentioned in the other PR, I'm very reluctant about this change.

To be clear, NetworkBehaviour is not supposed to be asynchronous at all, and the idea is that the state of the Swarm and the state of the NetworkBehaviour are in sync. The fact that we call NetworkBehaviour::poll after every single call to NetworkBehaviour::inject_* is completely intended. This way, the various "events queues" that implementations of NetworkBehaviour do maintain in practice are more of a "delayed return type" rather than an actual queue of events.

In other words, in an ideal world, the various NetworkBehaviour::inject_* methods would return which actions the swarm has to perform in response to the event. This return value has been "moved" to become the return value of NetworkBehaviour::poll (with the guarantee that poll would be called immediately after) because it makes it seemingly easier to implement the trait.

I do believe that this PR considerably complicates the overall design, and I don't really understand what making NetworkBehaviours asynchronous brings that the bounded channels to background task don't already bring.

romanb · 2020-05-26T11:46:09Z

The fact that we call NetworkBehaviour::poll after every single call to NetworkBehaviour::inject_* is completely intended.

I find this problematic for the reasons mentioned in the PR.

I do believe that this PR considerably complicates the overall design, [..]

Interesting, because I think it offers a good trade-off between allowing opt-in improvements on the potential pending_event bottleneck and added complexity, while making the whole polling more robust due to backpressure in both directions and clear documentation on how it works. Of course there are certain complications because the intention here is precisely to make it possible to improve upon the pending_event design. But there is really nothing I can add here that I didn't already write in the PR description. I will give up on this topic in libp2p-swarm then.

Roman S. Borschel added 7 commits May 20, 2020 14:42

[libp2p-swarm] Let network I/O continue if a single connection handle…

0dbf5c4

…r is busy.

Commentary

52a2f72

CI

ada175b

Permit more controlled backpressure in libp2p-swarm.

27100d3

Try to catch up with polling a bit more eagerly.

14c19c9

Cleanup

fa9930e

Simplify and better commentary.

66fb077

romanb mentioned this pull request May 23, 2020

[RFC][libp2p-swarm] Let network I/O run ahead of NetworkBehaviour polling. #1585

Closed

Roman S. Borschel added 3 commits May 23, 2020 20:14

CI

800329c

noise smoke tests stall with async-std 1.6.

22d59cc

Establish an upper bound for now, this is a separate issue.

noise smoke tests stall with async-std 1.6.

557eb9c

mxinden reviewed May 25, 2020

View reviewed changes

swarm/src/lib.rs Outdated Show resolved Hide resolved

Roman S. Borschel and others added 4 commits May 25, 2020 10:40

Refine comment.

450b364

Merge branch 'master' into swarm-backpressure

049593c

Merge branch 'master' into swarm-backpressure

0d29ef1

Merge branch 'master' into swarm-backpressure

4e19add

romanb closed this May 26, 2020

romanb mentioned this pull request Jun 2, 2020

[libp2p-swarm] Opt-in connection-oriented backpressure for NetworkBehaviours. #1602

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libp2p-swarm] Permit more controlled backpressure. #1586

[libp2p-swarm] Permit more controlled backpressure. #1586

romanb commented May 23, 2020 •

edited

Loading

tomaka commented May 26, 2020 •

edited

Loading

romanb commented May 26, 2020 •

edited

Loading

[libp2p-swarm] Permit more controlled backpressure. #1586

[libp2p-swarm] Permit more controlled backpressure. #1586

Conversation

romanb commented May 23, 2020 • edited Loading

Network to NetworkBehaviour

NetworkBehaviour to Network

tomaka commented May 26, 2020 • edited Loading

romanb commented May 26, 2020 • edited Loading

romanb commented May 23, 2020 •

edited

Loading

`Network` to `NetworkBehaviour`

`NetworkBehaviour` to `Network`

tomaka commented May 26, 2020 •

edited

Loading

romanb commented May 26, 2020 •

edited

Loading