Don't poll network unnecessarily. #1977

dvc94ch · 2021-02-23T17:08:13Z

Currently we have customers with >100 nodes in their local network. We received reports that mdns was responsible for 25% to 75% of network traffic. As you can imagine our customers weren't very happy about that. This PR massively reduces the mdns bandwidth requirements.

Mdns needs two sockets, a send socket and a receive socket. The receive socket is listening on the mdns broadcast address. When you join a network and receive an IfEvent::Up event the send socket is used to send an mdns query to the broadcast address. Everyone listening on the broadcast address responds with their mdns records, including yourself, to the broadcast address, so everyone gets a fresh view of the network and the existing peers learn about your records.

To avoid the case where you join a network and your initial discovery message was lost I readded a timer that by default will send a mdns query to the broadcast address if there has been no incoming mdns queries in the last five minutes. This timeout is configurable.

~~There is still one remaining issue and that is making if-watch pollable using manual futures. But that should minimally effect this PR other than requiring a version bump.~~

~~Companion PR libp2p/if-watch#7~~

rkuhn

Looks good as far as I understand it!

romanb

Thanks for the PR. To summarise in my own words after a first review, this PR does the following:

It removes service.rs and the separate MdnsService, collapsing the (simplified) code into the Mdns behaviour while just moving other query types from service.rs to the new query.rs module.
It makes the MDNS record TTL for response records configurable, with a default of 5 minutes as was previously fixed in a constant.
It makes the interval at which multicast queries are sent out configurable with a default of 5 minutes, as opposed to the current frequency of 20 seconds, thereby additionally sending out a query whenever we join a new multicast group on some network interface to ensure timely discovery in most cases despite the long query interval. Lost queries are effectively compensated for by that regular query interval.

Does that sound about right as a summary?

I left some comments that probably need to be resolved but the direction looks good to me. I would prefer though if we keep the default query interval lower, say at 1 minute, documenting the fact that the larger your network the larger you may want to configure the query interval at the increased risk of delayed discoveries due to lost datagrams, but at the gain of less MDNS network traffic.

This PR massively reduces the mdns bandwidth requirements.

It would be great for the record if you could put absolute numbers on these gains with these changes as observed on some networks, i.e. a rough before and after comparison.

examples/mdns-passive-discovery.rs

protocols/mdns/src/behaviour.rs

protocols/mdns/src/query.rs

dvc94ch · 2021-02-25T17:39:03Z

It makes the interval at which multicast queries are sent out configurable with a default of 5 minutes, as opposed to the current frequency of 20 seconds, thereby additionally sending out a query whenever we join a new multicast group on some network interface to ensure timely discovery in most cases despite the long query interval. Lost queries are effectively compensated for by that regular query interval.

Well I'd actually make the default to never query the network (at an interval). It is probably not required at all. Once we deploy this in production we can play with the settings but I suspect we could make the interval much larger.

The only reason why it may be needed is because udp is an unreliable transport so packets may be dropped without noticing it. I think the 20s interval was required before mainly because we didn't have if-watch and were therefore not notified of network changes.

…dth-requirements

romanb · 2021-03-01T14:06:38Z

I think the 20s interval was required before mainly because we didn't have if-watch and were therefore not notified of network changes.

I think so, too, and what is done here seems certainly more in the spirit of the libp2p MDNS spec, which states "When a peer starts (or detects a network change), it sends a query for all peers.". Nevertheless we currently have the notion of TTLs of the discovered peer records and that is why I would prefer to keep the query interval shorter than the default TTL - to avoid intermittent "expired" events that are immediately followed again by "discovered" events. Since the default TTL is 5 minutes, how about 4 minutes then for the default query TTL? Of course, since the TTL of a record is specified by the remote that sent it, if nodes are configured differently the local query interval may still be larger than the received record TTL. If we feel like it, we could dynamically update the interval based on the shortest TTL received that did not yet expire, but I think that goes beyond the scope of this PR and we can leave that for another time, if there is interest.

dvc94ch · 2021-03-01T14:17:37Z

My personal opinion is that ttl's should be ignored. The application should store all known addresses and discard them if a dial failure occurs. If the dial failure is temporarily, the address will be rediscovered at a later time. There is a need for most applications to have an address book of some sort (substrate implements it's own). I think we could consider adding a general behaviour for that to rust-libp2p that does the right thing for most cases. [0] is what ipfs-embed currently does, and users of ipfs-embed can add additional discovery mechanisms based on the gossipsub api (if you aren't using a dht you need a mechanism for peers to tell you about peers on their local subnet).

[0] https://github.com/ipfs-rust/ipfs-embed/blob/master/net/src/peers.rs

mxinden

It removes service.rs and the separate MdnsService, collapsing the (simplified) code into the Mdns behaviour while just moving other query types from service.rs to the new query.rs module.

I am in favor of merging the two.

protocols/mdns/src/behaviour.rs

mxinden · 2021-03-01T14:31:09Z

protocols/mdns/src/behaviour.rs

@@ -107,6 +164,77 @@ impl Mdns {
    pub fn discovered_nodes(&self) -> impl ExactSizeIterator<Item = &PeerId> {
        self.discovered_nodes.iter().map(|(p, _, _)| p)
    }
+
+    fn inject_mdns_packet(&mut self, packet: MdnsPacket, params: &impl PollParameters) {
+        self.timeout.set_interval(self.interval);


Say there is one very noisy node on the network broadcasting an mdns packet every 1 minute. With a timeout of e.g. 5 minutes configured locally, the timeout would never fire, thus the local node would never broadcast a query and thus addresses from other nodes would expire locally, correct?

In case I am not mistaken with the assumption above, I would see two ways forward:

Do not reset the timeout, always sending out a query at each interval.

Remove the notion of TTLs for address as suggested by David.

I have yet to put more thoughts into this, thus please feel free to ignore the comment.

This isn't quite accurate. If a peer sends a query, all other peers respond with their addresses to the multicast address. So you really only need one peer to make the query, as everyone will get the updates from everyone. This means that they will not expire locally.

I would prefer to not reset the timeout for now and leave the removal of the TTL to another PR.

if the timeout is smaller than the ttl, it should still be safe to reset the timeout. as I said it uses multicast so everyone gets all queries and all responses.

Thanks for clarifying @dvc94ch!

romanb · 2021-03-01T16:18:13Z

My personal opinion is that ttl's should be ignored. The application should store all known addresses and discard them if a dial failure occurs. If the dial failure is temporarily, the address will be rediscovered at a later time. There is a need for most applications to have an address book of some sort (substrate implements it's own). I think we could consider adding a general behaviour for that to rust-libp2p that does the right thing for most cases. [0] is what ipfs-embed currently does, and users of ipfs-embed can add additional discovery mechanisms based on the gossipsub api (if you aren't using a dht you need a mechanism for peers to tell you about peers on their local subnet).

That may be a good thing to do, I'd just prefer if we leave it to a possible follow-up PR, to not expand the scope here too much and to make reviewing of follow-up changes that remove the TTL easier. From my perspective as long as this PR keeps the query interval a bit lower than the TTL, it is already good to go.

romanb · 2021-03-02T09:02:09Z

I updated the CHANGELOG and will merge soon. Thanks for the PR @dvc94ch and the second review @mxinden.

dvc94ch added 3 commits February 23, 2021 17:58

Don't poll network unnecessarily.

33cb006

Fix ci.

96a29d1

Damn tokio.

d3aca54

rkuhn reviewed Feb 23, 2021

View reviewed changes

romanb reviewed Feb 25, 2021

View reviewed changes

dvc94ch added 4 commits March 1, 2021 13:53

Address review comments.

4eaa295

Merge remote-tracking branch 'upstream/master' into lower-mdns-bandwi…

d9552be

…dth-requirements

Update deps.

ed5f69a

Don't drop packet if socket is not writable.

a23b363

mxinden reviewed Mar 1, 2021

View reviewed changes

dvc94ch added 2 commits March 1, 2021 17:39

Increase TTL and rename to query_interval.

45bc8e7

Merge branch 'master' into lower-mdns-bandwidth-requirements

3cef328

romanb approved these changes Mar 1, 2021

View reviewed changes

romanb mentioned this pull request Mar 1, 2021

Update if-watch requirement from 0.1.8 to 0.2.0 #1979

Merged

mxinden approved these changes Mar 1, 2021

View reviewed changes

Update CHANGELOG.

4f5e52e

romanb merged commit b727efe into libp2p:master Mar 2, 2021

romanb mentioned this pull request Mar 22, 2021

Update to libp2p-0.36 paritytech/substrate#8420

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't poll network unnecessarily. #1977

Don't poll network unnecessarily. #1977

dvc94ch commented Feb 23, 2021 •

edited

Loading

rkuhn left a comment

romanb left a comment

dvc94ch commented Feb 25, 2021 •

edited

Loading

romanb commented Mar 1, 2021 •

edited

Loading

dvc94ch commented Mar 1, 2021

mxinden left a comment

mxinden Mar 1, 2021

dvc94ch Mar 1, 2021

romanb Mar 1, 2021

dvc94ch Mar 1, 2021

mxinden Mar 1, 2021

romanb commented Mar 1, 2021 •

edited

Loading

romanb commented Mar 2, 2021

Don't poll network unnecessarily. #1977

Don't poll network unnecessarily. #1977

Conversation

dvc94ch commented Feb 23, 2021 • edited Loading

rkuhn left a comment

Choose a reason for hiding this comment

romanb left a comment

Choose a reason for hiding this comment

dvc94ch commented Feb 25, 2021 • edited Loading

romanb commented Mar 1, 2021 • edited Loading

dvc94ch commented Mar 1, 2021

mxinden left a comment

Choose a reason for hiding this comment

mxinden Mar 1, 2021

Choose a reason for hiding this comment

dvc94ch Mar 1, 2021

Choose a reason for hiding this comment

romanb Mar 1, 2021

Choose a reason for hiding this comment

dvc94ch Mar 1, 2021

Choose a reason for hiding this comment

mxinden Mar 1, 2021

Choose a reason for hiding this comment

romanb commented Mar 1, 2021 • edited Loading

romanb commented Mar 2, 2021

dvc94ch commented Feb 23, 2021 •

edited

Loading

dvc94ch commented Feb 25, 2021 •

edited

Loading

romanb commented Mar 1, 2021 •

edited

Loading

romanb commented Mar 1, 2021 •

edited

Loading