-
Notifications
You must be signed in to change notification settings - Fork 949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(gossipsub): use signed peer records on handle prune #3579
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is definitely an improvement by itself.
// No duplicates | ||
assert_eq!(dials_set.len(), config.prune_peers()); | ||
|
||
// all dial peers must be in px | ||
assert!(dials_set.is_subset(&px_peer_ids.iter().copied().collect::<HashSet<_>>())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to assert the presence of the addresses too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestion on how to retrieve the addresses from DialOpts struct?
rust-libp2p/protocols/gossipsub/src/behaviour/tests.rs
Lines 1896 to 1897 in 3f5dd63
// TODO: How to extract addresses from DialOpts to assert them in the test? | |
NetworkBehaviourAction::Dial { opts } => opts.get_peer_id(), |
Or should I make this fn public?
rust-libp2p/swarm/src/dial_opts.rs
Lines 130 to 132 in d81f947
pub(crate) fn get_addresses(&self) -> Vec<Multiaddr> { | |
self.addresses.clone() | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not really a fan of the style of the tests in gossipsub anyway. We should only be testing behaviours by running them in a Swarm
and not call these lifecycle functions manually.
We now have a new crate, libp2p-swarm-test
which makes it easy to set up swarms for testing.
How about we do the following:
- Implement "sending" of signed peer records by keeping a hashmap of records in the behaviour where we perform a lookup upon prune
- At the moment, we don't have a way of adding records to this hashmap so in reality, there won't ever be any
- Within the test, we can add a record (by accessing the field directly)
- We can then write a test using
libp2p-swarm-test
that ends up sending aPrune
to a node which will then contain theSignedPeerRecord
- We can then assert that the
Swarm
establishes a connection to the node specified in the prune message
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going this way seems reasonable to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes tho if all tests in this package are written with a consistent style, should this PR stick to that style? And then refactor them into using swarm-test in another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes tho if all tests in this package are written with a consistent style, should this PR stick to that style? And then refactor them into using swarm-test in another PR
Eventually yes but that is a massive effort. I don't mind mixing styles in a transition period if we get a better, more feature-oriented test in exchange.
Regarding changelog: Given that this is backwards-compatible, can you can add an entry to the |
cargo-semver-checks CI step errors with
Any pointers on what could be the root issue? https://github.com/libp2p/rust-libp2p/actions/runs/4380836480/jobs/7668318386#step:14:409 |
Uhm this is likely because the caches are stale. I'll delete them. |
This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏 |
Let me know if you need a review from my end. Also adding @divagant-martian and @AgeManning for visibility and, if time permits, a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah nice. Looks good. Just a small comment
// mark as px peer | ||
self.px_peers.insert(peer_id); | ||
|
||
// dial peer | ||
self.events.push_back(NetworkBehaviourAction::Dial { | ||
opts: DialOpts::peer_id(peer_id).build(), | ||
opts: DialOpts::peer_id(peer_id).addresses(addresses).build(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this should have been done earlier, probably on line 1617.
I think we probably don't want to try to connect to all the peers, we probably only want to try and get up to mesh_n_high
. I'd suggest we set n
on line 1617 to be the min of self.config.prune_peers()
or self.mesh.get(&topic_hash).map(|x| x.len()).unwrap_or(0)
or something. Which may mean calculating this in handle_prune()
because we don't have the topic hash here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the needed change in general.
Note that prune_peers
according to our docs configure the number of peers we send to others, not the number of peers we connect to. Either code or docs need to be adjusted. But using the current docs, this should probably not be used here (or in handle_prune
)
It's also not clear to me if handling the prune is the best place to connect to those peers. We already graft peers in the heartbeat when below mesh_n_low
. We might want to connect to these peers to accelerate the process, or we might want to prioritize already connected peers since they are somewhat more trusted. In any case, both mesh_n_high
and mesh_n_low
should probably be used to select the number of peers from this set we want to try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PeerExchange is meant to give you more peers in case you are only connected to a few, right?
Perhaps it would be better to just add those to some kind of list and let the grafting logic upon heartbeats do the connecting?
For the case where we don't have many peers, that would otherwise "fail" in that we can't connect to as many peers as we'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are good points, and path forward is opinionated. Maintainers and @divagant-martian @AgeManning could you comment on what makes more sense?
IMO connecting eagerly instead of accumulating keep it simpler + limit only up to mesh needs. For reference Go impl eagerly schedules all peer exchanges to the connect channel (with max cap MaxPendingConnections)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not touched gossipsub much so I'll leave it to @mxinden and @AgeManning to make a decision here.
For reference Go impl eagerly schedules all peer exchanges to the connect channel (with max cap MaxPendingConnections)
That is an interesting data point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of unblocking this PR we can go with eager connection, since as Age mentioned, this needs to be explicitly turned on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AgeManning Should do_px
be used to not consume px by default? Or add a new config param like use_px
to split provision and consumption of px
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a more explicit config. I think we should opt for enable_peer_exchange()
which enables both sending and receiving. I dont think users really need granularity over one or the other (we can add it later if its requested).
I'd suggest renaming do_px
to enable_peer_exchange
. Have it off my default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dapplion to achieve this you technically need to change nothing. This is already the case. Age's suggestion is a nice to have, but would change the public api, so the version should be updated accordingly. I'm fine with doing or not doing this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do it in a non-breaking way by adding enable_peer_exchange
on Config
and deprecating the do_px
function. Internally, the field can be renamed without issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial review since it's getting late in my timezone
// mark as px peer | ||
self.px_peers.insert(peer_id); | ||
|
||
// dial peer | ||
self.events.push_back(NetworkBehaviourAction::Dial { | ||
opts: DialOpts::peer_id(peer_id).build(), | ||
opts: DialOpts::peer_id(peer_id).addresses(addresses).build(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the needed change in general.
Note that prune_peers
according to our docs configure the number of peers we send to others, not the number of peers we connect to. Either code or docs need to be adjusted. But using the current docs, this should probably not be used here (or in handle_prune
)
It's also not clear to me if handling the prune is the best place to connect to those peers. We already graft peers in the heartbeat when below mesh_n_low
. We might want to connect to these peers to accelerate the process, or we might want to prioritize already connected peers since they are somewhat more trusted. In any case, both mesh_n_high
and mesh_n_low
should probably be used to select the number of peers from this set we want to try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except for a risk pf panic and test updates I think this is close to ready
protocols/gossipsub/src/behaviour.rs
Outdated
let max_px_to_connect = self.config.mesh_n_low() | ||
- self.mesh.get(&topic_hash).map(|x| x.len()).unwrap_or(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is likely to panic on release debug
Also, I think you want mesh_n_high
instead of mesh_n_low
? Both are fine depending on what we want, @AgeManning suggested the high one which would be interpreted as capacity more than need
protocols/gossipsub/src/behaviour.rs
Outdated
//TODO: Once signed records are spec'd: Can we use peerInfo without any IDs if they have a | ||
// signed peer record? | ||
px.retain(|p| p.peer_id.is_some()); | ||
fn px_connect(&mut self, mut px: Vec<PeerInfo>, n: usize) { | ||
if px.len() > n { | ||
// only use at most prune_peers many random peers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another line no longer in sync with code
// mark as px peer | ||
self.px_peers.insert(peer_id); | ||
|
||
// dial peer | ||
self.events.push_back(NetworkBehaviourAction::Dial { | ||
opts: DialOpts::peer_id(peer_id).build(), | ||
opts: DialOpts::peer_id(peer_id).addresses(addresses).build(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of unblocking this PR we can go with eager connection, since as Age mentioned, this needs to be explicitly turned on.
.collect(); | ||
|
||
// Exactly config.prune_peers() many random peers should be dialled | ||
assert_eq!(dials.len(), config.prune_peers()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likely to fail?
// No duplicates | ||
assert_eq!(dials_set.len(), config.prune_peers()); | ||
|
||
// all dial peers must be in px | ||
assert!(dials_set.is_subset(&px_peer_ids.iter().copied().collect::<HashSet<_>>())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going this way seems reasonable to me
Rebased to latest master to fix conflicts. Since the discussion is getting long I'll link the remaining points:
|
This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏 |
2 similar comments
This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏 |
This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏 |
Closing PR for now since I have no capacity to address it. If there's a real need for this feature I'm happy to prioritise it again, but from offline discussions it seems like it is not the case. Sorry for polluting your PR inbox for a while! |
Description
Gossipsub's PeerExchange functionality suggests peers to connect to upon pruning a node from the mesh. Currently, we cannot import sending signed peer records to other nodes because we don't have a peer store (yet). This patch implements handling incoming signed peer records from other nodes in case we get pruned from the mesh.
Related #2398.
Notes
N/A
Links to any relevant issues
Open Questions
N/A
Change checklist
rust-libp2p/protocols/gossipsub/src/behaviour/tests.rs
Lines 1896 to 1897 in d1d468c
# 0.45.0
section?