-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection Manager Overhaul #744
Comments
The initial post of the issue describes the main areas that need to be improved within the connection manager scope, as well as the order I think they should be tackled. Concrete solutions for some of the problems/features mentioned above still need to be polished. During the implementation of each milestone, a written artefact should come together with the implementation for alignment of what is the proposed solution and documentation purposes. Considering the milestones table, I believe that the Milestones 0-3 have a higher priority and would be great to have them for releasing cc @jacobheun |
This is really there for preventing the connection manager from culling connections beyond that lower bound. With protected connections this is less important. Once we can tag important peers the proactive dial strategy will change, right now it's just a crude "priority" dial. It would be helpful to flush out what these actually proactive dial strategies are, and document those for clarity. Things like:
The first 3 here I think are the higher priority in terms of creating a solid set of base connections.
I don't think this is necessary and it's prone to be very wasteful. If we are proactively searching for peers that will have meaning to us (DHT/rendezvous) we don't need to do this ambient poking of the network, and keep track of who we've dialed. Active searching and connecting is the approach we should take. With larger networks active searching will still work, where as blind dialing to check capabilities starts to fail quickly.
Short ttl decay tags would be great for this.
I'm concerned these might be bad indicators. A connection that belongs in my gossipsub mesh should probably be protected, regardless of the number of other protocols we use on that connection. We're not waiting the protocol itself as lots of peers run gossipsub, we're waiting a specific peer due to its importance in that system. If a subsystem is exceeding its agreed allocation of connections, then we would look at disconnecting peers from it that no other system is using.
👍 In the majority of cases a ping on that connection should suffice, but we'll need to test this on the different transports. This is also really important for remote listening (webrtc-star, relay, etc).
This could be added later as needed. If too many peers are being protected it's likely just either a bug in a subsystem or user abuse of something like peering. If subsystems register for connection pools, that could be treated as the max for that system.
There was some initial discussion at libp2p/go-libp2p#238 for the polite disconnect protocol. General Note |
Thanks for your thoughts ❤️
Agreed, changed! Also modified the initial post based on your thoughts. Still need to flush out better the Watermarks observation |
Here follow some thoughts on a WIP proposal for the Connection Manager Design. This notes focus mostly on the design to enable Proactive Dial and Better connection trimming. Connection tags and gating might have some intersection here, but they are mostly isolated work, at least in terms of API and Data structures as the other components will only be consumers. cc @wemeetagain Connection Manager + Registrar Design ProposalOverviewConnection management can take place in a reactive or proactive fashion. This proposal will be focused on an hybrid approach where the ConnectionManager component will be responsible for a reactive maintenance of connections, according to the available configured pool size. The registrar component will receive topology registrations where each topology will handle the proactive connection management by trying to guarantee that the number of connections is within the configured thresholds. Moreover, once peer and connection scoring is in place, the mentioned components will likely collaborate to create scores and ask for connections/disconnections. The proactive management of components will replace the current For an efficient and easy to use connection management, libp2p will need:
Other considerations:
FlowsFirst Start (with bootstrap discovery module)When a libp2p node starts, it will need to bootstrap to the network and learn about peers that will enable it to fully operate (hopefully more distributed in the future). One of the common ways of doing this is via bootstrap nodes. These bootstrap nodes are important during the initial lifecycle of the node, but once the node gets to know other peers it should disconnect from them, as the bootstrap nodes will have a lot of requests from other peers. However, they should be disconnected only when enough other peers are connected. It is worth mentioning that the above might not be always the case. For instance, if a bootstrap node is a relay and the node binds to it for incoming connections, this connection must be protected. Subsequent Starts (with populated PeerStore)When a libp2p node restarts, it will likely have persisted a set of peers previously discovered. The persisted data will include the known protocols of a peer, as well as its metadata. While this information is not always correct has peers might change the protocols they run or might become offline, it provides enough value to be the first criterium. If the peer can get connected to enough peers for its requirements, it should not get connected to the bootstrap nodes. Moreover, the node should look for peers running a relay and supporting HOP if they have autoRelay enabled. Preemptive DisconnectWhen a connection with a given node is not needed anymore (example: bootstrap node) or the maximum threshold is reached, a peer will be disconnected. In some cases, this peer might try to reconnect with the peer. While we do not have a disconnect protocol, we should guarantee that reconnect attempts from these peers are blocked and that when peers try to reconnect they have a exponential backoff and perhaps a configurable maxReconnectAttempts. Remote disconnectFor several different reasons, a remote peer might disconnect. If this connection was important, the peer should try to reconnect with an exponential backoff and perhaps a configurable maxReconnectAttempts. The topology should be responsible for the re-connect Remote connect with max poolIf an inbound connection request is received and the current number of connections is already the MAX_VALUE, the inbound connection should be refused. Libp2p Node Connection Lifecycle overviewThe lifecycle of a libp2p node would be the following:
Please note that the first 2 steps can be skipped (or reduced) if the node had previously been running and already has peers stored in the PeerStore. ImplementationLibp2p configurationThe global share of connections can be set in the libp2p connectionManager configuration. Libp2p should have sane defaults (which should evolve with the libp2p configuration effort, where we aim to provide ready to go libp2p configs for several scenarios/runtimes). const Libp2p = require('libp2p')
const libp2p = await Libp2p.create({
// ...
connectionManager: {
maxConnections: 60,
minConnections: 0,
// TODO: Consider a number of connections that can only be used for libp2p core operations, like connect to rendezvous points, star servers, relays, ...
// ... per https://github.com/libp2p/js-libp2p/blob/master/doc/CONFIGURATION.md#configuring-connection-manager
},
config: {
pubsub: {
// ... https://github.com/libp2p/js-libp2p/blob/master/doc/CONFIGURATION.md#customizing-pubsub
topology: {
min: 10,
max: 30
}
},
// core topologies configuration
}
}) Libp2p core connectivity, such as connections to rendezvous points and to other peers used for listening purposes, should be protected by the relevant components / subsystems (Relay Listener, Rendezvous client). Libp2p Connection Managerclass ConnectionManager {
constructor ({ max, min }) {
connections: Map<string, Connection[]>;
tags: Map<string, string>;
requestedConnections: number;
}
requestConnectionSlots (amount: number): void;
protect(idStr: String): void;
// TODO: think better about release resources, timings...
requestBurstConnections (amount: number): boolean;
} Connection Manager is responsible for:
Libp2p DiscoveryWhen discovering peers, the context that resulted in the peer being discovered might be important for scoring and for configuring libp2p topologies. {
peerId,
multiaddrs,
metadata: {
context: Discovery.tag
// Other important metadata
}
} This context will be useful for setting up decaying tags for bootstrap nodes for example. RegistrarRegistrar should mediate the interactions between the topologies and the connection manager. In the begining, it should request the connectionManager slots for the requirements of each topology (maximum and minimum). It should tag connections used by the topologies to provide visibility to the connection manager for the reactive management of connections. Libp2p TopologiesA topology will need to:
Multicodec TopologiesLibp2p protocols like Pubsub, DHT or application level protocols can create their own topology. When a topology is created, a const MulticodecTopology = require('libp2p-interfaces/src/topology/multicodec-topology')
// ...
const topology = new MulticodecTopology({
min: 10,
max: 30,
multicodecs: [this.protocol],
handlers: {
onConnect: this._onPeerConnected,
onDisconnect: this._onPeerDisconnected
}
})
this._registrarId = await this._libp2p.registrar.register(topology) MetadataTopologyLibp2p will have to deal with less structured topologies, such as Bootstrap nodes. These modules should create topologies in their context and needed use case. const MetadataTopology = require('libp2p-interfaces/src/topology/metadata-topology')
// ...
const topology = new MetadataTopology({
min: 10,
max: 30,
metadata: [this.metadata],
handlers: {
onConnect: this._onPeerConnected,
onDisconnect: this._onPeerDisconnected
}
})
this._registrarId = await this._libp2p.registrar.register(topology)
// Unregister when not needed
this._libp2p.registrar.unregister(this._registrarId) Modules like bootstrap should decide to register and unregister according to the PeerStore content and connection tags. FlowsFirst StartOn startup, bootstrap metadata topology should kick in and connect to the bootstrap nodes. Once connections are established, these nodes should be protected while they are important. A decaying tag should be added to them. Once tags are dropped and the system minimum number of connections is reached, these connections can start to be dropped. Once all bootstrap connections are dropped the bootstrap metadata topology is unregistered. It can still probably listen for a connection manager event of low number of connections to act and restart? Removing the kept space for these connections will allow other subsystems to burst with the released resources. Subsequent Starts (with populated PeerStore)On subsequent starts, the bootstrap should only kick in if not enough peers exist after a given period of time. Discover strategiesTBD Tags + Decaying TagsTBD Connection GatingTBD Alternative DesignsDo not create the Metadata TopologyProbably there is no need for a metadata topology abstraction layer at this point, and bootstrap can handle itself. "Token based" connection managerThe connection manager would be responsible for distributing tokens for each topology. Challenges:
Future Work
References:
|
..on `Waku` with a default to 59s to send ping messages over relay to ensure the relay stream stays open. This is a workaround until [js-libp2p#744](libp2p/js-libp2p#744) is done as there are issues when TCP(?) timeouts and the stream gets closed.
..on `Waku` with a default to 5min to send ping messages over relay to ensure the relay stream stays open. This is a workaround until [js-libp2p#744](libp2p/js-libp2p#744) is done as there are issues when TCP(?) timeouts and the stream gets closed.
..on `Waku` with a default to 5min to send ping messages over relay to ensure the relay stream stays open. This is a workaround until [js-libp2p#744](libp2p/js-libp2p#744) is done as there are issues when TCP(?) timeouts and the stream gets closed.
..on `Waku` with a default to 5min to send ping messages over relay to ensure the relay stream stays open. This is a workaround until [js-libp2p#744](libp2p/js-libp2p#744) is done as there are issues when TCP(?) timeouts and the stream gets closed.
..on `Waku` with a default to 5min to send ping messages over relay to ensure the relay stream stays open. This is a workaround until [js-libp2p#744](libp2p/js-libp2p#744) is done as there are issues when TCP(?) timeouts and the stream gets closed.
Port of https://github.com/libp2p/go-libp2p-core/blob/master/connmgr/gater.go Adds a new configuration key `connectionGater` which allows denying the dialing of certain peers, individual multiaddrs and the creation of connections at certain points in the connection flow. Fixes: #175 Refs: #744 Refs: #769 Co-authored-by: mzdws <[email protected]>
related: #426 & ipfs/helia#182 & https://filecoinproject.slack.com/archives/C03K82MU486/p1689794990432059 It doesn't seem like connection-manager / auto-dial has any backoff capabilities currently. I did a quick search and only found backoff functionality in pubsub & pubsub-gossipsub: https://github.com/search?q=repo%3Alibp2p%2Fjs-libp2p%20backoff&type=code For browser libp2p functionality to work consistently without relying on a specific backend that supports our desired transports (see universal-connectivity needing a specific backend node) we need to optimize auto-dialing and connection attempts. |
As discussed in the Open Maintainers call 29-08-23, the scope of this issue is very broad and the connection manager has changed substantially since this was created. There where some valuable suggestions which have been referenced in other issues, namely:
Closing this as this has been broken down into more granular issues. |
Connection Manager Overhaul
This Issue is an EPIC to track the work related to the Connection Manager Overhaul. Each milestone context and initial thoughts are described next.
Background
As we land new features like the auto-relay and rendezvous as part of improving connectivity and discoverability in libp2p libp2p/js-libp2p#703, the connection manager overhaul becomes an important work stream to guarantee these protocols work as expected. In addition, this work will be important for some already implemented features/protocols like
webrtc-star
andbootstrap
. Finally, this work is really important to enable the DHT work.This overhaul should be an initial step towards the future ConnMgr v2.
Milestones Overview
0) Documentation - Baseline
1) Watermarks Observation - Proactive Dial
2) Keep Alive
3) Protect Connections - Connection Tags
4) Protect Connections - Decaying Tags
5) Watermarks Observation - Trimming
6) Connection Gater
7) Dial retry
8) Disconnect message
These milestones do not need to be worked on in the displayed sequence. For instance, Connection tags, Connection Gater and Keep Alive can be isolated and implemented.
Context
The Connection manager is responsible for managing all the connections a peer has over time. It allows users to enforce an upper bound on the total number of open connections. To avoid possible service disruptions, connections can be tagged with metadata and optionally "protected" to guarantee that essential connections are kept alive.
0) Documentation - Connection flows
Create a
DISCOVERABILITY_AND_CONNECTIVITY.md
document to be a subsequent to theGETTING_STARTED
document. After someone getting up to speed with how to configure and start libp2p on the getting started document, they should move into how to setup their peer/network according to their use case/environment, in order to enable peers to be discovered and connections with them to be established.This will be divided in two categories:
1) Watermarks observation
Proactive dial
The connection manager proactively dials known peers, in order to have a meaningful set of connections to enable a node to work as expected, according to each use case/environment.
We have been relying on the connection manager low watermark, so that the peer keeps a reasonable number of arbitrary connections. Once we introduce protected connections, as well as tagging important peers, the proactive dial strategy can be modified to keep trying to dial more meaningful peers.
Proactive dial strategies
The following dial strategies should exist:
n
to them. If peers from the previous search are no longer our closest peers, we should untag those connections, or just let decaying tags handle this.The above dial strategies should have sane defaults, but also support to be overwritten.
We should have an interval to double check if we have the most meaningful peers connected to, as well as to proactively dial on some events like Peer discovery/disconnect.
TODO: different strategy for Startup/Persistence?
Subsystems should be able to ask the connection manager for a slice of the connection pool. A connection that belongs in my gossipsub mesh should probably be protected
Trim Connections
The connection manager trims less useful connections to be below a high watermark number.
2) Keep Alive
Currently, if a connection does not have anything going on for a while, it will timeout and close.
Libp2p should guarantee that specific connections are alive. This is important for keeping connected to peers important to us, both in terms of infrastructure or application layer. Remote listening (webrtc-star, relay, etc) is really important in this context.
Keep Alive should be used for protected peers via the API (Milestone 3) and Peers provided in the configuration.
In most cases, a ping on the connection should be enough, but this needs to be tested for each transport.
3) Protect important connections
ConnManager tracks connections to peers, and allows consumers to associate metadata with each peer. This enables connections to be trimmed based on implementation-defined metadata per peer.
To see: #369
Connection tags
API
(based on go interface: https://github.com/libp2p/go-libp2p-core/blob/master/connmgr/manager.go)
tagPeer (peerId: PeerId, tag: string, weight: number) : void
untagPeer (peerId: PeerId, tag: string) : void
getTagInfo (peerId: PeerId) : TagInfo
protect (peerId: PeerId, tag: string)
unProtect (peerId: PeerId, tag: string)
isProtected (peerId: PeerId, tag: string)
Data structures
Integration with Trim connections
Connection tags will allows the trimming to become more intelligent in this stage. Peers should be iterated and the weight of the tags should be used as a first criterium.
4) Decaying tags
Note: Inspired by go-libp2p https://github.com/libp2p/go-libp2p-core/blob/master/connmgr/decay.go
A decaying tag is one whose value automatically decays over time. The decay behaviour is encapsulated in a user-provided decaying function (DecayFn). The function is called on every tick (determined by the interval parameter), and returns either the new value of the tag, or whether it should be erased altogether.
We do not set values on a decaying function, but "bump" decaying tags by a delta value. This calls the BumpFn with the old value and the delta, to determine the new value.
While users should be able to provide their own functions, we should provide some preset functions to be used. Behaviours that are straightforward to implement include:
This is particularly important for scenarios like the Bootstrap discovery. When it starts, these connections are really important to get to know other peers. But as time passes and new connection exist, peers should disconnect from the bootstrap nodes.
API
setDecayingTag(tag: string, interval: time, decayFn: function, bumpFn: function)
5) Connection Gater
TODO: https://github.com/libp2p/go-libp2p-core/blob/master/connmgr/gater.go
Related: #175
6) Connection Retry
Retry a dial if it fails on a first attempt.
7) Disconnect
Sometimes it will be possible to have flows where a peer A wants to disconnect from peer B because it has a lot of connections, all of them more important that the connection with peer B. However, peer B wants to be connected to peer A. A message should be exchanged so that peer B understands that it should not retry it (for a given time?) and eventually a peer exchange. This needs to be spec'ed. Initial discussion at libp2p/go-libp2p#238
Notes
References
The text was updated successfully, but these errors were encountered: