Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peer Management: Connection and Disconnection #914

Closed
17 of 26 tasks
fryorcraken opened this issue Aug 31, 2022 · 17 comments
Closed
17 of 26 tasks

Peer Management: Connection and Disconnection #914

fryorcraken opened this issue Aug 31, 2022 · 17 comments
Assignees
Labels
E:2.1: Production testing of existing protocols See https://github.com/waku-org/pm/issues/49 for details track:restricted-run Restricted run track (Secure Messaging/Waku Product), e.g. filter, WebRTC

Comments

@fryorcraken
Copy link
Collaborator

fryorcraken commented Aug 31, 2022

Planned start date:
Due date:

Summary

In a browser environment, loss of connectivity may happen.

When it happens, this would impede the application to receive and send messages.

Strategies need to be designed to:

  • detect a loss of connection
  • Automated actions when resuming a connection
  • cope with lack of connectivity when trying to use a protocol: retry strategy, feedback to the app and guideline on how to handle

Proposed Solutions

The result of this issue would be a mix of:

  • Documentation to guide the developers in handling failure
  • Strategy embedded in the protocols (e.g. retry)
  • Helper library that can provide utility to developers (e.g. auto store query, auto filter renewal)

At this stage it is not sure what should go in waku core and what should be a library helper.

Acceptance Criteria

  • Upon start-up, app dial a reasonable set of peers from peer discovery methods (no persistence, only dial 1 bootstrap peer)
  • Upon start-up, if some peers are not reachable, app attempt fallback method (try 2nd bootstrap peer)
  • Following above rules, when disconnected to a remote peer, attempts to reconnect
  • Clear guidelines for developer to know when node is connected or disconnected Advanced docs for js-waku docs.waku.org#104 (comment)

Tasks

Tasks Moved to Out of Scope

Notes

libp2p/js-libp2p#744

RAID (Risks, Assumptions, Issues and Dependencies)

@fryorcraken fryorcraken added this to Waku Aug 31, 2022
@fryorcraken fryorcraken added track:restricted-run Restricted run track (Secure Messaging/Waku Product), e.g. filter, WebRTC track:production labels Aug 31, 2022
@fryorcraken fryorcraken moved this to Todo in Waku Aug 31, 2022
fryorcraken added a commit that referenced this issue Sep 1, 2022
This module will just consume a generate Waku and Waku Relay interfaces
so we already we want to extract it.

It is also one opinionated to handle connection management, other ways
might come with #914.
status-bors-ng bot added a commit that referenced this issue Sep 1, 2022
919: include wait_for_remote_peer in exports map r=fryorcraken a=fryorcraken

This module will just consume a generate Waku and Waku Relay interfaces so we already we want to extract it.

It is also one opinionated to handle connection management, other ways might come with #914.

Co-authored-by: fryorcraken.eth <[email protected]>
@fryorcraken fryorcraken changed the title Connection Management Peer Management Dec 13, 2022
@fryorcraken
Copy link
Collaborator Author

Need to re-groom this issue to take peer discovery in account. Some draft notes:

    // createLightNode => 2 light push, 2 filter, 1 store
    // dns discovery: emits "bootstrap" nodes
    // catch "bootstrap" node -> peer exchange? connect to it
    // How do we know it has peer exchange? need to extend peerInfo to include `waku2` ENR field or full ENR
    // Already connected to a "bootstrap" node? new node, park it.
    // random order? probably at dns discovery level, because we know when we are "done".
    // Now we receive a `peer-exchange` node
    // -> Do we need this node? What capabilities does it have?
    // -> What ndes are we currently connected to?
    // -> connect, discard or save for later (add to peer store)
    // -> Do a filter query: make it lazy, no wait for remote peer
    // Peer management module: good defaults and configurable/swappable
    ```

@danisharora099 danisharora099 self-assigned this Jan 3, 2023
@danisharora099
Copy link
Collaborator

danisharora099 commented Jan 4, 2023

Since this issue is a considerable overhaul, I'd like to be sure of my interpretation of the same.

I think the biggest one is:

* detect a loss of connection

For this, my interpretation is that libp2p's connection manager doesn't emit an event when a peer disconnects/streams are closed (on peer:disconnect) (ref: status-im/status-web#288 (comment)) and that this will either be something we'll be handling ourselves (similar to status-im/status-web#288 or xmtp/xmtp-js#128) or an upstream change to libp2p (cont of libp2p/js-libp2p#939 / libp2p/js-libp2p#744)
Please let me know if that's a fair interpretation.

Secondly, an explanation on the following, of how this fits with the initially described scope of the PR, will also be greatly appreciated.

Need to re-groom this issue to take peer discovery in account. Some draft notes:

    // createLightNode => 2 light push, 2 filter, 1 store
    // dns discovery: emits "bootstrap" nodes
    // catch "bootstrap" node -> peer exchange? connect to it
    // How do we know it has peer exchange? need to extend peerInfo to include `waku2` ENR field or full ENR
    // Already connected to a "bootstrap" node? new node, park it.
    // random order? probably at dns discovery level, because we know when we are "done".
    // Now we receive a `peer-exchange` node
    // -> Do we need this node? What capabilities does it have?
    // -> What ndes are we currently connected to?
    // -> connect, discard or save for later (add to peer store)
    // -> Do a filter query: make it lazy, no wait for remote peer
    // Peer management module: good defaults and configurable/swappable
    ```

Other than this, appreciate that the issue has been written very descriptively and will have a great impact!

cc @fryorcraken @felicio

@fryorcraken
Copy link
Collaborator Author

As mentioned. this is a considerable overhaul and would need to be tackled iteratively. Each iteration can increase the functionality/complexity of the manager while taking onboard feedback from dogfooding.

I suggest to focus the first iterations on using the peer discovery protocols dns discovery and peer exchange with the following logic:

  1. when using createLightNode, apply default values of wanting 2 light push nodes, 2 filters nodes and 1 store node (user can override values).
  2. DNS discovery bootstrap method is used, it discovers and emit nodes that are tagged as bootstrap
  3. Connection Manager processes the bootstrap node
    i. Can connect to it and use it for lightpush/store/filter (we can do that now and later remove this logic)
    ii. Use peer exchange to discover more nodes
  4. Connection Manager discards/saves for later other bootstrap nodes
  5. peer exchange discover new nodes (using a bootstrap node), tags then as peer-exchange
  6. peer-exchange node is discovered/emitted, connect to it (that's our second light push/filter node)
  7. connect to more peer-exchange node to fulfill requirements set in (1) if needed.

(3) introduces an issue: the first node discovery by DNS discovery will always be the same given a given enr tree. Hence we may need to introduce some pseudo random shuffling logic. this should probably be done in the connection manager.

The final result above can already be split in several tasks/PRs.

Then, we can add error resilience on top of it, I'd recommend in the following order:

  1. Fail to dial a node: handle this and try another node (easy as all located in connection manager)
  2. protocol failed on a node (ie, light push query failed): disconnect node and try other node: this is harder because the protocol needs to feedback the failure to the manager which can then handle the disconnect/connection. Also the protocol should know whether the error is worthy a disconnection. This is not really feasible now (no differentiations in errors) but may be with the waku store protocol overhaul).
  3. Node disconnects: as you mentioned, we need to have feedback from libp2p for that so it may be more difficult.

@danisharora099
Copy link
Collaborator

danisharora099 commented Jan 9, 2023

  1. Connection Manager processes the bootstrap node

We do two things with the peer, is that what we mean by "process" here?

  • Tag the peer within the PeerStore
  • Dial

i. Can connect to it and use it for lightpush/store/filter (we can do that now and later remove this logic)

  • why do we want to “later remove this logic”?
  1. Connection Manager discards/saves for later other bootstrap nodes
  • How does the ConnectionManager do that? Do we mean the PeerStore/ AddressManager?
  • (later, making a note: we might also have to think about the edge cases like the addresses changing for these saved peers in the interim)
  1. peer exchange discover new nodes (using a bootstrap node), tags then as peer-exchange
  2. peer-exchange node is discovered/emitted, connect to it (that's our second light push/filter node)
  • “that’s our second node” — this step is skipped if the initial node requirements are met, correct?
  • the nodes are anyway connected to (unless ConnectionManager's limit is reached, in which case it will auto-prune)
  1. Fail to dial a node: handle this and try another node (easy as all located in connection manager)
  • how can we reproduce a non-dialable node?
  • in the case the dial fails, the ConnectionManager/PeerStore should automatically ignore the node (from my understanding, with autoDial:true (which is used by default), all nodes are tried to connect to)

also, a lot of the process you describe, from my understanding, takes place implicitly by libp2p
for eg:

  • discovery, tagging, dial attempts, etc are all handled implicitly

I'm not entirely sure what you mean by manually handling the above-described process

we might not need to manually handle/connect to discovered peers as libp2p automatically dials to discovered peers and handles pruning if necessary (upper connection limit on ConnectionManager) — we should just focus on discovering the most number of peers (ref: #1117)

Please let me know if I'm missing something

@fryorcraken
Copy link
Collaborator Author

  1. Connection Manager processes the bootstrap node

We do two things with the peer, is that what we mean by "process" here?

* Tag the peer within the PeerStore

The tagging is done by the discovery service:

await this.components.peerStore.tagPeer(

* Dial

The dialling is done by the connection manager:

libp2p.dial(peerId).catch((err) => {

We will have to make this trivial logic more complex in the future and have a proper ConnectionManager class/module.

By "process" I mean deciding whether to dial and dialing.

i. Can connect to it and use it for lightpush/store/filter (we can do that now and later remove this logic)

* why do we want to “later remove this logic”?

Because if all js-waku nodes connects to bootstrap nodes to be served (store/light push/filter), then the waku network will be centralized because it would only use the bootstrap nodes.
It would also mean that the bootstrap nodes would need to be high performing node to serve all js-waku nodes.

Instead, I propose for the bootstrap node to only be used for bootstrapping, ie, getting access to the network. meaning only used for discv5 and peer-exchange.

  1. Connection Manager discards/saves for later other bootstrap nodes
* How does the `ConnectionManager` do that? Do we mean the `PeerStore`/  `AddressManager`?

I don't know. The node will be tagged as bootstrap and store in peerStore so I guess that's good enough.
Or maybe we'll have to also store the nodes in the ConnectionManager. We'll have to decide at implementation time if the peerStore gives us enough.

The ConnectManager is the new module that handles most of the logic described here.

* (later, making a note: we might also have to think about the edge cases like the addresses changing for these saved peers in the interim)
  1. peer exchange discover new nodes (using a bootstrap node), tags then as peer-exchange
  2. peer-exchange node is discovered/emitted, connect to it (that's our second light push/filter node)
* “that’s our second node” — this step is skipped if the initial node requirements are met, correct?

Yes. For the rest of the logic I assume the proposed default requirements are applied.

* the nodes are anyway connected to (unless ConnectionManager's limit is reached, in which case it will auto-prune)

We should automatically connect to all node we discover. Which is why we need a Connection Manager and why we need to implement this logic.

  1. Fail to dial a node: handle this and try another node (easy as all located in connection manager)
* how can we reproduce a non-dialable node?

Pass an invalid multiaddr.

* in the case the dial fails, the ConnectionManager/PeerStore should automatically ignore the node (from my understanding, with `autoDial:true` (which is used by default), all nodes are tried to connect to)

We'll have to review in details the retry logic when it's time to implement.

also, a lot of the process you describe, from my understanding, takes place implicitly by libp2p for eg:

* discovery, tagging, dial attempts, etc are all handled implicitly

We should not dial every node we discover. This is why custom connection management is needed.

I'm not entirely sure what you mean by manually handling the above-described process

we might not need to manually handle/connect to discovered peers as libp2p automatically dials to discovered peers and handles pruning if necessary (upper connection limit on ConnectionManager) — we should just focus on discovering the most number of peers (ref: #1117)

Please let me know if I'm missing something

@danisharora099
Copy link
Collaborator

danisharora099 commented Jan 10, 2023

By "process" I mean deciding whether to dial and dialing.

How are we deducing whether to dial or not?
If there's a peer available other than bootstrap, we give it more preference and dial to it instead of the bootstrap peer - correct? (to increase decentralisation)

Because if all js-waku nodes connects to bootstrap nodes to be served (store/light push/filter), then the waku network will be centralized because it would only use the bootstrap nodes. It would also mean that the bootstrap nodes would need to be high performing node to serve all js-waku nodes.
That makes sense! We should prioritize nodes found by discovery mechanisms other the bootstrapping. Noted.

* the nodes are anyway connected to (unless ConnectionManager's limit is reached, in which case it will auto-prune)

We should automatically connect to all node we discover. Which is why we need a Connection Manager and why we need to implement this logic.

We should not dial every node we discover. This is why custom connection management is needed.

My interpretation so far:
Our connection priority should be something along the lines of:

  • connect to only 1 bootstrap peer (as little as possible to avoid centralisation on bootstrap nodes)
  • find more peers using other discovery mechanisms, via the bootstrap peer
  • connect to all new peers found until requirements are fulfilled

is that a fair interpretation?

@danisharora099
Copy link
Collaborator

Also, as concluded from #1117 (comment), libp2p by default autodials.

Considering the scope of this PR, it's best to default autoDial to false considering we don't want to connect to all discovered bootstrap nodes immediately.

@fryorcraken
Copy link
Collaborator Author

My interpretation so far: Our connection priority should be something along the lines of:

* connect to only 1 bootstrap peer (as little as possible to avoid centralisation on bootstrap nodes)

* find more peers using other discovery mechanisms, via the bootstrap peer

* connect to all new peers found until requirements are fulfilled

is that a fair interpretation?

Yes.

@fryorcraken
Copy link
Collaborator Author

As discussed with @danisharora099 the first step was actually a focus on connections to node discovered, Need to update description to record this first step.

Next, would be exploratory work to understand what information we can get when a node is disconnected.

@fryorcraken fryorcraken changed the title Peer Management [Epic] Peer Management Apr 6, 2023
@fryorcraken fryorcraken added the milestone Tracks a subteam milestone label Apr 6, 2023
@fryorcraken fryorcraken changed the title [Epic] Peer Management [Milestone] Peer Management Apr 6, 2023
@fryorcraken
Copy link
Collaborator Author

fryorcraken commented Jun 29, 2023

Blocked by #1412

@danisharora099 danisharora099 removed the blocked This issue is blocked by some other work label Aug 10, 2023
@fryorcraken fryorcraken changed the title [Milestone] Peer Management: Connection and Disconnection [Epic] Peer Management: Connection and Disconnection Aug 24, 2023
@fryorcraken fryorcraken added epic Tracks a yearly team epic (only for waku-org/pm repo) E:2023-light-protocols and removed milestone Tracks a subteam milestone E:2023-light-protocols labels Aug 24, 2023
@fryorcraken fryorcraken added E:2.1: Production testing of existing protocols See https://github.com/waku-org/pm/issues/49 for details and removed E:2023-peer-mgmt epic Tracks a yearly team epic (only for waku-org/pm repo) labels Sep 8, 2023
@fryorcraken fryorcraken changed the title [Epic] Peer Management: Connection and Disconnection Peer Management: Connection and Disconnection Sep 8, 2023
@danisharora099
Copy link
Collaborator

referring from #1412,

upon further investigation and taking into context previous findings, it's safe to conclude:

  • connection:close: monitor a permanent connection close between the local & remote node (which should usually be triggered along with peer:disconnect for cases where we only have one connection with the remote node)
  • peer:disconnect: monitor permanent disconnections with peers (this would imply not that no the underlying connection(s) have been permanently closed, and the only way to communicate with this peer again is to open a new connection/reconnect)
  • pings will fail when there are temporary network degradations or reachability issues. this does not mean that the underlying connection has been closed.

to address part of the acceptance criteria from this milestone,
the principal question remains: how do we understand if the disconnection with the remote peer is deliberate, or simply due to network conditions (implying we want to initiate a reconnection)

as we concluded with #1403 (comment), js-waku redials automatically after the 10 mins mark, which is not a conscious disconnection and requires reconnection so perhaps the above question is enforced by default already?

cc @fryorcraken

@danisharora099
Copy link
Collaborator

Weekly Update

@danisharora099
Copy link
Collaborator

danisharora099 commented Oct 10, 2023

the principal question remains: how do we understand if the disconnection with the remote peer is deliberate, or simply due to network conditions (implying we want to initiate a reconnection)

for this, we want to simply attempt reconnections


For library consumers,

For Filter,

  • we should open a subscription, and also send recurring SUBSCRIBE_PING to the node to ensure we're maintaing active subscriptions

    • if we don't have an active subscription, we should reinitiate a subscription with the node
  • in case of reconnection, SUBSCRIBER_PING should be used to check for active subscription on the service nodes' end

  • write explicit tests to cover this hypothetical as much as possible Advanced docs for js-waku docs.waku.org#104 (comment)

  • add something like manage your filter subscriptions to advanced js-waku docs Advanced docs for js-waku docs.waku.org#104 (comment)

@danisharora099
Copy link
Collaborator

danisharora099 commented Oct 13, 2023

Weekly Update

  • achieved: reached a conclusion tackling deliberate vs accidental disconnections, PRs opened to handle Filter subscriptions on disconnection/reconnections, iterative fixes on addressing multiple dial attempts for same peer, fixes around keep alive pings
  • next: getting reviews & merging these PRs which should enable us to close this epic 🥳

@danisharora099 danisharora099 moved this from In Progress to Code Review / QA in Waku Oct 13, 2023
@danisharora099
Copy link
Collaborator

danisharora099 commented Oct 23, 2023

Weekly Update

  • achieved: The Connection and Disconnection Peer Management epic has been closed

@github-project-automation github-project-automation bot moved this from Code Review / QA to Done in Waku Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E:2.1: Production testing of existing protocols See https://github.com/waku-org/pm/issues/49 for details track:restricted-run Restricted run track (Secure Messaging/Waku Product), e.g. filter, WebRTC
Projects
Archived in project
Development

No branches or pull requests

3 participants