Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network/litep2p: Investigate low peer count for long-running node #4925

Open
lexnv opened this issue Jul 2, 2024 · 1 comment
Open

network/litep2p: Investigate low peer count for long-running node #4925

lexnv opened this issue Jul 2, 2024 · 1 comment

Comments

@lexnv
Copy link
Contributor

lexnv commented Jul 2, 2024

2 Kusama nodes were started on 23th April and left running.

{__name__="substrate_build_info", chain="ksmcc3", instance="localhost:9615", job="substrate_node", name="gray-vase-1131", version="1.10.0-1a45bd88348"}

The commit was based on a2a049d, from branch:

Both long-running nodes present a low peer count.

2024-07-02 11:34:18.796  INFO tokio-runtime-worker substrate: 💤 Idle (1 peers), best: #23863740 (0x8b9d…8738), finalized #23863736 (0x104a…6c67), ⬇ 1.4kiB/s ⬆ 2.9kiB/s    
2024-07-02 11:34:23.797  INFO tokio-runtime-worker substrate: 💤 Idle (1 peers), best: #23863740 (0x8b9d…8738), finalized #23863736 (0x104a…6c67), ⬇ 6 B/s ⬆ 6 B/s    
2024-07-02 11:34:28.797  INFO tokio-runtime-worker substrate: 💤 Idle (1 peers), best: #23863740 (0x8b9d…8738), finalized #23863736 (0x104a…6c67), ⬇ 1.0kiB/s ⬆ 1.4kiB/s    
2024-07-02 11:34:33.798  INFO tokio-runtime-worker substrate: 💤 Idle (1 peers), best: #23863740 (0x8b9d…8738), finalized #23863736 (0x104a…6c67), ⬇ 0 ⬆ 0   
@dmitry-markin
Copy link
Contributor

What about inbound connections? Does the node have unreachable p2p port? 1 sync peer means that there are no inbound connections either.

github-merge-queue bot pushed a commit that referenced this issue Jul 20, 2024
This PR increments the beefy metric wrt no peers to query justification
from.
The metric is incremented when we submit a request to a known peer,
however that peer failed to provide a valid response, and there are no
further peers to query.

While at it, add a few extra details to identify the number of active
peers and cached peers, together with the request error

Part of:
- #4985
- #4925

---------

Signed-off-by: Alexandru Vasile <[email protected]>
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this issue Aug 2, 2024
This PR increments the beefy metric wrt no peers to query justification
from.
The metric is incremented when we submit a request to a known peer,
however that peer failed to provide a valid response, and there are no
further peers to query.

While at it, add a few extra details to identify the number of active
peers and cached peers, together with the request error

Part of:
- paritytech#4985
- paritytech#4925

---------

Signed-off-by: Alexandru Vasile <[email protected]>
lexnv added a commit to paritytech/litep2p that referenced this issue Aug 21, 2024
… error reporting (#206)

The purpose of this PR is to pave the way for making the Identify
protocol more robust, which is currently linked with the low number of
peers and connective issues over a long period of time
- paritytech/polkadot-sdk#4925

This PR adds a coherent `DialError` that exposes the minimal information
users need to know about dial failures.
- paritytech/polkadot-sdk#5239

A new litep2p event is added for reporting multiple dial errors that
occur on different protocols back to the user:

```rust
    /// A list of multiple dial failures.
    ListDialFailures {
        /// List of errors.
        ///
        /// Depending on the transport, the address might be different for each error.
        errors: Vec<(Multiaddr, DialError)>,
    },
```

This event eases the debugging of substrate connectivity issues. At the
same time, it can be used in a future PR to inform back to the Identify
protocol which self-reported addresses of some peers are unreachable:
- #203

### Next Steps
- Add more tests
- Warp sync + sync full nodes since this is touching individual
transports

### Future Work
- The overarching `litep2p::Error` needs a closer look and a
refactoring:
  - #204
  - #128
  
- ConnectionError event for individual transports can be simplified:
  - #205
  
- I've observed some inconsistencies in handling TCP vs WebSocket
connection timeouts. I believe that we can have another pass and share
even more code between them:
  - #70

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Co-authored-by: Dmitry Markin <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Sep 10, 2024
This release introduces several new features, improvements, and fixes to
the litep2p library. Key updates include enhanced error handling,
configurable connection limits, and a new API for managing public
addresses.

For a detailed set of changes, see [litep2p
changelog](https://github.com/paritytech/litep2p/blob/master/CHANGELOG.md#070---2024-09-05).

This PR makes use of:
- connection limits to optimize network throughput
- better errors that are propagated to substrate metrics 
- public addresses API to report healthy addresses to the Identify
protocol

### Warp sync time improvement

Measuring warp sync time is a bit inaccurate since the network is not
deterministic and we might end up using faster peers (peers with more
resources to handle our requests). However, I did not see warp sync
times of 16 minutes, instead, they are roughly stabilized between 8 and
10 minutes.

For measuring warp-sync time, I've used
[sub-trige-logs](https://github.com/lexnv/sub-triage-logs/?tab=readme-ov-file#warp-time)

### Litep2p

Phase | Time
 -|-
Warp  | 426.999999919s
State | 99.000000555s
Total | 526.000000474s

### Libp2p

Phase | Time
 -|-
Warp  | 731.999999837s
State | 71.000000882s
Total | 803.000000719s

Closes: #4986


### Low peer count

After exposing the `litep2p::public_addresses` interface, we can report
to litep2p confirmed external addresses. This should mitigate or at
least improve: #4925.
Will keep the issue around to confirm this.


### Improved metrics

We are one step closer to exposing similar metrics as libp2p:
#4681.

cc @paritytech/networking 

### Next Steps
- [x] Use public address interface to confirm addresses to identify
protocol

---------

Signed-off-by: Alexandru Vasile <[email protected]>
mordamax pushed a commit to paritytech-stg/polkadot-sdk that referenced this issue Sep 11, 2024
This release introduces several new features, improvements, and fixes to
the litep2p library. Key updates include enhanced error handling,
configurable connection limits, and a new API for managing public
addresses.

For a detailed set of changes, see [litep2p
changelog](https://github.com/paritytech/litep2p/blob/master/CHANGELOG.md#070---2024-09-05).

This PR makes use of:
- connection limits to optimize network throughput
- better errors that are propagated to substrate metrics 
- public addresses API to report healthy addresses to the Identify
protocol

### Warp sync time improvement

Measuring warp sync time is a bit inaccurate since the network is not
deterministic and we might end up using faster peers (peers with more
resources to handle our requests). However, I did not see warp sync
times of 16 minutes, instead, they are roughly stabilized between 8 and
10 minutes.

For measuring warp-sync time, I've used
[sub-trige-logs](https://github.com/lexnv/sub-triage-logs/?tab=readme-ov-file#warp-time)

### Litep2p

Phase | Time
 -|-
Warp  | 426.999999919s
State | 99.000000555s
Total | 526.000000474s

### Libp2p

Phase | Time
 -|-
Warp  | 731.999999837s
State | 71.000000882s
Total | 803.000000719s

Closes: paritytech#4986


### Low peer count

After exposing the `litep2p::public_addresses` interface, we can report
to litep2p confirmed external addresses. This should mitigate or at
least improve: paritytech#4925.
Will keep the issue around to confirm this.


### Improved metrics

We are one step closer to exposing similar metrics as libp2p:
paritytech#4681.

cc @paritytech/networking 

### Next Steps
- [x] Use public address interface to confirm addresses to identify
protocol

---------

Signed-off-by: Alexandru Vasile <[email protected]>
lexnv added a commit that referenced this issue Nov 15, 2024
This release introduces several new features, improvements, and fixes to
the litep2p library. Key updates include enhanced error handling,
configurable connection limits, and a new API for managing public
addresses.

For a detailed set of changes, see [litep2p
changelog](https://github.com/paritytech/litep2p/blob/master/CHANGELOG.md#070---2024-09-05).

This PR makes use of:
- connection limits to optimize network throughput
- better errors that are propagated to substrate metrics
- public addresses API to report healthy addresses to the Identify
protocol

Measuring warp sync time is a bit inaccurate since the network is not
deterministic and we might end up using faster peers (peers with more
resources to handle our requests). However, I did not see warp sync
times of 16 minutes, instead, they are roughly stabilized between 8 and
10 minutes.

For measuring warp-sync time, I've used
[sub-trige-logs](https://github.com/lexnv/sub-triage-logs/?tab=readme-ov-file#warp-time)

Phase | Time
 -|-
Warp  | 426.999999919s
State | 99.000000555s
Total | 526.000000474s

Phase | Time
 -|-
Warp  | 731.999999837s
State | 71.000000882s
Total | 803.000000719s

Closes: #4986

After exposing the `litep2p::public_addresses` interface, we can report
to litep2p confirmed external addresses. This should mitigate or at
least improve: #4925.
Will keep the issue around to confirm this.

We are one step closer to exposing similar metrics as libp2p:
#4681.

cc @paritytech/networking

- [x] Use public address interface to confirm addresses to identify
protocol

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants