Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

networking/litep2p: Node running litep2p seems to be leaking memory #6149

Open
alexggh opened this issue Oct 21, 2024 · 5 comments
Open

networking/litep2p: Node running litep2p seems to be leaking memory #6149

alexggh opened this issue Oct 21, 2024 · 5 comments
Assignees

Comments

@alexggh
Copy link
Contributor

alexggh commented Oct 21, 2024

Looking over the dashboards on our kusama validators the memory on node that is running litep2p seems to be constantly increasing it is now at 12GiB, all other nodes are around 3-4 GiB and constant.

Screenshot 2024-10-21 at 12 01 07

https://grafana.teleport.parity.io/goto/Uoh4CQmHg?orgId=1

cc: @paritytech/networking

@lexnv lexnv self-assigned this Oct 21, 2024
@lexnv
Copy link
Contributor

lexnv commented Oct 21, 2024

Thanks Alex for raising this! 🙏

We also had this issue with addresses, although the memory consumed here is in the order of GiB (#5998).

There are a few places that come to mind where to look at next:

I would start by looking at litep2p and then move to substrate code

@lexnv lexnv added this to Networking Oct 21, 2024
@lexnv
Copy link
Contributor

lexnv commented Oct 21, 2024

Initial testing with debug logs built on top of branch lexnv/holistic-litep2p-test-dhtandpeerset shows memory leaks in litep2p:

  • transport manager peer state (from above)
  • TCP / WebSocket state tracking

Will let my node running over night and followup with patches

Screenshot 2024-10-21 at 19 32 21

@lexnv
Copy link
Contributor

lexnv commented Oct 22, 2024

We need metrics to filter out potential leaks (ie monotonically increasing state tracking is a concern).

We have around 3 separate leaks:

1. Transport Manager State Leak

Screenshot 2024-10-22 at 12 39 27
Screenshot 2024-10-22 at 12 38 16

2. TCP/WebSocket Pending Dials Leak

Screenshot 2024-10-22 at 12 40 57

Screenshot 2024-10-22 at 14 01 57

3. TCP/WebSocket Cancellation Logic Leak

Screenshot 2024-10-22 at 14 02 45
Screenshot 2024-10-22 at 14 03 38

Screenshot 2024-10-22 at 12 41 49

Screenshot 2024-10-22 at 12 43 50

@lexnv
Copy link
Contributor

lexnv commented Oct 22, 2024

Litep2p PRs

For more details and explained edge-cases when the leaks happen see:

@lexnv
Copy link
Contributor

lexnv commented Oct 23, 2024

lexnv added a commit to paritytech/litep2p that referenced this issue Oct 24, 2024
The identify protocol implementation leaked SubstreamIds and PeerIds via
the `pending_opens` hashMap.
Objects were only inserted in the `pending_opens`, however they were
never removed.

The only possible purpose of `pending_opens` is to double-check the
events coming from the service layer: `TransportEvent::SubstreamOpened`.
However this is not needed, as illustrated by the current
implementation.

Part of endeavors to fix memory leaks:
paritytech/polkadot-sdk#6149

### Testing Done

- custom patched litep2p to dump the internal state of identify protocol
running in kusama

cc @paritytech/networking

Signed-off-by: Alexandru Vasile <[email protected]>
lexnv added a commit to paritytech/litep2p that referenced this issue Oct 24, 2024
The purpose of the `pending_opens` field is to double check outbound
substream opens.
This was used to ensure that the substream ID was indeed opened to a
given peer ID.
However, this is not needed considering the `identify` implementation.

Further, the `pending_opens` was leaking `(SubstramId, PeerId)` tuples
in cases where the
substream opening would later fail. In other words, the implementation
did not remove the tuples on the `TransportEvent::SubstreamOpenFailure`
event.

Part of endeavors to fix memory leaks:
paritytech/polkadot-sdk#6149

### Testing Done
- custom patched litep2p to dump the internal state of identify protocol
running in kusama

The code is similar to the identify protocol. However, this leak was
more subtle and not of the magnitude of the `identify` protocol since
substream open failures are not that frequent:
 - #273
 
 cc @paritytech/networking

Signed-off-by: Alexandru Vasile <[email protected]>
0xbillw added a commit to CESSProject/polkadot-sdk that referenced this issue Dec 23, 2024
…stable2409` for a temporary apply to the downstream project.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants