Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Network connection leak #4679

Closed
arkpar opened this issue Jan 20, 2020 · 9 comments
Closed

Network connection leak #4679

arkpar opened this issue Jan 20, 2020 · 9 comments
Labels
I3-bug The node fails to follow expected behavior.
Milestone

Comments

@arkpar
Copy link
Member

arkpar commented Jan 20, 2020

Currently substrate is based on libp2p 0.14.0-alpha.1 which has a few issues:

  1. Network opens a lot of TCP connections. After running for a few hours on kusama the number of connections reported by lsof is over 2000.
  2. Each connection preallocates at least 176 kbytes of memory for buffers in the noise protocol.
    Which leads to memory usage in hundreds of megabytes.
  3. Each connection spawns a short-lived thread, which leads to a leak in sysinfo crate.

Each of these should be fixed as soon as possible. Substrate already has a limit for protocol connections. That limit must be enforced as a global limit for TCP connections, including Kademlia and all other protocols.

All buffers should be preallocated to some small size, like 1k. The rest should be allocated lazily.

@arkpar arkpar added the I3-bug The node fails to follow expected behavior. label Jan 20, 2020
@gnunicorn gnunicorn added this to the 2.0 milestone Jan 20, 2020
@twittner
Copy link
Contributor

  1. Network opens a lot of TCP connections. After running for a few hours on kusama the number of connections reported by lsof is over 2000.

If I am not mistaken most of those connections are not established yet, e.g.

> ls -l /proc/(pgrep polkadot)/fd | wc -l
1407
> lsof -n -p (pgrep polkadot) | rg ESTABLISHED | wc -l
185
> lsof -n -p (pgrep polkadot) | rg SYN_SENT | wc -l
1134
  1. Each connection spawns a short-lived thread, which leads to a leak in sysinfo crate.

These threads are actually not as short lived as one would hope as for unreachable destinations each runs until the maximum number of initial SYN retries has been reached (cf. man 7 tcp). E.g. on a Linux system:

tcp_syn_retries (integer; default: 5; since Linux 2.2)
    The  maximum  number  of times initial SYNs for an active TCP
    connection attempt will be retransmitted. This value should not
    be higher than 255.  The default value is 5, which corresponds
    to approximately 180 seconds.

@arkpar
Copy link
Member Author

arkpar commented Jan 20, 2020

If I am not mistaken most of those connections are not established yet, e.g.

This might be the case. However I've also put a counter of NoiseOuput instances that went to thousands after a few hours as well. It is not clear then why a connection that is not established yet requires preallocated buffers. And also why there are that many connection attempts in the first place.

These threads are actually not as short lived as one would hope as for unreachable destinations each runs until the maximum number of initial SYN retries has been reached (cf. man 7 tcp). E.g. on a Linux system:

tcp_syn_retries (integer; default: 5; since Linux 2.2)
    The  maximum  number  of times initial SYNs for an active TCP
    connection attempt will be retransmitted. This value should not
    be higher than 255.  The default value is 5, which corresponds
    to approximately 180 seconds.

By short-lived I mean any thread is not kept around for the lifetime of the application. Anyway that's enough to cause issues with sysinfo and generally waste system resources.

@twittner
Copy link
Contributor

If I am not mistaken most of those connections are not established yet, e.g.

This might be the case. However I've also put a counter of NoiseOuput instances that went to thousands after a few hours as well. It is not clear then why a connection that is not established yet requires preallocated buffers.

The buffers are not preallocated but require an established connection. I was using twittner/rust-libp2p@fdd5bb5 to count the number of NoiseOutput instances and it correlates strongly with the number of established connections, not the ones in SYN_SENT state.

And also why there are that many connection attempts in the first place.

I agree, this is puzzling. I suppose there are many more nodes discovered, but I have no answer to this yet.

@twittner
Copy link
Contributor

And also why there are that many connection attempts in the first place.

I agree, this is puzzling. I suppose there are many more nodes discovered, but I have no answer to this yet.

I think that during discovery we repeatedly attempt to connect to many unreachable addresses which is not too uncommon. Discovery is scheduled to run every 60s. When queries time out (after 60s), the background threads continue to run until std::net::TcpStream::connect completes, which is roughly after 3 minutes. Thus instead of just one we may end up with up to three connection attempts for every unreachable address at any given time.

@twittner
Copy link
Contributor

twittner commented Feb 5, 2020

paritytech/polkadot#810 switched to async-std's master branch which fixes the spawning of a background thread for TcpStream::connect. In the meantime this version has also been published as async-std v1.5.

@tomaka
Copy link
Contributor

tomaka commented Feb 5, 2020

Network opens a lot of TCP connections. After running for a few hours on kusama the number of connections reported by lsof is over 2000.

I've been debugging this today. It's going rather slowly, as I have to recompile Polkadot every single time.

I created a version of Polkadot where the Kademlia discovery mechanism is running at a quick rate (10 seconds instead of 60 seconds), but the results of discovery are not transmitted to the PSM.
The result is that we constantly stay between 7 and 9 TCP connections (one per bootnode, plus a few extra connections from time to time for Kademlia).

However, when the results of discovery are transmitted to the PSM but the substrate-specific protocol is prevented from opening new connections, the number of TCP connections explodes.

The first test confirms that Kademlia alone works properly and isn't responsible for this explosion.
The second test shows that the problem comes from connections that are maintained alive by the PSM, rather than new connections being open.

@tomaka
Copy link
Contributor

tomaka commented Feb 5, 2020

Another test where discovery is normal but the PSM is modified to reject all connections and never open new ones: connections explode.
This is weird, but is a good clue.

@tomaka
Copy link
Contributor

tomaka commented Feb 6, 2020

I'm still not sure how we reach 2000 connections, but if we increase the reputation penalty for not being responsive, the number of TCP connections is much closer to the number of peers we're connected with for Substrate.

const DISCONNECT_REPUTATION_CHANGE: i32 = -10;

For instance, with a value of -256, and 18 connected peers, I'm between 20 and 30 established TCP connections (including the one for the telemetry), and there's no explosion.

It's however in general difficult to reach peers (regardless of this issue) because many nodes are unresponsive.

EDIT: with 25 peers, around 34 to 44 connections. It's still a bit weird.

EDIT: 27 peers. 29 to 46 connections.

EDIT: 26 peers. 41 to 56 connections. It's really weird.
The difficulty of the diagnosis comes from the fact that once Kademlia opens a connection, then Substrate is free to use it for Substrate-related purposes (and vice-versa).
Additionally, the DHT is quite polluted by obsolete addresses and/or nodes of other chains, which makes lots of connection attempts fail.

EDIT: 35 peers. 45 to 62 connections.

gnunicorn added a commit that referenced this issue Apr 4, 2020
This PR refactors the metrics measuring and Prometheus exposing entity in sc-service into its own submodule and extends the parameters it exposes by:

- system load average (over one, five and 15min)
- the TCP connection state of the process (lsof), refs #5304
- number of tokio threads
- number of known forks
- counter for items in each unbounded queue (with internal unbounded channels)
- number of file descriptors opened by this process (*nix only at this point)
- number of system threads (*nix only at this point)

refs #4679

Co-authored-by: Max Inden <[email protected]>
Co-authored-by: Ashley <[email protected]>
@tomaka
Copy link
Contributor

tomaka commented Apr 9, 2020

The large number of connections has been fixed by #4889 and #5086, and after #5571 we now have Prometheus metrics that lets us track this, meaning that we should be able to notice if that happens again.

@tomaka tomaka closed this as completed Apr 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I3-bug The node fails to follow expected behavior.
Projects
None yet
Development

No branches or pull requests

4 participants