-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Network connection leak #4679
Comments
If I am not mistaken most of those connections are not established yet, e.g. > ls -l /proc/(pgrep polkadot)/fd | wc -l
1407
> lsof -n -p (pgrep polkadot) | rg ESTABLISHED | wc -l
185
> lsof -n -p (pgrep polkadot) | rg SYN_SENT | wc -l
1134
These threads are actually not as short lived as one would hope as for unreachable destinations each runs until the maximum number of initial SYN retries has been reached (cf. man 7 tcp). E.g. on a Linux system:
|
This might be the case. However I've also put a counter of
By short-lived I mean any thread is not kept around for the lifetime of the application. Anyway that's enough to cause issues with |
The buffers are not preallocated but require an established connection. I was using twittner/rust-libp2p@fdd5bb5 to count the number of
I agree, this is puzzling. I suppose there are many more nodes discovered, but I have no answer to this yet. |
I think that during discovery we repeatedly attempt to connect to many unreachable addresses which is not too uncommon. Discovery is scheduled to run every 60s. When queries time out (after 60s), the background threads continue to run until |
paritytech/polkadot#810 switched to async-std's master branch which fixes the spawning of a background thread for |
I've been debugging this today. It's going rather slowly, as I have to recompile Polkadot every single time. I created a version of Polkadot where the Kademlia discovery mechanism is running at a quick rate (10 seconds instead of 60 seconds), but the results of discovery are not transmitted to the PSM. However, when the results of discovery are transmitted to the PSM but the substrate-specific protocol is prevented from opening new connections, the number of TCP connections explodes. The first test confirms that Kademlia alone works properly and isn't responsible for this explosion. |
Another test where discovery is normal but the PSM is modified to reject all connections and never open new ones: connections explode. |
I'm still not sure how we reach 2000 connections, but if we increase the reputation penalty for not being responsive, the number of TCP connections is much closer to the number of peers we're connected with for Substrate. substrate/client/peerset/src/lib.rs Line 32 in 6e0a487
For instance, with a value of -256, and 18 connected peers, I'm between 20 and 30 established TCP connections (including the one for the telemetry), and there's no explosion. It's however in general difficult to reach peers (regardless of this issue) because many nodes are unresponsive. EDIT: with 25 peers, around 34 to 44 connections. It's still a bit weird. EDIT: 27 peers. 29 to 46 connections. EDIT: 26 peers. 41 to 56 connections. It's really weird. EDIT: 35 peers. 45 to 62 connections. |
This PR refactors the metrics measuring and Prometheus exposing entity in sc-service into its own submodule and extends the parameters it exposes by: - system load average (over one, five and 15min) - the TCP connection state of the process (lsof), refs #5304 - number of tokio threads - number of known forks - counter for items in each unbounded queue (with internal unbounded channels) - number of file descriptors opened by this process (*nix only at this point) - number of system threads (*nix only at this point) refs #4679 Co-authored-by: Max Inden <[email protected]> Co-authored-by: Ashley <[email protected]>
Currently substrate is based on
libp2p 0.14.0-alpha.1
which has a few issues:lsof
is over 2000.Which leads to memory usage in hundreds of megabytes.
sysinfo
crate.Each of these should be fixed as soon as possible. Substrate already has a limit for protocol connections. That limit must be enforced as a global limit for TCP connections, including Kademlia and all other protocols.
All buffers should be preallocated to some small size, like 1k. The rest should be allocated lazily.
The text was updated successfully, but these errors were encountered: