bug: possible memory leak #1662

jm-clius · 2023-04-11T09:00:56Z

Problem

A possible memory leak was introduced or activated in the wakuv2.test fleet, roughly between 30 March and 5th of April

Impact

Critical. Unbounded memory growth causes crashes and instability.

Screenshots/logs

nwaku version/commit hash

At least c26dcb2 and later

Additional context

This memory leak is not visible on status.test, which runs the same version of nwaku

The text was updated successfully, but these errors were encountered:

alrevuelta · 2023-04-11T12:57:18Z

I just deployed latest master (statusteam/nim-waku:9e1432c9) in my minifleet with just relay enabled, in an attempt to point to which protocol may be causing issues.

alrevuelta · 2023-04-12T07:45:44Z

I just deployed latest master (statusteam/nim-waku:9e1432c9) in my minifleet with just relay enabled, in an attempt to point to which protocol may be causing issues.

Can't reproduce the memory leak with the above setup. 20 peers with just relay and discovery enabled 100 msg/second and 2kb msg size (actually more than the traffic we have in the fleet). Container mem usage is <50Mb over almost 1 day.

So I would say the mem lead is not related to the relay protocol.

jm-clius · 2023-04-14T07:40:57Z

We've run overnight on wakuv2.test:
commit c26dcb2 on ams3
and the latest commit (73cbafa) on hong-kong and restarted containers simultaneously.
Currently ams3 is at ~400MB memory and hong-kong over 1 GB, so it seems likely that the leak occurred in between,

I'm also running several other versions on a sandbox machine, but so far the leak is only observed on the fleets.

Next step is to try to pin the specific commit that caused the leak. Since we suspect that one of the submodule updates may be to blame, I'll deploy the commit just after the two submodule updates next.

Ivansete-status · 2023-04-15T08:27:55Z

ams3 it at 1.09GB - c26dcb2
us-central1-a is at 1.09GB - 73cbafa

The hong-kong node crashed with 862MB running 7c229ec

Log file: /var/log/docker/nim-waku-v2/docker.20230414_0000.log.gz (node-01.ac-cn-hongkong-c.wakuv2.test)

2023-04-13T19:02:56.472588+00:00 docker/nim-waku-v2[793]: /app/apps/wakunode2/wakunode2.nim(826) wakunode2
2023-04-13T19:02:56.472624+00:00 docker/nim-waku-v2[793]: /app/vendor/nim-chronos/chronos/asyncloop.nim(1245) 
runForever
...
2023-04-13T19:02:56.472882+00:00 docker/nim-waku-v2[793]: Error: unhandled exception: over- or underflow [OverflowDefect]

Ivansete-status · 2023-04-17T15:49:16Z

@jm-clius checked the current memory status. it seems like it was present already in v0.16.0 release - no noticeable difference between release and later versions, all using more than 1.2 GB of memory currently. Restarting ams3 with v0.15.0.

Ivansete-status · 2023-04-18T06:39:09Z

HK has 2.51GB at 2ffd2f8. It didn't crash and clearly has mem leak.
US central has now 1.08GB at v0.16.0. It crashed when it had 1.56GB.
Ams3 has 537MB at v0.15.0. Although it crashed, Its memory gets properly GC'ed, without crash, every now and then.

jm-clius · 2023-04-18T06:54:00Z

Thanks, @Ivansete-status! I'm starting HK with 43b3f1c - more or less halfway between the two releases, so we can narrow down the offending change.

Ivansete-status · 2023-04-18T15:58:13Z

Thanks @jm-clius!,
with 43b3f1c, the HK node gets correctly GC'ed at 1.17GB.

Ivansete-status · 2023-04-19T18:26:15Z

In 67db35e the HK node crashed at 1.06 GB

Ivansete-status · 2023-04-20T13:48:14Z

After analysing the heaptrack reports, we saw the next points as possible mem-leak candidates:

Chronos - one - https://github.com/status-im/nim-chronos/blob/e05d2f8e9648e17b9a71bf5e909a79241c067c8b/chronos/streams/tlsstream.nim#L366

Chronos - tlsWriteApp - https://github.com/status-im/nim-chronos/blob/e05d2f8e9648e17b9a71bf5e909a79241c067c8b/chronos/streams/tlsstream.nim#L339

LibP2P - sendEncoded - https://github.com/status-im/nim-libp2p/blob/a5666789b0a80e810d55844ccbbbc6c18928b462/libp2p/protocols/pubsub/pubsubpeer.nim#L262

jm-clius · 2023-04-20T14:30:11Z

Thanks for this! What is the meaning of the two values next to each other? Is that allocs and deallocs?

Ivansete-status · 2023-04-20T14:32:05Z

Thanks for this! What is the meaning of the two values next to each other? Is that allocs and deallocs?

ah sorry! I missed that. Is ther Peak & Leaked, according to heaptrack.

Ivansete-status · 2023-04-20T14:39:22Z

In 67db35e the HK is 1.29 GB. Has mem leak.

jm-clius · 2023-04-20T15:15:14Z

While we're investigating the ways in which these allocations could leak and if it's due to nwaku or submodules, I've done the following:

restarted HK with the same confirmed leaky version, but with websockets disabled. Secure websockets could be the cause of the TLSStream leaks
installed ef7ce67 on AMS3 and restarted

jm-clius · 2023-04-20T15:33:12Z

Websockets seeming more and more likely as the possible source of the leak: green line is HK node, running with websockets disabled after 17:15 - yellow is AMS3 running with websockets enabled. Note, however, that ams3 has more than 10 times the number of connected peers.

jm-clius · 2023-04-21T07:05:44Z

14 hours of running with and without websockets. Green is without websockets.

alrevuelta · 2023-04-21T12:10:08Z

We can indeed see some correlation between websockets enabled/not and the mem leak. Wondering where we are getting that many wss connections.

alrevuelta · 2023-04-21T12:29:49Z

Wondering also if its fair to assume its websockets issue, since disabling websockets, reduces significantly the amount of peers.

alrevuelta · 2023-04-21T14:15:52Z

A quick summary on the findings:

We verified with the existing data from the fleet that disabling websockets indeed stops the leak see.
But wondering if this is not really the rootcause of the issue. Disabling websockets also drops the connections from hundreds to just 6-8 see so perhaps its not fair to say its the websockets fault, since having a low number of peers has also some implications that can be related.
We deployed in the sandbox machine 80 nodes totally isolated from the internet, with the same exact cofiguration as in the fleets, but couldnt reproduce the leak.
We suspect that its not only about a given setup (release, configuration parameters, etc) but also some external action on the fleet (eg filte/lp/store requests, connections (tcp or wss), etc). Some of these external actions disapear when we test in our local sandbox.
We just deployed one node in the sandbox machine with the same configuration as the fleets, hoping to reproduce the leak here. Only difference is that the local database for messages is almost empty (since the node just started). If we can't reproduce the issue with this, I would disable store protocol as next step, as it would be the only difference.

Ivansete-status · 2023-04-21T16:01:03Z

Thanks for the summary @alrevuelta !

Regarding the LibP2P correlation with the mem-leak and wss option, in the next image we can see that in the last two hours, the US node has wss disabled but still ~70 peers, and the memory is low:

alrevuelta · 2023-04-24T07:17:30Z

@Ivansete-status ah i see my bad. I thought wss was enabled again in the star.

Ivansete-status · 2023-04-24T11:21:56Z

The next issue was reported in nim-websock status-im/nim-websock#138

Ivansete-status · 2023-04-28T16:25:49Z

We now have an additional problem which is that there are many peers connecting through wss.
The next table shows the most representative peers for node-01.do-ams3.wakuv2.test (271 wss ESTABLISHED connections.).

# of connections	Remote peer address
28	103.50.33.82
51	45.128.199.251
45	cped4e2cb4f2b0a-cmd4e2cb4f2b08.cpe.net.cable.rogers.com
...	...
47	gen-070-125-105-243.biz.spectrum.com

(netstat -ap | grep ESTABLISHED | grep 8000 | awk '{print $5}' | cut -d':' -f 1 | sort | uniq -c from within the docker container)

I checked that all connections with 45.128.199.251 are actually in use by running the next cmd for a few minutes:
tcpdump dst 45.128.199.251 | awk '{print $5}' | cut -d'.' -f 5

jm-clius · 2023-06-20T14:11:25Z

Several items being fixed here, but #1800 provides some memory improvement, coupled with better peer management techniques + gossipsub scoring.

Monitoring memory usage.

jm-clius added bug Something isn't working critical This issue needs critical attention track:maintenance labels Apr 11, 2023

oskarth added this to Waku and Vac Research Apr 11, 2023

Ivansete-status self-assigned this Apr 11, 2023

Ivansete-status assigned alrevuelta and jm-clius Apr 14, 2023

jm-clius moved this to Priority in Waku Apr 18, 2023

jm-clius moved this from Priority to In Progress in Waku Apr 18, 2023

Ivansete-status mentioned this issue Apr 18, 2023

feat(mem-analysis): Adding Dockerfile_with_heaptrack #1681

Merged

Ivansete-status mentioned this issue Apr 24, 2023

nwaku memory issue in chronos/streams/tlsstream.nim status-im/nim-websock#138

Closed

Ivansete-status mentioned this issue Apr 24, 2023

Heap analysis using heaptracker #986

Open

fryorcraken mentioned this issue May 1, 2023

Confirm use of static peers, protocols and usage quota status-im/community-dapp#23

Closed

fryorcraken pinned this issue May 1, 2023

This was referenced May 8, 2023

Nwaku maximum connections reached for wss clients waku-org/pm#23

Closed

bug: messages being gossiped one hour later #1542

Open

Waku Relay messages detected exceeding max size of 1 MB waku-org/pm#24

Open

jm-clius closed this as completed Jun 20, 2023

github-project-automation bot moved this from In Progress to Done in Waku Jun 20, 2023

github-project-automation bot moved this to Done in Vac Research Jun 20, 2023

SionoiS unpinned this issue Jul 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: possible memory leak #1662

bug: possible memory leak #1662

jm-clius commented Apr 11, 2023

alrevuelta commented Apr 11, 2023

alrevuelta commented Apr 12, 2023

jm-clius commented Apr 14, 2023

Ivansete-status commented Apr 15, 2023

Ivansete-status commented Apr 17, 2023

Ivansete-status commented Apr 18, 2023

jm-clius commented Apr 18, 2023

Ivansete-status commented Apr 18, 2023

Ivansete-status commented Apr 19, 2023

Ivansete-status commented Apr 20, 2023

jm-clius commented Apr 20, 2023

Ivansete-status commented Apr 20, 2023 •

edited

Loading

Ivansete-status commented Apr 20, 2023

jm-clius commented Apr 20, 2023

jm-clius commented Apr 20, 2023 •

edited

Loading

jm-clius commented Apr 21, 2023

alrevuelta commented Apr 21, 2023

alrevuelta commented Apr 21, 2023

alrevuelta commented Apr 21, 2023

Ivansete-status commented Apr 21, 2023

alrevuelta commented Apr 24, 2023

Ivansete-status commented Apr 24, 2023

Ivansete-status commented Apr 28, 2023

jm-clius commented Jun 20, 2023

bug: possible memory leak #1662

bug: possible memory leak #1662

Comments

jm-clius commented Apr 11, 2023

Problem

Impact

Screenshots/logs

nwaku version/commit hash

Additional context

alrevuelta commented Apr 11, 2023

alrevuelta commented Apr 12, 2023

jm-clius commented Apr 14, 2023

Ivansete-status commented Apr 15, 2023

Ivansete-status commented Apr 17, 2023

Ivansete-status commented Apr 18, 2023

jm-clius commented Apr 18, 2023

Ivansete-status commented Apr 18, 2023

Ivansete-status commented Apr 19, 2023

Ivansete-status commented Apr 20, 2023

jm-clius commented Apr 20, 2023

Ivansete-status commented Apr 20, 2023 • edited Loading

Ivansete-status commented Apr 20, 2023

jm-clius commented Apr 20, 2023

jm-clius commented Apr 20, 2023 • edited Loading

jm-clius commented Apr 21, 2023

alrevuelta commented Apr 21, 2023

alrevuelta commented Apr 21, 2023

alrevuelta commented Apr 21, 2023

Ivansete-status commented Apr 21, 2023

alrevuelta commented Apr 24, 2023

Ivansete-status commented Apr 24, 2023

Ivansete-status commented Apr 28, 2023

jm-clius commented Jun 20, 2023

Ivansete-status commented Apr 20, 2023 •

edited

Loading

jm-clius commented Apr 20, 2023 •

edited

Loading