Goroutine leak leads to OOM #6166

requilence · 2019-04-02T23:50:00Z

Version information:

tried both 0.4.19 and latest master:

go-ipfs version: 0.4.20-dev-fd15c62
Repo version: 7
System version: amd64/darwin
Golang version: go1.11.4

Type:

bug

Description:

I created the fresh repo this morning. It was working good for some time but now every time I run ipfs daemon I have a huge goroutine leak that leads to the OOM in a few minutes. I set HighWater = 60, LowWater = 30 to make sure it doesn't depend on swarm size
https://gist.github.com/requilence/8f81663a95bec7a4083e2600ff24aeda

I had the same problem a few days ago(recreated the repo after)

It is a really huge list to manually check one by one. Maybe someone has an idea where it could come from?

The text was updated successfully, but these errors were encountered:

requilence · 2019-04-03T11:01:42Z

I have more details to share:
I put the debug log here:
https://github.com/ipfs/go-bitswap/blob/85e3f43f0b3b6859434b16a59c36bae6abf5d29e/peermanager/peermanager.go#L131

After 2 minutes of uptime I see:
PeerManager.getOrCreate(QmRnTcjn29vbepLtQoUJdS8cYiNYUnMSrfTsTCJZUaPFRJ) times = 3, len(pm.peerQueues) = 8299, len(uniquePeersMap) = 13437

I count unique and times this way:

	uniquePeersMapMutex.Lock()
	times := uniquePeersMap[p] + 1
	uniquePeersMap[p] = times
	uniquePeersMapMutex.Unlock()

Please notice that I have HighWater = 60, LowWater = 30. But despite this, it connects to 8299 peers

Stebalien · 2019-04-03T12:24:30Z

@requilence could create a dump as described here: https://github.com/ipfs/go-ipfs/blob/master/docs/debug-guide.md#beginning? We have a tool called stackparse for exactly this.

requilence · 2019-04-03T13:31:29Z

@Stebalien thanks. It was challenging to capture all of them before OOM as it becomes worse and eats 3GB in 1 min :-)
0.4.19.tar.gz

0.4.20@74d07eff35965a3f635d03aedaa43561c73679e2:
0.4.20.tar.gz

I have also added ipfs.stacks_grouped using goroutine?debug=1 because full stack is 64M

Stebalien · 2019-04-04T05:20:30Z

Could you post your config, minus your private keys? It looks like you're running a relay which would explain all the peers.

Note: the connection manager tries to keep the number of connections within the target range but it doesn't stop new connections from being created. That's what's killing your CPU (creating/removing connections). We definitely need better back-pressure, it looks like this is a bit of a runaway process.

requilence · 2019-04-04T06:53:23Z

@Stebalien
You are right, I have EnableRelayHop = true and EnableAutoRelay = true
https://gist.github.com/requilence/0d713de5a8e52d666830b696a10b6264

That's what's killing your CPU

actually the main problem that it eats 3GB of RAM, while heap only showing about 500MB. As I know goroutine is pretty cheap(2KB of memory) and 200k goroutines should eat around 390MB. Where it could come from?

Stebalien · 2019-04-04T07:11:23Z

You are right, I have EnableRelayHop = true

EnableAutoRelay is fine, it's EnableRelayHop that's causing everyone to use you as a relay.

actually the main problem that it eats 3GB of RAM, while heap only showing about 500MB. As I know goroutine is pretty cheap(2KB of memory) and 200k goroutines should eat around 390MB. Where it could come from?

It could be allocation velocity (#5530). Basically, we're allocating and deallocating really fast so go reserves a bunch of memory it thinks it might need. That's my best guess.

requilence · 2019-04-04T14:09:19Z

EnableRelayHop that's causing everyone to use you as a relay.

It was intentionally. So I guess after introducing EnableAutoRelay option demand for relays dramatically increased but offers is still very thin. So this disbalance is the core reason.

Stebalien · 2019-04-04T19:29:07Z

Likely, yes. Basically, this is a combination of two issues:

You're relaying so many nodes are trying to use you to talk to other peers.
You have a very low connection limit so you're rapidly killing these connections.

Ideally, the connection manager and relay would actually talk to eachother and the relay would stop accepting new connections at some point... (libp2p/go-libp2p-circuit#65).

Stebalien · 2019-04-04T19:29:52Z

@requilence has disabling relay helped?

vyzo · 2019-04-06T09:30:49Z

If you want to enable relay hop you will need to set limits in the connection manager.
Otherwise you will be quickly inundated with connections (our relays have 40k-50k connections active currently), which will lead to ooms.

vyzo · 2019-04-08T16:17:31Z

See also libp2p/go-libp2p-circuit#69
We've identified the biggest culprit in relay memory usage, and this should make it much better.

requilence · 2019-04-08T16:19:20Z

@Stebalien disabling relay doesn't help. Probably because I have already advertised my peer as a relay through DHT and it needs some time to expire

requilence · 2019-04-08T16:35:06Z

@vyzo sounds cool, I will try to use this patch on leaking setup and come back here with results

vyzo · 2019-04-09T11:04:00Z

We have identified the goroutine buildup culprit as identify. There is a series of patches that should fix the issues:

Stebalien · 2019-04-30T18:09:46Z

@requilence could you try the latest master?

leerspace · 2019-05-01T12:32:44Z

I think I'm hitting an issue similar to this where at some point connection counts start climbing rapidly past the default HighWater threshold; but I don't have the exact same configuration. While I have EnableAutoRelay = true, I have EnableRelayHop = false; I also have QUIC enabled.

Should I create a separate issue? Or would it be worth upload the debug logs (e.g., heap dump, stacks, config, ipfs swarm peers snapshots, etc) here?

Stebalien · 2019-05-01T16:29:30Z

@leerspace please file a new issue. Also, try disabling the DHT with --routing=dhtclient (your node may now be dialable where it wasn't before).

Stebalien · 2019-05-01T16:30:18Z

I'm going to close this issue as "solved" for now. If that's not the case, please yell and I'll reopen it.

requilence changed the title ~~Goroutine leak~~ Goroutine leak leads to OOM Apr 3, 2019

Stebalien added the kind/bug A bug in existing code (including security flaws) label Apr 3, 2019

Stebalien mentioned this issue Apr 3, 2019

Release v0.4.20 #6163

Closed

9 tasks

Stebalien added the topic/perf Performance label Apr 4, 2019

Stebalien mentioned this issue Apr 4, 2019

Work with the connection manager libp2p/go-libp2p-circuit#65

Closed

Stebalien closed this as completed May 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Goroutine leak leads to OOM #6166

Goroutine leak leads to OOM #6166

requilence commented Apr 2, 2019 •

edited

Loading

requilence commented Apr 3, 2019 •

edited

Loading

Stebalien commented Apr 3, 2019

requilence commented Apr 3, 2019 •

edited

Loading

Stebalien commented Apr 4, 2019

requilence commented Apr 4, 2019 •

edited

Loading

Stebalien commented Apr 4, 2019

requilence commented Apr 4, 2019

Stebalien commented Apr 4, 2019

Stebalien commented Apr 4, 2019

vyzo commented Apr 6, 2019

vyzo commented Apr 8, 2019

requilence commented Apr 8, 2019

requilence commented Apr 8, 2019

vyzo commented Apr 9, 2019

Stebalien commented Apr 30, 2019

leerspace commented May 1, 2019

Stebalien commented May 1, 2019

Stebalien commented May 1, 2019

Goroutine leak leads to OOM #6166

Goroutine leak leads to OOM #6166

Comments

requilence commented Apr 2, 2019 • edited Loading

Version information:

Type:

Description:

requilence commented Apr 3, 2019 • edited Loading

Stebalien commented Apr 3, 2019

requilence commented Apr 3, 2019 • edited Loading

Stebalien commented Apr 4, 2019

requilence commented Apr 4, 2019 • edited Loading

Stebalien commented Apr 4, 2019

requilence commented Apr 4, 2019

Stebalien commented Apr 4, 2019

Stebalien commented Apr 4, 2019

vyzo commented Apr 6, 2019

vyzo commented Apr 8, 2019

requilence commented Apr 8, 2019

requilence commented Apr 8, 2019

vyzo commented Apr 9, 2019

Stebalien commented Apr 30, 2019

leerspace commented May 1, 2019

Stebalien commented May 1, 2019

Stebalien commented May 1, 2019

requilence commented Apr 2, 2019 •

edited

Loading

requilence commented Apr 3, 2019 •

edited

Loading

requilence commented Apr 3, 2019 •

edited

Loading

requilence commented Apr 4, 2019 •

edited

Loading