-
Notifications
You must be signed in to change notification settings - Fork 112
Potential leak of message queues. #341
Comments
I tried running one of the benchmarks on my local machine where 3 peers fetch 1000 blocks from each other, and grepping the logs for want / cancels sent by the message queue, it seems like it's sending about the right number of cancels:
We may need to do another custom build that outputs these logs on the staging server to see if the numbers add up there. |
I'm concerned that we might be leaking the queues themselves. But maybe not. |
That could happen if there are more Connect events than Disconnect events per peer that emanate from libp2p. Is that possible? |
That shouldn't be. Guarantees:
We can pull more stats and see if this has changed. |
This is how we handle Connected and Disconnected in PeerManager: go-bitswap/internal/peermanager/peermanager.go Lines 88 to 133 in 38114a6
When we send cancels we first check to make sure we previously sent a want to the peer: go-bitswap/internal/peermanager/peermanager.go Lines 164 to 174 in 38114a6
go-bitswap/internal/peermanager/peerwantmanager.go Lines 146 to 177 in 38114a6
|
Definitely a leak. We're now up to ~1GiB of CIDs for cancels held in memory. I'm not seeing anything close for wants etc. |
Ok, forcibly disconnecting all peers has fixed the issue so we're clearly not leaking entire queues. However, we're still collecting cancels we should be removing. |
@dirkmc has narrowed this down. It looks like we're backed up trying to send cancels to peers that aren't actually accepting our streams. That means we're:
Repeatedly... Then, in the send loop, we're really slow about actually aborting and giving up. |
We could just be really busy sending cancels? But it really looks like we're leaking something somewhere. Assuming each CID takes up at most 100 bytes (should take ~50 at most):
With 5K peers, that's 500 wants peer peer. That's a lot.
PPROF:
The text was updated successfully, but these errors were encountered: