-
-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Libp2p stop never resolves when shutting down node #6053
Comments
@nflaig Do you know if there were some special configuration passed to node when this behavior triggered? |
I don't think it is related to specific configuration, it really just happens if the node is running for a longer while, I had the bn running now for 15 minutes and didn't see the issue, but on two consecutive runs which had an uptime of 12+ hours I had the issue. I can't tell what the minimum amount is the bn has to run or if it happens consistently. My command ./lodestar beacon \
--dataDir /home/devops/goerli/data/beacon \
--rest \
--rest.namespace '*' \
--rest.address "0.0.0.0" \
--metrics \
--execution.urls http://localhost:8551 \
--jwt-secret /home/devops/goerli/data/jwtsecret \
--logLevel info \
--network goerli \
--checkpointSyncUrl "https://beaconstate-goerli.chainsafe.io/" \
--subscribeAllSubnets |
Just to highlight the difference of this issue compared to #5642 and #5775 What happens now is that libp2p stop never resolves, we basically have a hanging promise. The other two issues are related to an active handle that is left behind by libp2p after stop is already resolved. The active handle causes issues shutting down the main process and since we switched to a worker preventing that from being terminated by Let me know if you need more details. As you can see based on the issues I spent quite some time debugging this as well. |
@achingbrain any ideas? |
Can we look into implementing a hack to unblock our v1.12 while we take time to figure out the root cause? |
Maybe wrap
|
Nothing specific but I find why-is-node-running very helpful when tracking down active handles that stop processes from exiting. |
It not related to active handles, |
The libp2p.stop method just stops each component. Do you have any logs available to see which component is causing the problem? |
The issue is not that easy to reproduce, right now I only have the logs mentioned here #6053 (comment). I will run Lodestar with |
@achingbrain I caught some DEBUG logs (libp2p-debug.log), I didn't see any You can ignore the |
It's not clear from the logs when shutdown is started (this line is not present) but it looks like it's still accepting incoming connections so perhaps the tcp socket isn't being closed properly now. Does it happen with Or could one of these lines be throwing and the error swallowed? |
Was capturing the logs in tmux which seems to only capture the last 2k lines...
It happens since #6015 which bumped
No, those are really harmless and just remove listeners. Code has not been modified there recently
I can try pinning |
Haven't seen the issue so far when downgrading @libp2p/tcp to 8.0.7, it's been only 3 days so can't tell for sure, but seems to be a high probability that the issue was introduced in @libp2p/tcp v8.0.8 |
@achingbrain it's been a week now and I haven't seen the issue, whereas before it would happen at least once a day on a long running beacon node process. If libp2p/js-libp2p#2058 was the only change which got included in v8.0.8 I'm pretty confident that it is the cause of the problem. |
The release notes for The change in that PR was around TCP shutdown and since lodestar is now failing to shut down it seems like a thread worth pulling on. @nazarhussain @wemeetagain @maschad - can you please take a look and see if the changes in that PR have introduced this issue? |
@achingbrain is libp2p/js-libp2p#2421 expected to fix this issue? I can still see timeouts due to This issue is quite easy to reproduce now since we bumped our peer count to 100 as it became much more frequent. Was thinking maybe 5s is no longer sufficient for the timeout but removing it (unstable...nflaig/remove-libp2p-stop-timeout) introduces this issue again where the process is just stuck, as promise never resolves. |
|
Issue currently blocked until we can get js-libp2p 2.0 back in without issues. |
Describe the bug
There seems to be an issue since we have updated libp2p in #6015. In some cases, the beacon node never shuts down and just hangs indefinitely. This does not happen all the time but if the node is running for a longer time it seems to happen quite frequently.
Based on the logs it is clear that the issue is
await this.libp2p.stop();
which just never resolveslodestar/packages/beacon-node/src/network/core/networkCore.ts
Line 271 in c50db8f
See libp2p-shutdown-issue-full.log
This was also brought up by Sea Monkey on discord, corresponding logs sea-monkey-shutdown-issue.log
Expected behavior
Libp2p stop should always resolve in a timely manner
Steps to reproduce
Run beacon node on any network
Additional context
Operating system
Linux
Lodestar version or commit hash
unstable (ab2dfdd)
The text was updated successfully, but these errors were encountered: