-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: ensure all listeners are properly closed on tcp shutdown #2058
Conversation
5c361be
to
20b7929
Compare
Thanks for opening this. Can you please add some tests to demonstrate the problem this fixes and to ensure there are no regressions in future. |
@achingbrain Yes definitly. I opened it to get initial feedback from team members and will add tests to make sure that the edge case is covered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach LGTM
L160-161 is the important change. Before, this listener emitted the 'close'
event when it hit its connection limit. If libp2p is closed at that point, the listener is never cleaned up.
Now, thanks to differentiating paused vs closed listener status, we can only emit the 'close'
event when the listener is actually closed. And the listener will always be cleaned up.
this.metrics?.status.update({ | ||
[this.addr]: SERVER_STATUS_DOWN | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move L160 and L161 after this metrics call? Or else add a metric to know when the listener gets paused / unpaused.
41cf61d
to
631123f
Compare
@achingbrain can we cut a release with this fix? This is causing some chaos in our CI pipeline |
@nazarhussain this test has failed quite a few times, is it passing for you locally? |
Some wiered behavior.
I assume there some setup in any other file which is not cleanup properly. So need to figure which code is the culprit for that behavior.
|
These are the open handlers when tests stuck, none of these are related to the code change in this PR. Trying to debug further.
|
@maschad Please trigger the workflow, hope everything works fine now. The transport manager |
That is strange, now everything passing for me locally.
@wemeetagain @maschad Would you try running tests locally on Linux if you have. For me on Mac it passes all the time. |
I ran this locally ~/Code/js-libp2p/packages/libp2p$ npx aegir test -t node -f dist/test/circuit-relay/relay.node.js
build
> [email protected] build
> aegir build
[23:44:40] tsc [started]
[23:44:44] tsc [completed]
[23:44:44] esbuild [started]
[23:44:44] esbuild [completed]
test node.js
circuit-relay
flows with 1 listener
✔ should ask if node supports hop on protocol change (relay protocol) and add to listen multiaddrs (368ms)
✔ should only add discovered relays relayed addresses (382ms)
1) should not listen on a relayed address after we disconnect from peer
2 passing (1m)
1 failing
1) circuit-relay
flows with 1 listener
should not listen on a relayed address after we disconnect from peer:
Error: Timeout of 60000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/home/cayman/Code/js-libp2p/packages/libp2p/dist/test/circuit-relay/relay.node.js)
at listOnTimeout (node:internal/timers:573:17)
at processTimers (node:internal/timers:514:7)
|
Update: Verified that it's the issue between Mac/Linux, same test is passing on Mac but not on Linux for the same version of Node. Debugging it further. |
That is really strange, all test passing (except for one due to VPS network) for me on the Linux VPS.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left some minor suggestions. Thanks again for these improvements!
Awesome, can we get this cut in a release? This bug is currently breaking some Lodestar CI tests |
@maschad @wemeetagain Thanks for reviewing and merging this PR. |
@wemeetagain @nazarhussain master has been released |
🙏 |
The lister is implemented the logic for the server to operate in multiple states. During the connection limits, it stops and starts again. We can consider these stats as pause and resume.
But due to the way
TransportManager
is implemented, as soon the listener is stopped it's been removed from the transport manager and if during the same time the listener connection limit is changed then it's started listening again.Because the listener was already removed from the
TransportManager
, it's been been closed when we close the libp2p instance.And this behavior can trigger everyone when we close two libp2p instances which are connected to each other with the connection limits of 1. As soon the first instance is disconnected, the second instance connection count reaches to 0 and it closes the listener and transport manager remove the listener, but because 0 is the lower limit to start as well so the listener start listening again.
Resolves #440