Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ensure all listeners are properly closed on tcp shutdown #2058

Merged
merged 14 commits into from
Oct 1, 2023

Conversation

nazarhussain
Copy link
Contributor

@nazarhussain nazarhussain commented Sep 18, 2023

The lister is implemented the logic for the server to operate in multiple states. During the connection limits, it stops and starts again. We can consider these stats as pause and resume.

But due to the way TransportManager is implemented, as soon the listener is stopped it's been removed from the transport manager and if during the same time the listener connection limit is changed then it's started listening again.

Because the listener was already removed from the TransportManager, it's been been closed when we close the libp2p instance.

And this behavior can trigger everyone when we close two libp2p instances which are connected to each other with the connection limits of 1. As soon the first instance is disconnected, the second instance connection count reaches to 0 and it closes the listener and transport manager remove the listener, but because 0 is the lower limit to start as well so the listener start listening again.

Resolves #440

@nazarhussain nazarhussain requested a review from a team as a code owner September 18, 2023 11:35
@nazarhussain nazarhussain changed the title Fix the bug of listener intermediary state fix: handle the listener intermediary state Sep 18, 2023
@achingbrain
Copy link
Member

Thanks for opening this.

Can you please add some tests to demonstrate the problem this fixes and to ensure there are no regressions in future.

@nazarhussain
Copy link
Contributor Author

@achingbrain Yes definitly. I opened it to get initial feedback from team members and will add tests to make sure that the edge case is covered.

Copy link
Member

@wemeetagain wemeetagain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach LGTM

L160-161 is the important change. Before, this listener emitted the 'close' event when it hit its connection limit. If libp2p is closed at that point, the listener is never cleaned up.

Now, thanks to differentiating paused vs closed listener status, we can only emit the 'close' event when the listener is actually closed. And the listener will always be cleaned up.

Comment on lines 162 to 157
this.metrics?.status.update({
[this.addr]: SERVER_STATUS_DOWN
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move L160 and L161 after this metrics call? Or else add a metric to know when the listener gets paused / unpaused.

@wemeetagain
Copy link
Member

@achingbrain can we cut a release with this fix? This is causing some chaos in our CI pipeline

@maschad maschad requested a review from a team September 21, 2023 16:11
packages/transport-tcp/src/listener.ts Outdated Show resolved Hide resolved
packages/transport-tcp/src/listener.ts Outdated Show resolved Hide resolved
packages/transport-tcp/test/connection-limits.spec.ts Outdated Show resolved Hide resolved
packages/transport-tcp/test/connection-limits.spec.ts Outdated Show resolved Hide resolved
packages/transport-tcp/src/listener.ts Outdated Show resolved Hide resolved
packages/transport-tcp/src/listener.ts Outdated Show resolved Hide resolved
@maschad
Copy link
Member

maschad commented Sep 26, 2023

@nazarhussain this test has failed quite a few times, is it passing for you locally?

@nazarhussain
Copy link
Contributor Author

nazarhussain commented Sep 27, 2023

@nazarhussain this test has failed quite a few times, is it passing for you locally?

Some wiered behavior.

  • If I run whole test suite with npm run test:node that runs fine but stuck at the end and never exit though that test seems to pass fine.
  • If I run that single file only it passes fine and exit fine.
  • If I run the whole test suite with .only for that test case it passes fine and exits fine.
  • If I do same with .skip for the test case it passes but stuck the exit.

I assume there some setup in any other file which is not cleanup properly. So need to figure which code is the culprit for that behavior.

> npm run test:node

> [email protected] test:node
> aegir test -t node -f "./dist/test/**/*.{node,spec}.js" --cov

  circuit-relay
    flows with 1 listener
      ✔ should ask if node supports hop on protocol change (relay protocol) and add to listen multiaddrs
      ✔ should only add discovered relays relayed addresses (71ms)
      ✔ should not listen on a relayed address after we disconnect from peer (1036ms)
      ✔ should try to listen on other connected peers relayed address if one used relay disconnects (84ms)
      ✔ should try to listen on stored peers relayed address if one used relay disconnects and there are not enough connected (109ms)
      ✔ should not fail when trying to dial unreachable peers to add as hop relay and replaced removed ones (62ms)
      ✔ should announce new addresses when using a peer as a relay (1021ms)
      ✔ should announce new addresses when using no longer using peer as a relay

351 passing (45s)
  2 pending
> npx aegir test -t node -f dist/test/circuit-relay/relay.node.js

build

> [email protected] build
> aegir build

[10:55:02] tsc [started]
[10:55:07] tsc [completed]
[10:55:07] esbuild [started]
[10:55:07] esbuild [completed]
test node.js

....
....
....

26 passing (8s)

@nazarhussain
Copy link
Contributor Author

These are the open handlers when tests stuck, none of these are related to the code change in this PR. Trying to debug further.

- File descriptors: (note: stdio always exists)
  - fd 1 (tty) (stdio)
  - fd 0 (tty)
  - fd 2 (tty) (stdio)
- Child processes
  - PID 54509
    - Entry point: node:internal/child_process:255
- Servers:
  - :::55637 (HTTP)
    - Listeners:
      - request: proxy @ file:///js-libp2p/node_modules/it-ws/dist/src/server.js:70
- Intervals:
  - (300000 ~ 5 min) (anonymous) @ file:///js-libp2p/packages/libp2p/dist/src/circuit-relay/server/reservation-store.js:30

@nazarhussain
Copy link
Contributor Author

@maschad Please trigger the workflow, hope everything works fine now. The transport manager remove logic was different than the similar in the close, which was causing not all listeners to close for a transport.

@nazarhussain
Copy link
Contributor Author

nazarhussain commented Sep 28, 2023

That is strange, now everything passing for me locally.

~/projects/js-libp2p/packages/libp2p on nh/listener-bug *1 ......................................... at 16:11:06
> npm run test:node

> [email protected] test:node
> aegir test -t node -f "./dist/test/**/*.{node,spec}.js" --cov

build

> [email protected] build
> aegir build

[16:11:18] tsc [started]
[16:11:25] tsc [completed]
[16:11:25] esbuild [started]
[16:11:26] esbuild [completed]
test node.js


  Address Manager
    ✔ should not need any addresses
    ✔ should return listen multiaddrs on get
    ✔ should return announce multiaddrs on get
    ✔ should add observed addresses
    ✔ should allow duplicate listen addresses
    ✔ should dedupe added observed addresses
    ✔ should only set addresses once (1504ms)
    ✔ should strip our peer address from added obs

....
....
....
  peer job queue
    ✔ should have jobs


  351 passing (43s)
  2 pending

~/projects/js-libp2p/packages/libp2p on nh/listener-bug *1 .............................. took 1m 4s at 16:12:16

@wemeetagain @maschad Would you try running tests locally on Linux if you have. For me on Mac it passes all the time.

@wemeetagain wemeetagain changed the title fix: handle the listener intermediary state fix: handle tcp listener paused state Sep 29, 2023
@wemeetagain
Copy link
Member

I ran this locally

~/Code/js-libp2p/packages/libp2p$ npx aegir test -t node -f dist/test/circuit-relay/relay.node.js
build

> [email protected] build
> aegir build

[23:44:40] tsc [started]
[23:44:44] tsc [completed]
[23:44:44] esbuild [started]
[23:44:44] esbuild [completed]
test node.js


  circuit-relay
    flows with 1 listener
      ✔ should ask if node supports hop on protocol change (relay protocol) and add to listen multiaddrs (368ms)
      ✔ should only add discovered relays relayed addresses (382ms)
      1) should not listen on a relayed address after we disconnect from peer


  2 passing (1m)
  1 failing

  1) circuit-relay
       flows with 1 listener
         should not listen on a relayed address after we disconnect from peer:
     Error: Timeout of 60000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/home/cayman/Code/js-libp2p/packages/libp2p/dist/test/circuit-relay/relay.node.js)
      at listOnTimeout (node:internal/timers:573:17)
      at processTimers (node:internal/timers:514:7)

@nazarhussain
Copy link
Contributor Author

Update: Verified that it's the issue between Mac/Linux, same test is passing on Mac but not on Linux for the same version of Node.

Debugging it further.

@nazarhussain
Copy link
Contributor Author

That is really strange, all test passing (except for one due to VPS network) for me on the Linux VPS.

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:        22.04
Codename:       jammy
:~/libp2p$ node -v
v18.12.1

:~/libp2p$ npm -v
8.19.2
:~/libp2p$ npm run test:node

> [email protected] test:node
> aegir test -t node -f "./dist/test/**/*.{node,spec}.js" --cov

!!! Temporarily disabling code coverage

build


> [email protected] build
> aegir build

[09:48:30] tsc [started]
[09:48:33] tsc [completed]
[09:48:33] esbuild [started]
[09:48:33] esbuild [completed]
test node.js
....
....
....
  267 passing (2m)
  2 pending
  1 failing

  1) peer discovery scenarios
       MulticastDNS should discover all peers on the local network:
     Error: Timeout of 60000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/home/devops/nazar-dev/js-libp2p/packages/libp2p/dist/test/peer-discovery/index.node.js)
      at listOnTimeout (node:internal/timers:564:17)
      at processTimers (node:internal/timers:507:7)

Copy link
Member

@wemeetagain wemeetagain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@maschad maschad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left some minor suggestions. Thanks again for these improvements!

packages/libp2p/src/transport-manager.ts Outdated Show resolved Hide resolved
packages/libp2p/src/transport-manager.ts Outdated Show resolved Hide resolved
@maschad maschad self-assigned this Sep 30, 2023
@maschad maschad changed the title fix: handle tcp listener paused state fix: ensure all listeners are properly closed on tcp shutdown Sep 30, 2023
@maschad maschad merged commit b57bca4 into libp2p:master Oct 1, 2023
18 checks passed
@wemeetagain
Copy link
Member

Awesome, can we get this cut in a release? This bug is currently breaking some Lodestar CI tests

@nazarhussain
Copy link
Contributor Author

@maschad @wemeetagain Thanks for reviewing and merging this PR.

@nazarhussain nazarhussain deleted the nh/listener-bug branch October 2, 2023 04:59
@maschad
Copy link
Member

maschad commented Oct 2, 2023

@wemeetagain @nazarhussain master has been released it should appear in libp2p v0.46.12 and @libp2p/tcp v8.0.8 once CI is complete it is in libp2p v0.46.12 and @libp2p/tcp v8.0.8

@wemeetagain
Copy link
Member

🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[switch] start and stop accepting connections
4 participants