fix: ensure all listeners are properly closed on tcp shutdown #2058

nazarhussain · 2023-09-18T11:35:31Z

The lister is implemented the logic for the server to operate in multiple states. During the connection limits, it stops and starts again. We can consider these stats as pause and resume.

But due to the way TransportManager is implemented, as soon the listener is stopped it's been removed from the transport manager and if during the same time the listener connection limit is changed then it's started listening again.

Because the listener was already removed from the TransportManager, it's been been closed when we close the libp2p instance.

And this behavior can trigger everyone when we close two libp2p instances which are connected to each other with the connection limits of 1. As soon the first instance is disconnected, the second instance connection count reaches to 0 and it closes the listener and transport manager remove the listener, but because 0 is the lower limit to start as well so the listener start listening again.

Resolves #440

achingbrain · 2023-09-18T12:43:09Z

Thanks for opening this.

Can you please add some tests to demonstrate the problem this fixes and to ensure there are no regressions in future.

nazarhussain · 2023-09-18T13:47:25Z

@achingbrain Yes definitly. I opened it to get initial feedback from team members and will add tests to make sure that the edge case is covered.

wemeetagain

The approach LGTM

L160-161 is the important change. Before, this listener emitted the 'close' event when it hit its connection limit. If libp2p is closed at that point, the listener is never cleaned up.

Now, thanks to differentiating paused vs closed listener status, we can only emit the 'close' event when the listener is actually closed. And the listener will always be cleaned up.

wemeetagain · 2023-09-19T13:37:47Z

packages/transport-tcp/src/listener.ts

        this.metrics?.status.update({
          [this.addr]: SERVER_STATUS_DOWN
        })


Can you move L160 and L161 after this metrics call? Or else add a metric to know when the listener gets paused / unpaused.

wemeetagain · 2023-09-21T14:45:42Z

@achingbrain can we cut a release with this fix? This is causing some chaos in our CI pipeline

packages/transport-tcp/src/listener.ts

packages/transport-tcp/test/connection-limits.spec.ts

packages/transport-tcp/src/listener.ts

maschad · 2023-09-26T15:51:03Z

@nazarhussain this test has failed quite a few times, is it passing for you locally?

nazarhussain · 2023-09-27T09:05:43Z

@nazarhussain this test has failed quite a few times, is it passing for you locally?

Some wiered behavior.

If I run whole test suite with npm run test:node that runs fine but stuck at the end and never exit though that test seems to pass fine.
If I run that single file only it passes fine and exit fine.
If I run the whole test suite with .only for that test case it passes fine and exits fine.
If I do same with .skip for the test case it passes but stuck the exit.

I assume there some setup in any other file which is not cleanup properly. So need to figure which code is the culprit for that behavior.

> npm run test:node

> [email protected] test:node
> aegir test -t node -f "./dist/test/**/*.{node,spec}.js" --cov

  circuit-relay
    flows with 1 listener
      ✔ should ask if node supports hop on protocol change (relay protocol) and add to listen multiaddrs
      ✔ should only add discovered relays relayed addresses (71ms)
      ✔ should not listen on a relayed address after we disconnect from peer (1036ms)
      ✔ should try to listen on other connected peers relayed address if one used relay disconnects (84ms)
      ✔ should try to listen on stored peers relayed address if one used relay disconnects and there are not enough connected (109ms)
      ✔ should not fail when trying to dial unreachable peers to add as hop relay and replaced removed ones (62ms)
      ✔ should announce new addresses when using a peer as a relay (1021ms)
      ✔ should announce new addresses when using no longer using peer as a relay

351 passing (45s)
  2 pending

> npx aegir test -t node -f dist/test/circuit-relay/relay.node.js

build

> [email protected] build
> aegir build

[10:55:02] tsc [started]
[10:55:07] tsc [completed]
[10:55:07] esbuild [started]
[10:55:07] esbuild [completed]
test node.js

....
....
....

26 passing (8s)

nazarhussain · 2023-09-28T08:40:12Z

These are the open handlers when tests stuck, none of these are related to the code change in this PR. Trying to debug further.

- File descriptors: (note: stdio always exists)
  - fd 1 (tty) (stdio)
  - fd 0 (tty)
  - fd 2 (tty) (stdio)
- Child processes
  - PID 54509
    - Entry point: node:internal/child_process:255
- Servers:
  - :::55637 (HTTP)
    - Listeners:
      - request: proxy @ file:///js-libp2p/node_modules/it-ws/dist/src/server.js:70
- Intervals:
  - (300000 ~ 5 min) (anonymous) @ file:///js-libp2p/packages/libp2p/dist/src/circuit-relay/server/reservation-store.js:30

nazarhussain · 2023-09-28T09:24:15Z

@maschad Please trigger the workflow, hope everything works fine now. The transport manager remove logic was different than the similar in the close, which was causing not all listeners to close for a transport.

nazarhussain · 2023-09-28T14:13:01Z

That is strange, now everything passing for me locally.

~/projects/js-libp2p/packages/libp2p on nh/listener-bug *1 ......................................... at 16:11:06
> npm run test:node

> [email protected] test:node
> aegir test -t node -f "./dist/test/**/*.{node,spec}.js" --cov

build

> [email protected] build
> aegir build

[16:11:18] tsc [started]
[16:11:25] tsc [completed]
[16:11:25] esbuild [started]
[16:11:26] esbuild [completed]
test node.js


  Address Manager
    ✔ should not need any addresses
    ✔ should return listen multiaddrs on get
    ✔ should return announce multiaddrs on get
    ✔ should add observed addresses
    ✔ should allow duplicate listen addresses
    ✔ should dedupe added observed addresses
    ✔ should only set addresses once (1504ms)
    ✔ should strip our peer address from added obs

....
....
....
  peer job queue
    ✔ should have jobs


  351 passing (43s)
  2 pending

~/projects/js-libp2p/packages/libp2p on nh/listener-bug *1 .............................. took 1m 4s at 16:12:16

@wemeetagain @maschad Would you try running tests locally on Linux if you have. For me on Mac it passes all the time.

wemeetagain · 2023-09-29T03:47:11Z

I ran this locally

~/Code/js-libp2p/packages/libp2p$ npx aegir test -t node -f dist/test/circuit-relay/relay.node.js
build

> [email protected] build
> aegir build

[23:44:40] tsc [started]
[23:44:44] tsc [completed]
[23:44:44] esbuild [started]
[23:44:44] esbuild [completed]
test node.js


  circuit-relay
    flows with 1 listener
      ✔ should ask if node supports hop on protocol change (relay protocol) and add to listen multiaddrs (368ms)
      ✔ should only add discovered relays relayed addresses (382ms)
      1) should not listen on a relayed address after we disconnect from peer


  2 passing (1m)
  1 failing

  1) circuit-relay
       flows with 1 listener
         should not listen on a relayed address after we disconnect from peer:
     Error: Timeout of 60000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/home/cayman/Code/js-libp2p/packages/libp2p/dist/test/circuit-relay/relay.node.js)
      at listOnTimeout (node:internal/timers:573:17)
      at processTimers (node:internal/timers:514:7)

nazarhussain · 2023-09-29T09:06:50Z

Update: Verified that it's the issue between Mac/Linux, same test is passing on Mac but not on Linux for the same version of Node.

Debugging it further.

nazarhussain · 2023-09-29T14:39:47Z

That is really strange, all test passing (except for one due to VPS network) for me on the Linux VPS.

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:        22.04
Codename:       jammy

:~/libp2p$ node -v
v18.12.1

:~/libp2p$ npm -v
8.19.2

:~/libp2p$ npm run test:node

> [email protected] test:node
> aegir test -t node -f "./dist/test/**/*.{node,spec}.js" --cov

!!! Temporarily disabling code coverage

build


> [email protected] build
> aegir build

[09:48:30] tsc [started]
[09:48:33] tsc [completed]
[09:48:33] esbuild [started]
[09:48:33] esbuild [completed]
test node.js
....
....
....
  267 passing (2m)
  2 pending
  1 failing

  1) peer discovery scenarios
       MulticastDNS should discover all peers on the local network:
     Error: Timeout of 60000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/home/devops/nazar-dev/js-libp2p/packages/libp2p/dist/test/peer-discovery/index.node.js)
      at listOnTimeout (node:internal/timers:564:17)
      at processTimers (node:internal/timers:507:7)

wemeetagain

LGTM

maschad

LGTM, ~~left some minor suggestions~~. Thanks again for these improvements!

packages/libp2p/src/transport-manager.ts

wemeetagain · 2023-10-02T02:17:48Z

Awesome, can we get this cut in a release? This bug is currently breaking some Lodestar CI tests

nazarhussain · 2023-10-02T04:59:05Z

@maschad @wemeetagain Thanks for reviewing and merging this PR.

maschad · 2023-10-02T13:58:29Z

@wemeetagain @nazarhussain master has been released ~~it should appear in libp2p v0.46.12 and @libp2p/tcp v8.0.8 once CI is complete~~ it is in libp2p v0.46.12 and @libp2p/tcp v8.0.8

wemeetagain · 2023-10-02T14:21:28Z

🙏

nazarhussain requested a review from a team as a code owner September 18, 2023 11:35

nazarhussain force-pushed the nh/listener-bug branch from 5c361be to 20b7929 Compare September 18, 2023 11:37

nazarhussain changed the title ~~Fix the bug of listener intermediary state~~ fix: handle the listener intermediary state Sep 18, 2023

wemeetagain reviewed Sep 19, 2023

View reviewed changes

nazarhussain mentioned this pull request Sep 20, 2023

fix: make the network listener more stable ChainSafe/lodestar#5962

Closed

nazarhussain marked this pull request as draft September 20, 2023 17:29

nazarhussain added 3 commits September 21, 2023 12:38

Fix the bug of listner intermediary state

b23dd6a

Update the listner

e47e041

Add unit tests for the connection limit

631123f

nazarhussain force-pushed the nh/listener-bug branch from 41cf61d to 631123f Compare September 21, 2023 10:38

nazarhussain marked this pull request as ready for review September 21, 2023 10:38

nazarhussain requested a review from wemeetagain September 21, 2023 10:39

wemeetagain approved these changes Sep 21, 2023

View reviewed changes

Fix the lint errors

2e5346a

maschad requested a review from a team September 21, 2023 16:11

maschad requested changes Sep 21, 2023

View reviewed changes

Update code with feedback

620cd80

nazarhussain requested a review from maschad September 23, 2023 13:36

Merge branch 'master' into nh/listener-bug

a05200c

Merge branch 'master' into nh/listener-bug

e59dc54

Fix the transport manager to close all listeners

ea620dc

Merge branch 'master' into nh/listener-bug

57d19f2

wemeetagain changed the title ~~fix: handle the listener intermediary state~~ fix: handle tcp listener paused state Sep 29, 2023

nazarhussain added 2 commits September 29, 2023 11:51

Update the listner to close all connections later

d4259ae

Merge branch 'master' into nh/listener-bug

70c9037

Update the listener close handler

a6722a9

wemeetagain approved these changes Sep 30, 2023

View reviewed changes

maschad approved these changes Sep 30, 2023

View reviewed changes

packages/libp2p/src/transport-manager.ts Outdated Show resolved Hide resolved

packages/libp2p/src/transport-manager.ts Outdated Show resolved Hide resolved

maschad added 2 commits September 30, 2023 16:07

Update packages/libp2p/src/transport-manager.ts

83f5340

Update packages/libp2p/src/transport-manager.ts

fdd3e96

maschad self-assigned this Sep 30, 2023

maschad changed the title ~~fix: handle tcp listener paused state~~ fix: ensure all listeners are properly closed on tcp shutdown Sep 30, 2023

maschad merged commit b57bca4 into libp2p:master Oct 1, 2023
18 checks passed

nazarhussain deleted the nh/listener-bug branch October 2, 2023 04:59

nflaig mentioned this pull request Oct 19, 2023

Libp2p stop never resolves when shutting down node ChainSafe/lodestar#6053

Open

This was referenced Jan 18, 2024

chore(main): release 1.0.0 #2365

Closed

chore(main): release 1.0.0 #2366

Closed

This was referenced Feb 21, 2024

Process does not terminate after calling stop() ipfs/helia#309

Closed

fix: tcp server close race condition #2421

Merged

achingbrain mentioned this pull request Oct 23, 2024

fix(@libp2p/tcp): race condition in onSocket #2763

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: ensure all listeners are properly closed on tcp shutdown #2058

fix: ensure all listeners are properly closed on tcp shutdown #2058

nazarhussain commented Sep 18, 2023 •

edited by wemeetagain

Loading

achingbrain commented Sep 18, 2023

nazarhussain commented Sep 18, 2023

wemeetagain left a comment

wemeetagain Sep 19, 2023

wemeetagain commented Sep 21, 2023

maschad commented Sep 26, 2023

nazarhussain commented Sep 27, 2023 •

edited

Loading

nazarhussain commented Sep 28, 2023

nazarhussain commented Sep 28, 2023

nazarhussain commented Sep 28, 2023 •

edited

Loading

wemeetagain commented Sep 29, 2023

nazarhussain commented Sep 29, 2023

nazarhussain commented Sep 29, 2023

wemeetagain left a comment

maschad left a comment •

edited

Loading

wemeetagain commented Oct 2, 2023

nazarhussain commented Oct 2, 2023

maschad commented Oct 2, 2023 •

edited

Loading

wemeetagain commented Oct 2, 2023

fix: ensure all listeners are properly closed on tcp shutdown #2058

fix: ensure all listeners are properly closed on tcp shutdown #2058

Conversation

nazarhussain commented Sep 18, 2023 • edited by wemeetagain Loading

achingbrain commented Sep 18, 2023

nazarhussain commented Sep 18, 2023

wemeetagain left a comment

Choose a reason for hiding this comment

wemeetagain Sep 19, 2023

Choose a reason for hiding this comment

wemeetagain commented Sep 21, 2023

maschad commented Sep 26, 2023

nazarhussain commented Sep 27, 2023 • edited Loading

nazarhussain commented Sep 28, 2023

nazarhussain commented Sep 28, 2023

nazarhussain commented Sep 28, 2023 • edited Loading

wemeetagain commented Sep 29, 2023

nazarhussain commented Sep 29, 2023

nazarhussain commented Sep 29, 2023

wemeetagain left a comment

Choose a reason for hiding this comment

maschad left a comment • edited Loading

Choose a reason for hiding this comment

wemeetagain commented Oct 2, 2023

nazarhussain commented Oct 2, 2023

maschad commented Oct 2, 2023 • edited Loading

wemeetagain commented Oct 2, 2023

nazarhussain commented Sep 18, 2023 •

edited by wemeetagain

Loading

nazarhussain commented Sep 27, 2023 •

edited

Loading

nazarhussain commented Sep 28, 2023 •

edited

Loading

maschad left a comment •

edited

Loading

maschad commented Oct 2, 2023 •

edited

Loading