Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: nodes don't respond after a while #1831

Closed
Ivansete-status opened this issue Jun 28, 2023 · 9 comments · Fixed by #1848
Closed

bug: nodes don't respond after a while #1831

Ivansete-status opened this issue Jun 28, 2023 · 9 comments · Fixed by #1848
Assignees
Labels
bug Something isn't working critical This issue needs critical attention E:2.1: Production testing of existing protocols See https://github.com/waku-org/pm/issues/49 for details

Comments

@Ivansete-status
Copy link
Collaborator

Ivansete-status commented Jun 28, 2023

Problem

In wakuv2.prod all nodes were not responding to wss requests.

@fryorcraken informed that the js-waku (https://examples.waku.org/light-js/) requests were not properly handled. That issue was reported at status-im/infra-nim-waku#77. While there was initially a firewall issue, in the end, we found that the nwaku nodes were blocked.

For any reason, the node couldn't attend to more requests on port 8000.

@jakubgs discovered the next:

[email protected]:~ % d exec -it nim-waku-v2 sh 
/ # netstat -lpnt | grep 8000
tcp      101      0 0.0.0.0:8000            0.0.0.0:*               LISTEN      1/wakunode
/ # netstat -pnt | grep 8000 | wc -l
67
/ # netstat -pnt | grep 8000 | grep CLOSE_WAIT | wc -l
67
/ # netstat -pnt | grep 8000 | grep CLOSE_WAIT | head -n5
tcp      298      0 172.17.1.3:8000         77.181.42.28:42406      CLOSE_WAIT  -
tcp      298      0 172.17.1.3:8000         77.181.42.28:50646      CLOSE_WAIT  -
tcp      408      0 172.17.1.3:8000         164.92.69.24:33722      CLOSE_WAIT  -
tcp      408      0 172.17.1.3:8000         164.92.69.24:33734      CLOSE_WAIT  -
tcp      298      0 172.17.1.3:8000         165.232.90.54:47278     CLOSE_WAIT  -

and the next, where we can see that the p2p port seems to be blocked as well:

 > sudo nmap -Pn -p8000,30303 node-01.gc-us-central1-a.wakuv2.prod.statusim.net
Starting Nmap 7.93 ( https://nmap.org ) at 2023-06-28 10:03 CEST
Nmap scan report for node-01.gc-us-central1-a.wakuv2.prod.statusim.net (34.121.100.108)
Host is up.
rDNS record for 34.121.100.108: 108.100.121.34.bc.googleusercontent.com

PORT      STATE    SERVICE
8000/tcp  filtered http-alt
30303/tcp filtered unknown

Nmap done: 1 IP address (1 host up) scanned in 3.14 seconds

Impact

The node doesn't attend to any request until it gets restarted.

To reproduce

If you can reproduce the behavior, steps to reproduce it:

  1. Go to https://examples.waku.org/light-js/
  2. Add the address '/dns4/node-01.do-ams3.wakuv2.test.statusim.net/tcp/8000/wss/p2p/16Uiu2HAmPLe7Mzm8TsYUubgCAW1aJoeFScxrLj8ppHFivPo97bUZ'
  3. click on dial many times.
  4. close the browser.
  5. ( at this point, there might appear sockets in state CLOSE_WAIT in the node's container )
  6. Repeat this many times

@fryorcraken - kindly elaborate more on this if I missed any point on how to replicate the issue.

nwaku version/commit hash

v0.18.0-13-g44f9d8
We encountered this issue in all three wakuv2.prod nodes.

@fryorcraken
Copy link
Collaborator

I have marked this bug as critical as it is stopping js-waku nodes to connect to the fleet.
It also stops dogfooding the various efforts to make js-waku nodes less reliant on the wakuv2 fleet.

@fryorcraken
Copy link
Collaborator

fryorcraken commented Jun 30, 2023

@fryorcraken - kindly elaborate more on this if I missed any point on how to replicate the issue.

As specified in status-im/infra-nim-waku#77, one can simply use websocat to test the connection:

websocat -v wss://node-01.do-ams3.wakuv2.prod.statusim.net:8000

do note it was mentioned that the ws port was not the only one affected: status-im/infra-nim-waku#77 (comment)

Also, why this was not detected in consul?

@SionoiS SionoiS self-assigned this Jun 30, 2023
@Menduist
Copy link
Contributor

Might be related to status-im/nimbus-eth2#5004
Check the logs to see if the accept loop is crashing

@Menduist
Copy link
Contributor

Actually that is 99% likely the cause, just bump libp2p and it will be fixed

@SionoiS
Copy link
Contributor

SionoiS commented Jun 30, 2023

Merci @Menduist for the info I was looking at that code change wondering if it was related. I will try with new version of libp2p.

@Ivansete-status
Copy link
Collaborator Author

Btw, this issue is also happening on the wakuv2.test fleet.

@Ivansete-status Ivansete-status moved this to In Progress in Waku Jul 2, 2023
@SionoiS SionoiS removed their assignment Jul 3, 2023
@Ivansete-status
Copy link
Collaborator Author

Checking the HK node from wakuv2.test fleet.

A nwaku node was started locally and could establish a relay connection through port 30303.
A jswaku node was started locally but couldn't establish a relay connection through port 8000.

On the other hand, both ports are opened:

$ nmap -Pn -p30303,8000 node-01.ac-cn-hongkong-c.wakuv2.test.statusim.net
Starting Nmap 7.80 ( https://nmap.org ) at 2023-07-04 11:01 CEST
Nmap scan report for node-01.ac-cn-hongkong-c.wakuv2.test.statusim.net (47.242.210.73)
Host is up (0.33s latency).

PORT      STATE SERVICE
8000/tcp  open  http-alt
30303/tcp open  unknown
10:36:58 ~$ 
$ curl -v telnet://node-01.ac-cn-hongkong-c.wakuv2.test.statusim.net:30303
*   Trying 47.242.210.73:30303...
* Connected to node-01.ac-cn-hongkong-c.wakuv2.test.statusim.net (47.242.210.73) port 30303 (#0)

10:38:38 ~$ 
$ curl -v telnet://node-01.ac-cn-hongkong-c.wakuv2.test.statusim.net:8000
*   Trying 47.242.210.73:8000...
* Connected to node-01.ac-cn-hongkong-c.wakuv2.test.statusim.net (47.242.210.73) port 8000 (#0)

So we have an issue with websockets.

@Ivansete-status
Copy link
Collaborator Author

The reason why the nwaku node stops attending new upcoming connections is that an exception is re-raised from:

wstransport.nim#L288

which in turn, makes the main accept loop to stop in switch.nim

I will apply a @Menduist 's suggestion to enhance that.

cc - @SionoiS, @fryorcraken, @jm-clius

@jm-clius
Copy link
Contributor

jm-clius commented Jul 5, 2023

Great investigating work!

@fryorcraken fryorcraken added E:2.1: Production testing of existing protocols See https://github.com/waku-org/pm/issues/49 for details and removed E:2023-light-protocols labels Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working critical This issue needs critical attention E:2.1: Production testing of existing protocols See https://github.com/waku-org/pm/issues/49 for details
Projects
Archived in project
5 participants