Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native WebSocket Connection is unreliable #949

Closed
D4nte opened this issue Apr 20, 2022 · 12 comments
Closed

Native WebSocket Connection is unreliable #949

D4nte opened this issue Apr 20, 2022 · 12 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@D4nte
Copy link
Contributor

D4nte commented Apr 20, 2022

I have done some test automation to check behaviour of the fleets using js-waku in NodeJS.

It does some basic checks:

  1. Starts several js-waku nodes, and connect them to different nodes within a fleet (the targeted fleet is configurable),
  2. Send messages via Relay, check that all js-waku nodes receives messages from other nodes via Relay,
  3. Send messages via Light Push, check that all js-waku nodes receive messages from other nodes via Relay,
  4. Send messages via Light Push, check that all messages are returned by all nodes.

Here are the findings after running the tests several time from my laptop:

  1. Behaviour is consistently good on the Prod fleet when connecting via websockify (port 443). No message is ever missing
  2. Behaviour is rarely good on the Prod fleet when connecting via native wss (port 8000). Some messages are missing:

Here are some sample outputs, hopefully it's self-explanatory:

You can trigger the flow manually directly from GitHub: https://github.com/status-im/waku-tests/actions/workflows/run.yml.

# Run 1
  1) Prod Fleet Native WSS
       Run Waku Test Suite
         Relay:
     AssertionError: Node connected to node-01.do-ams3.wakuv2.prod.statusim.net did not receive message sent via node-01.ac-cn-hongkong-c.wakuv2.prod.statusim.net: expected [] to include 'sent via node-01.ac-cn-hongkong-c.wak…'
      at Context.<anonymous> (file:///home/fryorcraken/src/status-im/waku-tests/index.js:116:52)

# Run 2
  1) Prod Fleet Native WSS
       Run Waku Test Suite
         Relay:
     AssertionError: Node connected to node-01.do-ams3.wakuv2.prod.statusim.net did not receive message sent via node-01.ac-cn-hongkong-c.wakuv2.prod.statusim.net: expected [] to include 'sent via node-01.ac-cn-hongkong-c.wak…'
      at Context.<anonymous> (file:///home/fryorcraken/src/status-im/waku-tests/index.js:116:52)

  2) Prod Fleet Native WSS
       Run Waku Test Suite
         Send via Light Push, receive via Relay:
     AssertionError: Node connected to node-01.do-ams3.wakuv2.prod.statusim.net did not receive message sent via node-01.ac-cn-hongkong-c.wakuv2.prod.statusim.net: expected [ Array(1) ] to include 'sent via node-01.ac-cn-hongkong-c.wak…'
      at Context.<anonymous> (file:///home/fryorcraken/src/status-im/waku-tests/index.js:192:52)

# Run 3

  1) Prod Fleet Native WSS
       Run Waku Test Suite
         Relay:
     AssertionError: Node connected to node-01.do-ams3.wakuv2.prod.statusim.net did not receive message sent via node-01.ac-cn-hongkong-c.wakuv2.prod.statusim.net: expected [] to include 'sent via node-01.ac-cn-hongkong-c.wak…'
      at Context.<anonymous> (file:///home/fryorcraken/src/status-im/waku-tests/index.js:116:52)

  2) Prod Fleet Native WSS
       Run Waku Test Suite
         Send via Light Push, receive via Relay:
     AssertionError: Node connected to node-01.do-ams3.wakuv2.prod.statusim.net did not receive message sent via node-01.ac-cn-hongkong-c.wakuv2.prod.statusim.net: expected [ Array(1) ] to include 'sent via node-01.ac-cn-hongkong-c.wak…'
      at Context.<anonymous> (file:///home/fryorcraken/src/status-im/waku-tests/index.js:192:52)
@jm-clius
Copy link
Contributor

Thanks, @D4nte. From our side it will be necessary to repeat the tests and get the concurrent fleet logs in order to understand why the messages are going missing and to confirm that this is indeed related to websockets.

cc @dryajov

@jm-clius
Copy link
Contributor

jm-clius commented Apr 22, 2022

Some very preliminary results from an investigation

TL;DR the example below appears to indicate that for at least one failing relay test the message was in fact relayed but perhaps didn't trigger the relay callback/observer timeously or wasn't published to the connected js-waku node as the latter was not considered a full-message peer.

Steps taken

  1. Ran suite against the test fleet native wss port: https://github.com/status-im/waku-tests/runs/6133637106?check_suite_focus=true

  2. Relay test failed with:

Relay:
     AssertionError: Node connected to node-01.ac-cn-hongkong-c.wakuv2.test.statusim.net did not receive message sent via node-01.gc-us-central1-a.wakuv2.test.statusim.net: expected [ Array(1) ] to include 'sent via node-01.gc-us-central1-a.wak…'
      at Context.<anonymous> (file:///home/runner/work/waku-tests/waku-tests/index.js:116:52)

The run id for the Relay test is: f84eee35-10c4-4464-a7a4-4ddc13a63d2f, implying a content topic of /waku-tests/1/relay-test-f84eee35-10c4-4464-a7a4-4ddc13a63d2f/utf8

  1. Connected to node-01.ac-cn-hongkong-c.wakuv2.test.statusim.net and retrieved latest history.
./build/wakunode2 --storenode:/ip4/47.242.210.73/tcp/30303/p2p/16Uiu2HAkvWiyFsgRhuJEb9JfjYxEkoHLgnUQmr1N5mKWnYjxYRVm
  1. History on that content topic includes all the expected messages - indicating that the "missing" message has indeed been relayed:
{"payload":[115,101,110,116,32,118,105,97,32,110,111,100,101,45,48,49,46,97,99,45,99,110,45,104,111,110,103,107,111,110,103,45,99,46,119,97,107,117,118,50,46,116,101,115,116,46,115,116,97,116,117,115,105,109,46,110,101,116,32,45,32,102,56,52,101,101,101,51,53,45,49,48,99,52,45,52,52,54,52,45,97,55,97,52,45,52,100,100,99,49,51,97,54,51,100,50,102],"contentTopic":"/waku-tests/1/relay-test-f84eee35-10c4-4464-a7a4-4ddc13a63d2f/utf8","version":0,"timestamp":1650653602926000000,"proof":{"proof":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"merkleRoot":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"epoch":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"shareX":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"shareY":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"nullifier":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
{"payload":[115,101,110,116,32,118,105,97,32,110,111,100,101,45,48,49,46,100,111,45,97,109,115,51,46,119,97,107,117,118,50,46,116,101,115,116,46,115,116,97,116,117,115,105,109,46,110,101,116,32,45,32,102,56,52,101,101,101,51,53,45,49,48,99,52,45,52,52,54,52,45,97,55,97,52,45,52,100,100,99,49,51,97,54,51,100,50,102],"contentTopic":"/waku-tests/1/relay-test-f84eee35-10c4-4464-a7a4-4ddc13a63d2f/utf8","version":0,"timestamp":1650653602927000000,"proof":{"proof":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"merkleRoot":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"epoch":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"shareX":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"shareY":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"nullifier":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
{"payload":[115,101,110,116,32,118,105,97,32,110,111,100,101,45,48,49,46,103,99,45,117,115,45,99,101,110,116,114,97,108,49,45,97,46,119,97,107,117,118,50,46,116,101,115,116,46,115,116,97,116,117,115,105,109,46,110,101,116,32,45,32,102,56,52,101,101,101,51,53,45,49,48,99,52,45,52,52,54,52,45,97,55,97,52,45,52,100,100,99,49,51,97,54,51,100,50,102],"contentTopic":"/waku-tests/1/relay-test-f84eee35-10c4-4464-a7a4-4ddc13a63d2f/utf8","version":0,"timestamp":1650653602927000000,"proof":{"proof":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"merkleRoot":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"epoch":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"shareX":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"shareY":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"nullifier":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}

Note that two messages have exactly the same timestamp: 1650653602927000000, in case this may cause some inconsistencies

@D4nte
Copy link
Contributor Author

D4nte commented May 3, 2022

Note that two messages have exactly the same timestamp: 1650653602927000000, in case this may cause some inconsistencies

The two messages with exactly the same timestamp are likely to be the same message coming from the same js-waku node but received through a different path.

Just in case I added a delay: waku-org/waku-tests@aa0109e
Let's see if it helps.

@D4nte
Copy link
Contributor Author

D4nte commented May 4, 2022

Note that two messages have exactly the same timestamp: 1650653602927000000, in case this may cause some inconsistencies

The two messages with exactly the same timestamp are likely to be the same message coming from the same js-waku node but received through a different path.

Just in case I added a delay: status-im/waku-tests@aa0109e Let's see if it helps.

I reverted this change and it caused some timeout. Haven't look into providing a new change. Let me know if it's needed.

@D4nte D4nte added the bug Something isn't working label May 9, 2022
@jm-clius jm-clius self-assigned this May 9, 2022
@jm-clius jm-clius added this to the Release 0.10 milestone May 18, 2022
@jm-clius
Copy link
Contributor

Started looking into this again. Summary so far:

  1. Afaict it's only tests where relay is used to retrieve messages that sometimes fail. The "retrieve via store" tests passed during each experimental run.
  2. Similarly, using store queries I can determine that all messages are received by all fleet nodes for every test (at least the ones I checked manually) - it's only on the js-waku relay connections where messages are either not sent, or not sent timeously.
  3. When trying to set up a local replicate of this test, I remembered that js-waku requires domain certificates to work (which I don't have). I've set up a local nwaku-only environment that mimics the test environment (fleet of 3 nodes with individual nwaku nodes connecting via websocket to mimic the js-waku connections) and so far I haven't been able to reproduce the issue. Will continue working on this tomorrow.

@D4nte could we perhaps compile and deploy the waku.connect fleet with trace logging enabled and try to run a full scale test there?

@jm-clius
Copy link
Contributor

On local tests, it seems like the websocket connections time out 10 mins after they were established, despite the connection being used and keep-alive (5 min pings) enabled. Unclear if this is affecting the tests, which should run in under 10 mins. Continuing investigation.

@jm-clius
Copy link
Contributor

jm-clius commented May 24, 2022

Seems like activity flag for websocket connections are not set as expected, which may at least partly explain this issue. @Menduist

This would only affect connections after 10 min, so not clear if this is what causes issues here.

@Menduist
Copy link
Contributor

Fix for timeout here vacp2p/nim-libp2p#721

@D4nte
Copy link
Contributor Author

D4nte commented May 26, 2022

@jm-clius you don't need certificate for local tests, you can just plain websocket.

for the waku connect fleet: feel free to do as you wish, if you need me to do something, please let me know.

Similarly, using store queries I can determine that all messages are received by all fleet nodes for every test (at least the ones I checked manually) - it's only on the js-waku relay connections where messages are either not sent, or not sent timeously.

Indeed I agree that this only uses "receive via relay". I am happy to icebox this and proceed to enable dns discovery and increase js-waku's default number of connections before reviewing this issue.

@jm-clius
Copy link
Contributor

jm-clius commented Jun 1, 2022

Logs reveal that this is indeed related to timing in the test: the fleet node starts receiving messages related to the test while it's still in the process of finalising the connection to the relay js-waku peers. As a result the js-waku peer sometimes fail to receive the first test messages. Adding a short delay between establishing the connections and publishing the first test messages should fix the tests.

@jm-clius
Copy link
Contributor

jm-clius commented Jun 7, 2022

@D4nte I think this issue can be closed as the failing tests were unrelated to websocket performance? Since the last websocket fixes, I'm not aware of any remaining websocket issues and reliability tests pass.

@jm-clius
Copy link
Contributor

Fixed in v0.10 and tests pass, after waku-org/waku-tests#2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants