Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky test-net-connect-reset-until-connected #43446

Closed
lpinca opened this issue Jun 16, 2022 · 7 comments · Fixed by #46781
Closed

Investigate flaky test-net-connect-reset-until-connected #43446

lpinca opened this issue Jun 16, 2022 · 7 comments · Fixed by #46781
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI.

Comments

@lpinca
Copy link
Member

lpinca commented Jun 16, 2022

Test

test-net-connect-reset-until-connected

Platform

smartos

Console output

not ok 2579 parallel/test-net-connect-reset-until-connected
  ---
  duration_ms: 120.177
  severity: fail
  exitcode: -15
  stack: |-
    timeout

Build links

Additional information

No response

@lpinca lpinca added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Jun 16, 2022
panva added a commit to panva/node that referenced this issue Jun 16, 2022
@panva
Copy link
Member

panva commented Jun 16, 2022

Seems to have gone on strike June 9th, only on smartos20-64

https://ci.nodejs.org/job/node-test-commit-smartos/nodes=smartos20-64/buildTimeTrend

@panva
Copy link
Member

panva commented Jun 16, 2022

Unfortunately doesn't seem to be just test-net-connect-reset-until-connected, i'm observing a tendency to timeout test-report-fatal-error, and test-domain-no-error-handler-abort-on-uncaught-* too on smartos20-64

@lpinca
Copy link
Member Author

lpinca commented Jun 16, 2022

Yes, some other tests are also timing out. I think there is something wrong with that machine/system. @nodejs/build is it possible to restart it?

@panva
Copy link
Member

panva commented Jun 18, 2022

ping @nodejs/build this build continues to be a problem

@Trott
Copy link
Member

Trott commented Jun 18, 2022

I saw this test failing over and over in #43366 so rebased against the main branch to see if it fixed it, and it did (based on a single run anyway). So maybe try that if all else fails?

nodejs-github-bot pushed a commit that referenced this issue Jun 19, 2022
Refs: #43446

PR-URL: #43449
Refs: #43446
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Antoine du Hamel <[email protected]>
@PupilTong
Copy link
Contributor

PupilTong commented Jun 19, 2022

Not sure if there is something wrong in #43112 since the server side RST test test/parallel/test-net-server-reset.js works.

Recent days I'm waiting the ISP starts my services plan. I will investigate the root casuse once my internet service is back (Expected Jul 4.).

:D

@F3n67u
Copy link
Member

F3n67u commented Jun 28, 2022

This test is also flaky on freebsd, could I mark it as flaky on freebsd also?

Reason parallel/test-net-connect-reset-until-connected
Type JS_TEST_FAILURE
Failed PR 5 (#42929#43427#43544#43554#43561)
Appeared test-digitalocean-freebsd12-x64-2, test-digitalocean-freebsd12-x64-1
First CI https://ci.nodejs.org/job/node-test-pull-request/44833/
Last CI https://ci.nodejs.org/job/node-test-pull-request/44876/

not ok 2139 parallel/test-net-connect-reset-until-connected

duration_ms: 120.81
severity: fail
exitcode: -15
stack: |-
timeout
...

Refs: https://github.com/nodejs/reliability/issues/314

targos pushed a commit that referenced this issue Jul 12, 2022
Refs: #43446

PR-URL: #43449
Refs: #43446
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Antoine du Hamel <[email protected]>
targos pushed a commit that referenced this issue Jul 31, 2022
Refs: #43446

PR-URL: #43449
Refs: #43446
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Antoine du Hamel <[email protected]>
guangwong pushed a commit to noslate-project/node that referenced this issue Oct 10, 2022
Refs: #43446

PR-URL: nodejs/node#43449
Refs: nodejs/node#43446
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Antoine du Hamel <[email protected]>
batrla pushed a commit to batrla/node that referenced this issue Feb 22, 2023
Fixes a race condition in test that causes the test to randomly timeout on Solaris 11.4, SmartOS and potentially also FreeBSD. The client resets the connection using conn.resetAndDestroy(). This call is asynchronous and if it's effect occurs before server's listening socket accepts the connection, the test hangs. The fix is to put a synchronization barrier that resets the connection only after it is established on both server and client side.

Below is a little bit more about the root cause. I show positive (test works) and negative (when test hangs) scenarios. The output contains only relevant system / library calls that were collected using truss(1). Without the fix the test randomly hangs. With the fix the test completes thousands of runs without single issue.

Race condition scenario:
```
connect(23, 0x7FFFBFFF7F10, 32, SOV_XPG4_2)Err#150 EINPROGRESS
^ client socket connects to server
close(23)= 0
^ client closes the socket too early
accept(21, 0x00000000, 0x00000000, SOV_DEFAULT)Err#130 ECONNABORTED
accept(21, 0x00000000, 0x00000000, SOV_DEFAULT)Err#11 EAGAIN
^ accept on listening socket fails
... test hangs and times out...
```

Working (good) scenario:
```
connect(23, 0x7FFFBFFF7F00, 32, SOV_XPG4_2)Err#150 EINPROGRESS
^ client socket connects to server
accept(21, 0x00000000, 0x00000000, SOV_DEFAULT)= 24
^ server accepts client connection on listening socket
close(23)= 0
^ client socket closes
read(24, 0x046B6010, 65536)Err#131 ECONNRESET
^ test gets so much wanted error while reading on accepted FD
close(24)= 0
^ accepted FD closes
... test completes, passes ...
```

Fixes: nodejs#43446
batrla pushed a commit to batrla/node that referenced this issue Feb 23, 2023
nodejs-github-bot pushed a commit that referenced this issue Feb 25, 2023
Fixes: #43446
PR-URL: #46781
Reviewed-By: theanarkh <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
targos pushed a commit that referenced this issue Mar 13, 2023
Fixes: #43446
PR-URL: #46781
Reviewed-By: theanarkh <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
targos pushed a commit that referenced this issue Mar 14, 2023
Fixes: #43446
PR-URL: #46781
Reviewed-By: theanarkh <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
BethGriggs pushed a commit that referenced this issue Mar 27, 2023
Fixes: #43446
PR-URL: #46781
Reviewed-By: theanarkh <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
danielleadams pushed a commit that referenced this issue Apr 11, 2023
Fixes: #43446
PR-URL: #46781
Reviewed-By: theanarkh <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants