Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed test: test-child-process-fork-regr-gh-2847 #3635

Closed
jbergstroem opened this issue Nov 2, 2015 · 18 comments
Closed

Failed test: test-child-process-fork-regr-gh-2847 #3635

jbergstroem opened this issue Nov 2, 2015 · 18 comments
Labels
arm Issues and PRs related to the ARM platform. child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests. windows Issues and PRs related to the Windows platform.

Comments

@jbergstroem
Copy link
Member

Looks like this test is starting to fail in our windows environment: fail 1, fail 2

not ok 11 test-child-process-fork-regr-gh-2847.js
# events.js:141
# throw er; // Unhandled 'error' event
# ^
# 
# Error: read ECONNRESET
# at exports._errnoException (util.js:860:11)
# at TCP.onread (net.js:544:26)
@jbergstroem jbergstroem added the windows Issues and PRs related to the Windows platform. label Nov 2, 2015
@mscdex mscdex added the child_process Issues and PRs related to the child_process subsystem. label Nov 3, 2015
@Fishrock123
Copy link
Contributor

Also failed in #3631 (comment), I'm not sure if that is one of those two runs.

The test was added in 36b969f by @indutny and then modified in e9e837c by @mhdawson.

@mhdawson
Copy link
Member

@gireeshpunathil can you take a look at the windows failures since you worked on the the most recent change.

@gireeshpunathil
Copy link
Member

@mhdawson , sure, I will debug to see what the issue is.

@gireeshpunathil
Copy link
Member

When run in many iterations, I am able to see it failing with the original test case as well (36b969f). So this is not a regression.

There are three types of errors observed in total. With e9e837c (a) and (b) seem to have eliminated, but (c) persists.

(a)

events.js:141
      throw er; // Unhandled 'error' event
      ^
Error: connect ECONNREFUSED 127.0.0.1:12346
    at Object.exports._errnoException (util.js:915:11)
    at exports._exceptionWithHostPort (util.js:938:20)
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1065:14)

(b)

events.js:141
      throw er; // Unhandled 'error' event
      ^
Error: write EMFILE
    at exports._errnoException (util.js:915:11)
    at ChildProcess.target._send (internal/child_process.js:606:18)
    at internal/child_process.js:469:16
    at Array.forEach (native)
    at ChildProcess.<anonymous> (internal/child_process.js:468:13)
    at emitTwo (events.js:92:20)
    at ChildProcess.emit (events.js:172:7)
    at handleMessage (internal/child_process.js:686:10)
    at Pipe.channel.onread (internal/child_process.js:440:11)

(c)

events.js:141
      throw er; // Unhandled 'error' event
Error: read ECONNRESET
    at exports._errnoException (util.js:915:11)
    at TCP.onread (net.js:544:26)

The summary of the test case is to validate that the worker process being closed have its _channel field nullified. To re-inforce this assertion, a number of requests are sent from the master. The problem with the test case is that it does not handle the failure scenarios which can occur when the server is shutdown at arbitrary time intervals:
i) When an nth connection request is issued, the (n-1)th request could have resulted in a worker closure, followed by server shutdown. This results in failure (a)
ii) When an nth send request is issued, the (n-1)th request could have resulted in a worker closure, followed by server shutdown. This results in failure (b)
iii) Even when (i) and (ii) are false, the server socket could be closed in between several TCP protocol message transfers which constitute the connect or send. This results in failure (c).

Tweaking test case further, I see that even sending just two messages to the worker is sufficient to cause (c). More interestingly, as the error type (c) is coming from the TCP stack, detached from the net connect, send APIs, their error callbacks are incapable of catching this error. This essentially means that we cannot reliably pass messages, or suppress the errors.

Any suggestions to improve the test case is welcome.

Trott added a commit to Trott/io.js that referenced this issue Nov 24, 2015
Trott added a commit to Trott/io.js that referenced this issue Nov 25, 2015
See nodejs#3635 for details and failure
examples.

Ref: nodejs#3635
PR-URL: nodejs#4005
Reviewed-By: Colin Ihrig <[email protected]>
Trott added a commit that referenced this issue Dec 1, 2015
See #3635 for details and failure
examples.

Ref: #3635
PR-URL: #4005
Reviewed-By: Colin Ihrig <[email protected]>
Trott added a commit that referenced this issue Dec 4, 2015
See #3635 for details and failure
examples.

Ref: #3635
PR-URL: #4005
Reviewed-By: Colin Ihrig <[email protected]>
Trott added a commit that referenced this issue Dec 5, 2015
See #3635 for details and failure
examples.

Ref: #3635
PR-URL: #4005
Reviewed-By: Colin Ihrig <[email protected]>
Trott added a commit that referenced this issue Dec 17, 2015
See #3635 for details and failure
examples.

Ref: #3635
PR-URL: #4005
Reviewed-By: Colin Ihrig <[email protected]>
Trott added a commit that referenced this issue Dec 23, 2015
See #3635 for details and failure
examples.

Ref: #3635
PR-URL: #4005
Reviewed-By: Colin Ihrig <[email protected]>
@evanlucas
Copy link
Contributor

@rvagg
Copy link
Member

rvagg commented Feb 4, 2016

@santigimeno
Copy link
Member

Could you paste the errors? Thanks

@Trott Trott added arm Issues and PRs related to the ARM platform. test Issues and PRs related to the tests. labels Feb 4, 2016
@Trott
Copy link
Member

Trott commented Feb 4, 2016

Oh, right, the CI is locked for now. It's just this:

not ok 7 test-child-process-fork-regr-gh-2847.js
# TIMEOUT
  ---
  duration_ms: 120.61

Trott added a commit to Trott/io.js that referenced this issue Feb 4, 2016
A few tests have started failing on Raspberry Pi devices in CI.
https://ci.nodejs.org/job/node-test-binary-arm/943/

Ref: nodejs#4830
Ref: nodejs#3635
Ref: nodejs#4526
@santigimeno
Copy link
Member

It reproduces more or less consistently in my OS X box. I'll try to look into it

@santigimeno
Copy link
Member

I've tried fixing this. See #5121

stefanmb pushed a commit to stefanmb/node that referenced this issue Feb 23, 2016
Windows is still sometimes failing with ECONNRESET. Bring back the
handling of this error as was initially introduced in PR nodejs#4442.

PR-URL: nodejs#5179
Reviewed-By: Rich Trott <[email protected]>
Fixes: nodejs#3635
MylesBorins pushed a commit that referenced this issue Mar 2, 2016
A few tests have started failing on Raspberry Pi devices in CI.
https://ci.nodejs.org/job/node-test-binary-arm/943/

PR-URL: #5082
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Roman Klauke <[email protected]>
Ref: #4830
Ref: #3635
Ref: #4526
MylesBorins pushed a commit that referenced this issue Mar 2, 2016
The test would sometimes time out on some platforms. Take the initial
version of the test and instead of sending 100 handles, just send 2.
It still fails when run with the code before the change was introduced
and passes afterwards.

Remove test from parallel.status.

PR-URL: #5121
Reviewed-By: Brian White <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
Reviewed-By: Michael Dawson <[email protected]>
Fixes: #3635
MylesBorins pushed a commit that referenced this issue Mar 2, 2016
Windows is still sometimes failing with ECONNRESET. Bring back the
handling of this error as was initially introduced in PR #4442.

PR-URL: #5179
Reviewed-By: Rich Trott <[email protected]>
Fixes: #3635
scovetta pushed a commit to scovetta/node that referenced this issue Apr 2, 2016
See nodejs#3635 for details and failure
examples.

Ref: nodejs#3635
PR-URL: nodejs#4005
Reviewed-By: Colin Ihrig <[email protected]>
scovetta pushed a commit to scovetta/node that referenced this issue Apr 2, 2016
A few tests have started failing on Raspberry Pi devices in CI.
https://ci.nodejs.org/job/node-test-binary-arm/943/

PR-URL: nodejs#5082
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Roman Klauke <[email protected]>
Ref: nodejs#4830
Ref: nodejs#3635
Ref: nodejs#4526
refack added a commit to refack/node that referenced this issue May 19, 2017
According to the explanation in nodejs#3635#issuecomment-157714683
And as a continuation to nodejs#5422 we also ignore EMFILE
"No more file descriptors are available,so no more files can be opened"

PR-URL: nodejs#12698
Fixes: nodejs#10286
Refs: nodejs#3635 (comment)
Refs: nodejs#5178
Refs: nodejs#5179
Refs: nodejs#4005
Refs: nodejs#5121
Refs: nodejs#5422
Refs: nodejs#12621 (comment)
Reviewed-By: Gibson Fahnestock <[email protected]>
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
MylesBorins pushed a commit that referenced this issue Jul 17, 2017
According to the explanation in #3635#issuecomment-157714683
And as a continuation to #5422 we also ignore EMFILE
"No more file descriptors are available,so no more files can be opened"

PR-URL: #12698
Fixes: #10286
Refs: #3635 (comment)
Refs: #5178
Refs: #5179
Refs: #4005
Refs: #5121
Refs: #5422
Refs: #12621 (comment)
Reviewed-By: Gibson Fahnestock <[email protected]>
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
gireeshpunathil added a commit that referenced this issue May 29, 2018
In test-child-process-fork-closed-channel-segfault.js, race condition
is observed between the server getting closed and the worker sending
a message. Accommodate the potential errors.

Earlier, the same race was observed between the client and server
and was addressed through ignoring the relevant errors through error
handler. The same mechanism is re-used for worker too.

The only difference is that the filter is applied at the callback
instead of at the worker's error listener.

Refs: #3635 (comment)
Fixes: #20836
PR-URL: #20973

Reviewed-By: Rich Trott <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Jon Moss <[email protected]>
Reviewed-By: Matheus Marchini <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Bartosz Sosnowski <[email protected]>
addaleax pushed a commit that referenced this issue May 31, 2018
In test-child-process-fork-closed-channel-segfault.js, race condition
is observed between the server getting closed and the worker sending
a message. Accommodate the potential errors.

Earlier, the same race was observed between the client and server
and was addressed through ignoring the relevant errors through error
handler. The same mechanism is re-used for worker too.

The only difference is that the filter is applied at the callback
instead of at the worker's error listener.

Refs: #3635 (comment)
Fixes: #20836
PR-URL: #20973

Reviewed-By: Rich Trott <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Jon Moss <[email protected]>
Reviewed-By: Matheus Marchini <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Bartosz Sosnowski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arm Issues and PRs related to the ARM platform. child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests. windows Issues and PRs related to the Windows platform.
Projects
None yet
Development

No branches or pull requests

9 participants