Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIX: Intermittent failure on pseudo-tty/test-tty-wrap #9728

Closed
mhdawson opened this issue Nov 21, 2016 · 29 comments · Fixed by #28129
Closed

AIX: Intermittent failure on pseudo-tty/test-tty-wrap #9728

mhdawson opened this issue Nov 21, 2016 · 29 comments · Fixed by #28129
Labels
aix Issues and PRs related to the AIX platform. flaky-test Issues and PRs related to the tests with unstable failures on the CI. tty Issues and PRs related to the tty subsystem.

Comments

@mhdawson
Copy link
Member

  • Version: master and 4.x
  • Platform: AIX
  • Subsystem: pseudo-tty

Seen a few intermittent failures in the pseudo-tty/test-tty-wrap test today

https://ci.nodejs.org/job/node-test-commit-aix/1998/nodes=aix61-ppc64/
https://ci.nodejs.org/job/node-test-commit-aix/1988/nodes=aix61-ppc64/

length differs.
expect=2
actual=0
patterns:
pattern = ^hello\ world\ 1$
pattern = ^hello\ world\ 2$
outlines:
not ok 1242 pseudo-tty/test-tty-wrap
  ---
  duration_ms: 0.150
  severity: fail
  stack: |-
  ...
@mhdawson
Copy link
Member Author

mhdawson commented Nov 21, 2016

@mhdawson mhdawson changed the title AIX: Intermittent failure on AIX: Intermittent failure on pseudo-tty/test-tty-wrap Nov 21, 2016
@mscdex mscdex added the aix Issues and PRs related to the AIX platform. label Nov 21, 2016
@Trott
Copy link
Member

Trott commented Nov 21, 2016

/cc @Fishrock123

@Trott Trott added the tty Issues and PRs related to the tty subsystem. label Nov 21, 2016
@Fishrock123
Copy link
Contributor

I am very confused. What? How?

@Fishrock123
Copy link
Contributor

Fishrock123 commented Nov 21, 2016

Oh I see, no output is produced? Oh dear.

I guess you've triggered the exit bug with load alone?? ¯\_(ツ)_/¯

@gibfahn
Copy link
Member

gibfahn commented Nov 22, 2016

cc/ @nodejs/platform-aix @Akpotohwo

@mhdawson
Copy link
Member Author

Passed 200 runs in stress build.

@Akpotohwo
Copy link

Failed 24 times in 3100 on one of our aix machines... It's the same failure as shown earlier.

@mhdawson
Copy link
Member Author

Good to hear you can recreate.

@mhdawson
Copy link
Member Author

mhdawson commented Dec 8, 2016

Failed again in the build today. https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/2433/

@Akpotohwo any update on the investigation ?

@mhdawson
Copy link
Member Author

@Trott
Copy link
Member

Trott commented Dec 23, 2016

Again today:

https://ci.nodejs.org/job/node-test-commit-aix/2736/nodes=aix61-ppc64/console:

length differs.
expect=2
actual=0
patterns:
pattern = ^hello\ world\ 1$
pattern = ^hello\ world\ 2$
outlines:
not ok 1288 pseudo-tty/test-tty-wrap
  ---
  duration_ms: 0.158
  severity: fail
  stack: |-

@Trott
Copy link
Member

Trott commented Dec 23, 2016

I guess you've triggered the exit bug with load alone?? ¯_(ツ)_/¯

@Fishrock123 (or anyone else that knows): Can you elaborate a bit on "the exit bug"? Is there a link or something?

@Trott
Copy link
Member

Trott commented Dec 23, 2016

Should this be marked as flaky on AIX similar to no_dropped_stdio and no_interleaved_stdio? Is it likely failing for the same reasons?

@mhdawson
Copy link
Member Author

mhdawson commented Jan 3, 2017

@Trott Given that we have seen relatively regularly I'll plan to ahead and mark it as flaky.

@mhdawson
Copy link
Member Author

mhdawson commented Jan 4, 2017

@gibfahn I'll submit a PR to excluded, can you follow up on where our investigation stands on this one.

mhdawson added a commit to mhdawson/io.js that referenced this issue Jan 4, 2017
We have had nodejs#9728
open for a while but the frequency of the failures
seems to be such that we should mark it as flaky
while we continue to investigate.
@mhdawson
Copy link
Member Author

mhdawson commented Jan 4, 2017

@gibfahn
Copy link
Member

gibfahn commented Jan 4, 2017

PR: #10618

mhdawson added a commit that referenced this issue Jan 5, 2017
We have had #9728
open for a while but the frequency of the failures
seems to be such that we should mark it as flaky
while we continue to investigate.

PR-URL: #10618
Reviewed-by: Colin Ihrig <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
Reviewed-By: Sakthipriyan Vairamani <[email protected]>
Reviewed-By: James M Snell <[email protected]>
@gibfahn
Copy link
Member

gibfahn commented Jan 16, 2017

@Akpotohwo if you're looking at pseudo-tty test failures on AIX it might be worth looking at #7973 as well.

italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 18, 2017
We have had nodejs#9728
open for a while but the frequency of the failures
seems to be such that we should mark it as flaky
while we continue to investigate.

PR-URL: nodejs#10618
Reviewed-by: Colin Ihrig <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
Reviewed-By: Sakthipriyan Vairamani <[email protected]>
Reviewed-By: James M Snell <[email protected]>
italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 19, 2017
We have had nodejs#9728
open for a while but the frequency of the failures
seems to be such that we should mark it as flaky
while we continue to investigate.

PR-URL: nodejs#10618
Reviewed-by: Colin Ihrig <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
Reviewed-By: Sakthipriyan Vairamani <[email protected]>
Reviewed-By: James M Snell <[email protected]>
italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 24, 2017
We have had nodejs#9728
open for a while but the frequency of the failures
seems to be such that we should mark it as flaky
while we continue to investigate.

PR-URL: nodejs#10618
Reviewed-by: Colin Ihrig <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
Reviewed-By: Sakthipriyan Vairamani <[email protected]>
Reviewed-By: James M Snell <[email protected]>
italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 27, 2017
We have had nodejs#9728
open for a while but the frequency of the failures
seems to be such that we should mark it as flaky
while we continue to investigate.

PR-URL: nodejs#10618
Reviewed-by: Colin Ihrig <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
Reviewed-By: Sakthipriyan Vairamani <[email protected]>
Reviewed-By: James M Snell <[email protected]>
@gireeshpunathil
Copy link
Member

gireeshpunathil commented Feb 16, 2017

Investigated further, and got these inferences:

There are 2 problems with the pseudo-tty tests:
i) child (node) data getting truncated at parent (python),
ii) hang in parent.

This subject test (tty-wrap) mostly undergoes (i) and occasionally, (ii).

Root cause for (i):

Passing behavior:

2687288:        16056741: kopen("/dev/pts/4", O_RDWR)           = 14
...
2687288:        16056741: kwrite(14, " h e l l o   w o r l d  ".., 14) = 14
4850562: 10945517: kread(3, " h e l l o   w o r l d  ".., 9999) = 15
2687288:        16056741: kwrite(14, " h e l l o   w o r l d  ".., 14) = 14
4850562: 10945517: kread(3, " h e l l o   w o r l d  ".., 9999) = 15

Failing behavior:

852692: kopen("/dev/pts/4", O_RDWR)                     = 14
...
852692: kwrite(14, " h e l l o   w o r l d  ".., 14)    = 14
852692: kwrite(14, " h e l l o   w o r l d  ".., 14)    = 14
852692: _exit(0)
4325988: 13435549: kread(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 9999) Err#5  EIO

By the time the child is completely terminated, having its side of the fd closed. Subsequent read by the parent results in EIO, equivalent to a broken pipe.

Hang root cause is not identified, but recreated in a small python + C test case:

bash-4.3$ cat parent.py
#!/usr/bin/env python
import errno
import os
import pty
from subprocess import Popen, STDOUT

master_fd, slave_fd = pty.openpty()
proc = Popen(['./a.out'],stdout=slave_fd, close_fds=True)
os.close(slave_fd)
data = os.read(master_fd, 512)
print('got ' + repr(data))
bash-4.3$ cat child.cc
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

int main()
{
        printf("hello world\n");
        exit(0);
}
bash-4.3$ 

This issue is raised with python community: python 29545 , but no one seem to have picked it up yet.

For (i): I see an issue with the test case - it assumes that the parent can read from the fd irrespective of the child's life span and exit status, however it works always in Linux, magically. AIX uses different process scheduling policies that Linux, and gets affected by this. We have seen many issues in other tests, failure reason a function of the order of execution of child vs. parent soon after the fork.

A 100 ms delay after the child write and before exit consistently resolves the issue. This test and no_dropped_stdio.js - 7973 are affected by this.

@mhdawson @Fishrock123 @gibfahn - please let me know what you think, if you agree I will come up with a PR to this effect.

@gibfahn
Copy link
Member

gibfahn commented Feb 16, 2017

@gireeshpunathil if a 100ms delay fixes this consistently on AIX then that seems reasonable to me.

@mhdawson
Copy link
Member Author

Is there a way to have the parent signal the child when it has completed its write so that we can avoid a timeout ? For example if the parent could write 'done' to the childs stdin when it had read. Delays are always problematic so want to be sure we don't have any other way to achieve the same result.

@gireeshpunathil
Copy link
Member

@mhdawson - while it is technically possible for the parent to acknowledge the data and make it a mandate to receive the ack for child to exit, the python parent (testcfg.py) is the main drver also for the rest of the tests in test/pseudo-tty, furthermore it delegates the child spawning work to tools/test.py which is much more generic (drives the whole test). So I guess changing the parent logic to ack. can have side effects elsewhere.
/cc @Fishrock123

@gireeshpunathil
Copy link
Member

So the approach of synchronizing between parent child has issues:

  1. The python driver's child spawing routine (RunProcess()) is not designed to write back to child.
  2. On the other hand, none of the children (node test cases) are designed to live indefinitely or listen for events / data from parent.
  3. If we re-design this to communicate between them, the tests will hang. Here is why:
  • The only way parent finishes reading data from the child is when it encounters EIO (when child exits) and break-exit the tight read loop. (or timeout expiry)
  • There is no way for the parent to otherwise understand the child has finished writing. This is because each child has different data and difference sequence of writing the data.
  • If we insert a process.stdin.on('data') event on child in expecation of an acknowledgement from parent, this is never going to happen - the parent continues to attempt reading until it times out - as the read error will never occur (child is alive)
  1. We can send a 'special data' indicating the write-finish from the child to circumvent this, but given that this will warrant a change in all the children (node test cases) and the issue occurs only in AIX, I would think we address the race condition in the failing tests alone, by adding a reasonable delay before exit. [ we have done this in many other test cases where AIX-(other os) disparity was visible w.r.t process scheduling.

possibilities:

  1. Modify the python parent to read a 'special data' to terminate the read loop. Then acknowledge the data reception back to the child. Modify 'every' child programs to add the 'special data' at the end of the test, and then setup a read loop to receive the ACK, before exiting.

  2. Modify affected test cases to provide additional delay, supplying necessary leeway for the parent to read the data before exiting.

I am fine in either way, suggestions are much appreciated.
/cc @mhdawson @Fishrock123 @gibfahn

@gibfahn
Copy link
Member

gibfahn commented Mar 2, 2017

I'm still +1 on adding a delay, it seems like a vastly simpler option.

MylesBorins pushed a commit that referenced this issue Mar 7, 2017
We have had #9728
open for a while but the frequency of the failures
seems to be such that we should mark it as flaky
while we continue to investigate.

PR-URL: #10618
Reviewed-by: Colin Ihrig <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
Reviewed-By: Sakthipriyan Vairamani <[email protected]>
Reviewed-By: James M Snell <[email protected]>
MylesBorins pushed a commit that referenced this issue Mar 9, 2017
We have had #9728
open for a while but the frequency of the failures
seems to be such that we should mark it as flaky
while we continue to investigate.

PR-URL: #10618
Reviewed-by: Colin Ihrig <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
Reviewed-By: Sakthipriyan Vairamani <[email protected]>
Reviewed-By: James M Snell <[email protected]>
@gibfahn
Copy link
Member

gibfahn commented Mar 30, 2017

@gireeshpunathil so to clarify, would adding a delay fix the issue for pseudo-tty/test-tty-wrap, or do we need to get the underlying Python bug fixed first?

@gireeshpunathil
Copy link
Member

@gibfahn - as per clarified here - this issue is root caused by the python bug in question, and cannot be circumvented through time delay. The python parent gets trapped in the system on an fd close call, without recovery

@Trott
Copy link
Member

Trott commented Jul 31, 2017

This hasn't recurred in some time. Feel free to re-open (or leave a comment requesting that it be re-opened) if you disagree, but I'm inclined to close at this time. I'm just tidying up and not acting on a super-strong opinion or anything like that.

@sam-github
Copy link
Contributor

This is still referenced in

# being investigated under https://github.com/nodejs/node/issues/9728

If we close it, shouldn't we also remove the reference to it? Or perhaps change the ref to nodejs/build#1820?

Or perhaps, I misunderstand the meaning of test-tty-wrap : FAIL, PASS? That seems to imply the test is considered sucessful if it passes or if it fails, which seems to cover all the options, and not be so useful.

@sam-github sam-github added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Jun 7, 2019
@Trott
Copy link
Member

Trott commented Jun 8, 2019

@sam-github Yes, you are correct: That entry should be removed. #28129

Trott added a commit to Trott/io.js that referenced this issue Jun 10, 2019
The test is believed to no longer be unreliable on AIX. Remove the flaky
designation from the appropriate status file.

Closes: nodejs#9728

PR-URL: nodejs#28129
Fixes: nodejs#9728
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Trivikram Kamat <[email protected]>
Reviewed-By: James M Snell <[email protected]>
BridgeAR pushed a commit that referenced this issue Jun 17, 2019
The test is believed to no longer be unreliable on AIX. Remove the flaky
designation from the appropriate status file.

Closes: #9728

PR-URL: #28129
Fixes: #9728
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Trivikram Kamat <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aix Issues and PRs related to the AIX platform. flaky-test Issues and PRs related to the tests with unstable failures on the CI. tty Issues and PRs related to the tty subsystem.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants