Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing process.stdout terminates SSH connections #13278

Closed
patrickjane opened this issue May 29, 2017 · 17 comments
Closed

Accessing process.stdout terminates SSH connections #13278

patrickjane opened this issue May 29, 2017 · 17 comments
Labels
question Issues that look for answers.

Comments

@patrickjane
Copy link

patrickjane commented May 29, 2017

  • Version: v7.9.0
  • Platform: Linux dev-obelisk-1 2.6.18-274.el5 deps: update openssl to 1.0.1j #1 SMP Fri Jul 8 17:36:59 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux (RHEL 5.7)
  • Subsystem: Process

If this is of any importance, the local machine where the SSH connection was opened from is running MacOS 10.12.4. The info above applies to the remote machine where I connect to via SSH.

I am trying to write a library which serves to execute remote commads via SSH. In my first example I was using a Node C++ wrapper around libssh, but after researching this issue for a long time I am down to a very simple node.js sample program which demonstrates the error.

Given the following test program:

const spawn = require('child_process').spawn;
const bat = spawn('ssh', ['user@host']);

setTimeout(() => {
    bat.stdin.write('node test.js\n');

    setTimeout(() => bat.stdin.write('pwd -P\n'), 2000);
}, 2000);

With the following contents of test.js on the remote host:

if (process.stdout) ;  // only this line

The program will disconnect the SSH session after the node test.js command is issued, resulting in the scheduled pwd -P to fail and throw an exception.
If you change the contents of test.js to just if (process) ; (or anything else not "touching" process.stdout/process.stderr), the test program will run properly and execute all remote commands without problems.

I have tried several commands, and none of the other commands I used (cp, cd, mkdir, ls, git, ssh, scp, ...) failed in this way, only running node.js scripts which use process.stdout/stderr seem to fail.
This is pretty unfortunate, because for my actual implementation (automatic build/deploy system) I must execute npm install, which however breaks the whole application because of the issue described above.

I have researched this somewhat in detail here: https://stackoverflow.com/questions/43972863/ssh-npm-seems-to-break-ssh-non-pty-shells and also here: https://stackoverflow.com/questions/43784514/channel-in-libssh-gets-closed-for-no-obvious-reason?noredirect=1#comment74611459_43784514 (this one includes a working minimal C++ program using libssh and also the SSH protocol level packet/debug output which might help in investigating the issue.

For me, as a non-SSH expert, it looks like that Node.js is trying to use a TTY, while the SSH connection it is running on doesn't have a TTY. Therefore the SSH server terminated the connection. So please note, the error happens only in this non-tty enviroment. Of course, if you open your terminal application and SSH to some machine, executing the above node script will work flawlessly.

@gireeshpunathil
Copy link
Member

Unable to reproduce with the steps you provided, on similar OS version and same node version.

#node ssh.js 
Pseudo-terminal will not be allocated because stdin is not a terminal.

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.

node test.js

pwd
/home/gireesh

whoami
gireesh

Can you try the following code:

const spawn = require('child_process').spawn;
const bat = spawn('ssh', ['user@host']);

bat.stdout.on('data', function(d) {
  console.log(d.toString());
});
bat.stderr.on('data', function(d) {
  console.log(d.toString());
});

bat.on('error', function(error) {
  console.log('child err\'d with : ' + error);
});
bat.on('close', function(code, signal) {
  console.log('child exited with code: ' + code + ' and signal: ' + signal);
});

process.stdin.on('readable', () => {
  var chunk = process.stdin.read();
  if (chunk !== null) {
    bat.stdin.write(chunk);
  }
});

This way, you can provide commands repeatedly and interactively.

Also, in your failing scenario, pass --trace option to node to see how and why it is failing. (you need to capture the stdout of the child)

@patrickjane
Copy link
Author

On which remote linux are you testing? I was able to check the following versions of redhat enterprise linux:

  • RHEL 5.7 - breaks
  • RHEL 6.4 - works
  • RHEL 7.1 - works

Output with your example code (command node test.js(+enter) followed by ls(+enter)):

macuser@maclocalhost> node ssh.js
Pseudo-terminal will not be allocated because stdin is not a terminal.

node test.js
child exited with code: 0 and signal: null
ls
events.js:163
      throw er; // Unhandled 'error' event
      ^

Error: This socket has been ended by the other party
    at Socket.writeAfterFIN [as write] (net.js:305:12)
    at ReadStream.process.stdin.on (/Users/fial/Development/node-remoteshell/test.js:37:15)
    at emitNone (events.js:86:13)
    at ReadStream.emit (events.js:188:7)
    at emitReadable_ (_stream_readable.js:434:10)
    at emitReadable (_stream_readable.js:428:7)
    at readableAddChunk (_stream_readable.js:189:13)
    at ReadStream.Readable.push (_stream_readable.js:136:10)
    at TTY.onread (net.js:560:20)

I fail to post the trace output for it is over 65536 characters, and attaching it as a file shows "Something went really wrong, and we can't process that file". So I uploaded it here:
http://filehorst.de/d/bCjDbncp

@bnoordhuis
Copy link
Member

I expect that is a bug in ssh, not node. RHEL 5 ships openssh 4.3 from 2006. It has a huge list of known issues.

@gireeshpunathil
Copy link
Member

Mine is RHEL 6.6
Unfortunately, I don't have a RHEL 5.* to test and validate this.

The trace file is good, and proves that the child node exited gracefully.

So as @bnoordhuis pointed out, this can be a bug in the SSH itself. Probably

const bat = spawn('strace', ['-f', 'ssh', 'user@host']);

can confirm this?

@patrickjane
Copy link
Author

It will give me this, although I fail to interpret it correctly. The only thing which got my attention was
ioctl(0, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7fffc6f2d5f0) = -1 ENOTTY (Inappropriate ioctl for device)

Full output:

node test.js
)        = 1 (in [4])

clock_gettime(0x7 /* CLOCK_??? */, {12522267, 263832795}) = 0
read(4, "node test.js\n", 16384)        = 13
clock_gettime(0x7 /* CLOCK_??? */, {12522267, 263892138}) = 0

clock_gettime(0x7 /* CLOCK_??? */, {12522267, 263909672}) = 0
select(7, [3 4], [3], NULL, NULL)       = 1 (out [3])
clock_gettime(0x7 /* CLOCK_??? */,
{12522267, 263950670}) = 0
write(3, "\\\225M\235U\223\217\350\237)\\\302\343*\3301\372\356\301\270\322\314a\7=g|0xT\203\322"..., 48
) = 48
clock_gettime(0x7 /* CLOCK_??? */, {12522267, 264042568}) = 0
clock_gettime(0x7 /* CLOCK_??? */,
{12522267, 264060244}) = 0
select(7, [3 4], [], NULL, NULL
)        = 1 (in [3])
clock_gettime(0x7 /* CLOCK_??? */,
{12522267, 323629665}) = 0
read(3, "\2746\303\250\232\10\354i\312_\217 \327\27\346\4\333\227\264>\334\361L\215\2t\374\246\277\330\261{", 8192) = 32
close(5)                                = 0
clock_gettime(0x7 /* CLOCK_??? */, {12522267, 323735170}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {12522267, 323752761}) = 0
select(7, [3 4], [], NULL, NULL)        = 1 (in [3])
clock_gettime(0x7 /* CLOCK_??? */, {12522267, 323793825}) = 0
read(3,
"\302\217}KGCF\241\307H\37\365u<\252\326'ms\352\207\226\205\373\27\307\231\261\206\277'\351"..., 8192) = 96
close(4)                                = 0
clock_gettime(0x7 /* CLOCK_??? */, {12522267, 323855905}) = 0

clock_gettime(0x7 /* CLOCK_??? */, {12522267, 323872201}) = 0
brk(0)                                  = 0x2b1b1007e000
close(6
)                                = 0
select(7, [3], [3], NULL, NULL)         = 1 (out [3])
clock_gettime(0x7 /* CLOCK_??? */,
{12522267, 323958717}) = 0
write(3, "q55\266H\265\343k\325mr\344\16\250\23t\372\346V-0\316\314\311\224k\177r*\263_\26", 32) = 32
rt_sigaction(SIGWINCH, NULL,
{0x2b1b0e73f3e0, [], SA_RESTORER, 0x2b1b10e87650}, 8) = 0
rt_sigaction(SIGWINCH, {SIG_DFL, [], SA_RESTORER, 0x2b1b10e87650}, NULL, 8) = 0
write(3, "\f\303\236\336\206\323\223Y\1\177\320ml\363\277\vD\325\202!\242Y\326\312\326\350\325\305[\221&\232"..., 64
) = 64
ioctl(0, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7fffc6f2d5f0) = -1 ENOTTY (Inappropriate ioctl for device)

fcntl(0, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(0, F_SETFL, O_RDWR)               = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS
, 0x7fffc6f2d5f0) = -1 ENOTTY (Inappropriate ioctl for device)
fcntl(1, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(1, F_SETFL, O_RDWR
)               = 0
ioctl(2, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7fffc6f2d5f0) = -1 ENOTTY (Inappropriate ioctl for device)
fcntl(2, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(2, F_SETFL, O_RDWR
)               = 0
brk(0)                                  = 0x2b1b1007e000
shutdown(3, SHUT_RDWR
)                  = 0
close(3
)                                = 0

exit_group(0)                           = ?

+++ exited with 0 +++

child exited with code: 0 and signal: null

@gireeshpunathil
Copy link
Member

Nothing obvious found other than the ones which you pointed out. Will give it another try later.

@mscdex mscdex added the question Issues that look for answers. label May 29, 2017
@gibfahn
Copy link
Member

gibfahn commented May 29, 2017

I'm going to close this as it sounds like it's a RHEL 5.7 problem, and we don't support 5.7 (or even 5.11, which is I think the only RHEL 5 version that still has extended support from Red Hat). Node 8 won't run at all on RHEL 5.x, and it's quite possible that Node 6 will cease to work on it at some point.

I assume upgrading to RHEL 7.3 isn't an option for you, so feel free to continue updating this thread.

I see you also raised npm/npm#16608.

@gibfahn gibfahn closed this as completed May 29, 2017
@patrickjane
Copy link
Author

Yes, unfortunately I am bound to use RHEL 5.7 by customer requirement, so problably there is nothing that I can do.

@gireeshpunathil
Copy link
Member

Here is the sequence of SSH protocol as understood by me, based on a modern SSH

select(7, [3 4], [], NULL, NULL
// wait for user input

uname
// user issues the command

)        = 1 (in [4])
// one fd ready for read - fd 4, alias for stdin.

read(4, "uname\n", 16384)               = 6
//read the command

// want to pass the command to the remote
select(7, [3 4], [3], NULL, NULL)       = 1 (out [3])
// fd 3 ready to write

write(3, "\273\26L\300\201=\n\36C\336\17\205\224\275\206:v\20\255M4\375\232\337f\3 \177\220\200\33S"..., 48) = 48
// write to the remote

// wait for response
select(7, [3 4], [], NULL, NULL)        = 1 (in [3])
// fd 3 redy with response

read(3, "\16\253\4\0\253\272\275w44\216\373-\30[\326\355\16\306\226\262\361U\213\244\353[\252\247\261\372\226"..., 8192) = 48
// read the response

// write back the user
select(7, [3 4], [5], NULL, NULL)       = 1 (out [5])
write(5, "Linux\n", 6)                  = 6

Linux

// wait for next input
select(7, [3 4], [], NULL, NULL

The failing sequence which you showed above not only falling into an identifiable pattern, but also shows bad symptoms (close and ioctl calls) without evidence captured in the trace - probably decisions taken at the app (ssh) level not visible in the system calls.

Can you reverse the direction (linux as source and mac as target host) and check?

@patrickjane
Copy link
Author

This is RHEL 5.7 -> RHEL 6.4:
(and the connection works without disconnect)

select(7, [3 4], [], NULL, NULL
~/node/bin/node test.js
)        = 1 (in [4])
read(4, "~/node/bin/node test.js\n", 16384) = 24

select(7, [3 4], [3], NULL, NULL)       = 1 (out [3])

write(3, "{\270\270#0!y'\10l\342\303\347\376#\250\326\210[\345\354jI\2633\276\206\221\\N\340\263"..., 64
) = 64
select(7, [3 4], [], NULL, NULL

And this is RHEL 5.7 -> MacOS
(also works without problems)

select(7, [3 4], [], NULL, NULL
/usr/local/bin/node test.js
)        = 1 (in [4])
read(4, "/usr/local/bin/node test.js\n", 16384) = 28

select(7, [3 4], [3], NULL, NULL
)       = 1 (out [3])

write(3, "\366\365\264]\272?\240\274\262\336m\357\23\257\244\254\325\4\252\265\261\"2\350\337\232\231\30\231\23\377\16"..., 68
) = 68
select(7, [3 4], [], NULL, NULL

Also looks quite understandable to me, unlike the RHEL 5.7 output I posted earlier.

@gireeshpunathil
Copy link
Member

Looks like the old ssh closed its stream(s) in an unknown manner is how the problem has surfaced. For example, if I add a line to force close the stdin of the ssh child, I get the same sequence as that of yours:

setTimeout(() => {
    bat.stdin.write('node test.js\n');
    bat.stdin.end();
    setTimeout(() => bat.stdin.write('pwd -P\n'), 2000);
}, 2000);
read(4, "/home/gireesh/linux/ssh/node-v7."..., 16384) = 63
select(7, [3 4], [3], NULL, NULL)       = 1 (out [3])
write(3, "\231$\372\33p\375w'X8m\312\317W\202\230>oxc/\277\254\230\6\204\250\341\2175&*"..., 112) = 112
select(7, [3 4], [], NULL, NULL)        = 1 (in [4])
read(4, "", 16384)                      = 0
close(4)                                = 0
select(7, [3], [3], NULL, NULL)         = 1 (out [3])
write(3, "x\2339\353b\31O\211C\370\347\35\210p\207&\353\230\3]a,D\6\207Mqk[(\177\"", 32) = 32
select(7, [3], [], NULL, NULL
)          = 1 (in [3])

read(3, "\367\212\3249/(L\216\242Q\367\216eH\216\273\342\314\357\375\236\3678<-\35\355\253*\2711H"..., 8192) = 128
close(5)                                = 0
close(6)                                = 0
select(7, [3], [3], NULL, NULL)         = 1 (out [3])
write(3, "\326L\356\312\210\224\343\327\266\30\213N\264\34\332\v\304\16Gn\201\266o\371\224\223RqV\276\355K", 32) = 32
rt_sigaction(SIGWINCH, NULL, {0x7fc733743400, [], SA_RESTORER, 0x7fc73128f6a0}, 8) = 0
rt_sigaction(SIGWINCH, {SIG_DFL, [], SA_RESTORER, 0x7fc73128f6a0}, NULL, 8) = 0

write(3, "{vE\301=\330\327\204cWu\3501\v\334\23\200\341\257bT\252\27\0371z'\342\10\270\225\26"..., 64
) = 64

ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffe32ce820) = -1 EINVAL (Invalid argument)
fcntl(0, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)

fcntl(0, F_SETFL, O_RDWR)               = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS
, 0x7fffe32ce820) = -1 EINVAL (Invalid argument)
fcntl(1, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(1, F_SETFL, O_RDWR)               = 0

ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffe32ce820) = -1 EINVAL (Invalid argument)
fcntl(2, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(2, F_SETFL, O_RDWR)               = 0

shutdown(3, 2 /* send and receive */)   = 0
close(3
)                                = 0

exit_group(0)                           = ?

As a last experiment, can you trace the parent, much like you did it for the child? Just to make sure that node is not doing anything unexpected?

@patrickjane
Copy link
Author

I don't think you got the same output. There is some stuff missing, for example brk(0) but more importantly the ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS , 0x7fffc6f2d5f0) = -1 ENOTTY (Inappropriate ioctl for device) is missing.

I am by no means an expert in neither SSH nor all the system calls, but my assumption was from the beginning (back when I thought the issue is caused by libssh and I opened a ticket also on their bugtracker) that node is doing something which wants a TTY, but the SSH connection is not running on a TTY. And maybe the old SSH version has some kind of bad handling about this, in that it just closes the connection, while the newer versions won't close the connection.

At least I find it especially interesting, that all the other remote commands (which of course also heavily use stdout/stderr) just work fine, for example git.

When tracing the parent node, I would get this:

epoll_wait(5, {}, 1024, 0)              = 0
epoll_wait(5, node test.js
{{EPOLLIN, {u32=11, u64=11}}}, 1024, -1) = 1
read(11, "node test.js\n", 65536)       = 13
write(10, "node test.js\n", 13)         = 13
epoll_ctl(5, EPOLL_CTL_MOD, 10, {EPOLLIN, {u32=10, u64=10}}) = 0
epoll_wait(5, {}, 1024, 0)              = 0
epoll_wait(5, {{EPOLLIN|EPOLLHUP, {u32=10, u64=10}}, {EPOLLIN|EPOLLHUP, {u32=12, u64=12}}, {EPOLLIN|EPOLLHUP, {u32=14, u64=14}}}, 1024, -1) = 3
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=17185, si_status=0, si_utime=1, si_stime=0} ---
read(3, "*", 1)                         = 1
write(7, "\250_\274\1\0\0\0\0\21\0\0\0\0\0\0\0", 16) = 16
write(4, "*", 1)                        = 1
rt_sigreturn()                          = 3
read(10, "", 65536)                     = 0
epoll_ctl(5, EPOLL_CTL_DEL, 10, {0, {u32=0, u64=0}}) = 0
close(10)                               = 0
read(12, "", 65536)                     = 0
epoll_ctl(5, EPOLL_CTL_DEL, 12, {0, {u32=0, u64=0}}) = 0
close(12)                               = 0
read(14, "", 65536)                     = 0
epoll_ctl(5, EPOLL_CTL_DEL, 14, {0, {u32=0, u64=0}}) = 0
close(14)                               = 0
epoll_wait(5, {{EPOLLIN, {u32=6, u64=6}}}, 1024, -1) = 1
read(6, "\250_\274\1\0\0\0\0\21\0\0\0\0\0\0\0", 512) = 16
wait4(17185, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 17185
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
read(3, "*", 1)                         = 1
rt_sigaction(SIGCHLD, {SIG_DFL, [], SA_RESTORER, 0x2b73b76af130}, NULL, 8) = 0
write(4, "*", 1)                        = 1
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
write(13, "child exited with code: 0 and si"..., 43child exited with code: 0 and signal: null
) = 43
epoll_wait(5, {}, 1024, 0)              = 0
epoll_wait(5,

@gireeshpunathil
Copy link
Member

sorry for being vague - I meant --trace for the parent: to make sure that we are not inadvertently closing the child (ssh)'s stream. In the above strace I can see that the parent is writing into fd 10 and subsequently closing it, but data from trace would provide real evidence.

@patrickjane
Copy link
Author

Ah okay. This should be what I already posted in my first answer, now?

Anyway heres the output again, starting from when I typed the remote command:
http://filehorst.de/d/bDFfGgiu

@gireeshpunathil
Copy link
Member

no, I meant the parent (the script which invokes ssh). the last trace was for the child (test.js) right?

@patrickjane
Copy link
Author

patrickjane commented May 30, 2017

Nope, Its node --trace ssh.js (and ssh.js will in turn SSH to the remote and invoke node test.js). But then yeah, my first response contained the trace of the child, youre right.

@gireeshpunathil
Copy link
Member

thanks for the clarification, will have a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Issues that look for answers.
Projects
None yet
Development

No branches or pull requests

5 participants