Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong output when piping to less #2713

Closed
asterite opened this issue Jun 1, 2016 · 78 comments · Fixed by #5017
Closed

Wrong output when piping to less #2713

asterite opened this issue Jun 1, 2016 · 78 comments · Fixed by #5017
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib

Comments

@asterite
Copy link
Member

asterite commented Jun 1, 2016

Extracted from Google Groups

A consistent way to reproduce it:

# foo.cr
string = <<-EOF
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
EOF
puts string

Then:

$ crystal build foo.cr
$ ./foo | less

The issue seems to be related to using fcntl with stdin/stdout/stderr in a shell. I found this but I can't seem to understand what's going on in this case.

@asterite asterite added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib labels Jun 1, 2016
@miketheman
Copy link
Contributor

#triage
This is an interesting one. I am definitely able to repro off current master 9cf6e06 .

Redirecting stderr to stdout produces the expected result:

$ ./foo 2>&1 | less

(also ./foo |& less - |& was added to Bash 4 as an abbreviation for 2>&1 )

So is there some part of the output that is getting sent to stderr instead of stdout?

@miketheman
Copy link
Contributor

I dug into this further, and using tee, and other character-counters, I can determine that the amount of data transmitted before the "breakage" is 1012 characters on 12 lines = 1024, which is a very suspicious number.

Using the unbuffer command from the expect package, I can get the full output, since this uses an intermediate pty. As mentioned above,

I haven't been able to find the reasoning for 1024 being the buffer limit on OSX, other than some references being based on fcntl kernel default settings.
http://lxr.free-electrons.com/source/include/uapi/asm-generic/fcntl.h#L188
http://lxr.free-electrons.com/ident?i=F_SETPIPE_SZ
http://man7.org/linux/man-pages/man2/fcntl.2.html

I got lost in digging through how it all ties together...

@miketheman
Copy link
Contributor

@chaniks
Copy link

chaniks commented Oct 2, 2016

@asterite I am sorry. I was kinda unable to reproduce this problem and thought they were just related. (Still not sure if I understand it right.)

Does this make any change?

In src/io/file_descriptor.cr:

# add in IO::FileDescriptor
def self.blocking(fd)
  fcntl(fd, LibC::F_GETFL) & LibC::O_NONBLOCK == 0
end

In src/kernel.cr:

STDIN  = IO::FileDescriptor.new(0, blocking: IO::FileDescriptor.blocking(0))
STDOUT = (IO::FileDescriptor.new(1, blocking: IO::FileDescriptor.blocking(1))).tap { |f| f.flush_on_newline = true }
STDERR = (IO::FileDescriptor.new(2, blocking: IO::FileDescriptor.blocking(2))).tap { |f| f.flush_on_newline = true }
## comment out following lines
#STDIN  = IO::FileDescriptor.new(0, blocking: LibC.isatty(0) == 0)
#STDOUT = (IO::FileDescriptor.new(1, blocking: LibC.isatty(1) == 0)).tap { |f| f.flush_on_newline = true }
#STDERR = (IO::FileDescriptor.new(2, blocking: LibC.isatty(2) == 0)).tap { |f| f.flush_on_newline = true }

Ahh, now I see what is bugging you guys.
I'll dig a little deeper.

@asterite
Copy link
Member Author

asterite commented Oct 2, 2016

The main issue is that we want to have non-blocking IO for stdin/stdout/stderr. For example this should work well:

spawn do
  loop do
    puts 1
    sleep 1
  end
end

spawn do
  loop do
    while line = gets
      puts line
    end
  end
end

sleep

So gets blocks the fiber and the print loop fiber can continue to run. Right now we are using fcntl on stdin/stdout/stderr, but on consoles they seem to be linked to each other, so we set some flags on one descriptor and those flags are also applied to other descriptors, and it causes unwanted behaviour.

@asterite
Copy link
Member Author

asterite commented Oct 2, 2016

Some more links with others running into this problem:

https://twistedmatrix.com/trac/ticket/3442
http://cr.yp.to/unix/nonblock.html

It seems that O_NONBLOCK is a property of file descriptions, not of file descriptors. This means that it's incorrect for any code to ever set it on a file descriptor that it did not just freshly create with open(), socket(), etc.

So setting O_NONBLOCK on stdin/stdout/stderr is wrong.

@drosehn
Copy link

drosehn commented Feb 8, 2017

It seems that all crystal programs still modify the behavior of STDIO. Even if I compile an empty crystal program (one which is nothing but comments), if I type in mail after running the program, mail acts as if it immediately received an EOF. If I add the line:

STDIN.blocking = true

as mentioned in #3674, then the problem does not happen. @asterite mentioned "So setting O_NONBLOCK on stdin/stdout/stderr is wrong". Does that mean that crystal sets blocking on all three of the standard descriptors?

@ysbaddaden
Copy link
Contributor

Yes. It shouldn't, otherwise accessible standard descriptors would block the event loop.

We must rework IO:: FileDescriptor to handle cases where O_NONBLOCK isn't set on the FD, and rely on timers and signals to avoid long pauses. It's not trivial to fix.

@drosehn
Copy link

drosehn commented Apr 6, 2017

A little add-on trivia: If you're trying to write a single source-file which works in both Ruby and crystal without any changes, and if this issue causes some grief for you, here's a slight change to the command. This should work-around the issue in crystal without causing any errors in ruby:

  STDIN.blocking = true if STDIN.class != IO

(or at least it worked for me, using Crystal 0.21.1 and macOS)

@faustinoaq
Copy link
Contributor

Some progress on this issue?

Currently I'm using some hacks shown in above comments.

@faustinoaq
Copy link
Contributor

@asterite Do you think we can fix this before next release ? 😅

monito

Are some other issues blocking this in particular ?

@asterite
Copy link
Member Author

@faustinoaq To implement this we need to implement STDIN, STDOUT and STDERR as blocking, but when they block, fire a Thread and wait there. So that kinda involves parallelism. And I'm not in charge of that, so I don't think this will be fixed soon.

@bew
Copy link
Contributor

bew commented Sep 22, 2017

Isn't it possible to reset blocking state of the fd when the program exits?

@RX14
Copy link
Contributor

RX14 commented Sep 22, 2017

Why not just make writing to stdout blocking? It's a simple fix, which has some other benefits. Such as using puts/p/pp doesn't change the execution of the program. If for some reason using blocking stdout is a major performance problem, they can get the current behaviour with 1 line of code. I don't see too many cases where it'd be a problem through.

@asterite
Copy link
Member Author

@RX14 What about this?

spawn do
  loop do
    puts 1
    sleep 1
  end
end

spawn do
  loop do
    while line = gets
      puts line
    end
  end
end

sleep

@RX14
Copy link
Contributor

RX14 commented Sep 22, 2017

What about if it was a busy loop instead of puts? What if stdout and stdin are extremely fast C applications which always empty/fill the buffer faster than we can fill/empty it? What if they're files and never block? Until we have preemptive scheduling we can't solve all these problems, so I suggest we don't try to.

@RX14
Copy link
Contributor

RX14 commented Mar 10, 2018

until we double fault, or just kill -9 will do it

@chocolateboy
Copy link
Contributor

For a segfault we have a segfault handler to print a message about the segfault (and then exit)

What do you have to handle a SIGKILL? :-)

@bew
Copy link
Contributor

bew commented Mar 10, 2018

What do you have to handle a SIGKILL? :-)

Not much, but if you SIGKILL your program, it's difficult to expect the programs interacting with it to not have problems ^^

And if you need to SIGKILL your program, I bet you have other problems more important than the blocking state of the outputs!

@RX14
Copy link
Contributor

RX14 commented Mar 10, 2018

lol I have to sigkill something or another almost every week, it's not rare in any sense. For example, if you get your crystal program stuck into a tight loop, SIGINT etc will never work since the signal handles are processed in the event loop which never gets reached since you're in a tight loop.

@bew
Copy link
Contributor

bew commented Mar 10, 2018

Ok now we need a scheduler that handle priority or sth like that, to make sure the signal handler's fiber is checked once in a while.. 😛

@RX14
Copy link
Contributor

RX14 commented Mar 10, 2018

The problem is not scheduling the problem is preemption.

@asterite
Copy link
Member Author

@martinos Ruby stdin is blocking. If you do gets, it will block the entire thread. I think printing to stdout and stderr is blocking too in Ruby.

@martinos
Copy link

martinos commented Mar 10, 2018

@asterite stdout is non-blocking by default, data is being buffered and flushed after a while. I have been bitten with that multiple times before. http://ruby-doc.org/core-2.3.1/IO.html#method-i-sync even when there was no system threading support.

$ ruby -e "puts STDOUT.sync" | cat
false

I don't know much about system programming, sorry if I say stuff that don't make sense.

@RX14
Copy link
Contributor

RX14 commented Mar 11, 2018

@martinos buffering is totally different to nonblocking, let me explain:

Buffering is a solution to the problem of many little writes. Calling into the kernel to perform a write(2) to stdout/stderr/the network/a file is actually rather expensive, so you store up a chunk or so of writes in memory and then submit that whole chunk to the kernel at once. This reduces number of write calls massively and increases performance, but it means that in some situations you need to flush the buffer.

Nonblocking IO is a different solution to a different problem. When you're doing network or file IO, the system can spend a lot of time waiting on hardware: for example waiting 10ms to fetch something from the database or 30ms to fetch something from disk. Wouldn't it be great if your computer could do something else while that was happening? Well, that's what threads are for. Threads are the OS's solution to this problem, you can spawn a bunch of threads and one thread can do work while another is waiting (blocks) on IO.

However, it turns out that threads have some performance problems of their own on modern machines. Switching between threads involves a lot of housekeeping in the kernel, so there's an alternative that operates entirely in userland called green threads. They're similar to threads but have a lot less features but are a lot lot faster. So back to IO: when you make a read(2) call to the kernel, and the kernel realises that it has nothing to do before some data arrives from the disk or network it will reschedule another thread to do some work while it waits. However, in green threads the operating system doesn't know anything about the green threads so it can't do that. The solution is called nonblocking IO.

You can set a flag on the file descriptor called O_NONBLOCK which tells the kernel that the application is able to handle the case in which there is no data waiting. In this case, when you call read(2) and there is no data waiting, the kernel will instead return EAGAIN, which is an "error" flag which means "there's no data for you to read right now". The application can detect this and reschedule to another green thread to do some work while the kernel is waiting. When there is no more other work to do, the application can make a select call into the kernel to ask the kernel "which of these file descriptors are ready to read" so the application knows exactly which green threads are ready to retry the read call and read some more data and get some more work done.

This bug is caused by crystal setting the O_NONBLOCK flag on STDOUT/STDIN. This works fine when the application is crystal which uses green threads and is aware of O_NONBLOCK and EAGAIN, but when crystal exits STDIN still has the O_NONBLOCK flag set. If you for example launch vim it won't be aware that STDIN is still in nonblocking mode and that read can return EAGAIN. Setting O_NONBLOCK essentially modifies the behaviour of STDIN in a way that other applications don't understand, which obviously causes the issues documented in this thread.

There are two clear solutions:

  • Clear the O_NONBLOCK flag on STDIN, STDOUT and STDERR when the application exits
  • Never set O_NONBLOCK on those three at all.

The former has better performance and changes current behaviour the least, but the second option is far more robust. This is because running some code when your application exits isn't always possible in every single scenario (i.e. kill -9 as mentioned above).

Hope that helps!

@martinos
Copy link

@RX14, that was a great explanation! I really appreciate.

I would say that in when Ruby was 1.8 it only had green threads and I have never had issues when using SIGKILL with IO#sync = false. This is why I thought that their model was good. I tried to read the source code but I haven't programmed C for the last 20 years so I can't tell if they were using O_NONBLOCK. Thanks again!

@straight-shoota
Copy link
Member

Please take what I say with a grain of salt, I'm not really confident in these matters but I'm trying to analyze the situation and possible scenarios to proceed. Correct me if I'm wrong.

Of these two solutions mentioned by @RX14, number 2 can only be implemented with parallelism and if done so, it will be slightly less performant. It would be the cleaner and standards compliant solution, but unfortunately, it can't be implemented until Crystal supports parallelism.

Now, the first question is, what should be the final goal to solve this? Should standard streams be blocking or might it be worth to avoid the performance impact and stay with non-blocking IO even if we have parallelism?
I guess it won't be justified given that it's just max three streams that will need a thread to wait for unblocking.
But that's a decision that we should be clear about.

The second question is about the immediate solution.
If we can contain the impact of setting 0_NONBLOCKING on standard streams that it will only leak out of the Crystal context in case of a segfault or SIG_KILL, I guess this is fine.

SIG_KILL is only meant as a last resort and since it stops the process immediately, one cannot expect that it terminates cleanly. This should usually not leave non-blocking standard streams, but this seems like a worthy trade-off at least as an intermediary solution.

@asterite
Copy link
Member Author

The problem is that when you run your program in the shell, the real file descriptors for stdin, stdout and stderr are sometimes shared (can't remember where in stack overflow I read this, I think there's a link somewhere in this github issue). So changing one to be blocking or non blocking will change the other one too. It's a little mess. In short, it seems it's never safe to modify the state of these things, so the only good solution is to have them be blocking (the default) and use threads. This is what Go does.

@bew
Copy link
Contributor

bew commented Mar 11, 2018

Maybe this one: https://cr.yp.to/unix/nonblock.html
Nice explanation here too: https://twistedmatrix.com/trac/ticket/3442

(From your comment @asterite)

@asterite
Copy link
Member Author

@RX14
Copy link
Contributor

RX14 commented Mar 11, 2018

number 2 can only be implemented with parallelism

simply incorrect. We would only need threads to simulate nonblocking IO on a blocking FD. Just don't do that just make it block like all Files already do and nobody complains.

@straight-shoota
Copy link
Member

straight-shoota commented Mar 11, 2018

Right, I had it in mind but forgot when writing it down explicitly:

Number 2 can only be implemented with parallelism if it is to be non-blocking for the crystal program.

@skuznetsov
Copy link
Contributor

skuznetsov commented Mar 31, 2018

I did the following test:

added 280 lines to test on the top of this thread and some spacing to the lines to see any changes in output and it was working just fine.

then I added code from Ary and executed them together. I stated to see weird behaviour.
If you go back and forth pages before you are going to the last page - crazy lines are there.
But... if you go to the last page and then go back in pages to the top you can see that pages started to show correctly.

So data is not lost.

Here is my short version of the code:

string = <<-EOF
012345678901234567890123456789 01234567890123456789012345678901234567890123456789
0123456789012345678901234567890 1234567890123456789012345678901234567890123456789
0123456789012345678901234567890123 4567890123456789012345678901234567890123456789
012345678901234567890123456789012345 67890123456789012345678901234567890123456789
012345678901234567890123456789012345 67890123456789012345678901234567890123456789
012345678901234567890123456789012345 67890123456789012345678901234567890123456789
0123456789012345678901234567890123456 7890123456789012345678901234567890123456789
0123456789012345678901234567890123456 7890123456789012345678901234567890123456789
012345678901234567890123456789012345678 90123456789012345678901234567890123456789
0123456789012345678901234567890123456789 0123456789012345678901234567890123456789
012345678901234567890123456789012345678901 23456789012345678901234567890123456789
012345678901234567890123456789012345678901 23456789012345678901234567890123456789
012345678901234567890123456789012345678901 23456789012345678901234567890123456789
0123456789012345678901234567890123456789012 3 456789012345678901234567890123456789
0123456789012345678901234567890123456789012345 6789012345678901234567890123456789
0123456789012345678901234567890123456789012345678 9012345678901234567890123456789
0123456789012345678901234567890123456789012345678 9012345678901234567890123456789
0123456789012345678901234567890123456789012345678 9012345678901234567890123456789
012345678901234567890123456789012345678901234567 89012345678901234567890123456789
012345678901234567890123456789012345678901234567 89012345678901234567890123456789
012345678901234567890 123456789012345678901234567 89012345678901234567890123456789
01234567890123456789 0123456789012345678901234567 89012345678901234567890123456789
0123456789012345678 90123456789012345678901234567 89012345678901234567890123456789
01234567890123456789 0123456789012345678901234567 89012345678901234567890123456789
012345678901234567890 1234567890123456789012345678 9012345678901234567890123456789
0123456789012345678 901234567890123456789012345678 9012345678901234567890123456789
0123456789012345678901 2345678901234567890123456789 012345678901234567890123456789
01234567890123456789 012345678901234567890123456789012345678901234567890123456789
01234567890123456789 012345678901234567890123456789012345678901234567890123456789
012345678901234567890 12345678901234567890123456789012345678901234567890123456789
EOF
puts string

1000.times do |i|
puts "Hello #{i}, " * 8
end

@skuznetsov
Copy link
Contributor

Interesting effect is with chains like this:

./test_bug | tee test_bug.log

file content is good and as expected but screen output is crazy

./test_bug | cat

shows little bit of output and then crashes with stack trace.

I think it can help you understand what is going on.

@faustinoaq
Copy link
Contributor

Yay! Looks like this issue is fixed on v0.25.0 🎉

[main@pc ~]$ crystal --version
Crystal 0.25.0 (2018-06-12)

LLVM: 6.0.0
Default target: x86_64-pc-linux-gnu
[main@pc ~]$ shards --version
Shards 0.8.1 (2018-06-19)
[main@pc ~]$ lsb_release -a
LSB Version:    n/a
Distributor ID: ManjaroLinux
Description:    Manjaro Linux
Release:        17.1.10
Codename:       Hakoila
[main@pc ~]$ uname -a
Linux pc 4.14.48-2-MANJARO #1 SMP PREEMPT Fri Jun 8 20:41:40 UTC 2018 x86_64 GNU/Linux
[main@pc ~]$ bash --version
GNU bash, version 4.4.19(1)-release (x86_64-unknown-linux-gnu)
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
[main@pc ~]$ cat foo.cr
# foo.cr
string = <<-EOF
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
EOF
puts string

[main@pc ~]$ crystal build foo.cr
[main@pc ~]$ ./foo | less
[main@pc ~]$ read

[main@pc ~]$ nano foo.cr
[main@pc ~]$ ./foo | grep 123
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789

@rdp
Copy link
Contributor

rdp commented Jul 2, 2018

Maybe I'm missing something,can you just select on stdin before reading from it to make that non-blocking and not have to change state, perhaps?

@asterite
Copy link
Member Author

@RX14 @ysbaddaden @straight-shoota @Timbus Is this issue fixed by #6518 ?

@Timbus
Copy link
Contributor

Timbus commented Aug 22, 2018

I can test when I get home (can't build master at work), but in theory this should definitely be fixed. Will confirm in ~7hrs if needed.

@Timbus
Copy link
Contributor

Timbus commented Aug 23, 2018

Yeah, I ran nearly every test case in this thread and it seems to be fixed in master.
The tee example above fails in 0.26, if you need a regression test to confirm it yourself.

@asterite
Copy link
Member Author

Fixed by #6518

@Timbus Thank you so much for this! ❤️ I really thought this couldn't be fixed without firing up threads to do the IO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib
Projects
None yet
Development

Successfully merging a pull request may close this issue.