Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider an option to capture stdout without printing it #377

Closed
BurntSushi opened this issue Apr 3, 2021 · 13 comments · Fixed by #509
Closed

consider an option to capture stdout without printing it #377

BurntSushi opened this issue Apr 3, 2021 · 13 comments · Fixed by #509

Comments

@BurntSushi
Copy link

BurntSushi commented Apr 3, 2021

By default, Hyperfine will attach Stdio::null() to the stdout of the command it runs. Some programs, like greps, will detect this, alter how they execute and run more quickly as a result. For example, in the case of GNU grep, if it detects that stdout is /dev/null, then it will behave as if the -q/--quiet flag were present. This means GNU grep can exit as soon as a match is found. This is a perfectly legitimate optimization since its only observable behavior is its exit code, and its exit code is only influenced by whether a match exists or not (assuming no errors or interrupts occur).

What this means is that, e.g., time grep foo file | wc -l and hyperfine 'grep foo file' aren't quite doing the same thing. That's no fault of Hyperfine, however, it would be nice to have a way to easily work around it. As of today, here are some work-arounds:

  • Use the --show-output flag, which captures stdout. But this also prints stdout, which can be undesirable if there is a lot of output.
  • In the case of grep specifically, use a query that is not expected to match and give Hyperfine the --ignore-failure flag.

A better work-around might be to have some other flag, maybe --capture-output, that causes Hyperfine to capture stdout but then do nothing with it.

I was motivated to write this feature request based on an errant benchmark found in the wild: https://old.reddit.com/r/commandline/comments/mibsw8/alternative_to_grep_less/gt90lmm/

@sharkdp
Copy link
Owner

sharkdp commented Apr 3, 2021

Thank you for this request.

By default, Hyperfine will attach Stdio::null() to the stdout of the command it runs. Some programs, like greps, will detect this, alter how they execute and run more quickly as a result. For example, in the case of GNU grep, if it detects that stdout is /dev/null, then it will behave as if the -q/--quiet flag were present.

Interesting. I haven't really seen any optimizations of this kind (specifically checking whether the output is attached to /dev/null). But I have seen quite a few people that were puzzled by benchmark results, due to changed behavior of programs when their STDOUT is not attached to an interactive terminal (like: switching off colorization, which can make the program significantly faster). In such a case, you either have to force the program to keep the old behavior (--color=always), or use hyperfines --show-output option... which, as you said, can be annoying if you don't really care about the output.

A better work-around might be to have some other flag, maybe --capture-output, that causes Hyperfine to capture stdout but then do nothing with it.

I agree. This would be useful for the grep use case. Do you think we could maybe even make this work for the other use case as well? By attaching STDOUT to something that looks like an interactive terminal (see this related SO question)?

Another question is whether we want to introduce a new option for this or if it could be enabled by default (which would maybe even enable us to do things like #375). I would actually be rather hesitant to enable it by default, as I could imagine that it could potentially slow down the benchmarked program (if the reader on the other side is not keeping up), as compared to the /dev/null case.

@BurntSushi
Copy link
Author

With respect to the tty issue, I do think that is a distinct issue. It's "the user sees no output, so we can do something different" vs "the user sees output, and it's on a terminal, so we can emit terminal escape sequences." Or maybe what you meant is that the issues are related, and that perhaps there might be a better UX that tries to unify them somehow. I agree that that might be a good idea, assuming you can reliably fake being a tty. But I'm not sure about that.

As for making it the default... I'm not sure about that either. For me personally, I'd be happy with an opt-in flag. But that doesn't help the folks creating errant benchmarks. On the flip side, this is just one of many issues that can result in benchmarks being in error. Your point about capturing the output causing the programs to run more slowly seems like the crux of the issue to me. If that is indeed a problem, then making it the default probably seems unwise. I would be somewhat surprised if it couldn't keep up though. I think the optimal implementation of this flag would be to read the process's stdout and drop its contents immediately. So, basically, a /dev/null without being a /dev/null.

@sharkdp
Copy link
Owner

sharkdp commented Feb 26, 2022

@BurntSushi I tried to implement this today. Unfortunately, I did not find a fast solution so far.

Setting

Let's assume we have a command that writes a lot to stdout. Here, we use cat large as a proxy where large is a file that shouldn't be too small to give somewhat reliable benchmark results:

dd if=/dev/random of=large bs=1M count=100

We now want to write code to launch a process (here: cat large) without using .stdout(Stdio::null()), but emulating its functionality.

Attempt

My naive attempt looks like this:

use std::io::{self, Result, Write};
use std::process::{Command, ExitStatus, Stdio};

struct DevNull {}

impl Write for DevNull {
    fn write(&mut self, buf: &[u8]) -> Result<usize> {
        Ok(buf.len())
    }

    fn flush(&mut self) -> Result<()> {
        Ok(())
    }
}

trait CommandExt {
    fn discard_output_and_get_status(&mut self) -> Result<ExitStatus>;
}

impl CommandExt for Command {
    fn discard_output_and_get_status(&mut self) -> Result<ExitStatus> {
        let mut child = self.spawn()?;
        let mut reader = child.stdout.take().unwrap();

        let mut dev_null = DevNull {};

        io::copy(&mut reader, &mut dev_null)?;

        child.wait()
    }
}

fn main() -> Result<()> {
    let result = Command::new("cat")
        .arg("large")
        .stdout(Stdio::piped())
        .discard_output_and_get_status();

    println!("Exited successfully: {}", result?.success());

    Ok(())
}

Benchmarks

If we compare the runtime of this program (dev_null) to cat large, we can see that it's a factor 2.6 slower. I also tried various other approaches using hand-written loops calling read_exact, with or without a BufReader, with a custom buffer of various sizes etc. I couldn't get it to run much faster.

But notice that a similar "black hole" program (cat > /dev/null) performs just as bad:

Command Mean [ms] Min [ms] Max [ms] Relative
cat large > /dev/null 19.0 ± 0.8 17.7 22.6 1.00
cat large | cat > /dev/null 48.6 ± 2.0 44.0 53.4 2.56 ± 0.15
./dev_null 49.2 ± 2.1 45.7 56.6 2.59 ± 0.16

If anyone has ideas on how to improve the situation, please let me know. I checked if read(2) can be used without any buffer (by passing nullptr as the second argument), but that is apparently not allowed. Also, the buffer does not seem to be the problem here. Maybe there is a kernel-internal optimization where writes to /dev/null are no-ops?

If there is no way to make this perform any faster, the question is if we want to implement this at all. Especially given that there is an easy workaround by basically just piping into cat

hyperfine "grep foo file | cat"

… which is equivalent to running grep foo file | cat > /dev/null (but without giving grep a chance to detect that its stdout is to /dev/null):

Command Mean [ms] Min [ms] Max [ms] Relative
grep -R foo * 125.2 ± 2.5 122.3 133.8 1.00
grep -R foo * | cat 197.9 ± 4.4 194.0 210.8 1.58 ± 0.05

@tavianator
Copy link
Contributor

Maybe there is a kernel-internal optimization where writes to /dev/null are no-ops?

Yes: https://github.com/torvalds/linux/blob/ec7f49619d8ee13e108740c82f942cd401b989e9/drivers/char/mem.c#L453-L457

I suspect you cannot get writes to a pipe to be anywhere close to as fast as writes to /dev/null since the kernel won't even copy the data for /dev/null. They're basically O(1) while writing to anything else is O(n).

I think that means | cat >/dev/null is a better benchmark than >/dev/null though, since /dev/null is artificially faster than anything else could possibly be.

@tavianator
Copy link
Contributor

You could beat | cat >/dev/null slightly with a program that read()s but never write()s. Or even better, splice() right to /dev/null which will avoid all copies, so all you're benchmarking is the write()s of the part before the pipe. For some reason, GNU cat doesn't do that, but I found a couple existing implementations:

@tavianator
Copy link
Contributor

pv uses splice(), so here's a quick estimate:

Command Mean [ms] Min [ms] Max [ms] Relative
cat large 150.8 ± 2.7 146.7 156.5 1.00
cat large | cat 486.6 ± 15.8 458.3 510.0 3.23 ± 0.12
cat large | pv -q 324.6 ± 11.6 306.6 344.7 2.15 ± 0.09

So using splice() instead of read()/write() can be about 50% faster.

tavianator added a commit to tavianator/hyperfine that referenced this issue May 14, 2022
@tavianator
Copy link
Contributor

One more thing: GNU grep (and probably other programs) detect /dev/null by checking if stdout is the exact same inode as /dev/null, so you can fool it by making a new device node:

# mknod -m a+rw ./null c 1 3
Command Mean [ms] Min [ms] Max [ms] Relative
grep -r . ./large_dir/ >/dev/null 6.1 ± 0.7 5.0 8.8 1.00
grep -r . ./large_dir/ | cat >/dev/null 314.7 ± 3.4 310.7 321.8 51.90 ± 5.74
grep -r . ./large_dir/ | pv -q >/dev/null 306.8 ± 3.1 302.0 311.7 50.60 ± 5.60
grep -r . ./large_dir/ >./null 258.2 ± 2.0 255.4 262.5 42.59 ± 4.70

@BurntSushi
Copy link
Author

@sharkdp

If there is no way to make this perform any faster, the question is if we want to implement this at all.

I'm not quite sure I follow. Your table of benchmark results for cat is basically what I would expect. cat likely has an optimization that detects /dev/null and knows it doesn't have to write anything at all. grep has the same optimization. The main point of the flag I'm requesting is precisely to defeat this optimization and thus cause the program to run more "slowly." The program running more slowly isn't necessarily the point. The point is to time the actual workload. But since hyperfine attachs /dev/null to stdout of the subprocess, the timing is not for the actual workload but for a shortcut because it knows the actual workload isn't used.

Using grep ... | cat as a work-around is a good point, although probably isn't as nice for Windows users without a Unix-like environment.

@tavianator
Copy link
Contributor

cat likely has an optimization that detects /dev/null and knows it doesn't have to write anything at all. grep has the same optimization.

AFAIK, GNU cat has no such optimization (confirmed with strace). But the kernel side of write(/dev/null) is much faster than write(pipe).

@BurntSushi
Copy link
Author

@tavianator Aye, interesting. Good to know! I think cat is probably a little special. For something like grep, it really needs to know whether it's writing to /dev/null itself in order to know to stop searching. Regardless, whether it's done by the kernel or by the program itself, the result is the same: you wind up changing what you're actually trying to measure.

@sharkdp
Copy link
Owner

sharkdp commented May 16, 2022

If there is no way to make this perform any faster, the question is if we want to implement this at all.

I'm not quite sure I follow. Your table of benchmark results for cat is basically what I would expect. cat likely has an optimization that detects /dev/null and knows it doesn't have to write anything at all. grep has the same optimization. The main point of the flag I'm requesting is precisely to defeat this optimization and thus cause the program to run more "slowly."

It's clear now, but back then I didn't realize that there is probably no way to make it faster. I just didn't expect there to be such a huge difference.

To summarize (mostly for me):

  • The fastest way to get rid of data is by directly writing to /dev/null as in producer > /dev/null, because the kernel optimizes writes to /dev/null.
  • The next best thing is to pipe to a program that uses splice, such as in producer | pv -q > /dev/null.
  • Building a custom reader (like I did above) is more or less equivalent to producer | cat > /dev/null. This is slower than the two options above, since both cases involve a pipe that we read from.

Given these results, I'm completely in favor of implementing this in #509

sharkdp pushed a commit that referenced this issue May 17, 2022
@sharkdp
Copy link
Owner

sharkdp commented May 17, 2022

We can now repeat the benchmark from the reddit thread:

The wrong way, triggering greps /dev/null optimization (I am using hyperfines new -N/--shell=none option here to measure accurate times below ~5ms):

▶ hyperfine -N -L tool 'rg -N','grep' -w 10 '{tool} Harry bookx1000.txt'

Benchmark 1: rg -N Harry bookx1000.txt
  Time (mean ± σ):     299.2 ms ±   3.9 ms    [User: 264.7 ms, System: 33.3 ms]
  Range (min … max):   295.7 ms … 308.1 ms    10 runs
 
Benchmark 2: grep Harry bookx1000.txt
  Time (mean ± σ):       2.3 ms ±   0.1 ms    [User: 1.8 ms, System: 0.4 ms]
  Range (min … max):     2.1 ms …   3.1 ms    1308 runs
 
Summary
  'grep Harry bookx1000.txt' ran
  129.99 ± 8.43 times faster than 'rg -N Harry bookx1000.txt'

And now with the new --output=pipe option:

▶ hyperfine -N -L tool 'rg -N','grep' -w 10 '{tool} Harry bookx1000.txt' --output=pipe

Benchmark 1: rg -N Harry bookx1000.txt
  Time (mean ± σ):     338.5 ms ±   7.4 ms    [User: 286.0 ms, System: 51.6 ms]
  Range (min … max):   331.2 ms … 351.7 ms    10 runs
 
Benchmark 2: grep Harry bookx1000.txt
  Time (mean ± σ):     682.2 ms ±   8.3 ms    [User: 596.9 ms, System: 84.2 ms]
  Range (min … max):   673.9 ms … 701.4 ms    10 runs
 
Summary
  'rg -N Harry bookx1000.txt' ran
    2.02 ± 0.05 times faster than 'grep Harry bookx1000.txt'

@BurntSushi
Copy link
Author

w00t! Thank you @sharkdp and @tavianator!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants