consider an option to capture stdout without printing it #377

BurntSushi · 2021-04-03T15:15:29Z

By default, Hyperfine will attach Stdio::null() to the stdout of the command it runs. Some programs, like greps, will detect this, alter how they execute and run more quickly as a result. For example, in the case of GNU grep, if it detects that stdout is /dev/null, then it will behave as if the -q/--quiet flag were present. This means GNU grep can exit as soon as a match is found. This is a perfectly legitimate optimization since its only observable behavior is its exit code, and its exit code is only influenced by whether a match exists or not (assuming no errors or interrupts occur).

What this means is that, e.g., time grep foo file | wc -l and hyperfine 'grep foo file' aren't quite doing the same thing. That's no fault of Hyperfine, however, it would be nice to have a way to easily work around it. As of today, here are some work-arounds:

Use the --show-output flag, which captures stdout. But this also prints stdout, which can be undesirable if there is a lot of output.
In the case of grep specifically, use a query that is not expected to match and give Hyperfine the --ignore-failure flag.

A better work-around might be to have some other flag, maybe --capture-output, that causes Hyperfine to capture stdout but then do nothing with it.

I was motivated to write this feature request based on an errant benchmark found in the wild: https://old.reddit.com/r/commandline/comments/mibsw8/alternative_to_grep_less/gt90lmm/

The text was updated successfully, but these errors were encountered:

sharkdp · 2021-04-03T20:07:46Z

Thank you for this request.

By default, Hyperfine will attach Stdio::null() to the stdout of the command it runs. Some programs, like greps, will detect this, alter how they execute and run more quickly as a result. For example, in the case of GNU grep, if it detects that stdout is /dev/null, then it will behave as if the -q/--quiet flag were present.

Interesting. I haven't really seen any optimizations of this kind (specifically checking whether the output is attached to /dev/null). But I have seen quite a few people that were puzzled by benchmark results, due to changed behavior of programs when their STDOUT is not attached to an interactive terminal (like: switching off colorization, which can make the program significantly faster). In such a case, you either have to force the program to keep the old behavior (--color=always), or use hyperfines --show-output option... which, as you said, can be annoying if you don't really care about the output.

A better work-around might be to have some other flag, maybe --capture-output, that causes Hyperfine to capture stdout but then do nothing with it.

I agree. This would be useful for the grep use case. Do you think we could maybe even make this work for the other use case as well? By attaching STDOUT to something that looks like an interactive terminal (see this related SO question)?

Another question is whether we want to introduce a new option for this or if it could be enabled by default (which would maybe even enable us to do things like #375). I would actually be rather hesitant to enable it by default, as I could imagine that it could potentially slow down the benchmarked program (if the reader on the other side is not keeping up), as compared to the /dev/null case.

BurntSushi · 2021-04-04T14:36:13Z

With respect to the tty issue, I do think that is a distinct issue. It's "the user sees no output, so we can do something different" vs "the user sees output, and it's on a terminal, so we can emit terminal escape sequences." Or maybe what you meant is that the issues are related, and that perhaps there might be a better UX that tries to unify them somehow. I agree that that might be a good idea, assuming you can reliably fake being a tty. But I'm not sure about that.

As for making it the default... I'm not sure about that either. For me personally, I'd be happy with an opt-in flag. But that doesn't help the folks creating errant benchmarks. On the flip side, this is just one of many issues that can result in benchmarks being in error. Your point about capturing the output causing the programs to run more slowly seems like the crux of the issue to me. If that is indeed a problem, then making it the default probably seems unwise. I would be somewhat surprised if it couldn't keep up though. I think the optimal implementation of this flag would be to read the process's stdout and drop its contents immediately. So, basically, a /dev/null without being a /dev/null.

sharkdp · 2022-02-26T21:22:53Z

@BurntSushi I tried to implement this today. Unfortunately, I did not find a fast solution so far.

Setting

Let's assume we have a command that writes a lot to stdout. Here, we use cat large as a proxy where large is a file that shouldn't be too small to give somewhat reliable benchmark results:

dd if=/dev/random of=large bs=1M count=100

We now want to write code to launch a process (here: cat large) without using .stdout(Stdio::null()), but emulating its functionality.

Attempt

My naive attempt looks like this:

use std::io::{self, Result, Write};
use std::process::{Command, ExitStatus, Stdio};

struct DevNull {}

impl Write for DevNull {
    fn write(&mut self, buf: &[u8]) -> Result<usize> {
        Ok(buf.len())
    }

    fn flush(&mut self) -> Result<()> {
        Ok(())
    }
}

trait CommandExt {
    fn discard_output_and_get_status(&mut self) -> Result<ExitStatus>;
}

impl CommandExt for Command {
    fn discard_output_and_get_status(&mut self) -> Result<ExitStatus> {
        let mut child = self.spawn()?;
        let mut reader = child.stdout.take().unwrap();

        let mut dev_null = DevNull {};

        io::copy(&mut reader, &mut dev_null)?;

        child.wait()
    }
}

fn main() -> Result<()> {
    let result = Command::new("cat")
        .arg("large")
        .stdout(Stdio::piped())
        .discard_output_and_get_status();

    println!("Exited successfully: {}", result?.success());

    Ok(())
}

Benchmarks

If we compare the runtime of this program (dev_null) to cat large, we can see that it's a factor 2.6 slower. I also tried various other approaches using hand-written loops calling read_exact, with or without a BufReader, with a custom buffer of various sizes etc. I couldn't get it to run much faster.

But notice that a similar "black hole" program (cat > /dev/null) performs just as bad:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`cat large > /dev/null`	19.0 ± 0.8	17.7	22.6	1.00
`cat large \| cat > /dev/null`	48.6 ± 2.0	44.0	53.4	2.56 ± 0.15
`./dev_null`	49.2 ± 2.1	45.7	56.6	2.59 ± 0.16

If anyone has ideas on how to improve the situation, please let me know. I checked if read(2) can be used without any buffer (by passing nullptr as the second argument), but that is apparently not allowed. Also, the buffer does not seem to be the problem here. Maybe there is a kernel-internal optimization where writes to /dev/null are no-ops?

If there is no way to make this perform any faster, the question is if we want to implement this at all. Especially given that there is an easy workaround by basically just piping into cat …

hyperfine "grep foo file | cat"

… which is equivalent to running grep foo file | cat > /dev/null (but without giving grep a chance to detect that its stdout is to /dev/null):

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`grep -R foo *`	125.2 ± 2.5	122.3	133.8	1.00
`grep -R foo * \| cat`	197.9 ± 4.4	194.0	210.8	1.58 ± 0.05

tavianator · 2022-05-14T15:30:52Z

Maybe there is a kernel-internal optimization where writes to /dev/null are no-ops?

Yes: https://github.com/torvalds/linux/blob/ec7f49619d8ee13e108740c82f942cd401b989e9/drivers/char/mem.c#L453-L457

I suspect you cannot get writes to a pipe to be anywhere close to as fast as writes to /dev/null since the kernel won't even copy the data for /dev/null. They're basically O(1) while writing to anything else is O(n).

I think that means | cat >/dev/null is a better benchmark than >/dev/null though, since /dev/null is artificially faster than anything else could possibly be.

tavianator · 2022-05-14T15:41:14Z

You could beat | cat >/dev/null slightly with a program that read()s but never write()s. Or even better, splice() right to /dev/null which will avoid all copies, so all you're benchmarking is the write()s of the part before the pipe. For some reason, GNU cat doesn't do that, but I found a couple existing implementations:

tavianator · 2022-05-14T16:06:31Z

pv uses splice(), so here's a quick estimate:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`cat large`	150.8 ± 2.7	146.7	156.5	1.00
`cat large \| cat`	486.6 ± 15.8	458.3	510.0	3.23 ± 0.12
`cat large \| pv -q`	324.6 ± 11.6	306.6	344.7	2.15 ± 0.09

So using splice() instead of read()/write() can be about 50% faster.

Fixes sharkdp#377.

tavianator · 2022-05-16T16:12:26Z

One more thing: GNU grep (and probably other programs) detect /dev/null by checking if stdout is the exact same inode as /dev/null, so you can fool it by making a new device node:

# mknod -m a+rw ./null c 1 3

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`grep -r . ./large_dir/ >/dev/null`	6.1 ± 0.7	5.0	8.8	1.00
`grep -r . ./large_dir/ \| cat >/dev/null`	314.7 ± 3.4	310.7	321.8	51.90 ± 5.74
`grep -r . ./large_dir/ \| pv -q >/dev/null`	306.8 ± 3.1	302.0	311.7	50.60 ± 5.60
`grep -r . ./large_dir/ >./null`	258.2 ± 2.0	255.4	262.5	42.59 ± 4.70

BurntSushi · 2022-05-16T16:53:11Z

@sharkdp

If there is no way to make this perform any faster, the question is if we want to implement this at all.

I'm not quite sure I follow. Your table of benchmark results for cat is basically what I would expect. cat likely has an optimization that detects /dev/null and knows it doesn't have to write anything at all. grep has the same optimization. The main point of the flag I'm requesting is precisely to defeat this optimization and thus cause the program to run more "slowly." The program running more slowly isn't necessarily the point. The point is to time the actual workload. But since hyperfine attachs /dev/null to stdout of the subprocess, the timing is not for the actual workload but for a shortcut because it knows the actual workload isn't used.

Using grep ... | cat as a work-around is a good point, although probably isn't as nice for Windows users without a Unix-like environment.

tavianator · 2022-05-16T17:00:06Z

cat likely has an optimization that detects /dev/null and knows it doesn't have to write anything at all. grep has the same optimization.

AFAIK, GNU cat has no such optimization (confirmed with strace). But the kernel side of write(/dev/null) is much faster than write(pipe).

BurntSushi · 2022-05-16T17:08:41Z

@tavianator Aye, interesting. Good to know! I think cat is probably a little special. For something like grep, it really needs to know whether it's writing to /dev/null itself in order to know to stop searching. Regardless, whether it's done by the kernel or by the program itself, the result is the same: you wind up changing what you're actually trying to measure.

sharkdp · 2022-05-16T18:38:52Z

If there is no way to make this perform any faster, the question is if we want to implement this at all.

I'm not quite sure I follow. Your table of benchmark results for cat is basically what I would expect. cat likely has an optimization that detects /dev/null and knows it doesn't have to write anything at all. grep has the same optimization. The main point of the flag I'm requesting is precisely to defeat this optimization and thus cause the program to run more "slowly."

It's clear now, but back then I didn't realize that there is probably no way to make it faster. I just didn't expect there to be such a huge difference.

To summarize (mostly for me):

The fastest way to get rid of data is by directly writing to /dev/null as in producer > /dev/null, because the kernel optimizes writes to /dev/null.
The next best thing is to pipe to a program that uses splice, such as in producer | pv -q > /dev/null.
Building a custom reader (like I did above) is more or less equivalent to producer | cat > /dev/null. This is slower than the two options above, since both cases involve a pipe that we read from.

Given these results, I'm completely in favor of implementing this in #509

Fixes #377.

sharkdp · 2022-05-17T20:53:07Z

We can now repeat the benchmark from the reddit thread:

The wrong way, triggering greps /dev/null optimization (I am using hyperfines new -N/--shell=none option here to measure accurate times below ~5ms):

▶ hyperfine -N -L tool 'rg -N','grep' -w 10 '{tool} Harry bookx1000.txt'

Benchmark 1: rg -N Harry bookx1000.txt
  Time (mean ± σ):     299.2 ms ±   3.9 ms    [User: 264.7 ms, System: 33.3 ms]
  Range (min … max):   295.7 ms … 308.1 ms    10 runs
 
Benchmark 2: grep Harry bookx1000.txt
  Time (mean ± σ):       2.3 ms ±   0.1 ms    [User: 1.8 ms, System: 0.4 ms]
  Range (min … max):     2.1 ms …   3.1 ms    1308 runs
 
Summary
  'grep Harry bookx1000.txt' ran
  129.99 ± 8.43 times faster than 'rg -N Harry bookx1000.txt'

And now with the new --output=pipe option:

▶ hyperfine -N -L tool 'rg -N','grep' -w 10 '{tool} Harry bookx1000.txt' --output=pipe

Benchmark 1: rg -N Harry bookx1000.txt
  Time (mean ± σ):     338.5 ms ±   7.4 ms    [User: 286.0 ms, System: 51.6 ms]
  Range (min … max):   331.2 ms … 351.7 ms    10 runs
 
Benchmark 2: grep Harry bookx1000.txt
  Time (mean ± σ):     682.2 ms ±   8.3 ms    [User: 596.9 ms, System: 84.2 ms]
  Range (min … max):   673.9 ms … 701.4 ms    10 runs
 
Summary
  'rg -N Harry bookx1000.txt' ran
    2.02 ± 0.05 times faster than 'grep Harry bookx1000.txt'

BurntSushi · 2022-05-17T23:07:44Z

w00t! Thank you @sharkdp and @tavianator!!!

sharkdp mentioned this issue Apr 3, 2021

Show output only on fails #375

Closed

sharkdp added the feature-request label Jul 27, 2021

This was referenced Nov 26, 2021

Add buffering to stdout when it's not a terminal. sharkdp/fd#885

Merged

Extended benchmark suite sharkdp/fd#893

Open

sharkdp added this to the hyperfine 2.0 milestone Feb 13, 2022

tavianator added a commit to tavianator/hyperfine that referenced this issue May 14, 2022

Implement --output={null,pipe,<FILE>}

5868f08

Fixes sharkdp#377.

tavianator mentioned this issue May 14, 2022

Implement --output={null,pipe,<FILE>} #509

Merged

sharkdp closed this as completed in #509 May 17, 2022

sharkdp pushed a commit that referenced this issue May 17, 2022

Implement --output={null,pipe,<FILE>}

277807c

Fixes #377.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consider an option to capture stdout without printing it #377

consider an option to capture stdout without printing it #377

BurntSushi commented Apr 3, 2021 •

edited

Loading

sharkdp commented Apr 3, 2021

BurntSushi commented Apr 4, 2021

sharkdp commented Feb 26, 2022 •

edited

Loading

tavianator commented May 14, 2022

tavianator commented May 14, 2022

tavianator commented May 14, 2022

tavianator commented May 16, 2022

BurntSushi commented May 16, 2022

tavianator commented May 16, 2022

BurntSushi commented May 16, 2022

sharkdp commented May 16, 2022

sharkdp commented May 17, 2022

BurntSushi commented May 17, 2022

consider an option to capture stdout without printing it #377

consider an option to capture stdout without printing it #377

Comments

BurntSushi commented Apr 3, 2021 • edited Loading

sharkdp commented Apr 3, 2021

BurntSushi commented Apr 4, 2021

sharkdp commented Feb 26, 2022 • edited Loading

Setting

Attempt

Benchmarks

tavianator commented May 14, 2022

tavianator commented May 14, 2022

tavianator commented May 14, 2022

tavianator commented May 16, 2022

BurntSushi commented May 16, 2022

tavianator commented May 16, 2022

BurntSushi commented May 16, 2022

sharkdp commented May 16, 2022

sharkdp commented May 17, 2022

BurntSushi commented May 17, 2022

BurntSushi commented Apr 3, 2021 •

edited

Loading

sharkdp commented Feb 26, 2022 •

edited

Loading