Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read::bytes() is a performance trap #28073

Closed
eefriedman opened this issue Aug 28, 2015 · 6 comments
Closed

Read::bytes() is a performance trap #28073

eefriedman opened this issue Aug 28, 2015 · 6 comments

Comments

@eefriedman
Copy link
Contributor

use std::fs::File;
use std::io::Read;
use std::io::BufReader;
fn main() {
    let f = File::open("/usr/share/dict/words").unwrap();
    let f = BufReader::new(f);
    let mut count = 0;
    for b in f.bytes() {
        if b.ok() == Some(b'a') { count += 1 }
    }
    println!("{}", count);
}

On my computer, this takes about 0.011s to run. If you comment out the line to create the BufReader, the code still compiles... but it takes roughly 0m0.209s to run.

It seems like Rust should try to prevent people from shooting themselves in the foot, although I'm not exactly sure what could be done short of deprecating it.

@bluss
Copy link
Member

bluss commented Aug 28, 2015

Do you think it's a trap even with the BufReader? It's surprisingly good.

@eefriedman
Copy link
Contributor Author

The performance with BufReader seems fine... the problem is really the performance without BufReader.

Actually, hmm... maybe the implementation of bytes() for File should build a BufReader? That seems a little awkward, but I guess it's potentially workable.

@abonander
Copy link
Contributor

I can see where several Read impls could be performance traps with this method. Since impl<R: Read> Iterator for Bytes<R> essentially reads one byte for each call to next(), it can interact poorly with any Read impl that isn't tolerant of small, extremely frequent reads--basically exactly what BufReader is meant to help with.

The easiest fix would be to document this on the Read::bytes() method to let users know they should use a buffer for some cases. But without any way to warn them directly, a lot of users could still be bitten in the butt by this.

Bytes<R> can inherently use BufReader, and then BufReader can override Read::bytes() to return itself in Bytes instead of constructing a new BufReader. But this will be unnecessary for other Read impls that don't need buffering, like Cursor<Vec<u8>>.

Another solution could be to let Bytes be dynamically buffered or raw:

pub struct Bytes<R> {
    inner: BytesInner<R>,
}

enum BytesInner<R> {
    Raw(R),
    Buffered(BufReader<R>),
}

Then the Read impl can return Raw by default, or be overridden to return Buffered. The drawback here is that the enum would have to be matched on with each call to Bytes::next(), which would be a bit of a performance hit, though likely not nearly as bad as the performance problem it fixes, and should be ironed out by branch prediction after a few iterations.

@bluss
Copy link
Member

bluss commented Aug 31, 2015

@cybergeek94 BytesInner sounds good, but I'm not sure if we can actually do it. The problem is that it's a hidden buffer that the user can't recover -- think of the case reader.by_ref().bytes(), after iterating only a few bytes, then stopping, the rest of the buffered data is unreachable.

It's not just about .bytes(), it's about non-buffered I/O in general.

  • We need to make it clear to users that files and network streams opened are unbuffered by default.
  • Stdin is buffered. But you have less overhead if you .lock() it for multiple operations or for access access to the buffer methods.

@abonander
Copy link
Contributor

The best solution with the least commitment is to educate the user better about buffers and when to use them.

@steveklabnik
Copy link
Member

Filing as a doc bug. We talk about this in BufReader, but not Read. Seems bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants