Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokei: eliminate reading files entirely into memory #148

Closed
wants to merge 3 commits into from
Closed

tokei: eliminate reading files entirely into memory #148

wants to merge 3 commits into from

Conversation

twmb
Copy link

@twmb twmb commented Oct 1, 2017

This change drops memory consumption quite a bit (cpu timings are
unaffected).

The old behavior was to read files entirely into memory and then
utf8-translate them. This was necessary because the encoding crate
doesn't support readers - it only operates on whole string references.

discussion about this here:
lifthrasiir/rust-encoding#91

We can't just wrap the new reader in the standard BufReader, though,
because the standard BufReader returns a newly allocated line for each
line in the lines iterator. This would have put us right back to where
we were.

Instead, we can just use read_line directly on a BufReader. We do not
need to worry about stripping trailing newlines.

Using https://gist.github.com/netj/526585 and running tokei vs. the new
tokei (after running a few times for page warming):

for i in `seq 1 10` ; do ~/memusg tokei>/dev/null ; ~/memusg ~/rust/tokei/target/release/tokei>/dev/null ; done

Results:
memusg: peak=62468
memusg: peak=21108
memusg: peak=67736
memusg: peak=19952
memusg: peak=68984
memusg: peak=20960
memusg: peak=68208
memusg: peak=20964
memusg: peak=57968
memusg: peak=21492
memusg: peak=63468
memusg: peak=20484
memusg: peak=68512
memusg: peak=19740
memusg: peak=68600
memusg: peak=21984
memusg: peak=69860
memusg: peak=22016
memusg: peak=70932
memusg: peak=22656

As visible, every run with the new release has a a much smaller (<50%)
peak memory usage footprint.

I've ran tokei over some arbitrary files on my own system before and after this diff and the results are the same.

This solution isn't perfect: if files are large and do not have newlines, the buffered string will still eat up loads of ram. I may look into a solution that does not buffer the line into memory.

Travis Bischel added 3 commits September 30, 2017 18:58
Cargo.lock is by-default ignored with `cargo new` - it is generated from
Cargo.toml.
This was warning on tests.
This change drops memory consumption quite a bit (cpu timings are
unaffected).

The old behavior was to read files entirely into memory and then
utf8-translate them. This was necessary because the encoding crate
doesn't support readers - it only operates on whole string references.

discussion about this here:
lifthrasiir/rust-encoding#91

We can't just wrap the new reader in the standard BufReader, though,
because the standard BufReader returns a newly allocated line for each
line in the lines iterator. This would have put us right back to where
we were.

Instead, we can just use read_line directly on a BufReader. We do not
need to worry about stripping trailing newlines.

Using https://gist.github.com/netj/526585 and running tokei vs. the new
tokei (after running a few times for page warming):

for i in `seq 1 10` ; do ~/memusg tokei>/dev/null ; ~/memusg
~/rust/tokei/target/release/tokei>/dev/null ; done

Results:
memusg: peak=62468
memusg: peak=21108
memusg: peak=67736
memusg: peak=19952
memusg: peak=68984
memusg: peak=20960
memusg: peak=68208
memusg: peak=20964
memusg: peak=57968
memusg: peak=21492
memusg: peak=63468
memusg: peak=20484
memusg: peak=68512
memusg: peak=19740
memusg: peak=68600
memusg: peak=21984
memusg: peak=69860
memusg: peak=22016
memusg: peak=70932
memusg: peak=22656

As visible, every run with the new release has a a much smaller (<50%)
peak memory usage footprint.
@XAMPPRocky
Copy link
Owner

Thanks for your PR! From reading through the threads and seeing that this is also used in ripgrep. I think the decoder should be it's own crate and used by tokei. And keeping Cargo.lock is best practice when you're distributing binaries.

@twmb
Copy link
Author

twmb commented Oct 1, 2017

From reading through the threads and seeing that this is also used in ripgrep. I think the decoder should be it's own crate and used by tokei.

Agreed - I thought the same thing when I saw it. It would need a bit of refactoring to get it more generally usable, though. I'll ping @BurntSushi and potentially get his thoughts.

And keeping Cargo.lock is best practice when you're distributing binaries.

I did not know that! But cargo new --bin does indeed not include the Cargo.lock line in its generated .gitignore. I'll remove that commit.

@BurntSushi
Copy link

It would need a bit of refactoring to get it more generally usable, though. I'll ping @BurntSushi and potentially get his thoughts.

Yes, the intent was always to put these transcoders in a separate crate, but I haven't had the time to do that. The transcoder in ripgrep is only correct in the way that it is being used, but not for all possible uses of the Read trait. So there's some refactoring there that needs to be done, as well as the not inconsiderable polishing effort that's part of publishing a new crate.

@twmb
Copy link
Author

twmb commented Oct 7, 2017

After reading hsivonen/encoding_rs#8 in full, there does not appear to be an easy way to refactor a utf-8 transcoder into a generic crate. I would be more willing to refactor the already-modified transcoder into its own special use case to eliminate the second BufReader.

@XAMPPRocky
Copy link
Owner

Closing due to conflicts. I think this would need to be a separate crate in some form before it could be merged in as I don't want to bring the maintenance of the module into the project. If this happens or there is an alternative available feel free to make another PR!

@XAMPPRocky XAMPPRocky closed this Jan 15, 2018
@BurntSushi
Copy link

ripgrep's transcoder is now documented and polished in an external crate: https://github.com/BurntSushi/encoding_rs_io

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants