tokei: eliminate reading files entirely into memory #148

twmb · 2017-10-01T04:50:35Z

This change drops memory consumption quite a bit (cpu timings are
unaffected).

The old behavior was to read files entirely into memory and then
utf8-translate them. This was necessary because the encoding crate
doesn't support readers - it only operates on whole string references.

discussion about this here:
lifthrasiir/rust-encoding#91

We can't just wrap the new reader in the standard BufReader, though,
because the standard BufReader returns a newly allocated line for each
line in the lines iterator. This would have put us right back to where
we were.

Instead, we can just use read_line directly on a BufReader. We do not
need to worry about stripping trailing newlines.

Using https://gist.github.com/netj/526585 and running tokei vs. the new
tokei (after running a few times for page warming):

for i in `seq 1 10` ; do ~/memusg tokei>/dev/null ; ~/memusg ~/rust/tokei/target/release/tokei>/dev/null ; done

Results:
memusg: peak=62468
memusg: peak=21108
memusg: peak=67736
memusg: peak=19952
memusg: peak=68984
memusg: peak=20960
memusg: peak=68208
memusg: peak=20964
memusg: peak=57968
memusg: peak=21492
memusg: peak=63468
memusg: peak=20484
memusg: peak=68512
memusg: peak=19740
memusg: peak=68600
memusg: peak=21984
memusg: peak=69860
memusg: peak=22016
memusg: peak=70932
memusg: peak=22656

As visible, every run with the new release has a a much smaller (<50%)
peak memory usage footprint.

I've ran tokei over some arbitrary files on my own system before and after this diff and the results are the same.

This solution isn't perfect: if files are large and do not have newlines, the buffered string will still eat up loads of ram. I may look into a solution that does not buffer the line into memory.

Cargo.lock is by-default ignored with `cargo new` - it is generated from Cargo.toml.

This was warning on tests.

This change drops memory consumption quite a bit (cpu timings are unaffected). The old behavior was to read files entirely into memory and then utf8-translate them. This was necessary because the encoding crate doesn't support readers - it only operates on whole string references. discussion about this here: lifthrasiir/rust-encoding#91 We can't just wrap the new reader in the standard BufReader, though, because the standard BufReader returns a newly allocated line for each line in the lines iterator. This would have put us right back to where we were. Instead, we can just use read_line directly on a BufReader. We do not need to worry about stripping trailing newlines. Using https://gist.github.com/netj/526585 and running tokei vs. the new tokei (after running a few times for page warming): for i in `seq 1 10` ; do ~/memusg tokei>/dev/null ; ~/memusg ~/rust/tokei/target/release/tokei>/dev/null ; done Results: memusg: peak=62468 memusg: peak=21108 memusg: peak=67736 memusg: peak=19952 memusg: peak=68984 memusg: peak=20960 memusg: peak=68208 memusg: peak=20964 memusg: peak=57968 memusg: peak=21492 memusg: peak=63468 memusg: peak=20484 memusg: peak=68512 memusg: peak=19740 memusg: peak=68600 memusg: peak=21984 memusg: peak=69860 memusg: peak=22016 memusg: peak=70932 memusg: peak=22656 As visible, every run with the new release has a a much smaller (<50%) peak memory usage footprint.

XAMPPRocky · 2017-10-01T16:30:01Z

Thanks for your PR! From reading through the threads and seeing that this is also used in ripgrep. I think the decoder should be it's own crate and used by tokei. And keeping Cargo.lock is best practice when you're distributing binaries.

twmb · 2017-10-01T19:01:18Z

From reading through the threads and seeing that this is also used in ripgrep. I think the decoder should be it's own crate and used by tokei.

Agreed - I thought the same thing when I saw it. It would need a bit of refactoring to get it more generally usable, though. I'll ping @BurntSushi and potentially get his thoughts.

And keeping Cargo.lock is best practice when you're distributing binaries.

I did not know that! But cargo new --bin does indeed not include the Cargo.lock line in its generated .gitignore. I'll remove that commit.

BurntSushi · 2017-10-03T12:28:58Z

It would need a bit of refactoring to get it more generally usable, though. I'll ping @BurntSushi and potentially get his thoughts.

Yes, the intent was always to put these transcoders in a separate crate, but I haven't had the time to do that. The transcoder in ripgrep is only correct in the way that it is being used, but not for all possible uses of the Read trait. So there's some refactoring there that needs to be done, as well as the not inconsiderable polishing effort that's part of publishing a new crate.

twmb · 2017-10-07T21:10:09Z

After reading hsivonen/encoding_rs#8 in full, there does not appear to be an easy way to refactor a utf-8 transcoder into a generic crate. I would be more willing to refactor the already-modified transcoder into its own special use case to eliminate the second BufReader.

XAMPPRocky · 2018-01-15T00:59:56Z

Closing due to conflicts. I think this would need to be a separate crate in some form before it could be merged in as I don't want to bring the maintenance of the module into the project. If this happens or there is an alternative available feel free to make another PR!

BurntSushi · 2018-07-22T00:43:12Z

ripgrep's transcoder is now documented and polished in an external crate: https://github.com/BurntSushi/encoding_rs_io

Travis Bischel added 3 commits September 30, 2017 18:58

remove Cargo.lock and add it to .gitignore

a733709

Cargo.lock is by-default ignored with `cargo new` - it is generated from Cargo.toml.

tests/accuracy: remove unused crate

245b7ea

This was warning on tests.

XAMPPRocky closed this Jan 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokei: eliminate reading files entirely into memory #148

tokei: eliminate reading files entirely into memory #148

twmb commented Oct 1, 2017 •

edited

Loading

XAMPPRocky commented Oct 1, 2017

twmb commented Oct 1, 2017

BurntSushi commented Oct 3, 2017

twmb commented Oct 7, 2017

XAMPPRocky commented Jan 15, 2018

BurntSushi commented Jul 22, 2018

tokei: eliminate reading files entirely into memory #148

tokei: eliminate reading files entirely into memory #148

Conversation

twmb commented Oct 1, 2017 • edited Loading

XAMPPRocky commented Oct 1, 2017

twmb commented Oct 1, 2017

BurntSushi commented Oct 3, 2017

twmb commented Oct 7, 2017

XAMPPRocky commented Jan 15, 2018

BurntSushi commented Jul 22, 2018

twmb commented Oct 1, 2017 •

edited

Loading