Significant slow down at around 400k lines #1

soaxelbrooke · 2018-09-02T18:59:29Z

First off, thank you for writing this tool! It's been a great help!

I'm building a word frequency list for online reviews, and ran into significant slow down at around 400k lines - the combined data to be counted is about 80M lines (~850M words), so as it is, it would seemingly never finish.

If anyone else runs into this problem, and is looking for a bandaid, the following script does a map-reduce-style split-count-combine:

# install python dependencies: `pip install tqdm`
# install gnu parallel: sudo apt-get install parallel
mkdir splits wfs
echo 'Splitting file into parts...'
split -a 5 -l 100000 $1 splits/split
ls splits/ | parallel 'echo "Counting {}..."; cat splits/{} | wf > wfs/{}_wf.txt'
echo 'Combining split counts...'
python -c 'from tqdm import tqdm; from functools import reduce; from glob import glob; from collections import Counter; of = open("wfs.txt", "w"); wf = reduce(lambda a, b: a + b, (Counter(dict((pair[0], int(pair[1])) for pair in (line.strip().split() for line in open(fpath)))) for fpath in tqdm(glob("wfs/*"))), Counter()); [of.write("{} {}\n".format(key, count)) for key, count in sorted(wf.items(), key=lambda p: -p[1])]'
rm -rf wfs splits
echo 'Word frequencies written to wfs.txt.'

soaxelbrooke · 2018-09-02T19:15:27Z

To provide some timing context:

$ time head -n 100000 review_bodies.txt | wf > wf100k.txt

real    0m0.613s
user    0m0.603s
sys     0m0.014s
$ time head -n 500000 review_bodies.txt | wf > wf500k.txt

real    0m26.117s
user    0m7.571s
sys     0m18.557s
$ time head -n 1000000 review_bodies.txt | wf > wf1m.txt

real    2m19.113s
user    0m31.536s
sys     1m47.588s
$ time head -n 2000000 review_bodies.txt | wf > wf2m.txt

real    6m39.479s
user    1m25.665s
sys     5m13.699s
$ time head -n 80000000 cat review_bodies.txt | wf > wfs_all.txt

real    651m36.666s
user    244m12.682s
sys     328m30.466s

The whole file has about 1.2M distinct words in it.

jarcane · 2018-09-02T22:16:23Z

Oh wow! I'm glad it's been useful to you, and sorry about the performance.

It's a pretty naive implementation, written more for clarity than raw performance, so I'm not entirely surprised it wouldn't scale to a file that size.

I'm not the best when it comes to optimization, and the code is pretty simple and straightforward, so I'm not sure where to look to go about it. I wonder it's something with the iterators, or with HashMap, but I'd have to play around to really make any kind of sensible guess.

soaxelbrooke · 2018-09-02T22:42:09Z

PRs welcome? :P I could give it a shot - I've been doing more rust lately and this seems like a place I may actually be able to help.

jarcane · 2018-09-03T07:08:30Z

Oh for sure. :)

jarcane · 2018-09-03T10:59:53Z

Attempted to replicate with a 1GB Wikipedia dump (enwiki9, 13M lines), and it peaked at almost 12 gigs of memory after a few minutes just doing "cat | wf" on Windows. CPU usage however remained pretty low (only peaking around 10% on a Ryzen7), but it ran for over an hour and showed no real sign of stopping.

I can only imagine how it might balloon with nearly eight times that amount of data.

Part of the problem no doubt is it buffers the entire file into memory first, it makes no attempt to stream it, so first point of attack is probably there. Next I would guess is maybe the HashMap that's used for the dictionary.

This seems to at least improve performance somewhat with regard to #1

jarcane · 2018-09-03T12:32:47Z

Huh. Well, this is interesting. With Norvig's big.txt, the Lines iterator version runs 4 seconds faster on Windows, but almost 4 times slower on WSL/Ubuntu18.04, where the old version is almost instant.

Also, I don't see the explosion of RAM usage on WSL either, it just silently seems to do nothing.

jarcane · 2018-09-03T14:29:33Z

OK, had a knockabout with valgrind and some tests on sub-sections of the Wikipedia data.

@soaxelbrooke is quite right: it is just about 400k lines that it just stalls forever.

According to valgrind, at this point the vast majority of the runtime then is spent on:

32,046,810,487  /build/glibc-OTsEL5/glibc-2.27/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms [/lib/x86_64-linux-gnu/libc-2.27.so]

Followed by quite a lot of calls to the string coercion and hashmap functions, but at almost a full order of magnitude fewer number of calls. Given the glibc call is in string, I'm guessing these factors are related.

I have no idea what this should be, and I welcome expert opinions.

jarcane · 2018-09-20T10:12:16Z

@soaxelbrooke Good news!

Following some of the advice in llogiq's post, using buffered IO and a more efficient method for reading lines, plus avoiding just a tooon of unnecessary string allocation, I've fixed the problem.

Not only does it no longer stall, but it will now chew through the entire 1GB Wikipedia English corpus in 1m30secs on my machine, even with sorting!

As this does make a slight breaking change (I borrowed the idea from your fork of not doing the to_lowercase force), I've released this as 0.2.0. Just run cargo install again to update, and you're good to go.

soaxelbrooke · 2018-09-20T21:09:09Z

Wow! Sorry I didn't get a chance to pitch in here - but amazing results!

jarcane added a commit that referenced this issue Sep 3, 2018

Use Lines iterator instead of buffering to String

30020ef

This seems to at least improve performance somewhat with regard to #1

jarcane closed this as completed Sep 20, 2018

This was referenced Oct 10, 2018

Use try_println instead of println for safe pipe handling #3

Merged

Spurious "broken pipe" error messages, when used in typical UNIX shell pipelines rust-lang/rust#46016

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant slow down at around 400k lines #1

Significant slow down at around 400k lines #1

soaxelbrooke commented Sep 2, 2018 •

edited

Loading

soaxelbrooke commented Sep 2, 2018 •

edited

Loading

jarcane commented Sep 2, 2018

soaxelbrooke commented Sep 2, 2018

jarcane commented Sep 3, 2018

jarcane commented Sep 3, 2018 •

edited

Loading

jarcane commented Sep 3, 2018 •

edited

Loading

jarcane commented Sep 3, 2018 •

edited

Loading

jarcane commented Sep 20, 2018

soaxelbrooke commented Sep 20, 2018

Significant slow down at around 400k lines #1

Significant slow down at around 400k lines #1

Comments

soaxelbrooke commented Sep 2, 2018 • edited Loading

soaxelbrooke commented Sep 2, 2018 • edited Loading

jarcane commented Sep 2, 2018

soaxelbrooke commented Sep 2, 2018

jarcane commented Sep 3, 2018

jarcane commented Sep 3, 2018 • edited Loading

jarcane commented Sep 3, 2018 • edited Loading

jarcane commented Sep 3, 2018 • edited Loading

jarcane commented Sep 20, 2018

soaxelbrooke commented Sep 20, 2018

soaxelbrooke commented Sep 2, 2018 •

edited

Loading

soaxelbrooke commented Sep 2, 2018 •

edited

Loading

jarcane commented Sep 3, 2018 •

edited

Loading

jarcane commented Sep 3, 2018 •

edited

Loading

jarcane commented Sep 3, 2018 •

edited

Loading