Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant slow down at around 400k lines #1

Closed
soaxelbrooke opened this issue Sep 2, 2018 · 9 comments
Closed

Significant slow down at around 400k lines #1

soaxelbrooke opened this issue Sep 2, 2018 · 9 comments

Comments

@soaxelbrooke
Copy link

soaxelbrooke commented Sep 2, 2018

First off, thank you for writing this tool! It's been a great help!

I'm building a word frequency list for online reviews, and ran into significant slow down at around 400k lines - the combined data to be counted is about 80M lines (~850M words), so as it is, it would seemingly never finish.

If anyone else runs into this problem, and is looking for a bandaid, the following script does a map-reduce-style split-count-combine:

# install python dependencies: `pip install tqdm`
# install gnu parallel: sudo apt-get install parallel
mkdir splits wfs
echo 'Splitting file into parts...'
split -a 5 -l 100000 $1 splits/split
ls splits/ | parallel 'echo "Counting {}..."; cat splits/{} | wf > wfs/{}_wf.txt'
echo 'Combining split counts...'
python -c 'from tqdm import tqdm; from functools import reduce; from glob import glob; from collections import Counter; of = open("wfs.txt", "w"); wf = reduce(lambda a, b: a + b, (Counter(dict((pair[0], int(pair[1])) for pair in (line.strip().split() for line in open(fpath)))) for fpath in tqdm(glob("wfs/*"))), Counter()); [of.write("{} {}\n".format(key, count)) for key, count in sorted(wf.items(), key=lambda p: -p[1])]'
rm -rf wfs splits
echo 'Word frequencies written to wfs.txt.'
@soaxelbrooke
Copy link
Author

soaxelbrooke commented Sep 2, 2018

To provide some timing context:

$ time head -n 100000 review_bodies.txt | wf > wf100k.txt

real    0m0.613s
user    0m0.603s
sys     0m0.014s
$ time head -n 500000 review_bodies.txt | wf > wf500k.txt

real    0m26.117s
user    0m7.571s
sys     0m18.557s
$ time head -n 1000000 review_bodies.txt | wf > wf1m.txt

real    2m19.113s
user    0m31.536s
sys     1m47.588s
$ time head -n 2000000 review_bodies.txt | wf > wf2m.txt

real    6m39.479s
user    1m25.665s
sys     5m13.699s
$ time head -n 80000000 cat review_bodies.txt | wf > wfs_all.txt

real    651m36.666s
user    244m12.682s
sys     328m30.466s

The whole file has about 1.2M distinct words in it.

@jarcane
Copy link
Owner

jarcane commented Sep 2, 2018

Oh wow! I'm glad it's been useful to you, and sorry about the performance.

It's a pretty naive implementation, written more for clarity than raw performance, so I'm not entirely surprised it wouldn't scale to a file that size.

I'm not the best when it comes to optimization, and the code is pretty simple and straightforward, so I'm not sure where to look to go about it. I wonder it's something with the iterators, or with HashMap, but I'd have to play around to really make any kind of sensible guess.

@soaxelbrooke
Copy link
Author

PRs welcome? :P I could give it a shot - I've been doing more rust lately and this seems like a place I may actually be able to help.

@jarcane
Copy link
Owner

jarcane commented Sep 3, 2018

Oh for sure. :)

@jarcane
Copy link
Owner

jarcane commented Sep 3, 2018

Attempted to replicate with a 1GB Wikipedia dump (enwiki9, 13M lines), and it peaked at almost 12 gigs of memory after a few minutes just doing "cat | wf" on Windows. CPU usage however remained pretty low (only peaking around 10% on a Ryzen7), but it ran for over an hour and showed no real sign of stopping.

I can only imagine how it might balloon with nearly eight times that amount of data.

Part of the problem no doubt is it buffers the entire file into memory first, it makes no attempt to stream it, so first point of attack is probably there. Next I would guess is maybe the HashMap that's used for the dictionary.

jarcane added a commit that referenced this issue Sep 3, 2018
This seems to at least improve performance somewhat with regard to #1
@jarcane
Copy link
Owner

jarcane commented Sep 3, 2018

Huh. Well, this is interesting. With Norvig's big.txt, the Lines iterator version runs 4 seconds faster on Windows, but almost 4 times slower on WSL/Ubuntu18.04, where the old version is almost instant.

Also, I don't see the explosion of RAM usage on WSL either, it just silently seems to do nothing.

@jarcane
Copy link
Owner

jarcane commented Sep 3, 2018

OK, had a knockabout with valgrind and some tests on sub-sections of the Wikipedia data.

@soaxelbrooke is quite right: it is just about 400k lines that it just stalls forever.

According to valgrind, at this point the vast majority of the runtime then is spent on:

32,046,810,487  /build/glibc-OTsEL5/glibc-2.27/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms [/lib/x86_64-linux-gnu/libc-2.27.so]

Followed by quite a lot of calls to the string coercion and hashmap functions, but at almost a full order of magnitude fewer number of calls. Given the glibc call is in string, I'm guessing these factors are related.

I have no idea what this should be, and I welcome expert opinions.

@jarcane
Copy link
Owner

jarcane commented Sep 20, 2018

@soaxelbrooke Good news!

Following some of the advice in llogiq's post, using buffered IO and a more efficient method for reading lines, plus avoiding just a tooon of unnecessary string allocation, I've fixed the problem.

Not only does it no longer stall, but it will now chew through the entire 1GB Wikipedia English corpus in 1m30secs on my machine, even with sorting!

As this does make a slight breaking change (I borrowed the idea from your fork of not doing the to_lowercase force), I've released this as 0.2.0. Just run cargo install again to update, and you're good to go.

@jarcane jarcane closed this as completed Sep 20, 2018
@soaxelbrooke
Copy link
Author

Wow! Sorry I didn't get a chance to pitch in here - but amazing results!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants