-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant slow down at around 400k lines #1
Comments
To provide some timing context: $ time head -n 100000 review_bodies.txt | wf > wf100k.txt
real 0m0.613s
user 0m0.603s
sys 0m0.014s
$ time head -n 500000 review_bodies.txt | wf > wf500k.txt
real 0m26.117s
user 0m7.571s
sys 0m18.557s
$ time head -n 1000000 review_bodies.txt | wf > wf1m.txt
real 2m19.113s
user 0m31.536s
sys 1m47.588s
$ time head -n 2000000 review_bodies.txt | wf > wf2m.txt
real 6m39.479s
user 1m25.665s
sys 5m13.699s
$ time head -n 80000000 cat review_bodies.txt | wf > wfs_all.txt
real 651m36.666s
user 244m12.682s
sys 328m30.466s The whole file has about 1.2M distinct words in it. |
Oh wow! I'm glad it's been useful to you, and sorry about the performance. It's a pretty naive implementation, written more for clarity than raw performance, so I'm not entirely surprised it wouldn't scale to a file that size. I'm not the best when it comes to optimization, and the code is pretty simple and straightforward, so I'm not sure where to look to go about it. I wonder it's something with the iterators, or with HashMap, but I'd have to play around to really make any kind of sensible guess. |
PRs welcome? :P I could give it a shot - I've been doing more rust lately and this seems like a place I may actually be able to help. |
Oh for sure. :) |
Attempted to replicate with a 1GB Wikipedia dump (enwiki9, 13M lines), and it peaked at almost 12 gigs of memory after a few minutes just doing "cat | wf" on Windows. CPU usage however remained pretty low (only peaking around 10% on a Ryzen7), but it ran for over an hour and showed no real sign of stopping. I can only imagine how it might balloon with nearly eight times that amount of data. Part of the problem no doubt is it buffers the entire file into memory first, it makes no attempt to stream it, so first point of attack is probably there. Next I would guess is maybe the HashMap that's used for the dictionary. |
This seems to at least improve performance somewhat with regard to #1
Huh. Well, this is interesting. With Norvig's big.txt, the Lines iterator version runs 4 seconds faster on Windows, but almost 4 times slower on WSL/Ubuntu18.04, where the old version is almost instant. Also, I don't see the explosion of RAM usage on WSL either, it just silently seems to do nothing. |
OK, had a knockabout with valgrind and some tests on sub-sections of the Wikipedia data. @soaxelbrooke is quite right: it is just about 400k lines that it just stalls forever. According to valgrind, at this point the vast majority of the runtime then is spent on:
Followed by quite a lot of calls to the string coercion and hashmap functions, but at almost a full order of magnitude fewer number of calls. Given the glibc call is in string, I'm guessing these factors are related. I have no idea what this should be, and I welcome expert opinions. |
@soaxelbrooke Good news! Following some of the advice in llogiq's post, using buffered IO and a more efficient method for reading lines, plus avoiding just a tooon of unnecessary string allocation, I've fixed the problem. Not only does it no longer stall, but it will now chew through the entire 1GB Wikipedia English corpus in 1m30secs on my machine, even with sorting! As this does make a slight breaking change (I borrowed the idea from your fork of not doing the to_lowercase force), I've released this as 0.2.0. Just run |
Wow! Sorry I didn't get a chance to pitch in here - but amazing results! |
First off, thank you for writing this tool! It's been a great help!
I'm building a word frequency list for online reviews, and ran into significant slow down at around 400k lines - the combined data to be counted is about 80M lines (~850M words), so as it is, it would seemingly never finish.
If anyone else runs into this problem, and is looking for a bandaid, the following script does a map-reduce-style split-count-combine:
The text was updated successfully, but these errors were encountered: