Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File IO slow #8826

Closed
alexalemi opened this issue Oct 27, 2014 · 27 comments
Closed

File IO slow #8826

alexalemi opened this issue Oct 27, 2014 · 27 comments
Labels
io Involving the I/O subsystem: libuv, read, write, etc. performance Must go faster

Comments

@alexalemi
Copy link

I'm trying to count how often each word occurs in a file, and I'm getting roughly 10x slower speeds for julia compared to C, python or go

counts = Dict{String,Int}()

open(ARGS[1],"r") do f
    for line in eachline(f)
        for word in split(line)
            counts[word] = get(counts,word,0) + 1 
        end
    end
end

cut = 10

guys = collect(counts)
sort!( guys, by=(x)->x[2], rev=true )

for x in guys
    if x[2] > cut
        println(x[1], '\t',  x[2])
    end
end

On a 96 M text file, it is taking julia 54.35 seconds on my machine, versus 2.69s for equivalent C code, 7.35 s for python, and 5.98 s for Go.

Profiling the code, it seems the actual speed is the time spent reading the file. I've seen other mentions of slow IO in regards to reading csv files (e.g. #3350 or this group discussion) where some specific fixes were worked out, but these discussions are a year old and I was wondering whether there were hopes of speeding up I/O more generally.

I've thrown together a git repository of the test code.

@staticfloat
Copy link
Member

Try reading the file in one go, and then iterating over each line in memory.

@StefanKarpinski
Copy link
Member

This should work and be fast – we've seen similar issues with out I/O performance, especially line-oriented reading. This should be fixed – and I'm currently working on it – well, string performance at the moment, but general I/O comes after that.

@StefanKarpinski
Copy link
Member

@alexalemi, if the data ok to make public and you wouldn't mind posting it on S3 or Dropbox or something like that, this could make a good benchmark to measure progress of I/O and string improvements. If you don't want to make the data public but wouldn't mind sending it to me that would be ok too.

@staticfloat
Copy link
Member

We could also just grab every Shakespeare play ever or something, and make that the dataset if we can't use his. ;)

@StefanKarpinski
Copy link
Member

Or do word count on all of Wikipedia.

@jiahao
Copy link
Member

jiahao commented Oct 27, 2014

We could also just grab every Shakespeare play ever or something

http://www.gutenberg.org/ebooks/100

@staticfloat
Copy link
Member

I was expecting more than just 5MB. Amazing.
-E

On Mon, Oct 27, 2014 at 1:51 PM, Jiahao Chen [email protected]
wrote:

We could also just grab every Shakespeare play ever or something

http://www.gutenberg.org/ebooks/100


Reply to this email directly or view it on GitHub
#8826 (comment).

@johnmyleswhite
Copy link
Member

All of Gutenberg combined is pretty small. :)

@alexalemi
Copy link
Author

@StefanKarpinski

I was actually doing my test runs on a publicly available dataset, the text8 (text8.zip) set from Matt Mahoney, which itself is a processed dump of some of wikipedia (the first 10^8 bytes, more about the data)

@staticfloat

Eventually I would like to work on a much larger text corpus, where reading the whole thing into memory would be painful, if not impossible. Anyway, doing a readall followed by split still takes 46s.

@StefanKarpinski
Copy link
Member

Reading all the data into memory at once is not an official recommendation. Obviously, streaming over a text file should work just fine and the fact that it's not as fast as it should be is a performance issue.

@ivarne ivarne added performance Must go faster io Involving the I/O subsystem: libuv, read, write, etc. labels Oct 27, 2014
@staticfloat
Copy link
Member

Yes, sorry, I should have expounded and said that I was merely trying to nail down for certain where the performance degradation was coming from; if the slowdown persists when the file is loaded into memory, it might have more to do with how we're iterating over the lines, e.g. string manipulation rather than disk I/O. In any case, what you have written definitely should work.

@alexalemi
Copy link
Author

For what its worth, I've thrown up the code I used for testing up in a git repository in the rough style of the micro benchmarks in /test/perf/micro if it helps.

@catawbasam
Copy link
Contributor

I get a linecount=1 when running the code below with fn="text8". It looks like this file contains a giant list of terms with no line breaks to iterate over. split() is getting a serious workout!

Having said that, I certainly agree that text IO should be faster. Looks like we could make EachLine immutable and get a speed bump from that.

function readfile1(fn)
    linecount=0
    wordcount=0
    open(fn,"r") do f
        for line in eachline(f)
            linecount+=1
            for word in split(line)
                wordcount+=1
                #counts[word] = get(counts,word,0) + 1 
            end
        end
    end
    return linecount, wordcount
end

@garborg
Copy link
Contributor

garborg commented Oct 28, 2014

I ran the code on the first 20th of your file. It looks like getting and setting on counts in the inner loop should be the bottleneck. Reproducible gist here.

A couple notes:

  1. I expected DataStructures.counter to speed things up. It didn't, but I'm not sure if it's just because I had to use counter(SubString{ASCIIString}) instead of counter(String) (because convert(String, word) still returns SubString{ASCIIString} and I wasn't sure what else to do.).

  2. Maybe something like eachword(io, split=Base._default_delims) would be nice, even if the extra allocation isn't a bottleneck here.

@alexalemi
Copy link
Author

@catawbasam True, text8 is only one line. If I split up text8 so that it has a reasonable number of words each line, with fmt, the julia code goes from 54s on my machine down to 32, but that is still noticeably slower than the ~6 seconds for go or python, or 3s for C. I'll also note that the python script is more or less identical to the julia, and manages 6 seconds even with the massive one line of text, it gives a similar time on fmted version. Same for the go/C code.

I added fmted text8 versions of the tests to my repo.

@remusao
Copy link
Contributor

remusao commented Nov 2, 2014

I did similar benchmarks recently (In Python, Julia, C++, C, Haskell, Rust, @alexalemi so maybe I can pull request them in your repo?), and also wondered why Julia was slower. It seems that if you remove the update of counter in the main loop, so that the code just iterate on lines from a file, Julia is still 3 times slower than my Python version.

@alexalemi
Copy link
Author

@remusao Be my guest.

@nolta
Copy link
Member

nolta commented Nov 4, 2014

Rewriting split gets me a 4x speedup:

function mysplit(s::ASCIIString)
    a = ASCIIString[]
    i = 1
    was_space = true
    for j = 1:length(s)
        is_space = isspace(s[j])
        if is_space
            was_space || push!(a, s[i:j-1])
        else
            was_space && (i = j)
        end
        was_space = is_space
    end
    was_space || push!(a, s[i:end])
    a
end

function main(fn)
    counts = Dict{ASCIIString,Int}()

    open(fn,"r") do f
        for line in eachline(f)
            for word in mysplit(line)
                counts[word] = get(counts,word,0) + 1 
            end
        end
    end

    cut = 10

    guys = collect(counts)
    sort!( guys, by=(x)->x[2], rev=true )

    for x in guys
        if x[2] > cut
            println(x[1], '\t',  x[2])
        end
    end
end

isinteractive() || main(ARGS[1])

nolta added a commit that referenced this issue Nov 6, 2014
Cuts the #8826 vocab.jl benchmark in half.
@nolta
Copy link
Member

nolta commented Nov 6, 2014

I took a closer look at why rewriting split sped things up so much, and the results weren't quite what i expected. Here's a breakdown of how we could reach parity w/ python. Numbers in bold are how much slower julia is relative to python at each stage.

  • 7.4x — starting point.

  • 3.7x — fix slow SubString hashing (4b4565f).

  • 2.1x — fix eachline type instability. Merging ASCIIString & UTF8String will probably fix this, but for now

    for line::ASCIIString in eachline(f)
  • 1.9x — tighter counts type:

    counts = Dict{SubString{ASCIIString},Int}()
  • 1.7x — faster split (see previous comment).

  • 1.2x — reuse hash table index.

    counts[word] = get(counts,word,0) + 1

    looks up word in counts twice, which is inefficient. Replace it with increment!(counts, word) where

    function increment!{K,V}(h::Dict{K,V}, key::K)
        index = Base.ht_keyindex2(h, key)
        if index > 0
            h.vals[index] = h.vals[index] + one(V)
        else
            Base._setindex!(h, zero(V), key, -index)
        end
    end

    This same idea would speed up the DataStructures.jl push!(::Accumulator, k, v) method.

  • 1.1x — wrap the code in a function.

@ViralBShah
Copy link
Member

That is pretty awesome analysis and would make for a nice blog post, I think.

@nalimilan
Copy link
Member

This same idea would speed up the DataStructures.jl push!(::Accumulator, k, v) method.

+1 I felt the need for a similar function when writing code to compute frequency tables. An additional version increment!(counts, word, value) which would add value to the counts would be useful too.

@kmsquire
Copy link
Member

kmsquire commented Nov 6, 2014

On Thu, Nov 6, 2014 at 11:12 AM, Mike Nolta [email protected]
wrote:

1.2x — reuse hash table index.

counts[word] = get(counts,word,0) + 1

looks up word in counts twice, which is inefficient. Replace it with increment!(counts,
word) where

function increment!{K,V}(h::Dict{K,V}, key::K)
index = Base.ht_keyindex2(h, key)
if index > 0
h.vals[index] = h.vals[index] + one(V)
else
Base._setindex!(h, zero(V), key, -index)
endend

This same idea would speed up the DataStructures.jl push!(::Accumulator,
k, v) method
https://github.com/JuliaLang/DataStructures.jl/blob/d5308421daea7370a60d3e2718e1b3fc5fe47f50/src/accumulator.jl#L45
.

Steve Vavasis has been working on a SortedDict type for DataStructures.
It's not quite ready yet, but when ready, it will include a Token interface
which allows a user to get a token to a particular location in the
dictionary and modify the value later without looking it up again (take
care that the location hasn't changed, etc.). Once finalized, I plan to
implement something similar for Dict types.

Kevin

@nalimilan
Copy link
Member

@kmsquire Great!

@rtick
Copy link

rtick commented Nov 13, 2014

Hey all,

I am currently a Junior at the University of Notre Dame and as a final project we are looking to improve upon/increase the efficiency of some of the bugs that have been reported in Julia. I am in a team of four, and we have all had a lot of work with C, C++, python, etc. and we are taking a pretty advanced class in data structures currently. Wanted to post here and see if we could be of help/if you guys had any ideas. We were thinking about trying to solve this issue, but it seems like you guys have a lot of good ideas going around and are making some progress. Let us know what we can do to be of help!

@ViralBShah
Copy link
Member

@rtick The way to contribute is to jump in and create a Pull Request. I am sure you have seen the other tags with performance. Another good project is the ARM port, or the distributed array infrastructure in Julia that needs a lot more work. Feel free to write to me at [email protected] if you want to discuss any of these ideas further.

nolta added a commit that referenced this issue Nov 14, 2014
About 1.5x faster. Helps with #8826.
nolta added a commit that referenced this issue Nov 18, 2014
- don't call utf8sizeof() twice when writing a Char
- faster sizeof(::SubString{ASCIIString})

About 6% faster.
waTeim pushed a commit to waTeim/julia that referenced this issue Nov 23, 2014
Cuts the JuliaLang#8826 vocab.jl benchmark in half.
waTeim pushed a commit to waTeim/julia that referenced this issue Nov 23, 2014
waTeim pushed a commit to waTeim/julia that referenced this issue Nov 23, 2014
- don't call utf8sizeof() twice when writing a Char
- faster sizeof(::SubString{ASCIIString})

About 6% faster.
@RaviMohan
Copy link

I just tested this on my machine(quadcore S7, 32 GB Ram, Ubuntu 14.04) and got the following timings

---Testing C
gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2

1.56user 0.02system 0:01.59elapsed 99%CPU (0avgtext+0avgdata 33512maxresident)k
0inputs+1080outputs (0major+6039minor)pagefaults 0swaps

---Testing Python 2
Python 2.7.6
5.25user 0.25system 0:05.50elapsed 99%CPU (0avgtext+0avgdata 1012416maxresident)k
0inputs+1024outputs (0major+200543minor)pagefaults 0swaps

---Testing Julia
julia version 0.4.0-dev
8.55user 0.48system 0:08.80elapsed 102%CPU (0avgtext+0avgdata 2681616maxresident)k
0inputs+1024outputs (0major+39741minor)pagefaults 0swaps

Julia's speed is about 1.6 times Python's. Which should probably be better, but the almost 8 times slow down seems to be fixed.
Recommend closing this.

@ViralBShah
Copy link
Member

We really should be as fast as C! I agree that this is good enough to close this issue. Perhaps the thing to do here is to figure out where we are losing and file separate issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Involving the I/O subsystem: libuv, read, write, etc. performance Must go faster
Projects
None yet
Development

No branches or pull requests