File IO slow #8826

alexalemi · 2014-10-27T19:12:54Z

I'm trying to count how often each word occurs in a file, and I'm getting roughly 10x slower speeds for julia compared to C, python or go

counts = Dict{String,Int}()

open(ARGS[1],"r") do f
    for line in eachline(f)
        for word in split(line)
            counts[word] = get(counts,word,0) + 1 
        end
    end
end

cut = 10

guys = collect(counts)
sort!( guys, by=(x)->x[2], rev=true )

for x in guys
    if x[2] > cut
        println(x[1], '\t',  x[2])
    end
end

On a 96 M text file, it is taking julia 54.35 seconds on my machine, versus 2.69s for equivalent C code, 7.35 s for python, and 5.98 s for Go.

Profiling the code, it seems the actual speed is the time spent reading the file. I've seen other mentions of slow IO in regards to reading csv files (e.g. #3350 or this group discussion) where some specific fixes were worked out, but these discussions are a year old and I was wondering whether there were hopes of speeding up I/O more generally.

I've thrown together a git repository of the test code.

staticfloat · 2014-10-27T19:52:44Z

Try reading the file in one go, and then iterating over each line in memory.

StefanKarpinski · 2014-10-27T19:58:56Z

This should work and be fast – we've seen similar issues with out I/O performance, especially line-oriented reading. This should be fixed – and I'm currently working on it – well, string performance at the moment, but general I/O comes after that.

StefanKarpinski · 2014-10-27T20:00:39Z

@alexalemi, if the data ok to make public and you wouldn't mind posting it on S3 or Dropbox or something like that, this could make a good benchmark to measure progress of I/O and string improvements. If you don't want to make the data public but wouldn't mind sending it to me that would be ok too.

staticfloat · 2014-10-27T20:13:30Z

We could also just grab every Shakespeare play ever or something, and make that the dataset if we can't use his. ;)

StefanKarpinski · 2014-10-27T20:40:44Z

Or do word count on all of Wikipedia.

jiahao · 2014-10-27T20:51:38Z

We could also just grab every Shakespeare play ever or something

http://www.gutenberg.org/ebooks/100

staticfloat · 2014-10-27T20:53:02Z

I was expecting more than just 5MB. Amazing.
-E

On Mon, Oct 27, 2014 at 1:51 PM, Jiahao Chen [email protected]
wrote:

We could also just grab every Shakespeare play ever or something

http://www.gutenberg.org/ebooks/100

—
Reply to this email directly or view it on GitHub
#8826 (comment).

johnmyleswhite · 2014-10-27T20:53:34Z

All of Gutenberg combined is pretty small. :)

alexalemi · 2014-10-27T21:40:58Z

@StefanKarpinski

I was actually doing my test runs on a publicly available dataset, the text8 (text8.zip) set from Matt Mahoney, which itself is a processed dump of some of wikipedia (the first 10^8 bytes, more about the data)

@staticfloat

Eventually I would like to work on a much larger text corpus, where reading the whole thing into memory would be painful, if not impossible. Anyway, doing a readall followed by split still takes 46s.

StefanKarpinski · 2014-10-27T21:50:25Z

Reading all the data into memory at once is not an official recommendation. Obviously, streaming over a text file should work just fine and the fact that it's not as fast as it should be is a performance issue.

staticfloat · 2014-10-27T23:55:03Z

Yes, sorry, I should have expounded and said that I was merely trying to nail down for certain where the performance degradation was coming from; if the slowdown persists when the file is loaded into memory, it might have more to do with how we're iterating over the lines, e.g. string manipulation rather than disk I/O. In any case, what you have written definitely should work.

alexalemi · 2014-10-28T00:52:54Z

For what its worth, I've thrown up the code I used for testing up in a git repository in the rough style of the micro benchmarks in /test/perf/micro if it helps.

catawbasam · 2014-10-28T01:26:14Z

I get a linecount=1 when running the code below with fn="text8". It looks like this file contains a giant list of terms with no line breaks to iterate over. split() is getting a serious workout!

Having said that, I certainly agree that text IO should be faster. Looks like we could make EachLine immutable and get a speed bump from that.

function readfile1(fn)
    linecount=0
    wordcount=0
    open(fn,"r") do f
        for line in eachline(f)
            linecount+=1
            for word in split(line)
                wordcount+=1
                #counts[word] = get(counts,word,0) + 1 
            end
        end
    end
    return linecount, wordcount
end

garborg · 2014-10-28T02:38:26Z

I ran the code on the first 20th of your file. It looks like getting and setting on counts in the inner loop should be the bottleneck. Reproducible gist here.

A couple notes:

I expected DataStructures.counter to speed things up. It didn't, but I'm not sure if it's just because I had to use counter(SubString{ASCIIString}) instead of counter(String) (because convert(String, word) still returns SubString{ASCIIString} and I wasn't sure what else to do.).
Maybe something like eachword(io, split=Base._default_delims) would be nice, even if the extra allocation isn't a bottleneck here.

alexalemi · 2014-10-28T03:38:59Z

@catawbasam True, text8 is only one line. If I split up text8 so that it has a reasonable number of words each line, with fmt, the julia code goes from 54s on my machine down to 32, but that is still noticeably slower than the ~6 seconds for go or python, or 3s for C. I'll also note that the python script is more or less identical to the julia, and manages 6 seconds even with the massive one line of text, it gives a similar time on fmted version. Same for the go/C code.

I added fmted text8 versions of the tests to my repo.

remusao · 2014-11-02T10:03:32Z

I did similar benchmarks recently (In Python, Julia, C++, C, Haskell, Rust, @alexalemi so maybe I can pull request them in your repo?), and also wondered why Julia was slower. It seems that if you remove the update of counter in the main loop, so that the code just iterate on lines from a file, Julia is still 3 times slower than my Python version.

alexalemi · 2014-11-03T16:38:57Z

@remusao Be my guest.

nolta · 2014-11-04T23:50:11Z

Rewriting split gets me a 4x speedup:

function mysplit(s::ASCIIString)
    a = ASCIIString[]
    i = 1
    was_space = true
    for j = 1:length(s)
        is_space = isspace(s[j])
        if is_space
            was_space || push!(a, s[i:j-1])
        else
            was_space && (i = j)
        end
        was_space = is_space
    end
    was_space || push!(a, s[i:end])
    a
end

function main(fn)
    counts = Dict{ASCIIString,Int}()

    open(fn,"r") do f
        for line in eachline(f)
            for word in mysplit(line)
                counts[word] = get(counts,word,0) + 1 
            end
        end
    end

    cut = 10

    guys = collect(counts)
    sort!( guys, by=(x)->x[2], rev=true )

    for x in guys
        if x[2] > cut
            println(x[1], '\t',  x[2])
        end
    end
end

isinteractive() || main(ARGS[1])

Cuts the #8826 vocab.jl benchmark in half.

nolta · 2014-11-06T19:12:21Z

I took a closer look at why rewriting split sped things up so much, and the results weren't quite what i expected. Here's a breakdown of how we could reach parity w/ python. Numbers in bold are how much slower julia is relative to python at each stage.

7.4x — starting point.
3.7x — fix slow SubString hashing (4b4565f).
2.1x — fix eachline type instability. Merging ASCIIString & UTF8String will probably fix this, but for now
```
for line::ASCIIString in eachline(f)
```

1.9x — tighter counts type:

counts = Dict{SubString{ASCIIString},Int}()

1.7x — faster split (see previous comment).

1.2x — reuse hash table index.

counts[word] = get(counts,word,0) + 1

looks up word in counts twice, which is inefficient. Replace it with increment!(counts, word) where

function increment!{K,V}(h::Dict{K,V}, key::K)
    index = Base.ht_keyindex2(h, key)
    if index > 0
        h.vals[index] = h.vals[index] + one(V)
    else
        Base._setindex!(h, zero(V), key, -index)
    end
end

This same idea would speed up the DataStructures.jl push!(::Accumulator, k, v) method.

1.1x — wrap the code in a function.

ViralBShah · 2014-11-06T19:28:01Z

That is pretty awesome analysis and would make for a nice blog post, I think.

nalimilan · 2014-11-06T21:20:24Z

This same idea would speed up the DataStructures.jl push!(::Accumulator, k, v) method.

+1 I felt the need for a similar function when writing code to compute frequency tables. An additional version increment!(counts, word, value) which would add value to the counts would be useful too.

kmsquire · 2014-11-06T22:23:43Z

On Thu, Nov 6, 2014 at 11:12 AM, Mike Nolta [email protected]
wrote:

1.2x — reuse hash table index.

counts[word] = get(counts,word,0) + 1

looks up word in counts twice, which is inefficient. Replace it with increment!(counts,
word) where

function increment!{K,V}(h::Dict{K,V}, key::K)
index = Base.ht_keyindex2(h, key)
if index > 0
h.vals[index] = h.vals[index] + one(V)
else
Base._setindex!(h, zero(V), key, -index)
endend

This same idea would speed up the DataStructures.jl push!(::Accumulator,
k, v) method
https://github.com/JuliaLang/DataStructures.jl/blob/d5308421daea7370a60d3e2718e1b3fc5fe47f50/src/accumulator.jl#L45
.

Steve Vavasis has been working on a SortedDict type for DataStructures.
It's not quite ready yet, but when ready, it will include a Token interface
which allows a user to get a token to a particular location in the
dictionary and modify the value later without looking it up again (take
care that the location hasn't changed, etc.). Once finalized, I plan to
implement something similar for Dict types.

Kevin

nalimilan · 2014-11-07T08:54:16Z

@kmsquire Great!

rtick · 2014-11-13T02:00:43Z

Hey all,

I am currently a Junior at the University of Notre Dame and as a final project we are looking to improve upon/increase the efficiency of some of the bugs that have been reported in Julia. I am in a team of four, and we have all had a lot of work with C, C++, python, etc. and we are taking a pretty advanced class in data structures currently. Wanted to post here and see if we could be of help/if you guys had any ideas. We were thinking about trying to solve this issue, but it seems like you guys have a lot of good ideas going around and are making some progress. Let us know what we can do to be of help!

ViralBShah · 2014-11-13T03:23:32Z

@rtick The way to contribute is to jump in and create a Pull Request. I am sure you have seen the other tags with performance. Another good project is the ARM port, or the distributed array infrastructure in Julia that needs a lot more work. Feel free to write to me at [email protected] if you want to discuss any of these ideas further.

About 1.5x faster. Helps with #8826.

- don't call utf8sizeof() twice when writing a Char - faster sizeof(::SubString{ASCIIString}) About 6% faster.

Cuts the JuliaLang#8826 vocab.jl benchmark in half.

About 1.5x faster. Helps with JuliaLang#8826.

- don't call utf8sizeof() twice when writing a Char - faster sizeof(::SubString{ASCIIString}) About 6% faster.

RaviMohan · 2015-05-20T08:58:58Z

I just tested this on my machine(quadcore S7, 32 GB Ram, Ubuntu 14.04) and got the following timings

---Testing C
gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2

1.56user 0.02system 0:01.59elapsed 99%CPU (0avgtext+0avgdata 33512maxresident)k
0inputs+1080outputs (0major+6039minor)pagefaults 0swaps

---Testing Python 2
Python 2.7.6
5.25user 0.25system 0:05.50elapsed 99%CPU (0avgtext+0avgdata 1012416maxresident)k
0inputs+1024outputs (0major+200543minor)pagefaults 0swaps

---Testing Julia
julia version 0.4.0-dev
8.55user 0.48system 0:08.80elapsed 102%CPU (0avgtext+0avgdata 2681616maxresident)k
0inputs+1024outputs (0major+39741minor)pagefaults 0swaps

Julia's speed is about 1.6 times Python's. Which should probably be better, but the almost 8 times slow down seems to be fixed.
Recommend closing this.

ViralBShah · 2015-05-20T09:17:04Z

We really should be as fast as C! I agree that this is good enough to close this issue. Perhaps the thing to do here is to figure out where we are losing and file separate issues.

ivarne added performance Must go faster io Involving the I/O subsystem: libuv, read, write, etc. labels Oct 27, 2014

nolta added a commit that referenced this issue Nov 6, 2014

fix slow hash(::SubString{<:ByteString})

4b4565f

Cuts the #8826 vocab.jl benchmark in half.

nolta added a commit that referenced this issue Nov 14, 2014

speed up cmp(SubString,SubString)

cf7f24f

About 1.5x faster. Helps with #8826.

nolta added a commit that referenced this issue Nov 18, 2014

couple of micro-optimizations for #8826

b38c478

- don't call utf8sizeof() twice when writing a Char - faster sizeof(::SubString{ASCIIString}) About 6% faster.

waTeim pushed a commit to waTeim/julia that referenced this issue Nov 23, 2014

fix slow hash(::SubString{<:ByteString})

5ef5612

Cuts the JuliaLang#8826 vocab.jl benchmark in half.

waTeim pushed a commit to waTeim/julia that referenced this issue Nov 23, 2014

speed up cmp(SubString,SubString)

99b3f80

About 1.5x faster. Helps with JuliaLang#8826.

waTeim pushed a commit to waTeim/julia that referenced this issue Nov 23, 2014

couple of micro-optimizations for JuliaLang#8826

a2737df

- don't call utf8sizeof() twice when writing a Char - faster sizeof(::SubString{ASCIIString}) About 6% faster.

nolta mentioned this issue Nov 24, 2014

_methods(): merge identical methods #9140

Closed

ViralBShah closed this as completed May 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File IO slow #8826

File IO slow #8826

alexalemi commented Oct 27, 2014

staticfloat commented Oct 27, 2014

StefanKarpinski commented Oct 27, 2014

StefanKarpinski commented Oct 27, 2014

staticfloat commented Oct 27, 2014

StefanKarpinski commented Oct 27, 2014

jiahao commented Oct 27, 2014

staticfloat commented Oct 27, 2014

johnmyleswhite commented Oct 27, 2014

alexalemi commented Oct 27, 2014

StefanKarpinski commented Oct 27, 2014

staticfloat commented Oct 27, 2014

alexalemi commented Oct 28, 2014

catawbasam commented Oct 28, 2014

garborg commented Oct 28, 2014

alexalemi commented Oct 28, 2014

remusao commented Nov 2, 2014

alexalemi commented Nov 3, 2014

nolta commented Nov 4, 2014

nolta commented Nov 6, 2014

ViralBShah commented Nov 6, 2014

nalimilan commented Nov 6, 2014

kmsquire commented Nov 6, 2014

nalimilan commented Nov 7, 2014

rtick commented Nov 13, 2014

ViralBShah commented Nov 13, 2014

RaviMohan commented May 20, 2015

ViralBShah commented May 20, 2015

File IO slow #8826

File IO slow #8826

Comments

alexalemi commented Oct 27, 2014

staticfloat commented Oct 27, 2014

StefanKarpinski commented Oct 27, 2014

StefanKarpinski commented Oct 27, 2014

staticfloat commented Oct 27, 2014

StefanKarpinski commented Oct 27, 2014

jiahao commented Oct 27, 2014

staticfloat commented Oct 27, 2014

johnmyleswhite commented Oct 27, 2014

alexalemi commented Oct 27, 2014

StefanKarpinski commented Oct 27, 2014

staticfloat commented Oct 27, 2014

alexalemi commented Oct 28, 2014

catawbasam commented Oct 28, 2014

garborg commented Oct 28, 2014

alexalemi commented Oct 28, 2014

remusao commented Nov 2, 2014

alexalemi commented Nov 3, 2014

nolta commented Nov 4, 2014

nolta commented Nov 6, 2014

ViralBShah commented Nov 6, 2014

nalimilan commented Nov 6, 2014

kmsquire commented Nov 6, 2014

nalimilan commented Nov 7, 2014

rtick commented Nov 13, 2014

ViralBShah commented Nov 13, 2014

RaviMohan commented May 20, 2015

ViralBShah commented May 20, 2015