Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BitArray performance #2360

Closed
StefanKarpinski opened this issue Feb 19, 2013 · 24 comments
Closed

BitArray performance #2360

StefanKarpinski opened this issue Feb 19, 2013 · 24 comments
Labels
performance Must go faster priority This should be addressed urgently

Comments

@StefanKarpinski
Copy link
Member

BitArrays are substantially slower for many operations than and Array{Bool} would be (see JuliaData/DataFrames.jl#186). It would be a big win if we could speed them up.

@carlobaldassi
Copy link
Member

I managed to get the same performance as Array{Bool} when indexing with random indices (idx2=rand(1:N,N) in @tshort example), while indexing with sorted vectors (idx1=[1:N]) is still slower, but it's now comparable rather than being an order of magnitude slower.
Note: I specifically addressed the example discussed in JuliaData/DataFrames.jl#186 (and some similar patterns); indexing one item at a time (e.g. in a loop) would still be faster with regular Arrays.

@johnmyleswhite
Copy link
Member

Thanks, Carlo!

@cpcloud
Copy link

cpcloud commented Feb 22, 2013

This still seems to be a problem:

macro timeit(ex, n)
    quote
        t = Inf

        for i = 1:$n
            t = min(t, @elapsed $ex)
        end

        if print_output
            n = $n
            println("best of $n: ", t, " seconds")
        end
    end
end

function gt{T}(x::Matrix{T}, y::T)
    m, n = size(x)
    out = Array(Bool, m, n)
    for i = 1:m, j = 1:n
        out[i, j] = x[i, j] > y
    end
    out
end

x = rand(2_000_000, 16)

# BitArray version
@timeit x .> 0.5 10
# best of 10: best of 10: 1.965562105178833 seconds

# Array{Bool} version
@timeit gt(x, 0.5) 10
# best of 10: best of 10: 0.0854039192199707 seconds

@pao
Copy link
Member

pao commented Feb 22, 2013

@cpcloud FYI, the Benchmark package provides a nice compare function that does exactly this sort of performance comparison and prints out readily copy-and-pasteable results.

Silly question, but you're definitely using a build after 0c6b975 was committed?

@cpcloud
Copy link

cpcloud commented Feb 22, 2013

@pao Thanks! Wasn't aware of that package. I usually pull and build at least once per day.

@pao
Copy link
Member

pao commented Feb 23, 2013

Definitely agree something is wrong here, using the same x and gt defined by @cpcloud. Reopening.

julia> compare([()-> x.>0.5, ()-> gt(x, 0.5)], 10)
2x4 DataFrame:
            Function Elapsed Relative Replications
[1,]    "# function" 36.6735  21.5232           10
[2,]    "# function"  1.7039      1.0           10

@pao pao reopened this Feb 23, 2013
@diegozea
Copy link
Contributor

Some ideas: Looks like in iterations AbstractArray's function are used. In each step, next() is now checking bounds (looks not necessary on this context, because done is checked too) and taking the Uint64 chunk. Maybe define next for BitArray avoiding Bounds check can give some performance improve. Is there a way for not get 64 times the same chunk? I can't figure out a way for this last, and I don't know how important is to performance anyway.

@cpcloud
Copy link

cpcloud commented Mar 25, 2013

Wouldn't bool arrays also call the abstract array methods as well?

@diegozea
Copy link
Contributor

I playing with a uncheck next() for BitArray.
Is ~ 2.6 faster than the actual implementation:

julia> using Benchmark

julia> importall Base

julia> import Base.to_index, Base.getindex_unchecked

julia> function getindex_unchecked(B::BitArray, i::Real)
           i = to_index(i)
           getindex_unchecked(B.chunks, i)
       end
# methods for generic function getindex_unchecked
getindex_unchecked(Bc::Array{Uint64,1},i::Integer) at bitarray.jl:252
getindex_unchecked(B::BitArray{N},i::Real) at none:2

julia> const ar = BitArray(1000);

julia> function check()
         for i in ar
           i == false
         end
       end
# methods for generic function check
check() at none:2

julia> function uncheck()
         ind = start(ar)
         while !done(ar,ind)
           i = getindex_unchecked(ar,ind)
           ind = ind+1
           i == false
         end
       end
# methods for generic function uncheck
uncheck() at none:2

julia> compare([check,uncheck],1000)
2x4 DataFrame:
         Function   Elapsed Relative Replications
[1,]      "check" 0.0367271  2.61999         1000
[2,]    "uncheck" 0.0140181      1.0         1000

@diegozea
Copy link
Contributor

This last can be related to #1392

@diegozea
Copy link
Contributor

I don't fully understand BitArray yet. Tell me if I'm wrong with something here.

Would be faster some kind of double iteration when looping over BitArray ? I imagine something like, maybe is going to be more cache friendly:

for chunk on BitArray
    for bit on chunck
         do_something_with( bit )
    end
end

I read about some BitVector64/32 (a single Int) available on other languages: https://github.com/gingi/fastbit/blob/master/src/bitvector64.h They are only a single Int value and only can store 64/32 bits. But Is faster than a BitArray for this kind of cases as is declared here http://msdn.microsoft.com/en-us/library/system.collections.specialized.bitvector32.aspx

Now Julia have immutable types, I imagine that something like BitVector64 can be relatively easy to add. Maybe ImmutableArrays.jl can be the right place for something like that? A BitVector of 256 bits [ 4 chunks ] can gain speed up using SIMD instructions JuliaGeometry/OldImmutableArrays.jl#6

Maybe defining and abstract type over BitArray can make easy the creation of this kind of type. People who is still learning what happens at bit level (like me) are going to be happy with that :)

Maybe related with performance for BitArray... Have a lot of bits in only one number can give some extra information to compiler for branch prediction ? I can only imagine a perfect branch prediction for the extremes cases 0xffffffffffffffff and 0x0000000000000000

@carlobaldassi
Copy link
Member

So, it took a while, but I just pushed another fix, this time for the element-wise functions (.>, .^, etc). Note that the most efficient implementation of the .> function, returning an Array{Bool}, would be:

function gt(x::AbstractArray, y)
    out = Array(Bool, size(x))
    for i = 1:length(out)
        out[i] = x[i] > y
    end
    out
end

With respect to this, the previous BitArray implementation was between 70 and 90 times slower. The last version is only about 2 times slower in my tests.

Please keep reopening in case more performance issues show up.

@StefanKarpinski
Copy link
Member Author

Might be better to open new performance issues in the future? Reopening the same one repeatedly seems weird.

@JeffBezanson
Copy link
Member

Another idea: BitArray could have a custom immutable iteration state holding the mask and index.

@StefanKarpinski
Copy link
Member Author

I took a brief look at doing this and it seemed like it wasn't really a performance win.

@JeffBezanson
Copy link
Member

Yeah, I realized later that for x in bitarray is not terribly common or useful.

@diegozea
Copy link
Contributor

How can I do that (holding the mask and index) ? The 2 bit DNA sequence are two BitArrays, and I was iterating and indexing both for example on convertion from BitArray to Vector{Nucleotide} or ASCIIString
I guess can be a performance gain on my case.

@carlobaldassi
Copy link
Member

I've been experimenting with "manually" removing the boundary checks on Arrays and I get some substantial performance gains. This could be done with regular Array methods too. Example:

function myfill!{T<:Union(Integer,FloatingPoint)}(a::Array{T}, x)
    # note: preserve -0.0 for floats
    if isbits(T)
        if T<:Integer && convert(T,x) == 0
            ccall(:memset, Ptr{Void}, (Ptr{Void}, Int32, Csize_t), a,0,length(a)*sizeof(T))
        else
            pa = pointer(a)
            for i = 1:length(a)
                unsafe_store!(pa, x, i)
            end
        end
    else
        for i = 1:length(a)
            a[i] = x
        end
    end
    return a
end

performance on a 1 million long Vector{Int}, compared to standard fill!:

julia> FillPerf.dotest()
2x4 DataFrame:
             Function Elapsed Relative Replications
[1,]       "testfill"  2.5809  1.60157         1000
[2,]     "testmyfill"  1.61148     1.0         1000

In some more complex cases, it may be even better to use unsafe_store(pa, x); pa += sizeof(eltype(a)), but I need to check that in detail.
Of course, this is going in the dreaded direction of "writing C code in Julia", but I thought it was worth noticing. If there are no objections, I could push a version of bitarrays which uses this idiom everywhere (in most cases the isbits checks wouldn't even be needed); performance gains can be ~2 fold for many functions.

BTW Jeff, thanks for fixing #3240 so quickly ;)

@carlobaldassi
Copy link
Member

Oh and I forgot to mention: about the fill! function, in the tests above I was filling with 3's to avoid the special case for zero. It turns out that using pointers from Julia is as fast as C's memset, so the special case could be removed.

@JeffBezanson
Copy link
Member

Improvements to scalar getindex and setindex here: f3f6fa2

@carlobaldassi
Copy link
Member

@diegozea : I added a custom BitArray iterator in 2dd1960. It's 4x faster than before in my tests (stil 1.3x slower than indexing a Vector{Bool}).
I tested various things, including holding the mask, chunk etc. but this (which just uses some manual inlining) seemed to be the fastest.

@JeffBezanson
Copy link
Member

Only 1.3x slowdown for bitarrays is excellent.

@diegozea
Copy link
Contributor

diegozea commented Jun 3, 2013

Only 1.3x is great!!! :) I going to test it in the following days.

@timholy
Copy link
Member

timholy commented Jun 3, 2013

Carlo, you've moved some mountains here :-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster priority This should be addressed urgently
Projects
None yet
Development

No branches or pull requests

8 participants