BitArray performance #2360

StefanKarpinski · 2013-02-19T16:47:48Z

BitArrays are substantially slower for many operations than and Array{Bool} would be (see JuliaData/DataFrames.jl#186). It would be a big win if we could speed them up.

The text was updated successfully, but these errors were encountered:

carlobaldassi · 2013-02-20T00:29:22Z

I managed to get the same performance as Array{Bool} when indexing with random indices (idx2=rand(1:N,N) in @tshort example), while indexing with sorted vectors (idx1=[1:N]) is still slower, but it's now comparable rather than being an order of magnitude slower.
Note: I specifically addressed the example discussed in JuliaData/DataFrames.jl#186 (and some similar patterns); indexing one item at a time (e.g. in a loop) would still be faster with regular Arrays.

johnmyleswhite · 2013-02-20T01:55:38Z

Thanks, Carlo!

cpcloud · 2013-02-22T23:45:11Z

This still seems to be a problem:

macro timeit(ex, n)
    quote
        t = Inf

        for i = 1:$n
            t = min(t, @elapsed $ex)
        end

        if print_output
            n = $n
            println("best of $n: ", t, " seconds")
        end
    end
end

function gt{T}(x::Matrix{T}, y::T)
    m, n = size(x)
    out = Array(Bool, m, n)
    for i = 1:m, j = 1:n
        out[i, j] = x[i, j] > y
    end
    out
end

x = rand(2_000_000, 16)

# BitArray version
@timeit x .> 0.5 10
# best of 10: best of 10: 1.965562105178833 seconds

# Array{Bool} version
@timeit gt(x, 0.5) 10
# best of 10: best of 10: 0.0854039192199707 seconds

pao · 2013-02-22T23:54:58Z

@cpcloud FYI, the Benchmark package provides a nice compare function that does exactly this sort of performance comparison and prints out readily copy-and-pasteable results.

Silly question, but you're definitely using a build after 0c6b975 was committed?

cpcloud · 2013-02-22T23:56:14Z

@pao Thanks! Wasn't aware of that package. I usually pull and build at least once per day.

pao · 2013-02-23T00:00:53Z

Definitely agree something is wrong here, using the same x and gt defined by @cpcloud. Reopening.

julia> compare([()-> x.>0.5, ()-> gt(x, 0.5)], 10)
2x4 DataFrame:
            Function Elapsed Relative Replications
[1,]    "# function" 36.6735  21.5232           10
[2,]    "# function"  1.7039      1.0           10

diegozea · 2013-03-11T09:06:50Z

Some ideas: Looks like in iterations AbstractArray's function are used. In each step, next() is now checking bounds (looks not necessary on this context, because done is checked too) and taking the Uint64 chunk. Maybe define next for BitArray avoiding Bounds check can give some performance improve. Is there a way for not get 64 times the same chunk? I can't figure out a way for this last, and I don't know how important is to performance anyway.

cpcloud · 2013-03-25T19:34:02Z

Wouldn't bool arrays also call the abstract array methods as well?

diegozea · 2013-03-26T05:17:01Z

I playing with a uncheck next() for BitArray.
Is ~ 2.6 faster than the actual implementation:

julia> using Benchmark

julia> importall Base

julia> import Base.to_index, Base.getindex_unchecked

julia> function getindex_unchecked(B::BitArray, i::Real)
           i = to_index(i)
           getindex_unchecked(B.chunks, i)
       end
# methods for generic function getindex_unchecked
getindex_unchecked(Bc::Array{Uint64,1},i::Integer) at bitarray.jl:252
getindex_unchecked(B::BitArray{N},i::Real) at none:2

julia> const ar = BitArray(1000);

julia> function check()
         for i in ar
           i == false
         end
       end
# methods for generic function check
check() at none:2

julia> function uncheck()
         ind = start(ar)
         while !done(ar,ind)
           i = getindex_unchecked(ar,ind)
           ind = ind+1
           i == false
         end
       end
# methods for generic function uncheck
uncheck() at none:2

julia> compare([check,uncheck],1000)
2x4 DataFrame:
         Function   Elapsed Relative Replications
[1,]      "check" 0.0367271  2.61999         1000
[2,]    "uncheck" 0.0140181      1.0         1000

diegozea · 2013-03-27T02:43:48Z

This last can be related to #1392

diegozea · 2013-03-28T05:21:11Z

I don't fully understand BitArray yet. Tell me if I'm wrong with something here.

Would be faster some kind of double iteration when looping over BitArray ? I imagine something like, maybe is going to be more cache friendly:

for chunk on BitArray
    for bit on chunck
         do_something_with( bit )
    end
end

I read about some BitVector64/32 (a single Int) available on other languages: https://github.com/gingi/fastbit/blob/master/src/bitvector64.h They are only a single Int value and only can store 64/32 bits. But Is faster than a BitArray for this kind of cases as is declared here http://msdn.microsoft.com/en-us/library/system.collections.specialized.bitvector32.aspx

Now Julia have immutable types, I imagine that something like BitVector64 can be relatively easy to add. Maybe ImmutableArrays.jl can be the right place for something like that? A BitVector of 256 bits [ 4 chunks ] can gain speed up using SIMD instructions JuliaGeometry/OldImmutableArrays.jl#6

Maybe defining and abstract type over BitArray can make easy the creation of this kind of type. People who is still learning what happens at bit level (like me) are going to be happy with that :)

Maybe related with performance for BitArray... Have a lot of bits in only one number can give some extra information to compiler for branch prediction ? I can only imagine a perfect branch prediction for the extremes cases 0xffffffffffffffff and 0x0000000000000000

carlobaldassi · 2013-05-28T18:58:39Z

So, it took a while, but I just pushed another fix, this time for the element-wise functions (.>, .^, etc). Note that the most efficient implementation of the .> function, returning an Array{Bool}, would be:

function gt(x::AbstractArray, y)
    out = Array(Bool, size(x))
    for i = 1:length(out)
        out[i] = x[i] > y
    end
    out
end

With respect to this, the previous BitArray implementation was between 70 and 90 times slower. The last version is only about 2 times slower in my tests.

Please keep reopening in case more performance issues show up.

StefanKarpinski · 2013-05-28T19:18:49Z

Might be better to open new performance issues in the future? Reopening the same one repeatedly seems weird.

JeffBezanson · 2013-05-30T06:21:29Z

Another idea: BitArray could have a custom immutable iteration state holding the mask and index.

StefanKarpinski · 2013-05-30T06:47:12Z

I took a brief look at doing this and it seemed like it wasn't really a performance win.

JeffBezanson · 2013-05-30T06:50:12Z

Yeah, I realized later that for x in bitarray is not terribly common or useful.

diegozea · 2013-05-30T07:05:47Z

How can I do that (holding the mask and index) ? The 2 bit DNA sequence are two BitArrays, and I was iterating and indexing both for example on convertion from BitArray to Vector{Nucleotide} or ASCIIString
I guess can be a performance gain on my case.

carlobaldassi · 2013-05-30T11:19:36Z

I've been experimenting with "manually" removing the boundary checks on Arrays and I get some substantial performance gains. This could be done with regular Array methods too. Example:

function myfill!{T<:Union(Integer,FloatingPoint)}(a::Array{T}, x)
    # note: preserve -0.0 for floats
    if isbits(T)
        if T<:Integer && convert(T,x) == 0
            ccall(:memset, Ptr{Void}, (Ptr{Void}, Int32, Csize_t), a,0,length(a)*sizeof(T))
        else
            pa = pointer(a)
            for i = 1:length(a)
                unsafe_store!(pa, x, i)
            end
        end
    else
        for i = 1:length(a)
            a[i] = x
        end
    end
    return a
end

performance on a 1 million long Vector{Int}, compared to standard fill!:

julia> FillPerf.dotest()
2x4 DataFrame:
             Function Elapsed Relative Replications
[1,]       "testfill"  2.5809  1.60157         1000
[2,]     "testmyfill"  1.61148     1.0         1000

In some more complex cases, it may be even better to use unsafe_store(pa, x); pa += sizeof(eltype(a)), but I need to check that in detail.
Of course, this is going in the dreaded direction of "writing C code in Julia", but I thought it was worth noticing. If there are no objections, I could push a version of bitarrays which uses this idiom everywhere (in most cases the isbits checks wouldn't even be needed); performance gains can be ~2 fold for many functions.

BTW Jeff, thanks for fixing #3240 so quickly ;)

carlobaldassi · 2013-05-30T11:46:45Z

Oh and I forgot to mention: about the fill! function, in the tests above I was filling with 3's to avoid the special case for zero. It turns out that using pointers from Julia is as fast as C's memset, so the special case could be removed.

JeffBezanson · 2013-06-01T18:14:16Z

Improvements to scalar getindex and setindex here: f3f6fa2

carlobaldassi · 2013-06-02T22:40:03Z

@diegozea : I added a custom BitArray iterator in 2dd1960. It's 4x faster than before in my tests (stil 1.3x slower than indexing a Vector{Bool}).
I tested various things, including holding the mask, chunk etc. but this (which just uses some manual inlining) seemed to be the fastest.

JeffBezanson · 2013-06-02T23:12:37Z

Only 1.3x slowdown for bitarrays is excellent.

diegozea · 2013-06-03T13:01:03Z

Only 1.3x is great!!! :) I going to test it in the following days.

timholy · 2013-06-03T14:17:49Z

Carlo, you've moved some mountains here :-).

carlobaldassi closed this as completed in 0c6b975 Feb 20, 2013

pao reopened this Feb 23, 2013

diegozea mentioned this issue Mar 18, 2013

laplace equation benchmark performance #1168

Closed

carlobaldassi closed this as completed in d056ae0 May 28, 2013

carlobaldassi mentioned this issue Jun 1, 2013

BitArray performance enhancements via introduction of UnsafeArray and UnsafeBitArray #3265

Closed

vtjnash mentioned this issue Jun 16, 2013

Missing BoundsError on BitArray #2495

Closed

cstjean mentioned this issue Nov 7, 2015

BitArray is much slower than Array{Bool} #13914

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BitArray performance #2360

BitArray performance #2360

StefanKarpinski commented Feb 19, 2013

carlobaldassi commented Feb 20, 2013

johnmyleswhite commented Feb 20, 2013

cpcloud commented Feb 22, 2013

pao commented Feb 22, 2013

cpcloud commented Feb 22, 2013

pao commented Feb 23, 2013

diegozea commented Mar 11, 2013

cpcloud commented Mar 25, 2013

diegozea commented Mar 26, 2013

diegozea commented Mar 27, 2013

diegozea commented Mar 28, 2013

carlobaldassi commented May 28, 2013

StefanKarpinski commented May 28, 2013

JeffBezanson commented May 30, 2013

StefanKarpinski commented May 30, 2013

JeffBezanson commented May 30, 2013

diegozea commented May 30, 2013

carlobaldassi commented May 30, 2013

carlobaldassi commented May 30, 2013

JeffBezanson commented Jun 1, 2013

carlobaldassi commented Jun 2, 2013

JeffBezanson commented Jun 2, 2013

diegozea commented Jun 3, 2013

timholy commented Jun 3, 2013

BitArray performance #2360

BitArray performance #2360

Comments

StefanKarpinski commented Feb 19, 2013

carlobaldassi commented Feb 20, 2013

johnmyleswhite commented Feb 20, 2013

cpcloud commented Feb 22, 2013

pao commented Feb 22, 2013

cpcloud commented Feb 22, 2013

pao commented Feb 23, 2013

diegozea commented Mar 11, 2013

cpcloud commented Mar 25, 2013

diegozea commented Mar 26, 2013

diegozea commented Mar 27, 2013

diegozea commented Mar 28, 2013

carlobaldassi commented May 28, 2013

StefanKarpinski commented May 28, 2013

JeffBezanson commented May 30, 2013

StefanKarpinski commented May 30, 2013

JeffBezanson commented May 30, 2013

diegozea commented May 30, 2013

carlobaldassi commented May 30, 2013

carlobaldassi commented May 30, 2013

JeffBezanson commented Jun 1, 2013

carlobaldassi commented Jun 2, 2013

JeffBezanson commented Jun 2, 2013

diegozea commented Jun 3, 2013

timholy commented Jun 3, 2013