Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocations when using dualcache #31

Open
dbstein opened this issue Aug 5, 2022 · 19 comments
Open

Allocations when using dualcache #31

dbstein opened this issue Aug 5, 2022 · 19 comments

Comments

@dbstein
Copy link

dbstein commented Aug 5, 2022

Perhaps I'm doing something wrong in how I set things up, but dualcache seems to always generate allocations when requesting the dual portion of the cache. Here's the most minimal example I could think of; I posted a slightly less minimal example here. Writing my own (simpler, less flexible) version of get_tmp eliminates the issue. The following code:

using ForwardDiff, PreallocationTools

function my_get_tmp(dc::PreallocationTools.DiffCache, u::Vector{T}) where {T <: ForwardDiff.Dual}
    return reshape(reinterpret(T, dc.dual_du), size(dc.du))
end
function my_get_tmp(dc::PreallocationTools.DiffCache, u::Vector{T}) where {T <: Float64}
    return dc.du
end

function test()
    ChunkSize = 12
    for n in (10, 20)
        println("\nTesting for n=", n)
        A = zeros(n);
        DA = zeros(ForwardDiff.Dual{ForwardDiff.Tag{nothing, Float64}, Float64, ChunkSize}, 10);
        cache = dualcache(A, ChunkSize);
        print("... get_tmp on Vector{Float64}\n")
        a = @allocated my_get_tmp(cache, A);  println(a)
        print("... get_tmp on Vector{Dual}\n")
        a = @allocated get_tmp(cache, DA); println(a)
        print("... my_get_tmp on Vector{Float64}\n")
        a = @allocated my_get_tmp(cache, A);  println(a)
        print("... my_get_tmp on Vector{Dual}\n")
        a = @allocated my_get_tmp(cache, DA); println(a)
    end
end

test()

gives:

julia> test()
Testing for n=10
... get_tmp on Vector{Float64}
0
... get_tmp on Vector{Dual}
1168
... my_get_tmp on Vector{Float64}
0
... my_get_tmp on Vector{Dual}
0

Testing for n=20
... get_tmp on Vector{Float64}
0
... get_tmp on Vector{Dual}
2240
... my_get_tmp on Vector{Float64}
0
... my_get_tmp on Vector{Dual}
0

I.e. results in allocations that scale with the vector size and are slightly larger than what I would think would be needed (8*Chunksize*n=960 for n=10 and 1920 for n=20). Using Julia v1.7.3, PreallocationTools v0.4.0, on Mac OSX 12.5.

@thomvet
Copy link
Contributor

thomvet commented Aug 5, 2022

If using a simple reshape() instead of the ArrayInterfaceCore.restructure() the allocations disappear. Copy-paste version below:

using ForwardDiff, PreallocationTools

function my_get_tmp(dc::PreallocationTools.DiffCache, u::Vector{T}) where {T <: ForwardDiff.Dual}
    return reshape(reinterpret(T, dc.dual_du), size(dc.du))
end
function my_get_tmp(dc::PreallocationTools.DiffCache, u::Vector{T}) where {T <: Float64}
    return dc.du
end
function get_tmp2(dc::PreallocationTools.DiffCache, u::AbstractArray{T}) where T<:ForwardDiff.Dual
    nelem = div(sizeof(T), sizeof(eltype(dc.dual_du)))*length(dc.du)
    if nelem > length(dc.dual_du)
        enlargedualcache!(dc, nelem)
    end
    reshape(reinterpret(T, view(dc.dual_du, 1:nelem)),size(dc.du))
end

function test()
    ChunkSize = 12
    for n in (10, 20)
        println("\nTesting for n=", n)
        A = zeros(n);
        DA = zeros(ForwardDiff.Dual{ForwardDiff.Tag{nothing, Float64}, Float64, ChunkSize}, 10);
        cache = dualcache(A, ChunkSize);
        print("... get_tmp on Vector{Float64}\n")
        a = @allocated my_get_tmp(cache, A);  println(a)
        print("... get_tmp on Vector{Dual}\n")
        a = @allocated get_tmp(cache, DA); println(a)
        print("... my_get_tmp on Vector{Float64}\n")
        a = @allocated my_get_tmp(cache, A);  println(a)
        print("... my_get_tmp on Vector{Dual}\n")
        a = @allocated my_get_tmp(cache, DA); println(a)
        print("... get_tmp2 on Vector{Dual}\n")
        a = @allocated get_tmp2(cache, DA); println(a)
    end
end

test()

@dbstein
Copy link
Author

dbstein commented Aug 5, 2022

Removing the view was the first thing I did when I tried to write my own, it didn't help:

using ForwardDiff, PreallocationTools, ArrayInterfaceCore

function my_get_tmp(dc::PreallocationTools.DiffCache, u::Vector{T}) where {T <: ForwardDiff.Dual}
    return reshape(reinterpret(T, dc.dual_du), size(dc.du))
end
function get_tmp2(dc::PreallocationTools.DiffCache, u::Vector{T}) where {T <: ForwardDiff.Dual}
    ArrayInterfaceCore.restructure(dc.du, reinterpret(T, dc.dual_du))
end

function test()
    ChunkSize = 12
    n = 10
    A = zeros(n);
    DA = zeros(ForwardDiff.Dual{ForwardDiff.Tag{nothing, Float64}, Float64, ChunkSize}, 10);
    cache = dualcache(A, ChunkSize);
    print("... get_tmp on Vector{Dual}\n")
    a = @allocated get_tmp(cache, DA); println(a)
    print("... get_tmp2 on Vector{Dual}\n")
    a = @allocated get_tmp2(cache, DA); println(a)
    print("... my_get_tmp on Vector{Dual}\n")
    a = @allocated my_get_tmp(cache, DA); println(a)
end

test()

gives:

... get_tmp on Vector{Dual}
1168
... get_tmp2 on Vector{Dual}
1168
... my_get_tmp on Vector{Dual}
0

Since I didn't know all of what's going on under the hood in restructure I switched it to reshape and voila.

@ChrisRackauckas
Copy link
Member

Is restructure changing the view to an array?

@thomvet
Copy link
Contributor

thomvet commented Aug 5, 2022

I checked with earlier versions of PreallocationTools.jl (down to 0.2.4) where it was still ArrayInterface 5.y.z; the allocations also happen there.

@ChrisRackauckas
Copy link
Member

And is it because of changing a view to an Array?

@thomvet
Copy link
Contributor

thomvet commented Aug 5, 2022

Is this the right way to test it?

using ArrayInterfaceCore
a = zeros(10)
ArrayInterfaceCore.restructure(a, reinterpret(Float32, view(a, 1:5)))

Then yes:

10-element Vector{Float32}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

@ChrisRackauckas
Copy link
Member

There's the allocation then. Does Adapt.jl's adapt allocate here? Maybe that's the primitive that should be used. Otherwise we could specialize restructure to keep the view here.

@thomvet
Copy link
Contributor

thomvet commented Aug 6, 2022

I’ll have to look some more into it, but it looks like restructure() always allocates for non-Base.Array. Maybe it needs a general fix?

Regarding adapt() I have to play around more and understand better, but at the moment I don’t see how it can deliver a “reshape” (it just takes care of the wrappers?).

My proposal right now would be to do a fix for normal Arrays (simply doing a reshape, as I did above; same as dbstein’s version, but keeping the view) and care about the more advanced AbstractArrays later (ArrayPartition and friends).

CuArray would still need adapt though, right? (I don’t have a readily accessible Cuda GPU at hand to test).

@ChrisRackauckas
Copy link
Member

Yeah we can create a specialized dispatch on Array for now. I think in general it cannot just be a reshape because yes, it may need to be type matching in which case the view would get converted.

@thomvet
Copy link
Contributor

thomvet commented Aug 9, 2022

Just a note here to say that I am playing around on this. I hope to have a PR with a (at least partial) fix today or tomorrow.

@dbstein
Copy link
Author

dbstein commented Aug 9, 2022

Thanks for the fast work on it. Writing non-allocating Jacobians for complicated functions (or even allocating ones, for that matter) is a huge pain, and getting them for free is nothing short of magical.

@thomvet
Copy link
Contributor

thomvet commented Aug 10, 2022

#32 fixes the allocations for Base Arrays now. However, I couldn't get rid of the reshape(reinterpret(...view(...)) wrappers and stay allocation free.

That said, I don't know whether the wrappers interfere with anything downstream negatively. At least in the cases that I use PreallocationTools for, it didn't cause a performance loss (actually, a small gain due to the allocations being gone).

I tested with Adapt, but also with unsafe_wrap (as a last resort):

function _restructure(normal_cache::Array, duals) 
    unsafe_wrap(Array, pointer(duals), size(normal_cache))
end

That gets rid of the wrappers and returns a correctly sized Array with duals, but it also causes allocations (because the view() is also removed?).

Finally, the more complex AbstractArray cases still allocate.

@ChrisRackauckas
Copy link
Member

@chriselrod do you know if there's a more general solution here?

@chriselrod
Copy link

I got:

julia> test()
... get_tmp on Vector{Dual}
0
... get_tmp2 on Vector{Dual}
1168
... my_get_tmp on Vector{Dual}
0

get_tmp2, which calls ArrayInterface.restructure(::Array, ::ReinterpretArray) calls convert(Array, ::ReinterpretArray), which is why it allocates.

@ChrisRackauckas
Copy link
Member

Yes we know. But is it possible to solve?

@chriselrod
Copy link

chriselrod commented Oct 23, 2022

Yes we know. But is it possible to solve?

Don't make a copy?
Why the need for convert? What's wrong with the reinterpret array?
Where is restructure being called?

Like, obviously we can avoid these allocations. SimpleChains uses just a single UInt8 buffer vector for all its temporaries, whether Bool, Float32, Float64, or ForwardDiffDual.

I'd just need more context on the design and why things are the way they are now to suggest an alternative approach.

@thomvet
Copy link
Contributor

thomvet commented Dec 2, 2022

I am intrigued by this comment; it might be simpler to adapt this approach here as well. I don't think it will end up in a general solution for any AbstractArray type, though, but maybe it's a start towards one?

@ChrisRackauckas
Copy link
Member

The main issue is that it's a missing primitive, so it needs something in ArrayInterfaceCore to capture it. There is no primitive for "restructure but non-allocating". restructure is supposed to be type-matching: if you started with an ArrayPartition of Array, then after you restructure it should turn a Vector into an ArrayPartition of Array. @chriselrod 's comment is of course that if you want that to be non-allocating, that needs to be an ArrayPartition of SubArray, which is correct, but we don't have a primitive operation for "create a type that's like X but not exactly matching X, where the differences that are allowed are shortcuts that improve the performance".

I tend to think that this is a more fundamental issue that SubArray -> Array should just be a pointer thing and shouldn't actually allocate a new array, and I'm not sure why that is true in Julia, and so it would probaby be best to just get that fixed. But this notion of restructuring is actually quite important to PreallocationTools, in fact, it's how the current DiffCache works, so having some verbiage to ask for reshaping functionality to do this kind of stuff is likely useful in many places as long as this SubArray -> Array allocation exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants