Improve caching and dispatch of LinearizingSavingCallback

This adds a new type, `LinearizingSavingCallbackCache` and some sub-types to allow for efficient re-use of memory as the callback executes over the course of a solve, as well as re-use of that memory in future solves when operating on a large ensemble simulation. The top-level `LinearizingSavingCallbackCache` creates thread-safe cache pool objects that are then used to acquire thread-unsafe cache pool objects to be used within a single solve. Those thread-unsafe cache pool objects can then be released and acquired anew by the next solve. The thread-unsafe pool objects allow for acquisition of pieces of memory such as temporary `u` vectors (the recusrive nature of the `LinearizingSavingCallback` means that we must allocate unknown numbers of temporary `u` vectors) and chunks of `u` blocks that are then compacted into a single large matrix in the finalize method of the callback. All these pieces of memory are stored within that set of thread-unsafe caches, and these are released back to the top-level thread-safe cache pool, for the next solve to acquire and make use of those pieces of memory in the cache pool. Using these techniques, the solve time of a large ensemble simulation with low per-simulation computation has reduced dramatically. The simulation solves a butterworth 3rd-order filter circuit over a certain timespan, swept across different simulus frequencies and circuit parameters. The parameter sweep results in a 13500-element ensemble simulation, that when run with 8 threads on a M1 Pro takes: ``` 48.364827 seconds (625.86 M allocations: 19.472 GiB, 41.81% gc time, 0.17% compilation time) ``` Now, after these caching optimizations, we solve the same ensemble in: ``` 13.208123 seconds (166.76 M allocations: 7.621 GiB, 22.21% gc time, 0.61% compilation time) ``` As a side note, the size requirements of the raw linearized solution data itself is `1.04 GB`. In general, we expect to allocate somewhere between 2-3x the final output data to account for temporaries and inefficient sharing, so while there is still some more work to be done, this gets us significantly closer to minimal overhead. This also adds a package extension on `Sundials`, as `IDA` requires that state vectors are `NVector` types, rather than `Vector{S}` types in order to not allocate.
SciML · Feb 9, 2024 · 36743a9 · 36743a9
1 parent 08933b5
commit 36743a9
Show file tree

Hide file tree

Showing 6 changed files with 485 additions and 115 deletions.
diff --git a/Project.toml b/Project.toml
@@ -21,6 +21,9 @@ StaticArraysCore = "1e83bf80-4336-4d27-bf5d-d5a4f845583c"
 OrdinaryDiffEq = "1dea7af3-3e70-54e6-95c3-0bf5283fa5ed"
 Sundials = "c3572dad-4567-51f8-b174-8c6c989267f4"
 
+[extensions]
+DiffEqCallbacksSundialsExt = "Sundials"
+
 [compat]
 Aqua = "0.8"
 DataInterpolations = "4"

diff --git a/ext/DiffEqCallbacksSundialsExt.jl b/ext/DiffEqCallbacksSundialsExt.jl
@@ -0,0 +1,13 @@
+module DiffEqCallbacksSundialsExt
+
+using Sundials: NVector, IDA
+import DiffEqCallbacks: solver_state_alloc, solver_state_type
+
+
+# Allocator; `U` is typically something like `Vector{Float64}`
+solver_state_alloc(solver::IDA, U::DataType, num_us::Int) = () -> NVector(U(undef, num_us))
+
+# Type of `solver_state_alloc`, which is just `NVector`
+solver_state_type(solver::IDA, U::DataType) = NVector
+
+end # module
diff --git a/src/independentlylinearizedutils.jl b/src/independentlylinearizedutils.jl
@@ -2,30 +2,144 @@ using SciMLBase
 
 export IndependentlyLinearizedSolution
 
+
+"""
+    CachePool(T, alloc; thread_safe = true)
+
+Simple memory-reusing cache that allows us to grow a cache and keep
+re-using those pieces of memory (in our case, typically `u` vectors)
+until the solve is finished.  By default, this datastructure is made
+to be thread-safe by locking on every acquire and release, but it
+can be made thread-unsafe (and correspondingly faster) by passing
+`thread_safe = false` to the constructor.
+
+While manual usage with `acquire!()` and `release!()` is possible,
+most users will want to use `@with_cache`, which provides lexically-
+scoped `acquire!()` and `release!()` usage automatically.  Example:
+
+```julia
+us = CachePool(Vector{S}, () -> Vector{S}(undef, num_us); thread_safe=false)
+@with_cache us u_prev begin
+    @with_cache us u_next begin
+        # perform tasks with these two `u` vectors
+    end
+end
+```
+
+!!! warning "Escaping values"
+    You must not use an acquired value after you have released it;
+    the memory may be immediately re-used by some other consumer of
+    your cache pool.  Do not allow the acquired value to escape
+    outside of the `@with_cache` block, or past a `release!()`.
+"""
+mutable struct CachePool{T, THREAD_SAFE}
+    const pool::Vector{T}
+    const alloc::Function
+    lock::ReentrantLock
+    num_allocated::Int
+    num_acquired::Int
+
+    function CachePool(T, alloc::F; thread_safe::Bool = true) where {F}
+        return new{T,Val{thread_safe}}(T[], alloc, ReentrantLock(), 0, 0)
+    end
+end
+const ThreadSafeCachePool{T} = CachePool{T,Val{true}}
+const ThreadUnsafeCachePool{T} = CachePool{T,Val{false}}
+
+"""
+    acquire!(cache::CachePool)
+
+Returns a cached element of the cache pool, calling `cache.alloc()` if none
+are available.
+"""
+Base.@inline function acquire!(cache::CachePool{T}, _dummy = nothing) where {T}
+    cache.num_acquired += 1
+    if isempty(cache.pool)
+        cache.num_allocated += 1
+        return cache.alloc()::T
+    end
+    return pop!(cache.pool)
+end
+
+"""
+    release!(cache::CachePool, val)
+
+Returns the value `val` to the cache pool.
+"""
+Base.@inline function release!(cache::CachePool, val, _dummy = nothing)
+    push!(cache.pool, val)
+    cache.num_acquired -= 1
+end
+
+function is_fully_released(cache::CachePool, _dummy = nothing)
+    return cache.num_acquired == 0
+end
+
+# Thread-safe versions just sub out to the other methods, using `_dummy` to force correct dispatch
+acquire!(cache::ThreadSafeCachePool) = @lock cache.lock acquire!(cache, nothing)
+release!(cache::ThreadSafeCachePool, val) = @lock cache.lock release!(cache, val, nothing)
+is_fully_released(cache::ThreadSafeCachePool) = @lock cache.lock is_fully_released(cache, nothing)
+
+macro with_cache(cache, name, body)
+    return quote
+        $(esc(name)) = acquire!($(esc(cache)))
+        try
+            $(esc(body))
+        finally
+            release!($(esc(cache)), $(esc(name)))
+        end
+    end
+end
+
+
+struct IndependentlyLinearizedSolutionChunksCache{T,S}
+    t_chunks::ThreadUnsafeCachePool{Vector{T}}
+    u_chunks::ThreadUnsafeCachePool{Matrix{S}}
+    time_masks::ThreadUnsafeCachePool{BitMatrix}
+
+    function IndependentlyLinearizedSolutionChunksCache{T,S}(num_us::Int, num_derivatives::Int, chunk_size::Int) where {T,S}
+        t_chunks_alloc = () -> Vector{T}(undef, chunk_size)
+        u_chunks_alloc = () -> Matrix{S}(undef, num_derivatives+1, chunk_size)
+        time_masks_alloc = () -> BitMatrix(undef, num_us, chunk_size)
+        return new(
+            CachePool(Vector{T}, t_chunks_alloc; thread_safe=false),
+            CachePool(Matrix{S}, u_chunks_alloc; thread_safe=false),
+            CachePool(BitMatrix, time_masks_alloc; thread_safe=false),
+        )
+    end
+end
+
 """
     IndependentlyLinearizedSolutionChunks
 
 When constructing an `IndependentlyLinearizedSolution` via the `IndependentlyLinearizingCallback`,
 we use this indermediate structure to reduce allocations and collect the unknown number of timesteps
 that the solve will generate.
 """
-mutable struct IndependentlyLinearizedSolutionChunks{T, S}
+mutable struct IndependentlyLinearizedSolutionChunks{T, S, N}
     t_chunks::Vector{Vector{T}}
     u_chunks::Vector{Vector{Matrix{S}}}
     time_masks::Vector{BitMatrix}
 
+    # Temporary array that gets used by `get_chunks`
+    last_chunks::Vector{Matrix{S}}
+
     # Index of next write into the last chunk
     u_offsets::Vector{Int}
     t_offset::Int
 
+    cache::IndependentlyLinearizedSolutionChunksCache
+
     function IndependentlyLinearizedSolutionChunks{T, S}(num_us::Int, num_derivatives::Int = 0,
-            chunk_size::Int = 100) where {T, S}
-        return new([Vector{T}(undef, chunk_size)],
-            [[Matrix{S}(undef, num_derivatives+1, chunk_size)] for _ in 1:num_us],
-            [BitMatrix(undef, num_us, chunk_size)],
-            [1 for _ in 1:num_us],
-            1,
-        )
+            chunk_size::Int = 512,
+            cache::IndependentlyLinearizedSolutionChunksCache = IndependentlyLinearizedSolutionChunksCache{T,S}(num_us, num_derivatives, chunk_size)) where {T, S}
+        t_chunks = [acquire!(cache.t_chunks)]
+        u_chunks = [[acquire!(cache.u_chunks)] for _ in 1:num_us]
+        time_masks = [acquire!(cache.time_masks)]
+        last_chunks = [u_chunks[u_idx][1] for u_idx in 1:num_us]
+        u_offsets = [1 for _ in 1:num_us]
+        t_offset = 1
+        return new{T,S,num_derivatives}(t_chunks, u_chunks, time_masks, last_chunks, u_offsets, t_offset, cache)
     end
 end
 
@@ -44,14 +158,8 @@ function num_us(ilsc::IndependentlyLinearizedSolutionChunks)
     end
     return length(ilsc.u_chunks)
 end
+num_derivatives(ilsc::IndependentlyLinearizedSolutionChunks{T,S,N}) where {T,S,N} = N
 
-function num_derivatives(ilsc::IndependentlyLinearizedSolutionChunks)
-    # If we've been finalized, just return `0` (which means only the primal)
-    if isempty(ilsc.t_chunks)
-        return 0
-    end
-    return size(first(first(ilsc.u_chunks)), 1) - 1
-end
 
 function Base.isempty(ilsc::IndependentlyLinearizedSolutionChunks)
     return length(ilsc.t_chunks) == 1 && ilsc.t_offset == 1
@@ -61,24 +169,25 @@ function get_chunks(ilsc::IndependentlyLinearizedSolutionChunks{T, S}) where {T,
     # Check if we need to allocate new `t` chunk
     chunksize = chunk_size(ilsc)
     if ilsc.t_offset > chunksize
-        push!(ilsc.t_chunks, Vector{T}(undef, chunksize))
-        push!(ilsc.time_masks, BitMatrix(undef, length(ilsc.u_offsets), chunksize))
+        push!(ilsc.t_chunks, acquire!(ilsc.cache.t_chunks))
+        push!(ilsc.time_masks, acquire!(ilsc.cache.time_masks))
         ilsc.t_offset = 1
     end
 
     # Check if we need to allocate any new `u` chunks (but only for those with `u_mask`)
     for (u_idx, u_chunks) in enumerate(ilsc.u_chunks)
         if ilsc.u_offsets[u_idx] > chunksize
-            push!(u_chunks, Matrix{S}(undef, num_derivatives(ilsc)+1, chunksize))
+            push!(u_chunks, acquire!(ilsc.cache.u_chunks))
             ilsc.u_offsets[u_idx] = 1
         end
+        ilsc.last_chunks[u_idx] = u_chunks[end]
     end
 
     # return the last chunk for each
     return (
         ilsc.t_chunks[end],
         ilsc.time_masks[end],
-        [u_chunks[end] for u_chunks in ilsc.u_chunks],
+        ilsc.last_chunks,
     )
 end
 
@@ -135,16 +244,18 @@ function store!(ilsc::IndependentlyLinearizedSolutionChunks{T, S},
     ts, time_mask, us = get_chunks(ilsc)
 
     # Store into the chunks, gated by `u_mask`
-    for u_idx in 1:size(u, 2)
+    @inbounds for u_idx in 1:size(u, 2)
         if u_mask[u_idx]
             for deriv_idx in 1:size(u, 1)
                 us[u_idx][deriv_idx, ilsc.u_offsets[u_idx]] = u[deriv_idx, u_idx]
             end
             ilsc.u_offsets[u_idx] += 1
         end
+
+        # Update our `time_mask` while we're at it
+        time_mask[u_idx, ilsc.t_offset] = u_mask[u_idx]
     end
     ts[ilsc.t_offset] = t
-    time_mask[:, ilsc.t_offset] .= u_mask
     ilsc.t_offset += 1
 end
 
@@ -161,7 +272,7 @@ efficient `iterate()` method that can be used to reconstruct coherent views
 of the state variables at all timepoints, as well as an efficient `sample!()`
 method that can sample at arbitrary timesteps.
 """
-mutable struct IndependentlyLinearizedSolution{T, S}
+mutable struct IndependentlyLinearizedSolution{T, S, N}
     # All timepoints, shared by all `us`
     ts::Vector{T}
 
@@ -173,33 +284,42 @@ mutable struct IndependentlyLinearizedSolution{T, S}
     time_mask::BitMatrix
 
     # Temporary object used during construction, will be set to `nothing` at the end.
-    ilsc::Union{Nothing,IndependentlyLinearizedSolutionChunks{T,S}}
+    ilsc::Union{Nothing,IndependentlyLinearizedSolutionChunks{T,S,N}}
+    ilsc_cache_pool::Union{Nothing,ThreadSafeCachePool{IndependentlyLinearizedSolutionChunksCache{T,S}}}
 end
 # Helper function to create an ILS wrapped around an in-progress ILSC
-function IndependentlyLinearizedSolution(ilsc::IndependentlyLinearizedSolutionChunks{T,S}) where {T,S}
-    ils = IndependentlyLinearizedSolution(
+function IndependentlyLinearizedSolution(ilsc::IndependentlyLinearizedSolutionChunks{T,S,N}, cache_pool = nothing) where {T,S,N}
+    return IndependentlyLinearizedSolution{T,S,N}(
         T[],
         Matrix{S}[],
         BitMatrix(undef, 0,0),
         ilsc,
+        cache_pool,
     )
-    return ils
 end
 # Automatically create an ILS wrapped around an ILSC from a `prob`
-function IndependentlyLinearizedSolution(prob::SciMLBase.AbstractDEProblem, num_derivatives = 0)
+function IndependentlyLinearizedSolution(prob::SciMLBase.AbstractDEProblem, num_derivatives = 0;
+                                         cache_pool = nothing,
+                                         chunk_size::Int = 512)
     T = eltype(prob.tspan)
+    S = eltype(prob.u0)
     U = isnothing(prob.u0) ? Float64 : eltype(prob.u0)
-    N = isnothing(prob.u0) ? 0 : length(prob.u0)
-    chunks = IndependentlyLinearizedSolutionChunks{T,U}(N, num_derivatives)
-    return IndependentlyLinearizedSolution(chunks)
+    num_us = isnothing(prob.u0) ? 0 : length(prob.u0)
+    if cache_pool === nothing
+        cache = IndependentlyLinearizedSolutionChunksCache{T,S}(num_us, num_derivatives, chunk_size)
+    else
+        cache = acquire!(cache_pool)
+    end
+    chunks = IndependentlyLinearizedSolutionChunks{T,U}(num_us, num_derivatives, chunk_size, cache)
+    return IndependentlyLinearizedSolution(chunks, cache_pool)
 end
 
-num_derivatives(ils::IndependentlyLinearizedSolution) = !isempty(ils.us) ? size(first(ils.us), 1) : 0
+num_derivatives(::IndependentlyLinearizedSolution{T,S,N}) where {T,S,N} = N
 num_us(ils::IndependentlyLinearizedSolution) = length(ils.us)
 Base.size(ils::IndependentlyLinearizedSolution) = size(ils.time_mask)
 Base.length(ils::IndependentlyLinearizedSolution) = length(ils.ts)
 
-function finish!(ils::IndependentlyLinearizedSolution)
+function finish!(ils::IndependentlyLinearizedSolution{T,S}) where {T,S}
     function trim_chunk(chunks::Vector, offset)
         chunks = [chunk for chunk in chunks]
         if eltype(chunks) <: AbstractVector
@@ -216,10 +336,52 @@ function finish!(ils::IndependentlyLinearizedSolution)
     end
 
     ilsc = ils.ilsc::IndependentlyLinearizedSolutionChunks
-    ts = vcat(trim_chunk(ilsc.t_chunks, ilsc.t_offset)...)
-    time_mask = hcat(trim_chunk(ilsc.time_masks, ilsc.t_offset)...)
-    us = [hcat(trim_chunk(ilsc.u_chunks[u_idx], ilsc.u_offsets[u_idx])...)
-          for u_idx in 1:length(ilsc.u_chunks)]
+
+    chunk_len(chunk) = size(chunk, ndims(chunk))
+    function chunks_len(chunks::Vector, offset)
+        len = 0
+        for chunk_idx in 1:length(chunks)-1
+            len += chunk_len(chunks[chunk_idx])
+        end
+        return len + offset - 1
+    end
+
+    function copy_chunk!(out::Vector, in::Vector, out_offset::Int, len=chunk_len(in))
+        for idx in 1:len
+            out[idx+out_offset] = in[idx]
+        end
+    end
+    function copy_chunk!(out::AbstractMatrix, in::AbstractMatrix, out_offset::Int, len=chunk_len(in))
+        for zdx in 1:size(in, 1)
+            for idx in 1:len
+                out[zdx, idx+out_offset] = in[zdx, idx]
+            end
+        end
+    end
+
+    function collapse_chunks!(out, chunks, offset::Int)
+        write_offset = 0
+        for chunk_idx in 1:(length(chunks)-1)
+            chunk = chunks[chunk_idx]
+            copy_chunk!(out, chunk, write_offset)
+            write_offset += chunk_len(chunk)
+        end
+        copy_chunk!(out, chunks[end], write_offset, offset-1)
+    end
+
+    # Collapse t_chunks
+    ts = Vector{T}(undef, chunks_len(ilsc.t_chunks, ilsc.t_offset))
+    collapse_chunks!(ts, ilsc.t_chunks, ilsc.t_offset)
+
+    # Collapse u_chunks
+    us = Vector{Matrix{S}}(undef, length(ilsc.u_chunks))
+    for u_idx in 1:length(ilsc.u_chunks)
+        us[u_idx] = Matrix{S}(undef, size(ilsc.u_chunks[u_idx][1],1), chunks_len(ilsc.u_chunks[u_idx], ilsc.u_offsets[u_idx]))
+        collapse_chunks!(us[u_idx], ilsc.u_chunks[u_idx], ilsc.u_offsets[u_idx])
+    end
+
+    time_mask = BitMatrix(undef, size(ilsc.time_masks[1], 1), chunks_len(ilsc.time_masks, ilsc.t_offset))
+    collapse_chunks!(time_mask, ilsc.time_masks, ilsc.t_offset)
 
     # Sanity-check lengths
     if length(ts) != size(time_mask, 2)
@@ -238,7 +400,24 @@ function finish!(ils::IndependentlyLinearizedSolution)
         throw(ArgumentError("Time mask must indicate same length as `us` ($(time_mask_lens) != $(us_lens))"))
     end
 
-    # Update our struct, release the `ilsc`
+    # Update our struct, release the `ilsc` and its caches
+    for t_chunk in ilsc.t_chunks
+        release!(ilsc.cache.t_chunks, t_chunk)
+    end
+    @assert is_fully_released(ilsc.cache.t_chunks)
+    for u_idx in 1:length(ilsc.u_chunks)
+        for u_chunk in ilsc.u_chunks[u_idx]
+            release!(ilsc.cache.u_chunks, u_chunk)
+        end
+    end
+    @assert is_fully_released(ilsc.cache.u_chunks)
+    for time_mask in ilsc.time_masks
+        release!(ilsc.cache.time_masks, time_mask)
+    end
+    @assert is_fully_released(ilsc.cache.time_masks)
+    if ils.ilsc_cache_pool !== nothing
+        release!(ils.ilsc_cache_pool, ilsc.cache)
+    end
     ils.ilsc = nothing
     ils.ts = ts
     ils.us = us