Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround for N_Vector segfaults #380

Merged
merged 7 commits into from
Jan 13, 2023
Merged

Workaround for N_Vector segfaults #380

merged 7 commits into from
Jan 13, 2023

Conversation

sjdaines
Copy link
Contributor

Workaround for issue #367

This fixes a problem seen with Julia >= 1.8, where a temporary NVector was created and the N_Vector pointer it contains passed to ccall, which is unsafe as the temporary NVector may be garbage collected.

The fix is to remove convert(N_Vector, x) and replace with the combination of cconvert and unsafe_convert. An NVector can now be supplied as an argument to ccall with memory management then handled correctly ( and it is now impossible to explicitly create an N_Vector).

Changes:

  • change to cconvert / unsafe_convert is in src/nvector_wrapper.jl .
  • A global edit (~400 changes) to the autogenerated code in lib/libsundials_api.jl.
  • small number of changes elsewhere, replacing N_Vector with NVector

NB: The correct fix would be to update the code generator in gen/generate.jl, and regenerate the wrappers.

NB: As the comments in src/nvector_wrapper.jl show this was an intentional and (now) incorrect use of temporary variables to avoid garbage collection, it is possible there are other similar problems (pointers from other types of temporary variables passed to ccall).

Issue #367

This fixes a problem seen with Julia >= 1.8, where a temporary NVector was
created and the N_Vector pointer it contains passed to ccall, which is unsafe as
the temporary NVector may be garbage collected.

The fix is to remove convert(N_Vector, x) and replace with the combination of
cconvert and unsafe_convert. An NVector can new be supplied as an
argument to ccall with memory management then handled correctly (
and it is now impossible to explicitly create an N_Vector).

Changes:
 - change to cconvert / unsafe_convert is in src/nvector_wrapper.jl .
 - A global edit (~400 changes) to the autogenerated code in
lib/libsundials_api.jl.
 - small number of changes elsewhere, replacing N_Vector with NVector

NB: The correct fix would be to update the code generator in gen/generate.jl,
and regenerate the wrappers.

NB: As the comments in src/nvector_wrapper.jl show this was an intentional
and (now) incorrect use of temporary variables to avoid garbage collection,
it is possible there are other similar problems (pointers from other types of
temporary variables passed to ccall).
Comment on lines +135 to +142
# TODO replace with use of cconvert / unsafe_convert
# The code and comment below is incorrect and leads to memory errors with Julia >= 1.8,
# as storing in a local variable DOES NOT protect against garbage collection.
# Current workaround is to edit the generated files and apply a fix.
#
# Incorrect comment: 'first convert argument to NVector() and store in local var,
# this guarantees that the wrapper and associated Sundials object (e.g. N_Vector)
# is not removed by GC'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it would be good to handle this in full., but I think for now this comment is sufficient.

@codecov
Copy link

codecov bot commented Jan 10, 2023

Codecov Report

Merging #380 (6a465ec) into master (64c98c8) will decrease coverage by 0.04%.
The diff coverage is 58.33%.

@@            Coverage Diff             @@
##           master     #380      +/-   ##
==========================================
- Coverage   78.09%   78.05%   -0.05%     
==========================================
  Files          11       11              
  Lines        1484     1481       -3     
==========================================
- Hits         1159     1156       -3     
  Misses        325      325              
Impacted Files Coverage Δ
src/simple.jl 73.17% <0.00%> (ø)
src/nvector_wrapper.jl 48.64% <50.00%> (-1.36%) ⬇️
src/common_interface/solve.jl 81.76% <100.00%> (ø)
src/types_and_consts_additions.jl 79.62% <0.00%> (-1.86%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Change identity(x) -> x
(this was just a result of being lazy with the global edit)
@sjdaines
Copy link
Contributor Author

sjdaines commented Jan 10, 2023

CI failure with Julia 1.8.5 looks like the test runner is shut down after ~30mins, while somewhere in the middle of the preconditioner tests ? Perhaps hiiting some resource limit?

Tests pass locally on 1.8.3 and 1.9.0-beta2, but do take ~60 mins.

@ChrisRackauckas
Copy link
Member

I set it to run again and see. It shouldn't be that big of a test to cause a resource failure, so I'd assume something is wrong with the memory still if it's still failing (it's been our smoking gun MWE)

@sjdaines
Copy link
Contributor Author

sjdaines commented Jan 11, 2023

Maybe more than one problem here ? The earlier CI failures (Julia 1.8 with preconditioner tests) all ended with 'Error: Process completed with exit code 1', the failures with this fix have different exit codes as if the test runner was killed.

Perhaps worth releasing this, with the preconditioners tests temporarily removed as before? This definitely fixes one problem seen in application code that otherwise makes CVODE / IDA unusable here with 1.9.0-beta2,

@sjdaines
Copy link
Contributor Author

sjdaines commented Jan 12, 2023

Added some diagnostics to precs.jl tests and commented out the 'solve' calls at the end of the file to diagnose CI failure

#=
# Remove for testing
println("sol1:")
@time sol1 = solve(prob_ode_brusselator_2d, CVODE_BDF(; linear_solver = :GMRES);
save_everystep = false);
println("sol2:")
@time sol2 = solve(prob_ode_brusselator_2d,
CVODE_BDF(; linear_solver = :GMRES, prec = precilu, psetup = psetupilu,
prec_side = 1); save_everystep = false);
println("sol3:")
@time sol3 = solve(prob_ode_brusselator_2d,
CVODE_BDF(; linear_solver = :GMRES, prec = precamg, psetup = psetupamg,
prec_side = 1); save_everystep = false);
@test sol1.destats.nf > sol2.destats.nf
@test sol1.destats.nf > sol3.destats.nf
=#

Failure with Julia 1.8.5 still happens https://github.com/SciML/Sundials.jl/actions/runs/3903341325/jobs/6667576487
so confirms this is (now) before Sundials is even involved...

@ChrisRackauckas
Copy link
Member

Add a flush call after the printing so that we're very sure it's printing out. If this is really the case, that's... weird...

precs.jl tests fail on github CI with Julia >= 1.8, before Sundials is even
involved, ie failure looks to be unrelated to Sundials.
@sjdaines
Copy link
Contributor Author

To be totally sure, I had commented out the solve calls (which is the first place where Sundials is even involved) and the failure still happened !!

So I figured that was independent of any diagnostic output...

@sjdaines
Copy link
Contributor Author

I've just pushed a commit which omits the precs.jl tests for Julia >= 1.8 (as before), thinking it would be easier to release this and pursue the precs.jl issue in a different PR ?

NVector is now a mutable struct with finalizer.

(cf previous implementation with an immutable struct containing a
Ref{NVector} to which a finalizer was attached)
@sjdaines
Copy link
Contributor Author

I've opened new issues for:

@ChrisRackauckas ChrisRackauckas merged commit f38f804 into SciML:master Jan 13, 2023
ChrisRackauckas pushed a commit that referenced this pull request Jan 13, 2023
Addresses #381

This is theoretically a breaking change, however any application code
that was using the low-level API in this way (passing a Julia function into
eg CVodeInit and relying on CVRhsFn_wrapper) would likely segfault randomly
due to garbage collection of the wrapper.

The problem with CVRhsFn_wrapper and similar is that a temporary
Base.CFunction closure is returned by  @cfunction($f, ...),
which may be garbage collected hence it is not safe
to return the enclosed pointer.
This is somewhat similar to the problem with
NVector addressed by #380,
but more fundamental as the RHS function
needs to persist for the whole lifetime of the solver use.

Fix here is to remove these wrappers (which were only used by a few tests
of the low-level interface), and require that the caller either uses
the @cfunction(f, ...) form to generate a C pointer (ie f known at
compile time and the runtime Julia RHS function passed in to
Sundials C code as part of the opaque user data, this is what the
high-level interface does), or manages the lifetime of the @cfunction explicitly.
@ufechner7
Copy link

This commit causes KiteModels.jl to fail with:

julia> include("./examples/reel_out.jl")
ERROR: LoadError: MethodError: no method matching Sundials.NVector(::StaticArraysCore.MVector{62, Float64})
Closest candidates are:
  Sundials.NVector(::Ptr{Sundials._generic_N_Vector}) at ~/.julia/packages/Sundials/639C2/src/nvector_wrapper.jl:24
  Sundials.NVector(::Vector{Float64}) at ~/.julia/packages/Sundials/639C2/src/nvector_wrapper.jl:16
Stacktrace:
 [1] __init(prob::SciMLBase.DAEProblem{StaticArraysCore.MVector{62, Float64}, StaticArraysCore.MVector{62, Float64}, Tuple{Float64, Float64}, true, KPS4{Float64, StaticArraysCore.MVector{3, Float64}, 11, 15, KiteModels.Spring{Int16, Float64}}, SciMLBase.DAEFunction{true, SciMLBase.FullSpecialize, typeof(residual!), Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Vector{Bool}}, alg::Sundials.IDA{:GMRES, Nothing, Nothing}, timeseries::Vector{Any}, ts::Vector{Any}, ks::Vector{Any}; verbose::Bool, dt::Nothing, dtmax::Float64, save_on::Bool, save_start::Bool, callback::Nothing, abstol::Float64, reltol::Float64, saveat::Vector{Float64}, tstops::Vector{Float64}, maxiters::Int64, timeseries_errors::Bool, dense_errors::Bool, save_everystep::Bool, save_idxs::Nothing, dense::Bool, save_timeseries::Nothing, save_end::Bool, progress::Bool, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), advance_to_tstop::Bool, stop_at_next_tstop::Bool, userdata::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Sundials ~/.julia/packages/Sundials/639C2/src/common_interface/solve.jl:1114
 [2] init_call(_prob::SciMLBase.DAEProblem{StaticArraysCore.MVector{62, Float64}, StaticArraysCore.MVector{62, Float64}, Tuple{Float64, Float64}, true, KPS4{Float64, StaticArraysCore.MVector{3, Float64}, 11, 15, KiteModels.Spring{Int16, Float64}}, SciMLBase.DAEFunction{true, SciMLBase.FullSpecialize, typeof(residual!), Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Vector{Bool}}, args::Sundials.IDA{:GMRES, Nothing, Nothing}; merge_callbacks::Bool, kwargshandle::DiffEqBase.KeywordArgError, kwargs::Base.Pairs{Symbol, Float64, Tuple{Symbol, Symbol}, NamedTuple{(:abstol, :reltol), Tuple{Float64, Float64}}})
   @ DiffEqBase ~/.julia/packages/DiffEqBase/WXn2i/src/solve.jl:390
 [3] init_up(prob::SciMLBase.DAEProblem{StaticArraysCore.MVector{62, Float64}, StaticArraysCore.MVector{62, Float64}, Tuple{Float64, Float64}, true, KPS4{Float64, StaticArraysCore.MVector{3, Float64}, 11, 15, KiteModels.Spring{Int16, Float64}}, SciMLBase.DAEFunction{true, SciMLBase.FullSpecialize, typeof(residual!), Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Vector{Bool}}, sensealg::Nothing, u0::StaticArraysCore.MVector{62, Float64}, p::KPS4{Float64, StaticArraysCore.MVector{3, Float64}, 11, 15, KiteModels.Spring{Int16, Float64}}, args::Sundials.IDA{:GMRES, Nothing, Nothing}; kwargs::Base.Pairs{Symbol, Float64, Tuple{Symbol, Symbol}, NamedTuple{(:abstol, :reltol), Tuple{Float64, Float64}}})
   @ DiffEqBase ~/.julia/packages/DiffEqBase/WXn2i/src/solve.jl:436
 [4] #init#23
   @ ~/.julia/packages/DiffEqBase/WXn2i/src/solve.jl:403 [inlined]
 [5] init_sim!(s::KPS4{Float64, StaticArraysCore.MVector{3, Float64}, 11, 15, KiteModels.Spring{Int16, Float64}}; t_end::Float64, stiffness_factor::Float64, prn::Bool)
   @ KiteModels ~/repos/KiteModels/src/KiteModels.jl:375
 [6] top-level scope
   @ ~/repos/KiteModels/examples/reel_out.jl:58
 [7] include(fname::String)
   @ Base.MainInclude ./client.jl:476
 [8] top-level scope
   @ REPL[1]:1
in expression starting at /home/ufechner/repos/KiteModels/examples/reel_out.jl:58

@ChrisRackauckas
Copy link
Member

I'm not sure MVectors ever worked with Sundials? If it did that was unexpected and untested. Sundials is documented to only work with Array{Float64} specifically IIRC. We can add a conversion method for it though. Open an issue.

@ufechner7
Copy link

Here it is: #392

sjdaines added a commit to sjdaines/Sundials.jl that referenced this pull request Jan 27, 2024
Addresses SciML#435 and
also fixes a potential problem with memory safety:

1) Allocations in convert(Ptr{T}, Handle{T})
   Fix is as suggested in SciML#435
   (Handle is now a mutable struct with a ptr field)

2) Memory safety robustness fix
   Remove convert and use paired Base.cconvert / Base.unsafe_convert to get the ptr field from
   a Handle object h, so that the h is preserved from GC across the ccall
   This fix is analogous to that for the NVector wrapper in PR SciML#380
   (NB: there is no actual problem at least when Sundials is used with the SciML interface,
   as the Handle{CVODEMem} and similar objects are held by a persistent solver data structure, but this
   change should reduce the risk of something going wrong in the future, or for eg test harnesses that don't use
   the SciML interface)
ChrisRackauckas pushed a commit that referenced this pull request Jan 27, 2024
Addresses #435 and
also fixes a potential problem with memory safety:

1) Allocations in convert(Ptr{T}, Handle{T})
   Fix is as suggested in #435
   (Handle is now a mutable struct with a ptr field)

2) Memory safety robustness fix
   Remove convert and use paired Base.cconvert / Base.unsafe_convert to get the ptr field from
   a Handle object h, so that the h is preserved from GC across the ccall
   This fix is analogous to that for the NVector wrapper in PR #380
   (NB: there is no actual problem at least when Sundials is used with the SciML interface,
   as the Handle{CVODEMem} and similar objects are held by a persistent solver data structure, but this
   change should reduce the risk of something going wrong in the future, or for eg test harnesses that don't use
   the SciML interface)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants