Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

solve_for_H in CarbonChemistry throws GPU exceptions #228

Closed
ali-ramadhan opened this issue Nov 6, 2024 · 3 comments
Closed

solve_for_H in CarbonChemistry throws GPU exceptions #228

ali-ramadhan opened this issue Nov 6, 2024 · 3 comments

Comments

@ali-ramadhan
Copy link
Collaborator

I noticed that a few Oceananigans + LOBSTER simulations crashed due to some GPU exception. Re-running Julia with the -g2 flag produced a stack trace pointing to the solve_for_H function as being the source of the GPU exception. I'm guessing the root solver could not converge for some reason?

It happens after hundreds of thousands of Oceananigans iterations, so I'm not sure if it's just bad numbers going in. I tried reproducing by calling CarbonChemistry() directly with random numbers but so far I have not been able to reproduce.

using Distributions
using OceanBioME

cc = CarbonChemistry()

while true
	DIC = rand(Uniform(1000, 2500))
	T = rand(Uniform(-5, 30))
	S = rand(Uniform(25, 40))
	Alk = rand(Uniform(1000, 2500))

	cc(; DIC, T, S, Alk)
end

Was wondering if anyone has encountered a similar issue.

The only thing I could see is that the upper and lower pH bounds for the solve are super wide (0 <= pH <= 14). Not sure if this could be contributing, but would it make sense to narrow the range to more reasonable values, say 6-12?

https://github.com/OceanBioME/OceanBioME.jl/blob/main/src/Models/CarbonChemistry/carbon_chemistry.jl#L133-L134


Here's the GPU exception stack trace:

ERROR: a exception was thrown during kernel execution on thread (97, 1, 1) in block (37, 1, 1).
Stacktrace:
 [1] assert_bracket at /home/alir/.julia/packages/Roots/KNVCY/src/Bracketing/bracketing.jl:52
 [2] #init_state#52 at /home/alir/.julia/packages/Roots/KNVCY/src/Bracketing/bisection.jl:50
 [3] init_state at /home/alir/.julia/packages/Roots/KNVCY/src/Bracketing/bisection.jl:34
 [4] init_state at /home/alir/.julia/packages/Roots/KNVCY/src/Bracketing/bracketing.jl:6
 [5] #init#43 at /home/alir/.julia/packages/Roots/KNVCY/src/find_zero.jl:299
 [6] init at /home/alir/.julia/packages/Roots/KNVCY/src/find_zero.jl:289
 [7] #solve#47 at /home/alir/.julia/packages/Roots/KNVCY/src/find_zero.jl:491
 [8] solve at /home/alir/.julia/packages/Roots/KNVCY/src/find_zero.jl:484
 [9] #find_zero#40 at /home/alir/.julia/packages/Roots/KNVCY/src/find_zero.jl:220
 [10] find_zero at /home/alir/.julia/packages/Roots/KNVCY/src/find_zero.jl:210
 [11] find_zero at /home/alir/.julia/packages/Roots/KNVCY/src/find_zero.jl:210
 [12] solve_for_H at /home/alir/.julia/packages/OceanBioME/fTQIg/src/Models/CarbonChemistry/carbon_chemistry.jl:186
 [13] #_#38 at /home/alir/.julia/packages/OceanBioME/fTQIg/src/Models/CarbonChemistry/carbon_chemistry.jl:168
 [14] CarbonChemistry at /home/alir/.julia/packages/OceanBioME/fTQIg/src/Models/CarbonChemistry/carbon_chemistry.jl:123
 [15] surface_value at /home/alir/.julia/packages/OceanBioME/fTQIg/src/Models/GasExchange/carbon_dioxide_concentration.jl:30
 [16] GasExchange at /home/alir/.julia/packages/OceanBioME/fTQIg/src/Models/GasExchange/gas_exchange.jl:35
 [17] getbc at /home/alir/.julia/packages/Oceananigans/s1DfC/src/BoundaryConditions/discrete_boundary_function.jl:38
 [18] apply_z_top_bc! at /home/alir/.julia/packages/Oceananigans/s1DfC/src/BoundaryConditions/apply_flux_bcs.jl:158
 [19] macro expansion at /home/alir/.julia/packages/Oceananigans/s1DfC/src/BoundaryConditions/apply_flux_bcs.jl:82
 [20] gpu__apply_z_bcs! at /home/alir/.julia/packages/KernelAbstractions/iW1Rw/src/macros.jl:97
 [21] gpu__apply_z_bcs! at ./none:0

ERROR: LoadError: KernelException: exception thrown during kernel execution on device NVIDIA GeForce RTX 4090
Stacktrace:
  [1] check_exceptions()
    @ CUDA ~/.julia/packages/CUDA/2kjXI/src/compiler/exceptions.jl:39
  [2] synchronize(stream::CUDA.CuStream; blocking::Bool, spin::Bool)
    @ CUDA ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/synchronization.jl:207
  [3] synchronize (repeats 2 times)
    @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/synchronization.jl:194 [inlined]
  [4] (::CUDA.var"#1127#1128"{Float64, Vector{Float64}, Int64, CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}, Int64, Int64})()
    @ CUDA ~/.julia/packages/CUDA/2kjXI/src/array.jl:554
  [5] #context!#990
    @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/state.jl:168 [inlined]
  [6] context!
    @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/state.jl:163 [inlined]
  [7] unsafe_copyto!(dest::Vector{Float64}, doffs::Int64, src::CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}, soffs::Int64, n::Int64)
    @ CUDA ~/.julia/packages/CUDA/2kjXI/src/array.jl:550
  [8] copyto!
    @ ~/.julia/packages/CUDA/2kjXI/src/array.jl:503 [inlined]
  [9] getindex
    @ ~/.julia/packages/GPUArrays/qt4ax/src/host/indexing.jl:52 [inlined]
 [10] macro expansion
    @ ~/.julia/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:210 [inlined]
 [11] norm(a::Field{Center, Center, Center, Nothing, ImmersedBoundaryGrid{…}, Tuple{…}, OffsetArrays.OffsetArray{…}, Float64, FieldBoundaryConditions{…}, Nothing, Oceananigans.Fields.FieldBoundaryBuffers{…}}; condition::Nothing)
    @ Oceananigans.Fields ~/.julia/packages/Oceananigans/s1DfC/src/Fields/field.jl:742
 [12] norm
    @ ~/.julia/packages/Oceananigans/s1DfC/src/Fields/field.jl:739 [inlined]
 [13] solve!(::Field{…}, ::Oceananigans.Solvers.ConjugateGradientSolver{…}, ::Field{…})
    @ Oceananigans.Solvers ~/.julia/packages/Oceananigans/s1DfC/src/Solvers/conjugate_gradient_solver.jl:170
 [14] solve_for_pressure!(pressure::Field{…}, solver::ConjugateGradientPoissonSolver{…}, Δt::Float64, Ũ::@NamedTuple{…})
    @ Oceananigans.Models.NonhydrostaticModels ~/.julia/packages/Oceananigans/s1DfC/src/Models/NonhydrostaticModels/solve_for_pressure.jl:89
 [15] calculate_pressure_correction!(model::NonhydrostaticModel{…}, Δt::Float64)
    @ Oceananigans.Models.NonhydrostaticModels ~/.julia/packages/Oceananigans/s1DfC/src/Models/NonhydrostaticModels/pressure_correction.jl:15
 [16] time_step!(model::NonhydrostaticModel{…}, Δt::Float64; callbacks::Tuple{})
    @ Oceananigans.TimeSteppers ~/.julia/packages/Oceananigans/s1DfC/src/TimeSteppers/runge_kutta_3.jl:148
 [17] time_step!(sim::Simulation{NonhydrostaticModel{…}, Float64, Float64, OrderedCollections.OrderedDict{…}, OrderedCollections.OrderedDict{…}, OrderedCollections.OrderedDict{…}})
    @ Oceananigans.Simulations ~/.julia/packages/Oceananigans/s1DfC/src/Simulations/run.jl:134
 [18] run!(sim::Simulation{NonhydrostaticModel{…}, Float64, Float64, OrderedCollections.OrderedDict{…}, OrderedCollections.OrderedDict{…}, OrderedCollections.OrderedDict{…}}; pickup::Bool)
    @ Oceananigans.Simulations ~/.julia/packages/Oceananigans/s1DfC/src/Simulations/run.jl:97
 [19] run!(sim::Simulation{NonhydrostaticModel{…}, Float64, Float64, OrderedCollections.OrderedDict{…}, OrderedCollections.OrderedDict{…}, OrderedCollections.OrderedDict{…}})
    @ Oceananigans.Simulations ~/.julia/packages/Oceananigans/s1DfC/src/Simulations/run.jl:85
 [20] top-level scope
    @ ~/atdepth/Test.jl/test.jl:261
 [21] include(fname::String)
    @ Base.MainInclude ./client.jl:489
 [22] top-level scope
    @ REPL[1]:1
in expression starting at /home/alir/atdepth/Test.jl/test.jl:261
Some type information was truncated. Use `show(err)` to see complete types.
@jagoosw
Copy link
Collaborator

jagoosw commented Nov 7, 2024

I tried to reproduce this on CPU and didn't get an error for ~100million iterations, is that consistent with what you've seen?

It's strange that it's erroring when checking that the initial guesses are bracketing but only on GPU since it shouldn't be doing anything different.

The only thing I could see is that the upper and lower pH bounds for the solve are super wide (0 <= pH <= 14). Not sure if this could be contributing, but would it make sense to narrow the range to more reasonable values, say 6-12?

Maybe the problem with a super wide range is that if the residual it solves for is not monotonic, then the wide initial guesses might not be bracketing. I went Bisection because the solve failed to converge when it was using some initial value method, but maybe it would be better to find an initial value method that we can impose bounds on, and then start it from e.g. pH = 7.

@ali-ramadhan
Copy link
Collaborator Author

I think I went up to ~50 million iterations and also couldn't get an error.

I wonder if it's just a GPU thing, but I'll also try narrowing the pH range and seeing if that helps.

@ali-ramadhan
Copy link
Collaborator Author

Narrowing the pH range did not help and doing more investigating I realized I actually had NaNs in the simulation 🤦

I guess solve_for_H was catching them before the NaNChecker haha.

Sorry for the false alarm. Will close this issue as resolved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants