-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving petsc nonlinear solver state to cache #53
Moving petsc nonlinear solver state to cache #53
Conversation
Codecov Report
@@ Coverage Diff @@
## partitioned_arrays_support #53 +/- ##
===============================================================
- Coverage 87.96% 70.21% -17.76%
===============================================================
Files 9 9
Lines 748 752 +4
===============================================================
- Hits 658 528 -130
- Misses 90 224 +134
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amartinhuertas !
Please find some comments.
…yout in the nonlinear solver cache so that we avoid extra copies in residual! and jacobian!
The previous variant was causing a nasty error message with petsc built in debug mode. [0]PETSC ERROR: #3 VecSetErrorIfLocked() at /home/amartin/software_installers/petsc-3.15.4-build-dbg-gnu9-mpi/include/petscvec.h:574 [0]PETSC ERROR: #4 VecGetArray() at /home/amartin/software_installers/petsc-3.15.4-build-dbg-gnu9-mpi/src/vec/vec/interface/rvector.c:1795 PETScNonLinearSolvers: Error During Test at /home/amartin/git-repos/GridapPETSc.jl/test/sequential/runtests.jl:13 Got exception outside of a @test LoadError: Petsc returned with error code: 73 Stacktrace: [1] macro expansion @ ~/git-repos/GridapPETSc.jl/src/Config.jl:88 [inlined] [2] get_local_vector(a::GridapPETSc.PETScVector) @ GridapPETSc ~/git-repos/GridapPETSc.jl/src/PETScArrays.jl:176 [3] copy!(vec::Vector{Float64}, petscvec::GridapPETSc.PETScVector) @ GridapPETSc ~/git-repos/GridapPETSc.jl/src/PETScArrays.jl:124 [4] copy!(a::Vector{Float64}, petsc_vec::GridapPETSc.PETSC.Vec) @ GridapPETSc ~/git-repos/GridapPETSc.jl/src/PETScArrays.jl:112 [5] snes_residual(csnes::Ptr{Nothing}, cx::Ptr{Nothing}, cfx::Ptr{Nothing}, ctx::Ptr{Nothing}) @ GridapPETSc ~/git-repos/GridapPETSc.jl/src/PETScNonlinearSolvers.jl:42 [6] SNESSolve @ ~/git-repos/GridapPETSc.jl/src/PETSC.jl:62 [inlined] [7] macro expansion @ ~/git-repos/GridapPETSc.jl/src/Config.jl:86 [inlined] [8] solve!(x::Vector{Float64}, nls::GridapPETSc.PETScNonlinearSolver, op::Gridap.Algebra.NonlinearOperatorMock) @ GridapPETSc ~/git-repos/GridapPETSc.jl/src/PETScNonlinearSolvers.jl:183 [9] test_nonlinear_solver(nls::GridapPETSc.PETScNonlinearSolver, op::Gridap.Algebra.NonlinearOperatorMock, x0::Vector{Float64}, x::Vector{Float64}, pred::typeof(isapprox)) @ Gridap.Algebra ~/.julia/packages/Gridap/AfDIn/src/Algebra/NonlinearSolvers.jl:62 [10] test_nonlinear_solver(nls::GridapPETSc.PETScNonlinearSolver, op::Gridap.Algebra.NonlinearOperatorMock, x0::Vector{Float64}, x::Vector{Float64}) @ Gridap.Algebra ~/.julia/packages/Gridap/AfDIn/src/Algebra/NonlinearSolvers.jl:61 [11] top-level scope @ ~/git-repos/GridapPETSc.jl/test/sequential/PETScNonlinearSolversTests.jl:15 [12] include(mod::Module, _path::String) @ Base ./Base.jl:386 [13] include @ ~/git-repos/GridapPETSc.jl/test/sequential/runtests.jl:1 [inlined] [14] macro expansion @ ~/git-repos/GridapPETSc.jl/test/sequential/runtests.jl:13 [inlined] [15] macro expansion @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1151 [inlined] [16] macro expansion @ ~/git-repos/GridapPETSc.jl/test/sequential/runtests.jl:13 [inlined] [17] top-level scope @ ./timing.jl:210 [inlined] [18] top-level scope @ ~/git-repos/GridapPETSc.jl/test/sequential/runtests.jl:0 [19] include(fname::String) @ Base.MainInclude ./client.jl:444 [20] top-level scope @ REPL[3]:1 [21] eval @ ./boot.jl:360 [inlined] [22] eval @ ./Base.jl:39 [inlined] [23] repleval(m::Module, code::Expr, #unused#::String) @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.4.3/scripts/packages/VSCodeServer/src/repl.jl:157 [24] (::VSCodeServer.var"#69#71"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})() @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.4.3/scripts/packages/VSCodeServer/src/repl.jl:123 [25] with_logstate(f::Function, logstate::Any) @ Base.CoreLogging ./logging.jl:491 [26] with_logger @ ./logging.jl:603 [inlined] [27] (::VSCodeServer.var"#68#70"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})() @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.4.3/scripts/packages/VSCodeServer/src/repl.jl:124 [28] #invokelatest#2 @ ./essentials.jl:708 [inlined] [29] invokelatest(::Any) @ Base ./essentials.jl:706 [30] macro expansion @ ~/.vscode/extensions/julialang.language-julia-1.4.3/scripts/packages/VSCodeServer/src/eval.jl:34 [inlined] [31] (::VSCodeServer.var"#53#54")() @ VSCodeServer ./task.jl:411 in expression starting at /home/amartin/git-repos/GridapPETSc.jl/test/sequential/PETScNonlinearSolversTests.jl:1 Test Summary: | Error Total PETScNonLinearSolvers | 1 1 ERROR: LoadError: Some tests did not pass: 0 passed, 0 failed, 1 errored, 0 broken. in expression starting at /home/amartin/git-repos/GridapPETSc.jl/test/sequential/runtests.jl:1
After never ending sessions of debugging, I could workaround the bug in I still do not understand what was the cause of the problem, but I came up with a workround in that commit. |
The |
I think I performed all the pending work in this PR The MPI parallel tests pass on my local machine. However, serial tests are still giving me trouble! (I am starting to think there is something wrong with my local installation as the code seems to be robust in Github actions). @fverdugo can you reproduce the issues that I describe below on your machine?
|
(base) amartin@sistemas-ThinkPad-X1-Carbon-6th:~/git-repos/GridapPETSc.jl$ julia --project=. test/sequential/PETScAssemblyTests.jl [0] PetscDetermineInitialFPTrap(): Floating point trapping is off by default 0 [0] PetscInitialize(): PETSc successfully started: number of processors = 1 [0] PetscGetHostName(): Rejecting domainname, likely is NIS sistemas-ThinkPad-X1-Carbon-6th.(none) [0] PetscInitialize(): Running on machine: sistemas-ThinkPad-X1-Carbon-6th [0] PetscCommDuplicate(): Duplicating a communicator 140230383909504 25901472 max tags = 2147483647 [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [0]PETSC ERROR: Argument out of range [0]PETSC ERROR: nnz cannot be greater than row length: local row 0 value 5 rowlength 3 [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting. [0]PETSC ERROR: Petsc Release Version 3.15.4, Sep 01, 2021 [0]PETSC ERROR: GridapPETSc on a x86_64 named sistemas-ThinkPad-X1-Carbon-6th by amartin Fri Nov 12 23:58:38 2021 [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90 -with-blaslapack-dir=/opt/intel/compilers_and_libraries_2020.0.166/linux/mkl --download-mumps --download-scalapack --download-parmetis --download-metis --download-ptscotch --with-debugging --with-x=0 --with-shared-libraries=1 --with-mpi=1 --with-64-bit-indices [0]PETSC ERROR: #1 MatSeqAIJSetPreallocation_SeqAIJ() at /home/amartin/software_installers/petsc-3.15.4-build-dbg-gnu9-mpi/src/mat/impls/aij/seq/aij.c:4101 [0]PETSC ERROR: #2 MatCreateSeqAIJ() at /home/amartin/software_installers/petsc-3.15.4-build-dbg-gnu9-mpi/src/mat/impls/aij/seq/aij.c:4015 ERROR: LoadError: Petsc returned with error code: 63 Stacktrace: [1] macro expansion @ ~/git-repos/GridapPETSc.jl/src/Config.jl:88 [inlined] [2] nz_allocation(a::GridapPETSc.MatCounter{Gridap.Algebra.Loop}) @ GridapPETSc ~/git-repos/GridapPETSc.jl/src/PETScAssembly.jl:178 [3] (::Main.PETScAssemblyTests.var"#1#2")() @ Main.PETScAssemblyTests ~/git-repos/GridapPETSc.jl/test/sequential/PETScAssemblyTests.jl:28 [4] with(f::Main.PETScAssemblyTests.var"#1#2"; kwargs::Base.Iterators.Pairs{Symbol, Vector{SubString{String}}, Tuple{Symbol}, NamedTuple{(:args,), Tuple{Vector{SubString{String}}}}}) @ GridapPETSc ~/git-repos/GridapPETSc.jl/src/Environment.jl:38 [5] top-level scope @ ~/git-repos/GridapPETSc.jl/test/sequential/PETScAssemblyTests.jl:14 in expression starting at /home/amartin/git-repos/GridapPETSc.jl/test/sequential/PETScAssemblyTests.jl:1 [0] PetscFinalize(): PetscFinalize() called
I think I understand now what's going on at last! In my machine, PETSc is compiled in debug mode. This means that all debug checks on all arguments preconditions are enabled (they are disabled in release mode). I think that in Github Actions we might be using PETSc_jll compiled in release mode. If this is the case, I think this is VERY dangerous, and should be avoided. I can try to modify Github actions such that we use our own version of PETSc compiled in debug mode for the tests. I did this already for GridapDistributedPETScWrappers. Should be easy to replicate here.
I found a BUG related to this. See 1500d63. The BUG was introduced a long time ago. I wonder why it did not manifest before. Nevermind ... it is solved now.
This is now solved as well. See 5a7b90a. This error has to do with my comment above. We were not fulfilling a check on the preconditions in one of the PETSc subroutines. |
I can confirm this is the case. See output error in Github actions
|
@fverdugo ... FYI ... another "surprise" ... in 758a12d I reverted 9b21f58 It seems that the latter leads to deadlocks in github actions. (at least the github actions job got stucked and cancelled due to max wall clock time)
|
function copy!(a::AbstractVector,b::PetscVector)
@check length(a) == length(b)
_copy!(a,b.vec[])
end
_copy!(a::Vector,b::Vec)
_copy!(a::PVector{<:SequentialData},b::Vec) = @notimplemented
_copy!(a::PVector{<:MPIData},b::Vec)
|
It turns out that the deadlock in Github actions does not seem to be related to removing |
…PETSc.jl into moving_petsc_nonlinear_solver_state_to_cache
Not ready to merge ...
I have found an error on my local machine (see below) when running the PETSc non linear solvers sequential tests. At first glance, I though it could be related by the changes in this PR, but I have checkout previous commits and the error persists. To investigate.