Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pm-gpu #835

Closed
wants to merge 7 commits into from
Closed

Add pm-gpu #835

wants to merge 7 commits into from

Conversation

xylar
Copy link
Collaborator

@xylar xylar commented Jul 1, 2024

Checklist

  • User's Guide has been updated
  • Developer's Guide has been updated
  • API documentation in the Developer's Guide (api.rst) has any new or modified class, method and/or functions listed
  • Documentation has been built locally and changes look as expected
  • Document (in a comment titled Testing in this PR) any testing that was used to verify the changes

@xylar xylar added the dependencies and deployment Changes relate to creating conda and Spack environments, and creating a load script label Jul 1, 2024
@xylar xylar requested a review from matthewhoffman July 1, 2024 11:36
@xylar xylar self-assigned this Jul 1, 2024
@xylar
Copy link
Collaborator Author

xylar commented Jul 1, 2024

@mcarlson801 and @jewatkins, this is the starting point for adding pm-gpu support (with gnugpu for now, and maybe nvidiagpu to follow).

I have been able to run the full_integration suite from MALI on pm-gpu but I suspect it is probably just running on the CPU portion of each node for now. Along with @matthewhoffman, we should have a discussion about what modification are needed to build MALI and/or run the job to make sure Compass takes advantage of GPUs. We likely will also need a way to detect the GPU resources available and to specify the resource needs of each test. This isn't something I have thought about very much but it is also needed very, very soon for the Omega ocean model in Polaris, the successor to Compass.

@xylar
Copy link
Collaborator Author

xylar commented Jul 1, 2024

A note to say that I tried to build Albany and Trilinos with nvidiagpu and got a bunch of errors. So I'm not listing that as a supported config.

@xylar
Copy link
Collaborator Author

xylar commented Aug 13, 2024

@mcarlson801 and @jewatkins, I was able to build the trilinos and albany spack libraries (and the rest of the compass spack environment) using this branch, https://github.com/xylar/mache/tree/add-cuda-to-pm-gpu and @mcarlson801's E3SM-Project/spack#31.

I was also able to build MALI from the MALI-Dev submodule.

I ran the full_integration test suite and it was much slower than usual -- a 1-hour job timed out. This may be related to issues with the $SCRATCH drive, because I was also having trouble with basic operations there. So might be worth testing again later. But I also saw errors in several tests, mostly restart tests but also a decomp test, which I will report once I get the job to run again.

@xylar
Copy link
Collaborator Author

xylar commented Aug 13, 2024

The following restart tests failed with a validation error:

landice/dome/2000m/fo_restart_test
landice/dome/variable_resolution/fo_restart_tes
landice/greenland/fo_restart_test

The errors are all quantitatively similar to the following:

thickness            Time index: 0, 1, 2
1:  l1: 1.03182219712838e-12  l2: 2.85855338539842e-13  linf: 1.13686837721616e-13
2:  l1: 1.31200952879773e-12  l2: 3.29681925916544e-13  linf: 1.13686837721616e-13
  FAIL /pscratch/sd/x/xylar/compass_1.4/pm-cpu/test_20240813/full_integ_gnugpu/landice/dome/2000m/fo_restart_test/full_run/output.nc
       /pscratch/sd/x/xylar/compass_1.4/pm-cpu/test_20240813/full_integ_gnugpu/landice/dome/2000m/fo_restart_test/restart_run/output.nc
normalVelocity       Time index: 0, 1, 2
0:  l1: 2.42945560944798e-18  l2: 3.23431732743427e-20  linf: 2.11758236813575e-21
1:  l1: 2.66684010607556e-18  l2: 3.63297450051263e-20  linf: 1.90582413132218e-21
2:  l1: 2.68145536927484e-18  l2: 3.58541371140577e-20  linf: 2.11758236813575e-21
  FAIL /pscratch/sd/x/xylar/compass_1.4/pm-cpu/test_20240813/full_integ_gnugpu/landice/dome/2000m/fo_restart_test/full_run/output.nc
       /pscratch/sd/x/xylar/compass_1.4/pm-cpu/test_20240813/full_integ_gnugpu/landice/dome/2000m/fo_restart_test/restart_run/output.nc
Internal test case validation failed

@xylar
Copy link
Collaborator Author

xylar commented Aug 13, 2024

The landice/circular_shelf/decomposition_test fails in the 1proc_run step with the following stack trace from Albany:

:0: : block: [10,0,0], thread: [0,92,0] Assertion `Allocation failed.` failed.
:0: : block: [10,0,0], thread: [0,93,0] Assertion `Allocation failed.` failed.
:0: : block: [26,0,0], thread: [0,29,0] Assertion `Allocation failed.` failed.
:0: : block: [6,0,0], thread: [0,125,0] Assertion `Allocation failed.` failed.
:0: : block: [82,0,0], thread: [0,29,0] Assertion `Allocation failed.` failed.
:0: : block: [79,0,0], thread: [0,125,0] Assertion `Allocation failed.` failed.
:0: : block: [72,0,0], thread: [0,125,0] Assertion `Allocation failed.` failed.
:0: : block: [62,0,0], thread: [0,92,0] Assertion `Allocation failed.` failed.
:0: : block: [75,0,0], thread: [0,29,0] Assertion `Allocation failed.` failed.
:0: : block: [7,0,0], thread: [0,61,0] Assertion `Allocation failed.` failed.
:0: : block: [41,0,0], thread: [0,93,0] Assertion `Allocation failed.` failed.
(ptr->cuda_stream_synchronize_wrapper(stream)) error( cudaErrorAssert): device-side assert triggered /pscratch/sd/x/xylar/spack_gpu_tmp/spack-stage/spack-stage-trilinos-for-albany-compass-2024-03-13-wlto53yjmwkx6vak3n7ssh2tho4n2n6f/spack-src/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:166
Backtrace:
[0x7f97ad196e35] Kokkos::Impl::save_stacktrace()
[0x7f97ad16a58c] Kokkos::Impl::host_abort(char const*)
[0x7f97ad19e54e] Kokkos::Impl::cuda_internal_error_abort(cudaError, char const*, char const*, int)
[0x7f97ad19e80a] Kokkos::Impl::cuda_stream_synchronize(CUstream_st*, Kokkos::Impl::CudaInternal const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
[0x7f97f1f5bba8] PHX::DagManager<PHAL::AlbanyTraits>::evaluateFields(PHAL::Workset&)
[0x7f97e2c260de] Albany::Application::computeGlobalJacobianImpl(double, double, double, double, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::Array<Sacado::ScalarParameterVector<SPL_Traits> > const&, Teuchos::RCP<Thyra::VectorBase<double> > const&, Teuchos::RCP<Thyra::LinearOpBase<double> > const&, double)
[0x7f97e2c276e3] Albany::Application::computeGlobalJacobian(double, double, double, double, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::Array<Sacado::ScalarParameterVector<SPL_Traits> > const&, Teuchos::RCP<Thyra::VectorBase<double> > const&, Teuchos::RCP<Thyra::LinearOpBase<double> > const&, double)
[0x7f97e2e2632a] Albany::ModelEvaluator::evalModelImpl(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const
[0x7f97e2b4c878] Thyra::ModelEvaluatorDefaultBase<double>::evalModel(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const
[0x7f97e2bcd96d] Thyra::DefaultModelEvaluatorWithSolveFactory<double>::evalModelImpl(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const
[0x7f97e2b4c878] Thyra::ModelEvaluatorDefaultBase<double>::evalModel(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const
[0x7f97dbb90f25] NOX::Thyra::Group::computeJacobian()
[0x7f97dbaed262] NOX::Direction::Newton::compute(NOX::Abstract::Vector&, NOX::Abstract::Group&, NOX::Solver::Generic const&)
[0x7f97dbb09c48] NOX::Solver::LineSearchBased::step()
[0x7f97dbb0bb39] NOX::Solver::LineSearchBased::solve()
[0x7f97dbba66a6] Thyra::NOXNonlinearSolver::solve(Thyra::VectorBase<double>*, Thyra::SolveCriteria<double> const*, Thyra::VectorBase<double>*)
[0x7f97e198a14f] Piro::NOXSolver<double>::evalModelImpl(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const
[0x7f97e2b4c878] Thyra::ModelEvaluatorDefaultBase<double>::evalModel(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const
[0x7f97f1b3d385] void Piro::Detail::PerformSolveImpl<double, Thyra::VectorBase<double> const, Thyra::MultiVectorBase<double> const>(Thyra::ModelEvaluator<double> const&, Teuchos::Array<bool> const&, bool, Teuchos::Array<Teuchos::RCP<Thyra::VectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&, Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&, Teuchos::RCP<Piro::SolutionObserverBase<double, Thyra::VectorBase<double> const> >)
[0x7f97f1b3dd9d] void Piro::Detail::PerformSolveImpl<double, Thyra::VectorBase<double> const, Thyra::MultiVectorBase<double> const>(Thyra::ModelEvaluator<double> const&, Teuchos::ParameterList&, Teuchos::Array<Teuchos::RCP<Thyra::VectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&)
[0x7f97f1b0dc63] velocity_solver_solve_fo__(int, int, int, bool, bool, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, double, std::vector<double, std::allocator<double> > const&, std::vector<double, std::allocator<double> > const&, std::vector<double, std::allocator<double> > const&, std::vector<double, std::allocator<double> > const&, std::vector<double, std::allocator<double> >&, std::vector<double, std::allocator<double> > const&, std::vector<double, std::allocator<double> > const&, std::vector<double, std::allocator<double> > const&, std::vector<double, std::allocator<double> > const&, std::vector<double, std::allocator<double> > const&, std::vector<double, std::allocator<double> > const&, std::vector<double, std::allocator<double> >&, std::vector<double, std::allocator<double> >&, std::vector<double, std::allocator<double> >&, int&, double const&)

@xylar
Copy link
Collaborator Author

xylar commented Aug 13, 2024

Here's the timing I'm seeing:

Test Runtimes:
00:47 PASS landice_dome_2000m_sia_restart_test
00:11 PASS landice_dome_2000m_sia_decomposition_test
00:19 PASS landice_dome_variable_resolution_sia_restart_test
00:07 PASS landice_dome_variable_resolution_sia_decomposition_test
00:14 PASS landice_enthalpy_benchmark_A
00:17 PASS landice_eismint2_decomposition_test
00:14 PASS landice_eismint2_enthalpy_decomposition_test
00:31 PASS landice_eismint2_restart_test
01:30 PASS landice_eismint2_enthalpy_restart_test
01:13 PASS landice_greenland_sia_restart_test
00:45 PASS landice_greenland_sia_decomposition_test
01:12 PASS landice_hydro_radial_restart_test
01:27 PASS landice_hydro_radial_decomposition_test
01:47 PASS landice_humboldt_mesh-3km_decomposition_test_velo-none_calving-none_subglacialhydro
01:17 PASS landice_humboldt_mesh-3km_restart_test_velo-none_calving-none_subglacialhydro
00:53 PASS landice_dome_2000m_fo_decomposition_test
01:00 FAIL landice_dome_2000m_fo_restart_test
00:44 PASS landice_dome_variable_resolution_fo_decomposition_test
01:24 FAIL landice_dome_variable_resolution_fo_restart_test
00:08 FAIL landice_circular_shelf_decomposition_test
06:48 PASS landice_greenland_fo_decomposition_test
09:44 FAIL landice_greenland_fo_restart_test
05:10 PASS landice_thwaites_fo_decomposition_test
08:46 FAIL landice_thwaites_fo_restart_test
03:19 PASS landice_thwaites_fo-depthInt_decomposition_test
06:28 FAIL landice_thwaites_fo-depthInt_restart_test
12:00 FAIL landice_humboldt_mesh-3km_restart_test_velo-fo_calving-von_mises_stress_damage-threshold_faceMelting
07:57 FAIL landice_humboldt_mesh-3km_restart_test_velo-fo-depthInt_calving-von_mises_stress_damage-threshold_faceMelting
Total runtime 76:36

@mcarlson801
Copy link
Contributor

The landice/circular_shelf/decomposition_test fails in the 1proc_run step with the following stack trace from Albany:

Did you build Albany with the +slfad variant? I think this is the same error I ran into when I was building with DFad (although other tests would be failing with it too, hmmm). I'll take a look and see what's up.

@xylar
Copy link
Collaborator Author

xylar commented Aug 13, 2024

Did you build Albany with the +slfad variant?

No, I missed that. Is that for Trilinos? Albany? both?

@mcarlson801
Copy link
Contributor

Actually, scratch that, for MALI we would use +sfad12. You only need it for Albany.

@xylar
Copy link
Collaborator Author

xylar commented Aug 13, 2024

Okay, thanks, I'll add that and try again.

@mcarlson801
Copy link
Contributor

mcarlson801 commented Aug 13, 2024

Actually, to specify sfad 12, I think it's probably +sfad but I'm not sure how to set the size.

This is the line where the sfadsize gets used: https://github.com/E3SM-Project/spack/blob/develop/var/spack/repos/builtin/packages/albany/package.py#L130

And this is the line where the sfadsize is obtained: https://github.com/E3SM-Project/spack/blob/develop/var/spack/repos/builtin/packages/albany/package.py#L46

@ikalash Do you know what we need to add to the variants to install with +sfad with sfadsize = 12?

@xylar
Copy link
Collaborator Author

xylar commented Aug 13, 2024

@mcarlson801 please keep me posted, then.

@mcarlson801
Copy link
Contributor

I looked up how to set multi-valued variants and it looks like the way to do this would be to add +sfad sfadsize=12.

@mcarlson801
Copy link
Contributor

@xylar Did you get a chance to run the tests again with +sfad sfadsize=12? I noticed that my run with +slfad doesn't have failing tests due to validation so I'm wondering if that fixed it.

@xylar
Copy link
Collaborator Author

xylar commented Aug 26, 2024

@mcarlson801, I was away on vacation last week but I'm looking at this now.

@xylar
Copy link
Collaborator Author

xylar commented Sep 9, 2024

@mcarlson801, I'm very sorry for the additional delay on this. I ran the full integration suite after rebuilding the trilinos and albany spack packages as you suggested (+sfad sfadsize=12). The results can be found here:

/pscratch/sd/x/xylar/compass_1.4/pm-cpu/test_20240826/full_integ_gnugpu

All non-restart tests are now passing. Restart tests all seem to have small diffs (as reported above):

01:42 PASS landice_dome_2000m_sia_restart_test
00:15 PASS landice_dome_2000m_sia_decomposition_test
00:24 PASS landice_dome_variable_resolution_sia_restart_test
00:08 PASS landice_dome_variable_resolution_sia_decomposition_test
00:14 PASS landice_enthalpy_benchmark_A
00:17 PASS landice_eismint2_decomposition_test
00:17 PASS landice_eismint2_enthalpy_decomposition_test
00:37 PASS landice_eismint2_restart_test
00:49 PASS landice_eismint2_enthalpy_restart_test
01:08 PASS landice_greenland_sia_restart_test
00:49 PASS landice_greenland_sia_decomposition_test
01:35 PASS landice_hydro_radial_restart_test
00:49 PASS landice_hydro_radial_decomposition_test
00:25 PASS landice_humboldt_mesh-3km_decomposition_test_velo-none_calving-none_subglacialhydro
01:04 PASS landice_humboldt_mesh-3km_restart_test_velo-none_calving-none_subglacialhydro
00:41 PASS landice_dome_2000m_fo_decomposition_test
00:40 FAIL landice_dome_2000m_fo_restart_test
00:24 PASS landice_dome_variable_resolution_fo_decomposition_test
00:44 FAIL landice_dome_variable_resolution_fo_restart_test
00:51 PASS landice_circular_shelf_decomposition_test
03:42 PASS landice_greenland_fo_decomposition_test
05:19 FAIL landice_greenland_fo_restart_test
02:26 PASS landice_thwaites_fo_decomposition_test
05:26 FAIL landice_thwaites_fo_restart_test
03:38 PASS landice_thwaites_fo-depthInt_decomposition_test
05:05 FAIL landice_thwaites_fo-depthInt_restart_test
07:02 FAIL landice_humboldt_mesh-3km_restart_test_velo-fo_calving-von_mises_stress_damage-threshold_faceMelting
06:01 FAIL landice_humboldt_mesh-3km_restart_test_velo-fo-depthInt_calving-von_mises_stress_damage-threshold_faceMelting

As before, a typical diff looks like:

 tail -n 13 case_outputs/landice_dome_2000m_fo_restart_test.log 

thickness            Time index: 0, 1, 2
1:  l1: 9.45155412268583e-13  l2: 2.75775945926937e-13  linf: 1.13686837721616e-13
2:  l1: 1.08531933440403e-12  l2: 3.08181524945176e-13  linf: 1.70530256582424e-13
  FAIL /pscratch/sd/x/xylar/compass_1.4/pm-cpu/test_20240826/full_integ_gnugpu/landice/dome/2000m/fo_restart_test/full_run/output.nc
       /pscratch/sd/x/xylar/compass_1.4/pm-cpu/test_20240826/full_integ_gnugpu/landice/dome/2000m/fo_restart_test/restart_run/output.nc
normalVelocity       Time index: 0, 1, 2
0:  l1: 2.39186964580000e-18  l2: 3.29358121341076e-20  linf: 2.11758236813575e-21
1:  l1: 2.54985234083843e-18  l2: 3.39896234813705e-20  linf: 1.69406589450860e-21
2:  l1: 2.44841417612747e-18  l2: 3.28587467314536e-20  linf: 1.69406589450860e-21
  FAIL /pscratch/sd/x/xylar/compass_1.4/pm-cpu/test_20240826/full_integ_gnugpu/landice/dome/2000m/fo_restart_test/full_run/output.nc
       /pscratch/sd/x/xylar/compass_1.4/pm-cpu/test_20240826/full_integ_gnugpu/landice/dome/2000m/fo_restart_test/restart_run/output.nc
Internal test case validation failed

@mcarlson801
Copy link
Contributor

All non-restart tests are now passing. Restart tests all seem to have small diffs (as reported above):

@xylar Awesome that everything is running now, thanks! The restart failures will take some digging on our end and probably won't happen right away. Do we need those tests passing before this can be merged? If so, we can have our automated pm-gpu testing use this branch temporarily for tracking purposes until we get it fixed.

@xylar
Copy link
Collaborator Author

xylar commented Sep 9, 2024

@mcarlson801, what we need for this to be merged is a new spack build not just on Perlmutter but on all machines where Compass is supported (we can't update one location without updating all). That will require having E3SM-Project/spack#31 merged, and making sure those changes are also included in the corresponding branch for the latest release of the mache package. I can start working on this once that branch PR is merged into the spack develop branch.

@xylar
Copy link
Collaborator Author

xylar commented Sep 10, 2024

Since we need to rebuild all spack environments to bring in this feature, #857 will replace this PR.

@xylar xylar mentioned this pull request Sep 10, 2024
40 tasks
@xylar xylar closed this in #857 Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies and deployment Changes relate to creating conda and Spack environments, and creating a load script
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants