Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intrepid2: MonolithicExecutable unit test fails in Cuda builds on Power9+Volta70 arch #12037

Closed
ndellingwood opened this issue Jul 12, 2023 · 3 comments
Assignees
Labels
pkg: Intrepid2 type: bug The primary issue is a bug in Trilinos code or tests

Comments

@ndellingwood
Copy link
Contributor

ndellingwood commented Jul 12, 2023

Bug Report

@trilinos/intrepid2

The Intrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests_MPI_1 test fails in Cuda builds (with and without UVM enabled) in the StructuredIntegration_HcurlFormulation_UniformAlgorithm_D2_P1_QuadratureUniformMesh_UnitTest subtest :

220. StructuredIntegration_HcurlFormulation_UniformAlgorithm_D2_P1_QuadratureUniformMesh_UnitTest ... [weaver4:3206220:0:3206220] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:3206220) ====
 0  /home/projects/ppc64le-pwr9-nvidia/spack-installs/ucx/1.12.1/gcc/8.3.1/base/z5nsbpl/lib/libucs.so.0(ucs_handle_error+0x374) [0x20002dd56ea4]
 1  /home/projects/ppc64le-pwr9-nvidia/spack-installs/ucx/1.12.1/gcc/8.3.1/base/z5nsbpl/lib/libucs.so.0(+0x37070) [0x20002dd57070]
 2  /home/projects/ppc64le-pwr9-nvidia/spack-installs/ucx/1.12.1/gcc/8.3.1/base/z5nsbpl/lib/libucs.so.0(+0x374a0) [0x20002dd574a0]
 3  linux-vdso64.so.1(__kernel_sigtramp_rt64+0) [0x2000000604d8]
 4  /home/ndellin/trilinos/Trilinos-pristine/Build/Weaver-Cuda11-Gcc830-nightly-nouvm-select-packages/packages/intrepid2/unit-test/MonolithicExecutable/Intrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests.exe() [0x11e21b74]
 5  /home/ndellin/trilinos/Trilinos-pristine/Build/Weaver-Cuda11-Gcc830-nightly-nouvm-select-packages/packages/intrepid2/unit-test/MonolithicExecutable/Intrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests.exe() [0x116e887c]
 6  /home/ndellin/trilinos/Trilinos-pristine/Build/Weaver-Cuda11-Gcc830-nightly-nouvm-select-packages/packages/intrepid2/unit-test/MonolithicExecutable/Intrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests.exe() [0x11708ff0]
 7  /home/ndellin/trilinos/Trilinos-pristine/Build/Weaver-Cuda11-Gcc830-nightly-nouvm-select-packages/packages/intrepid2/unit-test/MonolithicExecutable/Intrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests.exe() [0x1170b064]
 8  /home/ndellin/trilinos/Trilinos-pristine/Build/Weaver-Cuda11-Gcc830-nightly-nouvm-select-packages/packages/intrepid2/unit-test/MonolithicExecutable/Intrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests.exe() [0x118c6670]
 9  /home/ndellin/trilinos/Trilinos-pristine/Build/Weaver-Cuda11-Gcc830-nightly-nouvm-select-packages/packages/intrepid2/unit-test/MonolithicExecutable/Intrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests.exe() [0x118c71ac]
10  /home/ndellin/trilinos/Trilinos-pristine/Build/Weaver-Cuda11-Gcc830-nightly-nouvm-select-packages/packages/intrepid2/unit-test/MonolithicExecutable/Intrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests.exe() [0x12018284]
11  /home/ndellin/trilinos/Trilinos-pristine/Build/Weaver-Cuda11-Gcc830-nightly-nouvm-select-packages/packages/intrepid2/unit-test/MonolithicExecutable/Intrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests.exe() [0x1201da54]
12  /home/ndellin/trilinos/Trilinos-pristine/Build/Weaver-Cuda11-Gcc830-nightly-nouvm-select-packages/packages/intrepid2/unit-test/MonolithicExecutable/Intrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests.exe() [0x1201f06c]
13  /home/ndellin/trilinos/Trilinos-pristine/Build/Weaver-Cuda11-Gcc830-nightly-nouvm-select-packages/packages/intrepid2/unit-test/MonolithicExecutable/Intrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests.exe() [0x12020340]
14  /home/ndellin/trilinos/Trilinos-pristine/Build/Weaver-Cuda11-Gcc830-nightly-nouvm-select-packages/packages/intrepid2/unit-test/MonolithicExecutable/Intrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests.exe() [0x10100f70]
15  /lib64/glibc-hwcaps/power9/libc-2.28.so(+0x29de8) [0x20002d439de8]
16  /lib64/glibc-hwcaps/power9/libc-2.28.so(__libc_start_main+0xb4) [0x20002d439fd4]

Steps to Reproduce

  1. SHA1: 3493405
  2. Configure script: (For Weaver testbed, rhel8 queue):
# Get interactive a compute node
bsub -Is -n 1 -q rhel8 -gpu "num=1" bash

# Load env
export ATDM_CONFIG_REGISTER_CUSTOM_CONFIG_DIR=${TRILINOS_DIR}/cmake/std/atdm/contributed/weaver
source ${TRILINOS_DIR}/cmake/std/atdm/load-env.sh weaver-cuda-11.2-opt
export OMPI_CXX="$KOKKOS_PATH/bin/nvcc_wrapper"

# Configuration - Cuda build, no UVM
cmake \
      -DCMAKE_CXX_FLAGS='-g' \
      -DCMAKE_CXX_STANDARD="17" \
      -DCMAKE_INSTALL_PREFIX=$PWD/install \
      -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
      -DTrilinos_ENABLE_COMPLEX_DOUBLE=ON \
      -DTrilinos_ENABLE_TESTS=OFF \
      -DTrilinos_ENABLE_ALL_PACKAGES=OFF \
      -DTPL_ENABLE_CUSPARSE:BOOL=ON \
      -DFC_FN_UNDERSCORE=UNDER \
      \
      -D Trilinos_ENABLE_Kokkos=ON \
      -D Kokkos_ARCH_VOLTA70=ON \
      -D Kokkos_ARCH_POWER9=ON \
      -D Kokkos_ENABLE_CUDA=ON \
      -D Kokkos_ENABLE_CUDA_LAMBDA=ON \
      -D Kokkos_ENABLE_CUDA_UVM=OFF \
      -DTrilinos_ENABLE_Sacado=ON \
      -DTrilinos_ENABLE_Intrepid2=ON \
      -DIntrepid2_ENABLE_TESTS=ON \
      \
$TRILINOS_DIR
@ndellingwood ndellingwood added type: bug The primary issue is a bug in Trilinos code or tests pkg: Intrepid2 labels Jul 12, 2023
@CamelliaDPG CamelliaDPG self-assigned this Jul 12, 2023
@CamelliaDPG
Copy link
Contributor

Thanks, @ndellingwood, for the report and for including reproduction instructions! I'll take a look.

@ndellingwood
Copy link
Contributor Author

Thanks @CamelliaDPG !

CamelliaDPG added a commit to CamelliaDPG/Trilinos that referenced this issue Jul 12, 2023
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jul 14, 2023
…s:develop' (4539a08).

* trilinos-develop: (61 commits)
  EXODUS: Add a missed ifdef
  Sacado:  Remove a few instances of use of deprecated Rank
  Sacado fix subview for LayoutContiguous<LayoutLeft>
  Intrepid2: fix for trilinos#12037; resolves a test failure on certain CUDA platforms. (PR trilinos#12047)
  EXODUS: Need to check whether nc_def_var_fill is defined in netCDF
  Use OpenMPI 1.10.1 for CXX20 build
  APREPRO: Fix "if" instead of "else if" in arg parsing
  IOSS: Remove fmt dependency for Trilinos users
  STK: Snapshot 07-12-23 07:43 from Sierra simon_2023-07-10-63-g4c25d07a
  Update logic for enabling TrilinosInstallTests in CI testing (trilinos#12024)
  Tempus: Example Problem to Use SolutionState
  Tempus: Example Problem to Use SolutionState
  Phalanx: only use partition_space if enough concurrency available
  Fix accidental debuggery
  Chomp usage whitespace before assertion
  Panzer : Remove deprecated STK code.
  Fix some error message testing
  Remove test for removed capability
  More fixes for tests after source_branch removal
  Phalanx: add some diagnostics for checking fencing with exec spaces
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jul 15, 2023
…s:develop' (4539a08).

* trilinos-develop: (64 commits)
  Phalanx: fix exec space instance init for UVM=ON
  SEACAS: Fix for netcdf without pnetcdf
  EXODUS: Add a missed ifdef
  Sacado:  Remove a few instances of use of deprecated Rank
  Sacado fix subview for LayoutContiguous<LayoutLeft>
  Intrepid2: fix for trilinos#12037; resolves a test failure on certain CUDA platforms. (PR trilinos#12047)
  EXODUS: Need to check whether nc_def_var_fill is defined in netCDF
  IOSS: Fix bad changes from !compare to !=
  Use OpenMPI 1.10.1 for CXX20 build
  APREPRO: Fix "if" instead of "else if" in arg parsing
  IOSS: Remove fmt dependency for Trilinos users
  STK: Snapshot 07-12-23 07:43 from Sierra simon_2023-07-10-63-g4c25d07a
  Update logic for enabling TrilinosInstallTests in CI testing (trilinos#12024)
  Tempus: Example Problem to Use SolutionState
  Tempus: Example Problem to Use SolutionState
  Phalanx: only use partition_space if enough concurrency available
  Fix accidental debuggery
  Chomp usage whitespace before assertion
  Panzer : Remove deprecated STK code.
  Fix some error message testing
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jul 16, 2023
…s:develop' (4539a08).

* trilinos-develop: (64 commits)
  Phalanx: fix exec space instance init for UVM=ON
  SEACAS: Fix for netcdf without pnetcdf
  EXODUS: Add a missed ifdef
  Sacado:  Remove a few instances of use of deprecated Rank
  Sacado fix subview for LayoutContiguous<LayoutLeft>
  Intrepid2: fix for trilinos#12037; resolves a test failure on certain CUDA platforms. (PR trilinos#12047)
  EXODUS: Need to check whether nc_def_var_fill is defined in netCDF
  IOSS: Fix bad changes from !compare to !=
  Use OpenMPI 1.10.1 for CXX20 build
  APREPRO: Fix "if" instead of "else if" in arg parsing
  IOSS: Remove fmt dependency for Trilinos users
  STK: Snapshot 07-12-23 07:43 from Sierra simon_2023-07-10-63-g4c25d07a
  Update logic for enabling TrilinosInstallTests in CI testing (trilinos#12024)
  Tempus: Example Problem to Use SolutionState
  Tempus: Example Problem to Use SolutionState
  Phalanx: only use partition_space if enough concurrency available
  Fix accidental debuggery
  Chomp usage whitespace before assertion
  Panzer : Remove deprecated STK code.
  Fix some error message testing
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jul 17, 2023
…s:develop' (4539a08).

* trilinos-develop: (64 commits)
  Phalanx: fix exec space instance init for UVM=ON
  SEACAS: Fix for netcdf without pnetcdf
  EXODUS: Add a missed ifdef
  Sacado:  Remove a few instances of use of deprecated Rank
  Sacado fix subview for LayoutContiguous<LayoutLeft>
  Intrepid2: fix for trilinos#12037; resolves a test failure on certain CUDA platforms. (PR trilinos#12047)
  EXODUS: Need to check whether nc_def_var_fill is defined in netCDF
  IOSS: Fix bad changes from !compare to !=
  Use OpenMPI 1.10.1 for CXX20 build
  APREPRO: Fix "if" instead of "else if" in arg parsing
  IOSS: Remove fmt dependency for Trilinos users
  STK: Snapshot 07-12-23 07:43 from Sierra simon_2023-07-10-63-g4c25d07a
  Update logic for enabling TrilinosInstallTests in CI testing (trilinos#12024)
  Tempus: Example Problem to Use SolutionState
  Tempus: Example Problem to Use SolutionState
  Phalanx: only use partition_space if enough concurrency available
  Fix accidental debuggery
  Chomp usage whitespace before assertion
  Panzer : Remove deprecated STK code.
  Fix some error message testing
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jul 18, 2023
…s:develop' (4539a08).

* trilinos-develop: (66 commits)
  APREPRO: Fix -I parsing; use include_path on input file
  STK: Snapshot 07-17-23 09:42 from Sierra 5.15.2-314-g7c2d5cc9
  Phalanx: fix exec space instance init for UVM=ON
  SEACAS: Fix for netcdf without pnetcdf
  EXODUS: Add a missed ifdef
  Sacado:  Remove a few instances of use of deprecated Rank
  Sacado fix subview for LayoutContiguous<LayoutLeft>
  Intrepid2: fix for trilinos#12037; resolves a test failure on certain CUDA platforms. (PR trilinos#12047)
  EXODUS: Need to check whether nc_def_var_fill is defined in netCDF
  IOSS: Fix bad changes from !compare to !=
  Use OpenMPI 1.10.1 for CXX20 build
  APREPRO: Fix "if" instead of "else if" in arg parsing
  IOSS: Remove fmt dependency for Trilinos users
  STK: Snapshot 07-12-23 07:43 from Sierra simon_2023-07-10-63-g4c25d07a
  Update logic for enabling TrilinosInstallTests in CI testing (trilinos#12024)
  Tempus: Example Problem to Use SolutionState
  Tempus: Example Problem to Use SolutionState
  Phalanx: only use partition_space if enough concurrency available
  Fix accidental debuggery
  Chomp usage whitespace before assertion
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jul 18, 2023
…s:develop' (4539a08).

* trilinos-develop: (66 commits)
  APREPRO: Fix -I parsing; use include_path on input file
  STK: Snapshot 07-17-23 09:42 from Sierra 5.15.2-314-g7c2d5cc9
  Phalanx: fix exec space instance init for UVM=ON
  SEACAS: Fix for netcdf without pnetcdf
  EXODUS: Add a missed ifdef
  Sacado:  Remove a few instances of use of deprecated Rank
  Sacado fix subview for LayoutContiguous<LayoutLeft>
  Intrepid2: fix for trilinos#12037; resolves a test failure on certain CUDA platforms. (PR trilinos#12047)
  EXODUS: Need to check whether nc_def_var_fill is defined in netCDF
  IOSS: Fix bad changes from !compare to !=
  Use OpenMPI 1.10.1 for CXX20 build
  APREPRO: Fix "if" instead of "else if" in arg parsing
  IOSS: Remove fmt dependency for Trilinos users
  STK: Snapshot 07-12-23 07:43 from Sierra simon_2023-07-10-63-g4c25d07a
  Update logic for enabling TrilinosInstallTests in CI testing (trilinos#12024)
  Tempus: Example Problem to Use SolutionState
  Tempus: Example Problem to Use SolutionState
  Phalanx: only use partition_space if enough concurrency available
  Fix accidental debuggery
  Chomp usage whitespace before assertion
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jul 18, 2023
…s:develop' (4539a08).

* trilinos-develop: (66 commits)
  APREPRO: Fix -I parsing; use include_path on input file
  STK: Snapshot 07-17-23 09:42 from Sierra 5.15.2-314-g7c2d5cc9
  Phalanx: fix exec space instance init for UVM=ON
  SEACAS: Fix for netcdf without pnetcdf
  EXODUS: Add a missed ifdef
  Sacado:  Remove a few instances of use of deprecated Rank
  Sacado fix subview for LayoutContiguous<LayoutLeft>
  Intrepid2: fix for trilinos#12037; resolves a test failure on certain CUDA platforms. (PR trilinos#12047)
  EXODUS: Need to check whether nc_def_var_fill is defined in netCDF
  IOSS: Fix bad changes from !compare to !=
  Use OpenMPI 1.10.1 for CXX20 build
  APREPRO: Fix "if" instead of "else if" in arg parsing
  IOSS: Remove fmt dependency for Trilinos users
  STK: Snapshot 07-12-23 07:43 from Sierra simon_2023-07-10-63-g4c25d07a
  Update logic for enabling TrilinosInstallTests in CI testing (trilinos#12024)
  Tempus: Example Problem to Use SolutionState
  Tempus: Example Problem to Use SolutionState
  Phalanx: only use partition_space if enough concurrency available
  Fix accidental debuggery
  Chomp usage whitespace before assertion
  ...
JacobDomagala pushed a commit to NexGenAnalytics/Trilinos that referenced this issue Aug 4, 2023
@ndellingwood
Copy link
Contributor Author

Resolved by #12047

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Intrepid2 type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

2 participants