Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kokkos error reporting failures with CUDA GPUs in exclusive mode #2471

Closed
streichler opened this issue Oct 16, 2019 · 8 comments
Closed

Kokkos error reporting failures with CUDA GPUs in exclusive mode #2471

streichler opened this issue Oct 16, 2019 · 8 comments
Assignees
Labels
Enhancement Improve existing capability; will potentially require voting

Comments

@streichler
Copy link

CUDA allows GPUs to be placed in an "exclusive process" mode that permits at most one process to use a GPU at a time. This appears to cause problems with how Kokkos reports errors, as seen in the Kokkos test suite:

$ ../generate_makefile.bash --arch=Volta70 --with-cuda=/usr/local/cuda
$ make -j8 build-test
$ sudo nvidia-smi -i 0 -c 1
$ make test

yields this:

...
[ RUN      ] cuda.view_layoutstride_right_to_layoutright_assignment
view_layoutstride_right_to_layoutright_assignment: srand(1571255899)
[       OK ] cuda.view_layoutstride_right_to_layoutright_assignment (74 ms)
[ RUN      ] cuda.view_layoutstride_right_to_layoutleft_assignment
view_layoutstride_right_to_layoutleft_assignment: srand(1571255900)
/local/home/sean/kokkos/core/unit_test/TestViewLayoutStrideAssignment.hpp:539: Failure
Death test: {dst=src;}
    Result: died but not with expected error.
  Expected: View assignment must have compatible layouts
Actual msg:
[  DEATH   ] terminate called after throwing an instance of 'std::runtime_error'
[  DEATH   ]   what():  cudaDeviceSynchronize() error( cudaErrorDevicesUnavailable): all CUDA-capable devices are busy or unavailable /local/home/sean/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:120
[  DEATH   ] Traceback functionality not available
[  DEATH   ] 
[  DEATH   ] 
[  FAILED  ] cuda.view_layoutstride_right_to_layoutleft_assignment (195 ms)
[ RUN      ] cuda.view_layoutstride_left_to_layoutright_assignment
view_layoutstride_left_to_layoutright_assignment: srand(1571255900)
/local/home/sean/kokkos/core/unit_test/TestViewLayoutStrideAssignment.hpp:662: Failure
Death test: {dst=src;}
    Result: died but not with expected error.
  Expected: View assignment must have compatible layouts
Actual msg:
[  DEATH   ] terminate called after throwing an instance of 'std::runtime_error'
[  DEATH   ]   what():  cudaDeviceSynchronize() error( cudaErrorDevicesUnavailable): all CUDA-capable devices are busy or unavailable /local/home/sean/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:120
[  DEATH   ] Traceback functionality not available
[  DEATH   ] 
[  DEATH   ] 
[  FAILED  ] cuda.view_layoutstride_left_to_layoutright_assignment (168 ms)
[ RUN      ] cuda.view_nested_view
[       OK ] cuda.view_nested_view (0 ms)
...
[----------] Global test environment tear-down
[==========] 158 tests from 3 test cases ran. (31854 ms total)
[  PASSED  ] 156 tests.
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] cuda.view_layoutstride_right_to_layoutleft_assignment
[  FAILED  ] cuda.view_layoutstride_left_to_layoutright_assignment

 2 FAILED TESTS
/local/home/sean/kokkos/core/unit_test/Makefile:444: recipe for target 'test-cuda' failed
make[2]: *** [test-cuda] Error 1
make[2]: Leaving directory '/local/home/sean/kokkos/build/core/unit_test'
Makefile:7: recipe for target 'test' failed
make[1]: *** [test] Error 2
make[1]: Leaving directory '/local/home/sean/kokkos/build/core/unit_test'
Makefile:32: recipe for target 'test' failed
make: *** [test] Error 2

The entire test suite runs fine if I switch the GPU back to the default compute mode with:

nvidia-smi -i 0 -c 0
@dhollman
Copy link

@crtrott Do we have access to a testbed where this is enabled or where we can enable this? The only machine I have root on is my laptop, and CUDA doesn't work on Macs...

@crtrott
Copy link
Member

crtrott commented Oct 17, 2019

I can test this on Apollo or Kokkos-Dev-2

@crtrott
Copy link
Member

crtrott commented Oct 17, 2019

actually I can set kokkos-dev-2 the second GPU to exclusive mode for testing purposes. You need to launch with CUDA_VISIBLE_DEVICES=1 then.

@crtrott
Copy link
Member

crtrott commented Oct 17, 2019

This is set up and confirmed. Same executable run on device 0 (which is in default mode) passes, while on device 1 (which is now in exclusive mode) it fails.

@crtrott crtrott added the Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) label Oct 17, 2019
@crtrott crtrott added this to the Tentative 3.1 Release milestone Oct 17, 2019
@crtrott
Copy link
Member

crtrott commented Oct 17, 2019

I believe this is because of us using gtest for a "death_test", not because of how Kokkos reports failures. I bet gtest death tests spawn off a child process, which the original process checks to have died. And the test fails, because it dies with the wrong error message (i.e. it couldn't get a GPU, instead of the expected assert). My guess is that in exclusive mode we just need to disable death tests ...

@crtrott crtrott added Enhancement Improve existing capability; will potentially require voting and removed Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) labels Oct 17, 2019
@crtrott
Copy link
Member

crtrott commented Oct 17, 2019

I mark this tentatively as enhancement not bug, since I am pretty convinced that this only affects testing, and via fiat I declare running our unit tests on a GPU in exclusive mode is currently not supported.

@crtrott
Copy link
Member

crtrott commented Oct 17, 2019

Alternatively we could just prefix the names of all death tests, and thus would allow on systems in exclusive mode a easy way of excluding those tests.

@crtrott
Copy link
Member

crtrott commented Oct 26, 2019

We merged this with a suffix according to the recommendations from gtest. Thus in exclusive mode you can now simply exclude the tests. We may come back to this and try to disable death_tests internally when we discover GPUs are in exclusive mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Improve existing capability; will potentially require voting
Projects
None yet
Development

No branches or pull requests

3 participants