Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow KokkosNode = OpenMP tests #103

Closed
ikalash opened this issue Mar 31, 2017 · 7 comments
Closed

Slow KokkosNode = OpenMP tests #103

ikalash opened this issue Mar 31, 2017 · 7 comments
Labels

Comments

@ikalash
Copy link
Collaborator

ikalash commented Mar 31, 2017

Another issue, stemming form my OS upgrade: KokkosNode = OpenMP tests are now incredibly slow, b/c of the following oversubscription of CPU cores:

Kokkos::OpenMP::initialize WARNING: You are likely oversubscribing your CPU cores.
Detected: 40 cores per node.
Detected: 4 MPI_ranks per node.
Requested: 20 threads per process.

I think there must be some configuration or runtime option to fix this. @jewatkins , I think you may have experience this in the past and found a fix?

@ibaned
Copy link
Contributor

ibaned commented Mar 31, 2017

@ikalash I updated your comment from not incredibly slow to now incredibly slow, which I think is what you mean. I also recommend setting

export OMP_NUM_THREADS=1

somewhere in your test scripts for starters. That is assuming no other part of the test system is setting thread counts.

@ikalash
Copy link
Collaborator Author

ikalash commented Mar 31, 2017

Oh, yes, that was an unfortunate typo - it should be "now". Unfortunately, setting export OMP_NUM_THREADS=1 does not fix the issue.

@bartgol
Copy link
Collaborator

bartgol commented Apr 3, 2017

@ikalash I opened an issue kokkos/kokkos#630 a while ago on Kokkos, cause I was experiencing the same thing. In my case the problem was that I installed kokkos with hwloc support, which basically makes kokkos ignore the OMP_NUM_THREADS env variable, and makes kokkos use hwloc to determine the number of threads, which hwloc puts equal to the number of PUs. The only way to overwrite this is to pass --kokkos-threads=N to the command line, but Albany does not allow this in the testsuite (although it does allow this when launching albany manually).

Unfortunately, there seem to be some complications in making Kokkos honor the OMP variables when hwloc is installed. From what I understand, the solution is to use hwloc only to investigate the hardware topology the first time, then never use it, and use just OMP env variables.

Are you using an installation of trilinos which uses hwloc?

lxmota added a commit that referenced this issue Apr 3, 2017
@ikalash
Copy link
Collaborator Author

ikalash commented Apr 3, 2017

@bartgol . I did have hwloc, and you are right - turning it off fixes the issue. I actually remembered having observed this earlier with @jewatkins after seeing your message.

Regarding passing --kokkos-threads=N to the code: I think this is actually possible to do in ctest. We do something similar for the CUDA runs - see the following code in Albany/CMakeLists.txt:

SET(KOKKOS_NDEVICES "--kokkos-ndevices=${NUM_GPUS_PER_NODE}")

It will generate a ctest line as follows:

add_test(Helmholtz2D_Tpetra "/home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/bin/mpirun" "-n" "4" "/home/projects/albany/build/AlbBuild/src/AlbanyT" "--kokkos-ndevices=4" "inputT.xml")

I think one could do something similar to pass --kokkos-threads=N to each ctest test when the code is compiled with OpenMP.

Closing the issue since turning off HWLOC resolved it.

@ikalash ikalash closed this as completed Apr 3, 2017
@bartgol
Copy link
Collaborator

bartgol commented Apr 3, 2017

Yeah, having a cmake config variable for the number of threads is useful. We could add one such option without default (so that kokkos does its magic under the hood if one does not want to bother setting it), then add it to all the variables that define executables,. e.g., turning

set(Albany.exe ${MPIEX} ${MPIPRE} ${MPINPF} ${MPIMNP} ${MPIPOST} ${AlbanyPath})
into

set(Albany.exe ${MPIEX} ${MPIPRE} ${MPINPF} ${MPIMNP} ${MPIPOST} ${AlbanyPath} ${KOKKOS_THREADS})

Perhaps it could also be useful if we could specify the number of threads directly when invoking ctest, like "ctest --kokkos-threads=4". But I guess the first option (cmake variable) would already be an improvement.

@ibaned
Copy link
Contributor

ibaned commented Apr 3, 2017

Maybe create a new AlbanyKokkos.exe which is the only one that gets a thread count bigger than one, and the other exes get set to 1 thread ? Depends on if we want to test everything with threads on regardless of how well it uses Kokkos, versus only using threads for the few tests we are focused on.

@bartgol
Copy link
Collaborator

bartgol commented Apr 3, 2017

Uhm, I would prefer to not create AlbanyKokkos, since we then have to create it for all the executables (for coherence). Albany already has quite a few executables, and multiplying everything by 2 may make things less clear and/or easy to maintain.

I think setting the number of threads equal to 1 by default may end up not checking that everything is thread safe. On the other hand, Albany rarely (?) uses Kokkos directly, and delegates parallelization to trilinos interfaces (intrepid, tpetra...). So perhaps it could be ok to just set it equal to 1 as default, letting the user change it through cmake. But I would argue that the nightly builds should always use --kokkos-threads>=2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants