-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Kokkos::OpenMP respect OMP environment even if hwloc is available #630
Comments
Are you exporting OMP_NUM_THREADS to the compute nodes? I haven't built kokkos standalone in a while, but I seem to recall it respecting the environment variables. E.g., I suggest always setting OMP_PLACES, If in the scope of Trilinos, you can achieve this by setting //The maximum mumber of processes to use when running MPI programs. mpirun: aprun: //Extra command-line args to the MPI exec after num-procs args. |
Ok, I did not mention that I'm running code on my workstation (I'm testing a library, Albany, before pushing some commits). I did set OMP_DISPLAY_NEV=verbsoe and I did try to set OMP_NUM_THREADS to a smaller number. And still Kokkos ignores it. I believe that Kokkos can ignore OMP_NUM_THREADS when hwloc is set and there is more than 1 numa region or more than 1 thread per core. In particular, if one initializes kokkos without passing a positive number of threads, then Kokkos uses hwloc instead of asking omp the number of threads. In fact, the OpenMP initialization method has these lines:
If you have at least 2 numa regions or at least 2 threads per core, then Impl::s_using_hwloc=true. Therefore, the following lines
will set thread_count to whatever hwloc says, regardless of the value one set in OMP_NUM_THREADS. Am I right or am I missing something? If I'm right, is there a way to make OpenMP check OMP_NUM_THREADS before checking hwloc availability? Something like
|
No you are right, hwloc is the default. The reason for that is that most people actually forget to set OMP_NUM_THREADS manually which on quite a few machines leads to it being set automatically to "all the cores". That is why our experience is that hwloc gives you the better experience. You are only running into trouble with that if you don't set correct process masks for MPI. But that will cause issues either way since you want to use thread binding with hwloc. Say you have 8 cores, run 4 MPI ranks with 2 threads. You would end up with 4 threads on core 0, and 4 threads on core 1, with threads 2-7 idle. This is because thread binding is with respect to the process mask. That means your oversubscription rate is exactly the same as it is now. The correct way to solve this is specifying MPI binding masks for your ctest execution. If you want to use thread binding with hwloc, you must set correct process masks to avoid oversubscription no matter how many threads you want to use. As a consequence I don't believe that the proposed change would solve your issue. |
Confused why we can't check for if the user has remembered to set the thread count then, if no, drop back to hwloc? I'd much rather use the OpenMP configuration variables. |
I think we should check if OMP_NUM_THREADS exist and default to it if it exist. Otherwise we get unexpected behavior. |
That's what I'm wondering. If the user remembered to set OMP variables, then why not using those? A simple getenv in the initialization should be enough. Also for thread binding (I'm venturing outside my expertise area here, so I may be way wrong), Kokkos::OpenMP could check the OMP_PROC_BIND and OMP_PLACES environment variables. If not set, rely on hwloc. Or is it hwloc something that must be used in its entirety, that is, one cannot use OMP_NUM_THREADS to determine the number of threads and rely on hwloc for their binding? |
Reply to SI: Two issues:
and there is one other: |
Reply to Dan: If we want to make environment variables the priority option, then we should go with something like KOKKOS_NUM_THREADS etc. which would also work for a threads backend. Furthermore what happens in the first scenario if OMP_NUM_THREADS and hwloc disagree on what the right thing to do is. And would you still use hwloc for binding? The whole issue in this case is that the user didn't set the MPI rank masks correct. In that case ANY use of hwloc leads to bad behavior, i.e. oversubscription of cores, no matter how you set the total number of threads. |
Reply to Bartgol: My advise: don't use hwloc. I don't for the vast majority of what I do, because I use the OpenMP backend and want to use OpenMP environment variables to control number of threads and their binding. |
Ok. But isn't hwloc a "must" if one wants Kokkos::AUTO to have the "best" teams organization in the team policies? I thought without information on the hardware (which is what I assume hwloc does), then Kokkos would not be able to create optimal(-ish) teams through Kokkos::AUTO... |
This is good to know. Over the summer I used Kokkos directly, but always used --kokkos-threads. FYI, Trilinos builds Kokkos without hwloc by default. So I haven't had to worry about this. |
You are right about the AUTO. Basically without hwloc you get 1 for team-size on CPUs. This is on the other hand mostly the best team-size. Even on KNL the benefit of team-size 4 is limited to a few algorithms (though it usually also not hurts to have team-size 4). |
For the OpenMP back-end, if OpenMP binding environment variables such as number of threads and binding, then honor this even when hwloc is enabled. Initialization must report that this binding was detected and honored, and report how OpenMP threads are mapped to Kokkos thread ranks. |
Initialization can't print this out by default, because if I have 20k MPI ranks on Trinity I don't want to see that from every MPI rank. |
Two options:
Discovering all OpenMP variables and what they are set to is problematic: to many variables which on top of it may be vendor specific. |
How about:
|
If I understand you correctly, you want Kokkos to rely on hwloc (if available), but also scan for the env variable KOKKOS_BIND_THREADS to see if the user knows how to use OpenMP. If this variable is set to 0, would kokkos proceed as if hwloc is not installed? |
How about something like a Kokkos bind policy with one of three settings = ALWAYS, NONE, AUTO. In ALWAYS Kokkos will override and force binding, NONE means it will get out of the way and AUTO means check if OpenMP standard variables are set, use those, if not, perform some kind of binding. |
@nmhamster I don't think kokkos can properly set bindings. Kokkos is mainly used inside an MPI environment, and in that case, I want bindings that make sense given the distribution of processes on the node. Perhaps, one way to allow this to work, is to have a --kokkos-procs-per-node=n setting. But my point still stands, you need to know how many other processes there are contending for those cores if you want to do binding. Otherwise, how will you know the proper cpusets to create? I've been inspecting bindings using the pthreads API (gnu extension) or a linux syscall. One thing you can't do, is know the difference between a logical core and physical one. You can infer a little, based on the number of threads that share a cpuset. E.g., on KNL w/bind to core, you will have things like But to get that right, the user needs to have queued and launched the job correctly. If kokkos starts messing with bindings, this could get ugly very fast. OMP_PLACES (gomp) and KMP_AFFINITY (iomp5) already fight for binding rights. ALPS also explicitly states to leave KMP_AFFINITY alone if you want Cray to handle bindings. All of that said: Perhaps the best route, is to provide basic tests to see if the threads 'look' bound correctly. That is, on KNL, threads may share at most 4 cpuids. On haswell that is 2. An obvious error, is when your threads have >4 or >2 cpuids in their cpusets. On Power8 w/cuda, it could be acceptable to have threads bound to ~32, but clearly bad if multiple processes have overlapping cpusets. Cuda brings up another issue: If you want to handle binding, then also ensure that processes are bound to CUDA devices correctly. I just wrote this code this week. The key element in all of this is the knowledge of how many procs_per_node there are. So again, MPI keeps coming into play. Perhaps careful thought about the relationship between Kokkos/Teuchos. What you don't want, is for Kokkos to blindly write a linear cpuset partitioned by procs_per_node, e.g., KNL proc1: tid1 {1,2,3,4} ... you would basically smush MPI procs into shaing cores ! |
A major use-case concern is automated testing environments where testing executables will run at the same time on the same node and oversubscribe the node. These short-lived runs need unbound host threads so that the executables can make progress. |
When the user initializes Kokkos and does not specify a valid number of threads, Kokkos figures it out for him. However, when hwloc is available, Kokkos always use hwloc to determine the number of threads instead of respecting OMP environment variables.
If launching a program, one can add '--kokkos-threads=...' at the end to set the number of threads. However, in my situation I can't do that, since the executable is launched in a (bunch of) cmake test(s), and the tests are run with N>=2 mpi process; with the current Kokkos behavior, I get my cpu oversubscribed, since every mpi process tries to use all the cpu's for the OpenMP initialization.
Would it be possible to make Kokkos respect OMP environment variables when they are set, rather than relying on hwloc information? Specifically I'm innterested in kokkos respecting the value of OMP_NUM_THREADS (if set).
The text was updated successfully, but these errors were encountered: