Make Kokkos::OpenMP respect OMP environment even if hwloc is available #630

bartgol · 2017-02-06T22:21:20Z

When the user initializes Kokkos and does not specify a valid number of threads, Kokkos figures it out for him. However, when hwloc is available, Kokkos always use hwloc to determine the number of threads instead of respecting OMP environment variables.

If launching a program, one can add '--kokkos-threads=...' at the end to set the number of threads. However, in my situation I can't do that, since the executable is launched in a (bunch of) cmake test(s), and the tests are run with N>=2 mpi process; with the current Kokkos behavior, I get my cpu oversubscribed, since every mpi process tries to use all the cpu's for the OpenMP initialization.

Would it be possible to make Kokkos respect OMP environment variables when they are set, rather than relying on hwloc information? Specifically I'm innterested in kokkos respecting the value of OMP_NUM_THREADS (if set).

jjellio · 2017-02-07T10:52:41Z

Are you exporting OMP_NUM_THREADS to the compute nodes?

I haven't built kokkos standalone in a while, but I seem to recall it respecting the environment variables.

E.g.,
mpirun -x OMP_NUM_THREADS=4 -x OMP_PLACES=cores --map-by .... -np <#>
or using Cray:
aprun -e OMP_NUM_THREADS=4
SLURM srun has a similar flag.

I suggest always setting OMP_PLACES,
OMP_DISPLAY_ENV=verbose can also be extremely helpful.

If in the scope of Trilinos, you can achieve this by setting

//The maximum mumber of processes to use when running MPI programs.
// Tests with more procs are excluded.
MPI_EXEC_MAX_NUMPROCS:STRING=4

mpirun:
MPI_EXEC_NUMPROCS_FLAG:STRING="-x;OMP_NUM_THREADS=4;-x;OMP_PLACES=cores(2);-x;OMP_DISPLAY_ENV=verbose;-np"

aprun:
MPI_EXEC_NUMPROCS_FLAG:STRING="-e;OMP_NUM_THREADS=4;-e;OMP_PLACES=cores(2);-e;OMP_DISPLAY_ENV=verbose;-d;4;-j;2;-cc;depth;-n"

//Extra command-line args to the MPI exec after num-procs args.
MPI_EXEC_POST_NUMPROCS_FLAGS:STRING=

bartgol · 2017-02-07T16:42:58Z

Ok, I did not mention that I'm running code on my workstation (I'm testing a library, Albany, before pushing some commits).

I did set OMP_DISPLAY_NEV=verbsoe and I did try to set OMP_NUM_THREADS to a smaller number. And still Kokkos ignores it.

I believe that Kokkos can ignore OMP_NUM_THREADS when hwloc is set and there is more than 1 numa region or more than 1 thread per core. In particular, if one initializes kokkos without passing a positive number of threads, then Kokkos uses hwloc instead of asking omp the number of threads. In fact, the OpenMP initialization method has these lines:

    Impl::s_using_hwloc = hwloc::available() && (
                            ( 1 < Kokkos::hwloc::get_available_numa_count() ) ||
                            ( 1 < Kokkos::hwloc::get_available_threads_per_core() ) );

If you have at least 2 numa regions or at least 2 threads per core, then Impl::s_using_hwloc=true. Therefore, the following lines

    if ( thread_count == 0 ) {
      thread_count = Impl::s_using_hwloc
      ? Kokkos::hwloc::get_available_numa_count() *
        Kokkos::hwloc::get_available_cores_per_numa() *
        Kokkos::hwloc::get_available_threads_per_core()
      : omp_max_threads ;
    }

will set thread_count to whatever hwloc says, regardless of the value one set in OMP_NUM_THREADS.

Am I right or am I missing something? If I'm right, is there a way to make OpenMP check OMP_NUM_THREADS before checking hwloc availability? Something like


if (getenv("OMP_NUM_THREADS"))
   num_threads = std::atoi(getenv("OMP_NUM_THREADS"));

crtrott · 2017-02-07T17:58:54Z

No you are right, hwloc is the default. The reason for that is that most people actually forget to set OMP_NUM_THREADS manually which on quite a few machines leads to it being set automatically to "all the cores". That is why our experience is that hwloc gives you the better experience. You are only running into trouble with that if you don't set correct process masks for MPI. But that will cause issues either way since you want to use thread binding with hwloc.

Say you have 8 cores, run 4 MPI ranks with 2 threads. You would end up with 4 threads on core 0, and 4 threads on core 1, with threads 2-7 idle. This is because thread binding is with respect to the process mask. That means your oversubscription rate is exactly the same as it is now.

The correct way to solve this is specifying MPI binding masks for your ctest execution.
The other option would be to just not use hwloc.

If you want to use thread binding with hwloc, you must set correct process masks to avoid oversubscription no matter how many threads you want to use. As a consequence I don't believe that the proposed change would solve your issue.

nmhamster · 2017-02-07T18:01:16Z

Confused why we can't check for if the user has remembered to set the thread count then, if no, drop back to hwloc? I'd much rather use the OpenMP configuration variables.

dsunder · 2017-02-07T18:22:01Z

I think we should check if OMP_NUM_THREADS exist and default to it if it exist. Otherwise we get unexpected behavior.

bartgol · 2017-02-07T18:28:57Z

That's what I'm wondering. If the user remembered to set OMP variables, then why not using those? A simple getenv in the initialization should be enough.

Also for thread binding (I'm venturing outside my expertise area here, so I may be way wrong), Kokkos::OpenMP could check the OMP_PROC_BIND and OMP_PLACES environment variables. If not set, rely on hwloc.

Or is it hwloc something that must be used in its entirety, that is, one cannot use OMP_NUM_THREADS to determine the number of threads and rely on hwloc for their binding?

crtrott · 2017-02-07T18:44:43Z

Reply to SI:

Two issues:

what if the system sets one by default
it will actually not solve bartgol's problem since he would still use hwloc for thread binding and thus oversubscribe his cores while leaving others idle.

and there is one other:
3) so far it was our policy to avoid environment variables to setting stuff for Kokkos as the default option. And yes Si, I am more or less on your side of that particular discussion (i.e. in favor of using environment variables to set defaults which can be overridden by command line arguments). But on balance in the Kokkos team this is not the (weighted) majority position.

crtrott · 2017-02-07T18:49:14Z

Reply to Dan:
We don't get unexpected behavior because the expected behavior is to use hwloc if you went through the trouble of enabling it. You have to do more work to use hwloc than to use OMP_NUM_THREADS.

If we want to make environment variables the priority option, then we should go with something like KOKKOS_NUM_THREADS etc. which would also work for a threads backend.

Furthermore what happens in the first scenario if OMP_NUM_THREADS and hwloc disagree on what the right thing to do is. And would you still use hwloc for binding? The whole issue in this case is that the user didn't set the MPI rank masks correct. In that case ANY use of hwloc leads to bad behavior, i.e. oversubscription of cores, no matter how you set the total number of threads.

crtrott · 2017-02-07T18:50:30Z

Reply to Bartgol:

My advise: don't use hwloc. I don't for the vast majority of what I do, because I use the OpenMP backend and want to use OpenMP environment variables to control number of threads and their binding.

bartgol · 2017-02-07T18:53:40Z

Ok. But isn't hwloc a "must" if one wants Kokkos::AUTO to have the "best" teams organization in the team policies? I thought without information on the hardware (which is what I assume hwloc does), then Kokkos would not be able to create optimal(-ish) teams through Kokkos::AUTO...

jjellio · 2017-02-08T03:07:02Z

This is good to know. Over the summer I used Kokkos directly, but always used --kokkos-threads.

FYI, Trilinos builds Kokkos without hwloc by default. So I haven't had to worry about this.

crtrott · 2017-02-09T19:53:25Z

You are right about the AUTO. Basically without hwloc you get 1 for team-size on CPUs. This is on the other hand mostly the best team-size. Even on KNL the benefit of team-size 4 is limited to a few algorithms (though it usually also not hurts to have team-size 4).

hcedwar · 2017-02-15T20:01:13Z

For the OpenMP back-end, if OpenMP binding environment variables such as number of threads and binding, then honor this even when hwloc is enabled. Initialization must report that this binding was detected and honored, and report how OpenMP threads are mapped to Kokkos thread ranks.

crtrott · 2017-02-15T21:48:26Z

Initialization can't print this out by default, because if I have 20k MPI ranks on Trinity I don't want to see that from every MPI rank.

crtrott · 2017-03-22T18:49:19Z

Two options:

Always use OpenMP Variables (maybe use HWLOC for discovery only)
Use HWLOC as the "I don't know what all this OpenMP stuff is, just try to give me something reasonable" and recommend to knowledgable users to not use HWLOC.

Discovering all OpenMP variables and what they are set to is problematic: to many variables which on top of it may be vendor specific.

crtrott · 2017-03-22T18:53:56Z

How about:

If build with HWLOC by default use it to discover and bind
Use environment variable KOKKOS_BIND_THREADS=0 to disable binding if you know how to set up your OpenMP stuff

bartgol · 2017-03-22T19:52:06Z

If I understand you correctly, you want Kokkos to rely on hwloc (if available), but also scan for the env variable KOKKOS_BIND_THREADS to see if the user knows how to use OpenMP. If this variable is set to 0, would kokkos proceed as if hwloc is not installed?

nmhamster · 2017-03-23T00:20:08Z

How about something like a Kokkos bind policy with one of three settings = ALWAYS, NONE, AUTO. In ALWAYS Kokkos will override and force binding, NONE means it will get out of the way and AUTO means check if OpenMP standard variables are set, use those, if not, perform some kind of binding.

jjellio · 2017-03-23T01:31:43Z

@nmhamster I don't think kokkos can properly set bindings. Kokkos is mainly used inside an MPI environment, and in that case, I want bindings that make sense given the distribution of processes on the node. Perhaps, one way to allow this to work, is to have a --kokkos-procs-per-node=n setting. But my point still stands, you need to know how many other processes there are contending for those cores if you want to do binding. Otherwise, how will you know the proper cpusets to create?

I've been inspecting bindings using the pthreads API (gnu extension) or a linux syscall. One thing you can't do, is know the difference between a logical core and physical one. You can infer a little, based on the number of threads that share a cpuset. E.g., on KNL w/bind to core, you will have things like
tid=1 \in {68, 136, 204, 272}
tid=2 \in {68, 136, 204, 272}

But to get that right, the user needs to have queued and launched the job correctly. If kokkos starts messing with bindings, this could get ugly very fast. OMP_PLACES (gomp) and KMP_AFFINITY (iomp5) already fight for binding rights. ALPS also explicitly states to leave KMP_AFFINITY alone if you want Cray to handle bindings.

All of that said: Perhaps the best route, is to provide basic tests to see if the threads 'look' bound correctly. That is, on KNL, threads may share at most 4 cpuids. On haswell that is 2. An obvious error, is when your threads have >4 or >2 cpuids in their cpusets. On Power8 w/cuda, it could be acceptable to have threads bound to ~32, but clearly bad if multiple processes have overlapping cpusets.

Cuda brings up another issue: If you want to handle binding, then also ensure that processes are bound to CUDA devices correctly. I just wrote this code this week. The key element in all of this is the knowledge of how many procs_per_node there are. So again, MPI keeps coming into play. Perhaps careful thought about the relationship between Kokkos/Teuchos. What you don't want, is for Kokkos to blindly write a linear cpuset partitioned by procs_per_node, e.g., KNL proc1: tid1 {1,2,3,4} ... you would basically smush MPI procs into shaing cores !

hcedwar · 2017-04-26T18:30:40Z

A major use-case concern is automated testing environments where testing executables will run at the same time on the same node and oversubscribe the node. These short-lived runs need unbound host threads so that the executables can make progress.

hcedwar added the Enhancement Improve existing capability; will potentially require voting label Feb 15, 2017

hcedwar added this to the Backlog milestone Feb 15, 2017

hcedwar modified the milestones: 2017-April-end, Backlog Feb 22, 2017

bathmatt mentioned this issue Mar 3, 2017

ctest -j and kokkos and hwloc trilinos/Trilinos#1104

Closed

bartgol mentioned this issue Apr 3, 2017

Slow KokkosNode = OpenMP tests sandialabs/Albany#103

Closed

crtrott modified the milestones: 2017-April-end, 2017-June-end Apr 26, 2017

hcedwar assigned dsunder Apr 26, 2017

dsunder added the InDevelop label Jun 13, 2017

ibaned modified the milestone: 2017-June-end Jun 14, 2017

crtrott closed this as completed Jul 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Kokkos::OpenMP respect OMP environment even if hwloc is available #630

Make Kokkos::OpenMP respect OMP environment even if hwloc is available #630

bartgol commented Feb 6, 2017

jjellio commented Feb 7, 2017 •

edited

Loading

bartgol commented Feb 7, 2017 •

edited

Loading

crtrott commented Feb 7, 2017

nmhamster commented Feb 7, 2017

dsunder commented Feb 7, 2017

bartgol commented Feb 7, 2017

crtrott commented Feb 7, 2017 •

edited

Loading

crtrott commented Feb 7, 2017

crtrott commented Feb 7, 2017

bartgol commented Feb 7, 2017

jjellio commented Feb 8, 2017

crtrott commented Feb 9, 2017

hcedwar commented Feb 15, 2017

crtrott commented Feb 15, 2017

crtrott commented Mar 22, 2017

crtrott commented Mar 22, 2017 •

edited

Loading

bartgol commented Mar 22, 2017

nmhamster commented Mar 23, 2017

jjellio commented Mar 23, 2017 •

edited

Loading

hcedwar commented Apr 26, 2017

Make Kokkos::OpenMP respect OMP environment even if hwloc is available #630

Make Kokkos::OpenMP respect OMP environment even if hwloc is available #630

Comments

bartgol commented Feb 6, 2017

jjellio commented Feb 7, 2017 • edited Loading

bartgol commented Feb 7, 2017 • edited Loading

crtrott commented Feb 7, 2017

nmhamster commented Feb 7, 2017

dsunder commented Feb 7, 2017

bartgol commented Feb 7, 2017

crtrott commented Feb 7, 2017 • edited Loading

crtrott commented Feb 7, 2017

crtrott commented Feb 7, 2017

bartgol commented Feb 7, 2017

jjellio commented Feb 8, 2017

crtrott commented Feb 9, 2017

hcedwar commented Feb 15, 2017

crtrott commented Feb 15, 2017

crtrott commented Mar 22, 2017

crtrott commented Mar 22, 2017 • edited Loading

bartgol commented Mar 22, 2017

nmhamster commented Mar 23, 2017

jjellio commented Mar 23, 2017 • edited Loading

hcedwar commented Apr 26, 2017

jjellio commented Feb 7, 2017 •

edited

Loading

bartgol commented Feb 7, 2017 •

edited

Loading

crtrott commented Feb 7, 2017 •

edited

Loading

crtrott commented Mar 22, 2017 •

edited

Loading

jjellio commented Mar 23, 2017 •

edited

Loading