Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a MCA framework for threads #6578

Merged
merged 5 commits into from
Mar 27, 2020

Conversation

hppritcha
Copy link
Member

Add a framework to support different types of threading models including
user space thread packages such as Qthreads and argobots:

https://github.com/pmodels/argobots

https://github.com/Qthreads/qthreads

The default threading model is pthreads. Alternate thread models are
specificed at configure time using the --with-threads=X option.

The framework is static. The theading model to use is selected at
Open MPI configure/build time.

@rhc54
Copy link
Contributor

rhc54 commented Apr 8, 2019

Just curious - I see a change in libevent itself. Does that mean we will no longer be able to support an external libevent? The change seems minor - are you planning to remove it before committing?

@hppritcha
Copy link
Member Author

@npe9 did you come up with a way to avoid the libevent patch?

@npe9
Copy link
Contributor

npe9 commented Apr 8, 2019

@hppritcha @rhc54 External libevents should still be fine. The msec change is for argobots progress I believe (@shintaro-iwasaki?). We probably need to look at better integrating libevent with tasking runtimes but that's orthogonal to this pull request.

The other change is the openmpi specific header (opal_event_use_threads() becomes a function rather than a #define).

With @shintaro-iwasaki's permission, you can probably back out the msec change.

@rhc54
Copy link
Contributor

rhc54 commented Apr 8, 2019

What happens if OMPI is using argobots (or whatever) and PMIx is using libevent? I don't offhand see why there would be an issue, but we are interweaving our way in/out of threads between the two libraries so I figure it might be worth asking.

@npe9
Copy link
Contributor

npe9 commented Apr 8, 2019

@rhc54 The idea is to have them both using the same scheduling substrate, integrating PMIx more tightly is on my todo list. Once this code is in, I intend to be more active on the PMIx slack.

I'm also interested in caching mechanisms for PMIx that would potentially obviate the need for TLS entirely.

@jsquyres
Copy link
Member

jsquyres commented Apr 9, 2019

There were some timeouts in the CI testing -- might be worth checking out to see if they are due to the code changes in this PR.

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started this review by looking at the individual commits, not realizing that there's a sequence of WIP commits (vs. a final set of polished / complete commits). E.g., some things I commented on in early commits are fixed in later commits. I trust that these types of things will be squashed down appropriately before this is merged. For such things, you can disregard my comments.

I then started reading the entire PR as a whole (vs. the individual commits). But I stopped because I'm somewhat confused. It seems like this PR is only halfway converting to a thread framework -- there seem to be parts still left in the base and parts in a framework. There's parts that do not obey the framework / component symbol prefix rule. The component configure.m4 scripts do not seem to actually do functional tests. There's no explanation for what the component API is, or what the intent is. It also seems that this is a compile-time framework (vs. a run-time decision framework). An explanation for that would be appreciated, because this will tend to lead to multiple Open MPI installations (vs. a single installation where the threading package can be chosen at run time).

Overall, there's no top-level description that explains what precisely this new "threads" framework is encompassing (and why). It seems like there was movement of threads, mutexes, condition variables, and at least some of the peripherary surrounding them into a framework. A cohesive explanation of the new view of the world would be greatly appreciated, either in a README[.md] in a .h file somewhere (i.e., it should be part of the code base, not something that exists only on the gtihub UI). E.g., it would be good to explain exactly what a component needs to implement in order to be able to be used in this framework.

opal/mca/threads/base/base.h Outdated Show resolved Hide resolved
return 0;
}
int opal_uses_threads = 0;
int opal_tsd_keys_destruct()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a (void) declaration here.


int opal_condition_t_class;
int opal_mutex_t_class;
int opal_recursive_mutex_t_class;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't these need to be initialized to avoid common symbols?


static int mca_threads_base_register(mca_base_register_flag_t flags)
{
// Do I need to register anything here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you? Please answer this question / update the code rather than leave a "to do"-like comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsquyres I don't believe I do, but I'd like to make sure with someone that knows more about proper mca instantiation before I get rid of the comment. Who's a good person to review this with?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably @hppritcha. If you have nothing to register, you should delete this function.

}

void opal_thread_set_main() {
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Void functions can't return a value. Also need a (void) in the declaration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

threads_base_null was an artifact of the initial template (the timer mca), but looking at your message I think these went away after the squash.

@@ -281,11 +281,11 @@ if test "$opal_pthread_c_success" = "0"; then
opal_pthread_c_success=0)
AC_LANG_POP(C)
if test "$opal_pthread_c_success" = "1"; then
PTHREAD_CFLAGS="$pf"
TPKG_CFLAGS="$pf"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is TPKG? Why is it better than a PTHREAD prefix? The tests here are specific to pthreads, after all...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hppritcha can explain the motivation better than I can, but I believe that the TPKG_CFLAGS get used for every possible thread package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@npe9 is correct. TPK_CFLAGS is used on opal_config_threads.m4. Right now the argo and qthreads configure functions don't use it, but the name was to indicate its more general and may be set by other thread package configure functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well we moved all of the threads related configury into the framework, so probably this comment doesn't apply anymore?


OPAL_SUMMARY_ADD([[Miscellaneous]],[[Threading Package]],[opal_threads], [$thread_type_found])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that threads is a framework, shouldn't all of this be in opal/mca/threads/configure.m4 and/or the various component configure.m4 files? Seems like an abstraction violation to have all this stuff at the top level now that there is a framework and components for each of these individual things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this has been done -- I still see lots of logic and hard-coding for argobots in config/opal_config_threads.m4. Shouldn't config/opal_config_threads.m4 go away and be fully replaced by framework-level thread m4 stuff?

@@ -162,6 +162,9 @@ poll_dispatch(struct event_base *base, struct timeval *tv)

EVBASE_RELEASE_LOCK(base, th_base_lock);

if (msec > 0) {
msec = 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in the comments on this PR, this change to libevent is... tricky. What does it mean for external libevents?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted this

AC_DEFUN([MCA_opal_threads_argobots_CONFIG],[
AC_CONFIG_FILES([opal/mca/threads/argobots/Makefile])

AC_MSG_CHECKING([HAVE_THREAD_PKG_TYPE = $HAVE_THREAD_PKG_TYPE])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This output message seems weird.

Don't you need some actual functional tests here for argobots?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed - unnecessary messages

evthread_argobots_lock_alloc(unsigned locktype)
{
ABT_mutex lock;
if (locktype & EVTHREAD_LOCKTYPE_RECURSIVE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeting level is wrong.

4 space indenting levels, please -- and no tabs.

@npe9
Copy link
Contributor

npe9 commented Apr 10, 2019

@jsquyres Thanks for the feedback regarding documentation. I explained our concerns about making the threading library shared initially (performance) inline to your comments. If that explanation is sufficient I can add it to a README in the mca root.

I'll also provide a writeup of the motivation for the component when I get a chance.

Also, who's the proper person to act as a shepherd for documentation issues?

@jsquyres
Copy link
Member

@npe9 Yea, please add a README or a large comment somewhere in this PR. I ask because this framework seems to be a bit different than our other frameworks (i.e., provide a component struct and provide a module struct, and all the APIs are available through there). Some explanation of exactly what component authors are supposed to provide would be greatly appreciated.

@hppritcha
Copy link
Member Author

i think the AWS failures are legit but just in case
bot:ompi:retest

@npe9
Copy link
Contributor

npe9 commented Apr 16, 2019

FYI The README is coming, I'm on paternity leave and cramming for an April 22nd paper deadline. I'll do the README when things slow down.

@hppritcha
Copy link
Member Author

bot:ompi:retest

1 similar comment
@hppritcha
Copy link
Member Author

bot:ompi:retest

@gpaulsen
Copy link
Member

Is this PR ready to do some performance regression testing on ppc64le?

@hppritcha
Copy link
Member Author

@gpaulsen yes. I'm trying to figure out the ubuntu problem we hitting with AWS CI.

@bwbarrett
Copy link
Member

@hppritcha I don't think Ubuntu was the problem (it looks like Ubuntu was killed because another job failed). If you follow the red balls of failure, you see that the --disable-oshmem build failed. It failed during execution of ring_c, during shutdown, because of a segmentation fault:

--> Running example: ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
[ip-172-31-7-102:11136] *** Process received signal ***
[ip-172-31-7-102:11136] Signal: Segmentation fault (11)
[ip-172-31-7-102:11136] Signal code: Address not mapped (1)
[ip-172-31-7-102:11136] Failing at address: 0x7f89cf3d98c7
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node ip-172-31-7-102 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Example failed: 139

My guess is that something after the MCA framework is shut down is making a call into the threading framework. Not sure why the --disable-oshmem would change that, so it might just be racey.

@hppritcha
Copy link
Member Author

@bwbarrett so far its only been on the ubuntu image that we're seeing this segfault. i can't reproduce on redhat or suse, so trying it in a vm running ubuntu.

@hppritcha
Copy link
Member Author

hmm... let's see if it fails just in fortran test again.
bot:ompi:retest

@hppritcha
Copy link
Member Author

UH was having issues, so restart ompi CI
bot:ompi:retest

@hppritcha
Copy link
Member Author

@rhc54 I excised the libevent change from this PR.

@hppritcha
Copy link
Member Author

@jsquyres i think we've cleaned up the issues you found with the PR? Could you review again? We'll squash down the commits - maybe not to one. Also, we plan to add some perf data soon.

@hppritcha
Copy link
Member Author

Screen Shot 2019-08-03 at 2 53 59 PM

MPI put bibw rate using the fence variant of the test from the rma-mt repo: https://github.com/hpc/rma-mt

@hjelmn
Copy link
Member

hjelmn commented Aug 4, 2019

So no noticable impact from the PR. That's good.

@rhc54
Copy link
Contributor

rhc54 commented Aug 5, 2019

Errr...if there is no identifiable benefit, then what is the point of this change?

@hjelmn
Copy link
Member

hjelmn commented Aug 5, 2019

@rhc54 The benefit is adding support for different threading models. The first step is adding a framework. After that I believe we intend to add support for QThreads.

@npe9
Copy link
Contributor

npe9 commented Aug 5, 2019

@rhc54 The potential benefit is better integration of progress threads and threading runtimes (especially OpenMP via Bolt).

The first goal though (this PR) is to get the infrastructure in place and "do no harm". The results showing no performance hit on the pthreads code is the start.

Once integrated, the next step is to go through and add an opal_yield call instead of usleeps (the pthread mca will still use usleeps though) and work to get libevent working non-preemptively.

@pavanbalaji
Copy link

pavanbalaji commented Mar 22, 2020

@pavanbalaji I'm not sure I fully understand how a standard-compliant MPI application may deadlock and what private stacks have to do with that?

@devreal Consider the case where there are two qthreads or argobots threads that are both mapped to a single pthread. Now, if one of the threads tries to do a blocking send to the other thread, which in turn is trying to do a blocking receive, this program can deadlock unless the MPI library can yield from one thread to another. Note that pthread yield is not sufficient here because both the threads are mapped to the same pthread. This is required by the MPI standard. MPI-3.1 page 485, lines 6-7:

Blocking MPI calls will block the calling thread only, allowing another thread to execute,
if available.

The reason I pointed out that they have independent stacks is to indicate that they really are "threads" (as opposed to OpenMP or Argobots tasks, for example).

FWIW, the same problem with occur with Solaris threads, which uses a similar model, I believe. Moreover, the pthread specification doesn't actually say that each pthread is mapped to a kernel thread. It just that most (all?) implementations of pthreads do that.

ompi/runtime/ompi_rte.c Outdated Show resolved Hide resolved
opal/mca/threads/argobots/threads_argobots_module.c Outdated Show resolved Hide resolved
opal/mca/threads/argobots/threads_argobots_module.c Outdated Show resolved Hide resolved
opal/mca/threads/pthreads/threads_pthreads_mutex.h Outdated Show resolved Hide resolved
opal/mca/threads/wait_sync.h Outdated Show resolved Hide resolved
opal/runtime/opal_params.c Outdated Show resolved Hide resolved
test/class/opal_fifo.c Outdated Show resolved Hide resolved
Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow my comments got registered but not my review. I have to resubmit it.

@devreal
Copy link
Contributor

devreal commented Mar 23, 2020

Sorry about this long post. This may not be the right forum but I think it's important to bring up.

tl;dr: IMO, ULTs are not threads in the sense of the MPI standard and should not be treated that way.

@pavanbalaji I've been grappling with this for some time now. I believe that the terminology we're using is quite convoluted and unclear. Here is what my understanding is right now and I'm happy to stand corrected: lightweight processes (LWPs) are a kernel construct that allow multiple threads inside a single process to share resources. Threads are scheduled by a scheduler and are pre-emtable. There is no need for a 1:1 mapping between LWPs and threads but some scheduler is responsible for some degree of fairness between threads (it is my understanding that pthreads are always pre-emptable since they carry their own scheduling policy and priority). This is what the MPI standard refers to as "a thread package similar to POSIX threads". It is my understanding that even Solaris threads are pre-emptable [1] (the MxN mapping between LWPs and Solaris threads was apparently abandoned in 2002 [2]). In that scheme, LWPs are a way to control the degree of parallelism by limiting the number of parallel active threads. So, thread support in MPI means protecting shared data structures against concurrent accesses. This is the guarantee that MPI_THREAD_MULTIPLE gives me. The correctness issue you mentioned is not relevant in the context of threads because eventually all threads will be scheduled for execution.

IMO, the term User Level Thread is misleading. What you are describing is a fiber: a lightweight thread that is executed by a regular thread in it's own context (stack) but scheduled cooperatively [3]. The boost.fiber package is an example, not to be confused with boost.thread, which is essentially a wrapper around POSIX/Windows threads. We may disagree on this and I'm happy to adjust my definitions. It is important, however, that ULTs/fibers have different scheduling semantics than POSIX threads.

The point here is that by using blocking MPI calls inside different fibers potentially running on the same thread your program is ill-formed. Essentially, it is the same as creating a pthread and issuing both blocking send and receive in sequence. You may do so switching stacks in between. Yes, MPI will block that thread and allow all other threads in the process to happily go about their lifes. But that one thread will remain blocked. That is in accordance with the requirement in the MPI standard you cited. And MPI provides non-blocking communication to avoid this.

What this PR does is shift the burden of correct scheduling (as opposed to correct mutual exclusion) into the MPI layer. The consequences are significant: an application will still see MPI_THREAD_MULTIPLE but a configure switch decides over whether it is safe to use blocking operations inside fibers or not. And there is no way of determining that property from inside the application. We would be fostering the development of (afaics non-compliant) non-portable MPI applications, creating a debugging and portability nightmare for the foreseeable future. What's more, we're starting with Qthreads/Argobots but there are many other fiber implementations out there (boost, I believe Clang's OpenMP uses fibers, OmpSs, ...) so where do we stop?

I fully agree that mixing ULT/fibers and MPI is an issue that needs to be addressed and there have been other approaches to that end (TAMPI comes to mind, there may be others). I am not convinced that MPI implementations are the right point for tackling this and we should be very cautious opening that can of worms.

As I said at the top, this may not be the right forum but I think it's important to have that discussion. If the community decides that broadening the definition of what a thread is in the sense of MPI that's fine. I just wanted to throw in my 2 cents here :)

[1] https://docs.oracle.com/cd/E19620-01/805-5080/6j4q7emic/index.html
[2] Multithreading in the Solaris TM Operating Environment
[3] https://en.wikipedia.org/wiki/Fiber_(computer_science)

@rhc54
Copy link
Contributor

rhc54 commented Mar 23, 2020

I won't weigh in on the question of whether or not to do this - but I am a tad concerned about it being a configure-time decision. It sounds like we are going to now require that organizations not only build and maintain OMPI against various types of compilers, but also multiply that by the number of different threading models?? Did we just triple the number of installations an organization has to maintain? Is there no way this could be done as a runtime decision?

@bwbarrett
Copy link
Member

I won't weigh in on the question of whether or not to do this - but I am a tad concerned about it being a configure-time decision. It sounds like we are going to now require that organizations not only build and maintain OMPI against various types of compilers, but also multiply that by the number of different threading models?? Did we just triple the number of installations an organization has to maintain? Is there no way this could be done as a runtime decision?

I share your concern, but threading constructs are so fundamental to performance and so low level that I'm concerned a run-time selection would be catastrophic for performance. From a technical perspective, I'm also not sure how we could implement static initializers (which we sprinkled throughout the code) with a run-time selection, since many of those initializers fire before we even get to MPI_INIT.

@rhc54
Copy link
Contributor

rhc54 commented Mar 23, 2020

hreading constructs are so fundamental to performance and so low level that I'm concerned a run-time selection would be catastrophic for performance

Yeah, I also feared that would be the case - just hoped that maybe I was wrong and we could avoid pissing off all the installers and downstream packagers.

@pavanbalaji
Copy link

ULTs are not threads in the sense of the MPI standard and should not be treated that way.

@devreal I think this discussion is getting out of the context of this PR and we should probably move it outside. I feel that saying that only entities that have associated kernel threads would be considered "threads" is quite restrictive (even pthreads do not have that requirement in the POSIX standard). But at this point we have gotten into opinions, rather than facts, so I'll let the OMPI developers decide on this.

@devreal
Copy link
Contributor

devreal commented Mar 23, 2020

I feel that saying that only entities that have associated kernel threads would be considered "threads" is quite restrictive (even pthreads do not have that requirement in the POSIX standard).

That was not my argument. You can come up with a preempting user-level thread scheduler that meets the POSIX requirements. The point is the difference in scheduling behavior, which is not just an opinion.

Anyway, I agree that this is beyond the scope of the PR. I'm happy to discuss this in a different forum.

@bwbarrett @rhc54 I think Cray's MPI implementation does runtime selection of implementations based on MPICH_MAX_THREAD_SAFETY, I assume by dlopening the respective library. It adds a level of indirection at the entry points into the library but Open MPI already makes heavy use of function pointers.

@hppritcha
Copy link
Member Author

@devreal the Cray MPI isn't that fancy. I forget the details but I think they have some single threaded optimizations that are on by default. MPI_init_thread by default will return max provided of MPI_THREAD_SERIALIZED I think. This allowed for some of the OSU/IMB results to look better than if MPI_THREAD_MULTIPLE was supported by default. adding this env. variable was a way to support MPI_THREAD_MULTIPLE - but not by default.

@devreal
Copy link
Contributor

devreal commented Mar 24, 2020

@hppritcha Darn! So much pain just for some OSU benchmark results? Oh well, and I thought there really was some fancyness at work there...

@shintaro-iwasaki
Copy link
Contributor

I will check failures of opal_lifo and opal_fifo in Mellanox CI.

@gpaulsen
Copy link
Member

bot:ibm:scale:retest

npe9 and others added 5 commits March 27, 2020 10:15
Add a framework to support different types of threading models including
user space thread packages such as Qthreads and argobot:

https://github.com/pmodels/argobots

https://github.com/Qthreads/qthreads

The default threading model is pthreads.  Alternate thread models are
specificed at configure time using the --with-threads=X option.

The framework is static.  The theading model to use is selected at
Open MPI configure/build time.

mca/threads: implement Argobots threading layer

config: fix thread configury

- Add double quotations
- Change Argobot to Argobots
config: implement Argobots check

If the poll time is too long, MPI hangs.

This quick fix just sets it to 0, but it is not good for the
Pthreads version. Need to find a good way to abstract it.

Note that even 1 (= 1 millisecond) causes disastrous performance
degradation.

rework threads MCA framework configury

It now works more like the ompi/mca/rte configury,
modulo some edge items that are special for threading package
linking, etc.

qthreads module
some argobots cleanup

Signed-off-by: Noah Evans <[email protected]>
Signed-off-by: Shintaro Iwasaki <[email protected]>
Signed-off-by: Howard Pritchard <[email protected]>
To suppress Valgrind warnings, opal_tsd_keys_destruct() needs to explicitly
release TSD values of the main thread.  However, they were not freed if keys are
created by non-main threads.  This patch fixes it.

This patch also optimizes allocation of opal_tsd_key_values by doubling its size
when count >= length instead of increasing the size by one.

Signed-off-by: Shintaro Iwasaki <[email protected]>
Argobots/Qthreads-aware libevent should be used instead.

Signed-off-by: Shintaro Iwasaki <[email protected]>
@awlauria
Copy link
Contributor

Is this ok to merge, or is there still discussion/changes needed? We would like to get this in well before the 5.0 branch date.

@jladd-mlnx
Copy link
Member

@hppritcha, I say we merge it this weekend. Any objection?

@hppritcha
Copy link
Member Author

hppritcha commented Mar 27, 2020

sorry wrong PR about merging back to 4.0.x. of course this doesn't go back to 4.

@awlauria
Copy link
Contributor

@hppritcha sounds good. Merge when you are ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.