-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Replace libltdl with OPAL "dl" framework #410
Conversation
805cf3a
to
abfa7a6
Compare
Refer to this link for build results (access rights to CI server needed): Build Log
Test FAILed. |
Refer to this link for build results (access rights to CI server needed): Build Log
Test FAILed. |
@miked-mellanox There's something happening here with MXM that I don't understand. Is this a false failure? |
i suspect some direct commit to the master did break modex for mxm. the jenkins setup did not change for a long time. |
I'm not aware of any "direct commit" that affected the modex. I did some some mxm and mtl related changes go by, but it's hard for me to keep track of everything |
Hmmm...scanning back, I see that Elena did commit a modex change the other day. I don't know why that would have affected mxm particularly, but it might be a place to look. |
abfa7a6
to
d037da3
Compare
Refer to this link for build results (access rights to CI server needed): Build Log
Test FAILed. |
@miked-mellanox I made a minor tweak and checked Jenkins again, and am still getting the MXM failure. Can someone have a look at master and/or this PR to see what is going on? I do not have the ability to test MXM myself. |
@yosefe - could you please comment on failure root-cause? |
it seems like a data corruption in modex send / modex recv, the value being printed (uuid) is passed over modex; it should be random value, but it looks like a pointer |
@yosefe So is this happening on master, and unrelated to this PR? |
we will know shortly, submitting PR to the master now for coverity fixes in yalla. |
fails as well :( |
FWIW: looks like Igor touched mxm when he committed a bunch of Coverity fixes - see 010dce3 |
yep, you are right- thanks 010dce3 (http://bgate.mellanox.com/jenkins/job/gh-ompi-master-merge/9) breaks the mxm. @igor-ivanov - please review. 426d1ce (http://bgate.mellanox.com/jenkins/job/gh-ompi-master-merge/8) is fine btw, now jenkins runs for "merge commits" as well, and has "coverity html report" per PR attached from the left for specific PR. here is a report example: |
bot:retest |
Coverity errors for all_260: http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/all_260/c/output/errors/index.html |
Coverity errors for oshmem_260: http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/oshmem_260/c/output/errors/index.html |
Coverity errors for yalla_260: http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/yalla_260/c/output/errors/index.html |
Coverity errors for mxm_260: http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/mxm_260/c/output/errors/index.html |
Refer to this link for build results (access rights to CI server needed): |
bot:retest |
Coverity 768 errors for all: http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/all_263/c/output/errors/index.html |
Coverity 20 errors for oshmem: http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/oshmem_263/c/output/errors/index.html |
Coverity 2 errors for mxm: http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/mxm_263/c/output/errors/index.html |
Refer to this link for build results (access rights to CI server needed): |
d037da3
to
0915ae2
Compare
Coverity 774 errors for all: http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/all_265/c/output/errors/index.html |
I have tested a tarball provided by Jeff, based on his commit 5a3fcf2. My tests include running ring_c (usually with 2 ranks on a single host). |
5a3fcf2
to
33987d5
Compare
Refer to this link for build results (access rights to CI server needed): Build Log
Test FAILed. |
bot:retest |
Refer to this link for build results (access rights to CI server needed): |
There's one more known issue on this branch that is being exposed by the change to OPAL_CHECK_PACKAGE: @PHHargrove noticed that when he runs this:
During the config for mtl_psm, it does not find his infinipath library, and rules that PSM is unavailable. The problem seems to be this:
My educated guess at why this is happening is that the switch from AC_CHECK_LIB to AC_SEARCH_LIBS means that the AC var we're resetting under the covers to enable us to run AC_SEARCH_LIBS twice (once with -L.../lib and again with -L.../lib64) is now wrong. I.e., it was the correct variable for AC_CHECK_LIB, but it's a different variable for AC_SEARCH_LIBS. |
"Use the source, Luke" A quick look at
So, it looks like the cache variable is |
Thanks! I am busy preparing for voting on MPI-3.1 this week; I just hadn't gotten to look in the autoconf source yet... |
33987d5
to
ad39b69
Compare
|
Refer to this link for build results (access rights to CI server needed): |
Embedding libltdl without the use of Libtool bootstrapping has proven... difficult. Instead, create a new simple "dl" framework. It only provides 4 functions: - open a DSO (very similar to lt_dlopenadvise()) - lookup a symbol in a previously-opened DSO (very similar to lt_dlsym()) - close a previously-opened DSO (very similar to lt_dlclose()) - iterate over all files in a directory (very similar to ld_dlforeachfile()) There will be follow-on commits with a simple dlopen-based component (nowhere near as complete/functional as libltdl, but good enough for Linux and OS X), and a libltdl-based component for all other platforms. The intent is that the dlopen-based component can be built by default in almost all cases. But if libltdl is available, that component will be built. End result: we still get DSO-based functionality by default in (almost?) all cases. Without embedding libltdl. Which is what we want.
Works on systems with dlopen (e.g., Linux and OS X). It requires dlfcn.h and libdl, which many systems have installed by default.
Works on any system that libltdl supports and has ltdl.h and libltdl available.
Noe that this commit removes option:lt_dladvise from the various "info" tools output. This technically breaks our CLI "ABI" because we're not deprecating it / replacing it with an alias to some other "into" tool output. Although the dl/libltdl component contains an "have_lt_dladvise" MCA var that contains the same information, the "option:lt_dladvise" output from the various "info" tools is *not* an MCA var, and therefore we can't alias it. So it just has to die.
The libltdl interface has been completely replaced by the OPAL DL framework (i.e., the opal_dl interface). Fixes open-mpi#311
ad39b69
to
914880a
Compare
|
Refer to this link for build results (access rights to CI server needed): |
RFC: Replace libltdl with OPAL "dl" framework
…indings More one sided java bindings.
Per #311, we've tried two approaches to getting rid of the embedded libltdl from OMPI. Neither worked.☹️
Here's a new approach: make dynamic library functionality (i.e., dlopen/dlsym-like functionality) be an OPAL framework. Have (at least) two components:
This idea is based on the premise that Open MPI's main two platforms are (modern) Linux and OS X, both of which support dlopen(2). Therefore, combined with the fact that dlfcn.h and libdl are typically available by default, the dlopen-based component can (usually) be built by default. For non-dlopen-lovin' platforms, libltdl support is still available and will function the same as ever -- just not embedded in the Open MPI tree (and therefore you must have libltdl devel support installed).
Additionally, plugins can be written for other platforms to support their native dlopen/dlsym-like functionality, if desired (e.g., if libltdl doesn't support that platform and/or if a developer doesn't want to force a user to have libltdl+devel support installed).
This PR contains a series of commits that incorporates the entirety of this functionality in logical steps:
(NOTE: there is currently a "zero" commit at the head of this patch set that is a bug fix for the MCA framework; this is getting reviewed independently by @hjelmn right now, and will likely be committed separately. It is included here because the fix is required to get this DL framework to function properly)