Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix --with-hwloc=external #2955

Merged
merged 3 commits into from
Feb 28, 2017
Merged

Conversation

jsquyres
Copy link
Member

This is an alternate approach to #2954 that I came up with while thinking about this problem overnight. It should fix #2616.

FYI: @ggouaillardet @artpol84 @opoplawski @dannyauble

@jsquyres
Copy link
Member Author

I neglected to mention in the PR description: there is a lengthy explanation of this approach in the commit messages.

The short version: allow frameworks to have an autogen.options file that lets the framework tell autogen things that it needs to know (e.g., "hey, my framework header is not framework.h, instead, it's abc.h").

Copy link
Contributor

@ggouaillardet ggouaillardet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at first glance, this is a more elegant approach.

we can likely get rid of the MCA_hwloc_external_header macro (including some configury logic), since we should now be able to simply

#include <hwloc.h>

i will make some more thorough testing on Monday

@@ -10,7 +10,7 @@
#

# We do not want -I$(srcdir) in AM_CPPFLAGS, or there can be a
# conflict between system hwloc.h and opal/mca/hwloc/hwloc.h. So just
# conflict between system hwloc.h and opal/mca/hwloc/hwloc-internal.h. So just
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need to set AM_CPPFLAGS here ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're exactly right -- we don't. I just noticed this when one of the CI bots failed "make dist". I have a fix (where we don't set AM_CPPFLAGS at all), but I'm doing a little more testing before I push it up here.

@jsquyres
Copy link
Member Author

After some thought, I think we still need MCA_hwloc_external_header -- that allows us to differentiate between --with-hwloc[=external] and --with-hwloc=/path/to/some/hwloc/install (especially when you want to point to a custom hwloc install but also have an hwloc installed in default compiler/linker search paths).

@jsquyres jsquyres force-pushed the pr/hwloc-external-fixes branch from 19451da to 5023c92 Compare February 10, 2017 13:45
@ggouaillardet
Copy link
Contributor

makes sense.
but don t we need to review the logic here ?
iirc, --with-hwloc=external will end up

#include "/include/hwloc.h"

@jsquyres
Copy link
Member Author

I don't think so:

  1. I tested --with-hwloc=external before I pushed, and I see:
/* Location of external hwloc header */
#define MCA_hwloc_external_header "hwloc.h"
  1. The logic in https://github.com/open-mpi/ompi/blob/master/opal/mca/hwloc/external/configure.m4#L64-L71 only sets MCA_hwloc_external_header to include the directory if the user specified --with-hwloc=/some/path. Otherwise -- if the user just did --with-hwloc or --with-hwloc=external -- it sets it to hwloc.h.

Am I missing something?

@ggouaillardet
Copy link
Contributor

@jsquyres you are good, and since git blame points to me, i guess i forgot i fixed this part ...

@rhc54
Copy link
Contributor

rhc54 commented Feb 12, 2017

I think you guys are swatting mosquitos with sledgehammers. Revamping the MCA system and modifying the headers used across the code base to work around a self-created problem seems extreme. This isn't an inherent issue in the config code - this is a problem caused by a design decision we made with respect to how the user specifies a path to an external hwloc. We chose, for whatever reason, to do this in an atypical way that is now biting us.

It therefore seems to me that we have two simpler solutions to these problems:

  1. the cleanest and probably best solution is to simply remove the embedded version of hwloc. Our packagers would applaud such a move, and it would certainly simplify our lives. Recall that the only reason we embedded in the beginning was the lack of general availability of adequate versions of this package. That problem has long since been resolved. There really is no longer a good reason for us to be embedding a copy of hwloc.

  2. if we continue to insist on embedding, then what is wrong with simply requiring the user to specify where to find their external hwloc? This eliminates the problem in the simplest, cleanest fashion - so why over complicate what we are inherently declaring (by having an embedded version) to be a non-default configuration?

Whatever we do, let's be fully aware of the extent of the problem caused by this external vs internal dependency. I think you have underestimated the extent of the problem - e.g., you don't appear to remember that specifying --with-hwloc=<foo> will automatically cause us to also slurp in an external libevent if it was also installed with <foo> as its prefix. The libevent situation may resolve itself: as @hjelmn has already been discussing elsewhere, we are quite likely going to replace that dependency with a simpler, customized solution for a variety of reasons. However, in the meantime, any proposed modification of the external hwloc code is going to impact libevent as well.

So why not get rid of this embedded code and save ourselves some pain?

@ggouaillardet
Copy link
Contributor

@rhc54 are you suggesting we simply remove the embedded hwloc component hwloc/hwloc1113 and the hwloc framework, and hence directly invoke hwloc_ functions provided by the external libhwloc ?
let's also keep in mind that hwloc version 2 cannot be used with Open MPI due to some incompatible API changes.

i guess one of the main issue with --with-hwloc=/usr is that we likely endup adding -I/usr/include -L/usr/lib64 in the search pathes. and though that can obviously be fixed with a simple hack, how to handle /usr/local for example (used in GNU compilers, but not by oracle compilers ...)

@rhc54
Copy link
Contributor

rhc54 commented Feb 13, 2017

Yes - remove the hwloc components (we still need the base functions), have a config check that rejects hwloc 2.x, and make the user install what we need. Why treat hwloc different from any other required support?

As multiple users have pointed out, there is no issue with setting ---with-hwloc=/usr. I don't know why we keep raising that red-herring. Is there some known issue that our users don't see??

@ggouaillardet
Copy link
Contributor

this is system dependent, and you described the (potential) issue earlier.
if you have both hwloc and PMIx in /usr, and you want to use external hwloc but embedded pmix, you might silently endup using external PMIx because of -I/usr/include -L/usr/lib64

@rhc54
Copy link
Contributor

rhc54 commented Feb 13, 2017

If pmix is available on the system, then we should use the system one too - again, there is no reason for us to continue treating these packages differently from any other ones. We can have "glue" components for the different versions - just drop the embedding.

@artpol84
Copy link
Contributor

@rhc54 the default assumption of --with-hwloc option without parameters is to search in the default paths. And it is known to be broken now. Is this considered by the proposed solution?

@rhc54
Copy link
Contributor

rhc54 commented Feb 13, 2017

If it is broken, then it needs to be fixed regardless of the path forward - that's a standard use-case we always support.

@artpol84
Copy link
Contributor

@rhc54, issue #2946 for the reference.

@jsquyres
Copy link
Member Author

@rhc54 and I talked about this at length this morning on the phone.

The proposed fix on this PR does, indeed, address the issue. It touches a bunch of files, but in trivial ways (hwloc.h -> hwloc-internal.h); the actual fix in autogen.pl is only about 20 lines of perl. It is definitely better than the workaround of --with-hwloc=/usr (which disrupts default compiler/linker search paths).

But it feels like a band-aid.

Maybe we should stop embedding hwloc. Here's the issues:

  1. OS X doesn't have hwloc at all.
    1. But it's only a port or brew installation away; the bar is quite low.
    2. But even so, hwloc doesn't give you too much on OS X, anyway, since OS X (even Sierra) doesn't support binding.
  2. Looking at RHEL / CentOS:
    1. RHEL 6.2 and 6.3 have hwloc 1.1.
    2. RHEL 6.4-6.7 have hwloc 1.5.
    3. RHEL 7.0-7.2 have hwloc 1.7.
    4. RHEL 7.3... will have a very new version of hwloc (it's not out yet).
  3. The hwloc/external component currently requires hwloc >= v1.8 because we have an hwloc base function that calls hwloc_topology_dup, which was introduced in hwloc v1.8.0. Need to do some testing, but we're pretty sure this is not actually used anywhere in active code paths.
    1. If we remove the hwloc >= v1.8 requirement, how far back in hwloc can we support? @jsquyres will do a little testing.
  4. If we remove embedded hwloc:
    1. We could require hwloc >= vX.Y to be installed (where X.Y TBD)
      • On OS X, ports/brew makes it easy.
      • On Solaris or older Linux distros, manual installation would be required.
    2. We could no longer require hwloc, and instead put #if HAVE_HWLOC blocks back throughout the code base. This was a little problematic before -- we kept unintentionally breaking the case where hwloc was not present, which is what led us to making hwloc mandatory. But this case is still possible, if we want to do it.

@jsquyres
Copy link
Member Author

With some testing, I have determined:

  1. We can reasonably support hwloc 1.5.x without much change to the OMPI code base.
    1. earlier than v1.8 does not have hwloc_topology_dup(), which we can either remove or put inside #if statements (it is only used in exactly one place in the OMPI code base).
    2. hwloc v1.5.x does not have HWLOC_OBJ_OSDEV_COPROC. We use this in opal_hwloc_base_find_coprocessors(), which can easily be hard-wired to return NULL if we have an hwloc that does not have HWLOC_OBJ_OSDEV_COPROC.
  2. Older than hwloc v1.5.x becomes problematic. E.g., hwloc_get_cache_type_depth() does not exist, which -- while we could probably code around that with some #if statements, it would seem like a PITA.

@rhc54 rhc54 mentioned this pull request Feb 25, 2017
Instead, include "opal/mca/hwloc/hwloc.h"

Signed-off-by: Jeff Squyres <[email protected]>
Frameworks are usually required to have a framework/framework.h file.
However, this is sometimes problematic (see the hwloc use case/problem
description, below).

This commit allows frameworks to have an "autogen.options" file (i.e.,
project/mca/framework/autogen.options) that specifies things that
autogen needs to know about the framework.  Currently, the only option
recognized in autogen.options is "framework_header", which allows a
framework to specify that its header file is named something other
than "framework.h" (the framework header file must still be in the
project/mca/framework directory; it simply may be named something
other than framework.h).  More options may be introduced over time.

The use case that motivated this is the hwloc framework
(open-mpi#2616).

Per MCA framework rules, the hwloc framework is required to have an
opal/mca/hwloc/hwloc.h file.  However, the hwloc library itself *also*
has an hwloc.h file.  This causes a problem when configuring Open MPI
with --with-hwloc=external (meaning: do not use the hwloc embedded
within the Open MPI source code tree -- instead, use an hwloc
installation from outside the Open MPI source code tree).
Specifically, when in the opal/mca/hwloc directory, the presence of
"-I." in DEFAULT_INCLUDES (put there by Automake) causes a confusion
between the hwloc.h in opal/mca/hwloc/hwloc.h and the system-installed
hwloc.h.  Chaos ensues (see the GitHub issue for more detail).

The solution is to rename the opal/mca/hwloc/hwloc.h to something else
(e.g., hwloc-internal.h), and extend autogen.pl to allow frameworks to
have an alternate name for their framework header file.

This commit introduces the autogen.pl mechanism to allow the alternate
header file name.  A follow-on commit will effect this change in the
hwloc framework (and update all the places in the code base to use the
new filename).

Signed-off-by: Jeff Squyres <[email protected]>
@jsquyres jsquyres force-pushed the pr/hwloc-external-fixes branch from 5023c92 to 30a3f68 Compare February 28, 2017 15:46
Per a prior commit, the presence of "hwloc.h" can cause ambiguity when
using --with-hwloc=external (i.e., whether to include
opal/mca/hwloc/hwloc.h or whether to include the system-installed
hwloc.h).

This commit:

1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h.
2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to
   find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc.
3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the
   rest of the code base.

Signed-off-by: Jeff Squyres <[email protected]>
@jsquyres jsquyres force-pushed the pr/hwloc-external-fixes branch from 30a3f68 to fec519a Compare February 28, 2017 15:49
@rhc54
Copy link
Contributor

rhc54 commented Feb 28, 2017

go with it 👍

@jsquyres
Copy link
Member Author

@rhc54 and I discussed -- we agree on the long term:

  1. This PR is basically a band aid.
  2. Eventually, we'll need to remove the embedded copy of hwloc (i.e., something like Remove hwloc framework. #3029).

@jsquyres jsquyres merged commit d5266ab into open-mpi:master Feb 28, 2017
@jsquyres jsquyres deleted the pr/hwloc-external-fixes branch February 28, 2017 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Build failure with --with-hwloc=external
4 participants