Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign the command line system to provide isolation #1176

Merged
merged 8 commits into from
Feb 5, 2022

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jan 10, 2022

Redesign the command line system to provide isolation

As the number of organizations using PRRTE grows, we are
beginning to see conflicts surrounding the command line
interface. This is the primary interaction point to the
user, so it is natural that organizations want to customize
it - i.e., to "rebrand" PRRTE to match their needs.

The original code utilized a custom command line parser
that was rigidly tied to the needs of OpenMPI. Modifying
this to create the desired independence would be challenging.
Instead, we chose to replace that code with use of the
standard "getopt_long" function. This reduced the amount
of custom code, though it still requires one to properly
deal with the input/output to/from that parser.

This represents an initial working prototype of the revised
system. Note that it does change the schizo interface. Some
minor configuration changes have also crept into this PR
and will be separated out prior to commit.

Main purpose of this PR is to provide the community an
opportunity to look over the changes and provide feedback.
Updates to the code will continue. In particular, we need
to separate out the map/rank/bind policy assignments so
that the individual organizations can each choose their
own defaults.

Signed-off-by: Ralph Castain [email protected]

@jjhursey
Copy link
Member

bot:ibm:retest

As the number of organizations using PRRTE grows, we are
beginning to see conflicts surrounding the command line
interface. This is the primary interaction point to the
user, so it is natural that organizations want to customize
it - i.e., to "rebrand" PRRTE to match their needs.

The original code utilized a custom command line parser
that was rigidly tied to the needs of OpenMPI. Modifying
this to create the desired independence would be challenging.
Instead, we chose to replace that code with use of the
standard "getopt_long" function. This reduced the amount
of custom code, though it still requires one to properly
deal with the input/output to/from that parser.

This represents an initial working prototype of the revised
system. Note that it does change the schizo interface. Some
minor configuration changes have also crept into this PR
and will be separated out prior to commit.

Main purpose of this PR is to provide the community an
opportunity to look over the changes and provide feedback.
Updates to the code will continue. In particular, we need
to seperate out the map/rank/bind policy assignments so
that the individual organizations can each choose their
own defaults.

Signed-off-by: Ralph Castain <[email protected]>
@rhc54 rhc54 changed the title Test PR just to see the state of the change - DNM Redesign the command line system to provide isolation Jan 13, 2022
Allow each schizo personality to define its own
default mapping, ranking, and binding policies.
Any they choose not to define will fallback to
the PRRTE defaults.

Add an MCA param to each schizo component allowing
it to silence deprecation warnings. Default the
PRRTE component to output them. Default the OMPI
component to silence them.

Update the OMPI component to set its own ranking
policy for the PPR mapping option - leave all
else to the defaults.

Signed-off-by: Ralph Castain <[email protected]>
Only output them once, when we first parse the
command line. Applications submitted via PMIx_Spawn
should not generate warnings

Signed-off-by: Ralph Castain <[email protected]>
Signed-off-by: Ralph Castain <[email protected]>
application name

Don't look at the app's arguments

Signed-off-by: Ralph Castain <[email protected]>
@rhc54
Copy link
Contributor Author

rhc54 commented Jan 27, 2022

weird - getting error about some nvidia library not loading. let's try again

bot:ibm:retest

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 27, 2022

@jjhursey Looks like the containers for IBM's CI are missing something? Could you take a look?

@jjhursey
Copy link
Member

Ah I think this is the same hwloc issue we hit on the OMPi side. Let me try to change the configure arguments. 1 min

@jjhursey
Copy link
Member

bot:ibm:retest

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 27, 2022

@jjhursey I think the problem is that you needed to add those configure options to the HWLOC build when constructing the container. I'm not sure where you have that, or if you can simply have it go back to pre-2.5 HWLOC to avoid the problem.

@jjhursey
Copy link
Member

The container is rebuilding now. Should be an hour or so until it is ready. I'll check back on it tonight.

@jjhursey
Copy link
Member

bot:ibm:retest

@jjhursey
Copy link
Member

The hwloc problem seems to be solved, but it looks like there is a problem with the mpir-shim now.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 28, 2022

Yeah, I tried on the other PRs and they all hit the same problem. I'm going to try and locally reproduce, but it may be another symptom of something in the container. This PR was passing last time it was modified - all I did was resolve a minor conflict in the copyrights.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 28, 2022

@jjhursey I'm afraid it is still something in those containers - I tested locally with my containers and the mpir-shim tests pass. Sorry to bother, but can you see what else might have changed? I would suggest backing down the HWLOC version as a starting point as that might have been what triggered the problem.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 28, 2022

bot:ibm:xl:retest

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 28, 2022

I turned off the mpir-shim test to see how much further we would get - we failed when the debug tests tried to do an indirect version. It therefore appears that we fail whenever PMIx does a fork/exec of the launcher. I'll try with HWLOC 2.7 on my container - it was using v2.2 before - just in case that change remains the source of the trouble, even with the modified configure line.

@jjhursey
Copy link
Member

Let's follow up on Slack with the Container issue since it's not directly related to this ticket. I can backlevel it, but I'd also like to know what's going on here. The only thing that should have changed is the PGI compiler and HWLOC levels.

@jjhursey
Copy link
Member

bot:ibm:retest

Copy link
Contributor

@awlauria awlauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments/questions. Overall looks good to me.

src/mca/plm/alps/configure.m4 Show resolved Hide resolved
src/mca/rmaps/base/rmaps_base_map_job.c Outdated Show resolved Hide resolved
src/mca/rmaps/base/rmaps_base_map_job.c Show resolved Hide resolved
src/mca/routed/routed.h Show resolved Hide resolved
src/mca/odls/base/odls_base_default_fns.c Show resolved Hide resolved
src/mca/schizo/ompi/schizo_ompi.c Show resolved Hide resolved
src/tools/prte/prte.c Outdated Show resolved Hide resolved
src/tools/prte_info/prte_info.c Outdated Show resolved Hide resolved
src/util/cmd_line.c Show resolved Hide resolved
src/util/cmd_line.c Show resolved Hide resolved
Signed-off-by: Ralph Castain <[email protected]>
@awlauria
Copy link
Contributor

awlauria commented Feb 3, 2022

Thanks - did you see my other comments? I had a few Q's on the ompi command line options.

@naughtont3
Copy link
Contributor

@rhc54 minor item that may have always been in this way, but i noticed that some of the --help options are unrelated to the tool (e.g., prte --show-progress or prterun --daemonize). If this is new behaviour, maybe something to adjust (likely just through sectioning of the help output).

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 3, 2022

@rhc54 minor item that may have always been in this way, but i noticed that some of the --help options are unrelated to the tool (e.g., prte --show-progress or prterun --daemonize). If this is new behaviour, maybe something to adjust (likely just through sectioning of the help output).

Yeah, there is some scrub work to be done on the help output. I just copied it as-is for now and figured we would iterate from there.

@naughtont3
Copy link
Contributor

Made a quick pass through things adding a new dummy tool/personality and worked as expected (minus a few goofs on my part). Overall seems to look good.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 5, 2022

Okay, we'll go with this for now - any issues can be dealt with as they surface.

@rhc54 rhc54 merged commit 9d06d37 into openpmix:master Feb 5, 2022
@rhc54 rhc54 deleted the topic/cliredo branch February 5, 2022 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants