Skip to content

WeeklyTelcon_20230606

Geoffrey Paulsen edited this page Jun 6, 2023 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoff Paulsen
  • Jeff Squires (CISCO)
  • Brendan Cunningham
  • Christoph Niethammer HLRS
  • David E. Bernholdt
  • Edgar Gabriel (AMD)
  • Joseph Schuchart
  • Luke Robison (Amazon)
  • Quincey (AWS)
  • Thomas Huber
  • Thomas Naughton
  • Todd Kordenbrok

New Issues:

  • Main needs to submodule pointer updates

  • #11532 - No progress yet

  • #11730 he has a large set of changes to mpirun and a few changes to schizo

    • Split between prterun and mpirun - command line args propagating is hard to describe, but putting them in, has to clone them. No way to make this community
    • Clone the text?
      • Probably best/easiest solution
      • Include a comment from where it came
    • Another idea might be to add this common text into prrte and #include via rst
      • Could just pull this from the submodule.
  • Concerned about all of these open issues

    • If we could get a few tickets closed, we could be done soon
    • If you know it, might be easy to close.
    • Broke them down into smaller tasks.
    • Pick one, assign yourself
  • Doc Issues: https://github.com/open-mpi/ompi/projects/3

    • Some of the smaller issues, we might be able to say that an item is ToBeDone. Note: Sorry we haven't documented this yet.
  • #11726 -N bind ppr:X:node, map by package (socket), or core

    • What we've confirmed is that there is a change to the way that binding works if you just specify -N
    • Seems like we try to change the schizo component so that we maintain behavior from v4 to v5.
    • With this, we can decide what to do.
    • Luke

Previous Issues:

  • #11722 - Cannot build+install with out of source builds (VPATH)
    • Possible blocker, need to update submodule pointers.
      • only on main
      • main needs submodule update - Austen

v4.1

  • No updates

v5.0

Current issues:

  • PMIX v4.2 async modex issue: https://github.com/openpmix/openpmix/issues/3077

    • Work around: -x PMIX_MCA_gds=hash or enable opal_pmix_collect_all_data
    • Need to up the timeout, fix in OMPI before PMIX_Get, increase timeout as a function of scale with user override.
    • Likely that the original issue is missing an additional variable for async modex. to ompi_pml_base_check_pml
    • New parameter exists for v5.0.x MUST be documented,
  • MCA Params issues are biggest issues now - no new updates.

  • Need to cherry-pick NIC selection (distances PR fixes) to v5.0.x

    • Several PRs will go into main, including coverity fixes.
    • Amir to open up a v5.0.x PR to track all main commits and cherry-pick to v5.0.x when finished.
    • Pending review -
    • Will create initial v5.0.x PR as a pre-PR for the NIC selection: needs review
  • UCX and enable mca dso do not mix issue: https://github.com/open-mpi/ompi/issues/11632

  • Issue #11532: mca_base_param_files option is no longer read

    • PMIX command line parsing issue fixed the first stage completed, next stage fix over the next few days.
  • PR 11681 Propagate the error from callback *Legit bug fixed by George but introduced behavior change, need community review.

Clone this wiki locally