Skip to content

WeeklyTelcon_20220517

Jeff Squyres edited this page May 17, 2022 · 2 revisions

Open MPI Weekly Telecon

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Jeff Squyres (Cisco)
  • Austen Lauria (IBM)
  • Brian Barrett (AWS)
  • David Bernholdt (ORNL)
  • Edgar Gabriel (AMD)
  • Howard Pritchard (LANL)
  • Joseph Schuchart (UTK)
  • Josh Fisher (Cornelis Networks)
  • Josh Hursey (IBM)
  • Matthew Dosanjh (Sandia)
  • Thomas Naughton (ORNL)
  • Todd Kordenbrock (Sandia)
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Geoffrey Paulsen (IBM)
  • Brendan Cunningham (Cornelis Networks)
  • Hessam Mirsadeghi (UCX/NVIDIA)
  • Tommy Janjusic (NVIDIA)
  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (NVIDIA)
  • Aurelien Bouteiller (UTK)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • Christoph Niethammer (HLRS)
  • George Bosilca (UTK)
  • Harumi Kuno (HPE)
  • Joshua Ladd (NVIDIA)
  • Marisa Roman (Cornelius Networks)
  • Mark Allen (IBM)
  • Michael Heinz (Cornelis Networks)
  • Nathan Hjelm (Google)
  • Sam Gutierrez (LLNL)
  • Xin Zhao (NVIDIA)

v4.1.x

  • 4.1.4
    • Merged in all the pending PRs
    • MTT looks good from overnight
    • Probably release this Thursday
    • UCC collectives and a bunch of little bug fixes
  • Josh H asks: is there predictability on when the nightly tarballs are available?
    • Probably not.
    • We probably could stand to update nightly tarball infra to actually running as a Jenkins job (so that it runs on the build hosts -- not www.open-mpi.org). Need to find some time to do that...
  • Last night's MTT picked up old tarball, so it still shows some old failures.

v5.0.x

  • RC7 went out late Fri

    • Have some compile failures; those should all be fixed.
    • A few more fixes in flight.
  • We talked yesterday about setting minimum of PRTE version to 2.0.2+fixes (i.e., latest PRTE public release + fixes). This may or may not be useful.

    • But recall that PRTE 2.1 had a large command line refactor.
    • Timeline for PRTE 2.1 isn't until end of summer (estimate).
    • PRTE 2.1 is what we have been targeting for OMPI 5 for a while. It would be... weird to try to make PRTE 2.0 be for OMPI 5 (e.g., have to back-port all the command line refactor stuff, and other things).
    • There are 7 critical 2.1.x issues on PRTE.
    • Q: Isn't this a stop-ship for OMPI 5?
      • Yes. :-(
    • We should help at least some of the PRTE 2.1 issues -- some of them are OMPI-related.
    • We/OMPI need a public PRTE 2.1.x release so that packagers can have an OMPI package + PRTE package. The embedded PRTE is "not enough" for packagers.
    • Sidenote: PRTE 2.0.x (including the publicly-released PRTE 2.0.2) would be weird for OMPI users, because it doesn't have all the CLI fixes/updates.
    • Need OMPI community help here (for PRTE 2.1.x): https://github.com/openpmix/prrte/issues?q=is%3Aissue+is%3Aopen+label%3A%22Target+2.1%22
    • Options:
      • Figure out how to be happy with PRTE 2.0.2
      • Wait for PRTE 2.1.0
        • OPTIONALLY: Pour resources into PRTE 2.1.0 (which could make it release faster)
    • Feels like the only reasonable path forward is to wait for 2.1.0, and we should all contribute resources as much as possible because we want it as fast as possible.
      • Resource availability is slim right now :-(
    • We should also set the minimum PRTE version to 2.1.x.
  • We also talked about setting a minimum/floor version of PMIx for OMPI 5

    • If OMPI 5 supports PMIx 3, we lose (at least):
      • Debugger support
      • show_help aggregation
      • sessions
      • ULFM/fault tolerance
    • This is quite undesireable.
    • Does anyone have a need for
    • Last PMIx 4.0.x release was Dec 2020.
    • We should probably target PMIx 4.1.x.
      • PMIx 4.1.2 was released Feb 2022. --> Does not include show_help aggregation.
  • Joseph brings up https://github.com/open-mpi/ompi/pull/10349 -- need to make sure this doesn't fall off the table.

  • Old issue that has re-surfaced: Intercomm communicators (when using more than 1 node) are hanging on main/v5.0.x. Josh Hursey thinks it might involve PMIx_Connect.

  • New: Lisandro has hit segv with partitioned sending.

  • New: ULFM issue: https://github.com/open-mpi/ompi/issues/10398

v4.0.x

  • Merging in small fixes.
  • No plan for update

Main branch

  • Did not get to discuss this. See notes from last meeting.

MTT

  • Did not get to discuss this. See notes from last meeting.

Face-to-face

  • Did not get to discuss this. See notes from last meeting.
Clone this wiki locally