Skip to content

WeeklyTelcon_20230509

Geoffrey Paulsen edited this page Jul 25, 2023 · 2 revisions

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoff Paulsen (IBM)
  • Howard Pritchard (LANL)
  • Edgar Gabriel (AMD)
  • Luke Robison (Amazon)
  • Joseph Schuchart
  • Thomas Huber
  • Thomas Naughton (ORNL)
  • Todd Kordenbrock (Sandia)
  • Tommy Janjusic (nVidia)

New issues

  • https://github.com/open-mpi/ompi/pull/11649 - an OFI callback scoped incorrectly.
    • This probably affects main, v5.0.x, and v4.x
    • could craft a testcase. Is this only seen with MT app? No, not only MT.
      • reentrant for single thread.
      • Got more than one completion because overwrote an array.
      • Luke will search for test case
    • Blocker for v5.0.0

v4.1

  • No new updates

v5.0

  • MAC Params issues are biggest issues now
  • Might be in PMIx base / framework
  • Need to cherry-pick NIC selection to v5.0.x
    • commit that went into main broke some AWS configurations
    • Caused some coverity issues, but fixed already PRed against main
  • 2 MTT issues
    • UCX and DSO - may be a fix needed to be cherry-picked back to v5.0.x
      • Issue 11632 - Fix provided in re-review
  • Good to retest ABI with v4.1.x before v5.0.0
    • Geoff will do this or next week
  • SMCuda to disqualify itself if no Cuda HW available.
    • Want this for v5.0
    • one rank or singleton closes itself early.
    • Edge case in SMCuda and attempts to clean up and tie into framework.
      • When it gets unloaded, there are dangling pointers.
      • Fix - doesn't setup callback functions unless Cuda_Init succeeds.
    • Edgar's PR is still trying to compile Cuda collective always (PR 11617)
      • Waiting for review
      • Summary, we want both
  • Doc work still remaining, will enumerate next week any remaining issues
    • A fix 20 minutes ago, other than there's some pmix cross version 11658
    • These same doc fixes will trickle through pmix/prrte and
  • New Issue, nVidia's internal MTT found an async-modex
    • global dstore has an issue.
    • If you set async-modex, or set dstore-hash.
    • Issue of scale... minimal required 4nodes x 4ppn.
    • UCX and ob1 both affected.
    • just Init+Finalize can trigger
    • v5.0 blocker.

Anything interesting from MPI Forums?

  • Behvaior of MPI_Comm_disconnect - a lot of discussion with George
  • MPI_Finalize - what happens to persistent communication handles that the user didn't explicitly free?
  • Option number 3 is under
  • C Const in headerfiles, but open mpi and mpich are both doing what's acceptable for ABI definitions, but not discussed this forum
Clone this wiki locally