Skip to content

WeeklyTelcon_20200616

Jeff Squyres edited this page Jun 25, 2020 · 2 revisions

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Aurelien Bouteiller (UTK)
  • Austen Lauria (IBM)
  • Barrett, Brian (AWS)
  • Brendan Cunningham (Intel)
  • Christoph Niethammer (HL
  • Edgar Gabriel (UH)
  • Geoffrey Paulsen (IBM)
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Joseph Schuchart
  • Mark Allen (IBM)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Intel)
  • Nathan Hjelm (Google)
  • Naughton III, Thomas (ORNL)
  • Ralph Castain (Intel)
  • Todd Kordenbrock (Sandia)
  • William Zhang

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (nVidia/Mellanox)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • David Bernhold (ORNL)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • George Bosilca (UTK)
  • Josh Hursey (IBM)
  • Joshua Ladd (nVidia/Mellanox)
  • Matias Cabral (Intel)
  • Noah Evans (Sandia)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • William Zhang (AWS)
  • Xin Zhao (nVidia/Mellanox)
  • mohan (AWS)

Release Branches

Review v4.1.x Milestones v4.1.0

  • Schedule: Want to release mid-July
    • RC1 planned for Monday, 8 July, 2020
  • Release Engineers: Brian (AWS) Jeff Squyres (Cisco)
  • We've come to consensus for a v4.1.0 release
    • Not breaking ABI or backwards compatibility.
    • Blocker moving forward is to start from the v4.0.4 tag (Tomorrow)
    • NOT touching runtime!!!
    • Not going to be pulling in a new PMIx version.
  • Next Steps: MTT testing needs to come online
    • Ciscos already online last night.
    • IBM will get it online tonight.
    • AWS will get it online tonight as well.
    • Mellanox - will come online tomorrow night

Review v4.0.x Milestones v4.0.4

  • v4.0.4 Released last week.
  • Where do we stand with the memory hooks for Open MPI Memory patcher?
    • Save/restore of r2 on ppc64le only.
    • Not sure which component use these memory patcher
      • OpenIB uses Open MPI memory patcher. Not in master, only in v4.0.x
      • 7799 is still open against master.
  • v4.0.5 - No schedule yet.

Review v5.0.0 Milestones v5.0.0

  • Need to put OSC pt2pt

    • OS RDMA requires a single BTL that can contact every single process.
      • This didn't use to be the case. (Comment in the code)
  • We can't use the OSC pt2pt.

    • It is not thread safe. Doesn't conform to MPI4 standard. Not safe.
    • This is just a testing falicy. Could add tests to show this, but still at same boat.
    • Either product A or B is broken and we need to fix it.
  • RDMA Onesided should fall back to "my atomics" because TCP will never have rdma atomics.

    • The idea was to put the atomics into the BTL base, which could do all of the one-sided atomics under the covers.
  • Jeff will close the PR, and

  • Jeff will Nathan will fetching, get, compare and swap.

  • Does UCX support iWarp?

    • Does libFabric support iWarp via verbs provider?
    • https://github.com/openucx/ucx/issues/2507 suggest it doesn't.
    • Brian thinks that libFabric
    • OFI can support iWarp, just need to specify the provider in the include list.
    • This person who's asking is a partner not a customer
  • PMIX

    • Working on PMIx v4.0.0 which is what Open MPI v5.0 will use.
    • PPN scaling issue - simple algorithmic issue in this function
      • PMIX talked about it. Artem might know someone who might be interested in working on it.
      • Algorithm behind one of the interfaces doesn't scale well.
      • Not a regression. Above ~ 4K nodes, becomes quadratic.
  • PRRTE *

  • Two new PRs for MPI4.0 Error handling - new PRs from Aurelien Bouteiller. *

master

  • UCX is failing in certain test cases, SEGV
    • Austen will open an issue.
  • PRRTE is hitting and assert in some cases.
    • Austen will Open Issue
  • Remaining CISCO failures look like connectivity issues.
    • Jeff hasn't got to look deeper to see
    • Looks like USNIC is either not being picked or disqualifying itself internic.
  • CLANG - added float16
    • Need to add a special compiler flag for software emulation of float16.
    • Not magically add that flag.

Face to face

  • Many companies are not allowing a face to face travel until 2021 due to COVID19.
    • Instead lets do a series of virtual-face to face?
  • Yes this summer to discuss for v5.0
    • Maybe we can do it by topic?
    • Maybe not 4 or 8 hour things.
  • Different topics on different days.
  • Do a doodle poll of least-worse days in late July/August.
  • Start a list of topics.

Super Computing Birds-of-a-feather

  • May not have Super Computing conference at ALL this year.
  • Many other projects are doing a virtual state of the union type meeting to try to cover what they'd usually do in a Birds of a feather meeting.
  • Then this works pretty well, and do this a couple of times a year.
  • Not constrained to Super Computing

Infrastructure

  • scale-testing, PRs have to opt-into it.

Review Master Master Pull Requests

CI status


Depdendancies

PMIx Update

ORTE/PRRTE

MTT


Back to 2020 WeeklyTelcon-2020

Clone this wiki locally