-
Notifications
You must be signed in to change notification settings - Fork 864
WeeklyTelcon_20210713
- DID NOT RECORD?
- Austen Lauria (IBM)
- Brendan Cunningham (Cornelis Networks)
- Brian Barrett (AWS)
- Geoffrey Paulsen (IBM)
- Harumi Kuno (HPE)
- Hessam Mirsadeghi (NVIDIA))
- Howard Pritchard (LANL)
- Jeff Squyres (Cisco)
- Joseph Schuchart (HLRS)
- Matthew Dosanjh (Sandia)
- Michael Heinz (Cornelis Networks)
- Naughton III, Thomas (ORNL)
- Sam Gutierrez (LANL)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (NVIDIA)
- Aurelien Bouteiller (UTK)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- Christoph Niethammer (HLRS)
- David Bernholdt (ORNL)
- Edgar Gabriel (UH)
- Erik Zeiske (HPE)
- Geoffroy Vallee (ARM)
- George Bosilca (UTK)
- Josh Hursey (IBM)
- Joshua Ladd (NVIDIA)
- Marisa Roman (Cornelius)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Nathan Hjelm (Google)
- Noah Evans (Sandia)
- Raghu Raja
- Ralph Castain (Intel)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- Todd Kordenbrock (Sandia)
- Tomislav Janjusic (NVIDIA)
- William Zhang (AWS)
- Xin Zhao (NVIDIA)
- No schedule for v4.0.7
- Cisco would like v4.0.7 someday.
- PR9094 - external32 - Do we want it in v4.0?
- No
- PR9088 - long long - Do we want it in v4.1
- Yes
- We need both 9094 and 9088 on v4.0.x to fix the bug reported.
- Quality of what this is and what's needed.
- v4.0.6 shipped last week. Looking good.
- Mpool PR, waiting for review and to go into master first.
- Howard is testing.
- 8919 nVidia cannot link. Some users may have already hit this.
- Tomislav will try to find someone to look at it.
- Schedule: Planning on late August (no reason for August) for accumulated bugfixes.
- Fix huge page allocator waiting on Howard's testing.
- Long Long one
- 8867 - show help if libz is missing, Jeff's looking at.
-
Documentation
- Issue 7668 - lots of things need to change here.
- Can use help
- Jeff done with first past of docs, and slowly folding in docs
- Still stuff that needs to be revamped or written.
- Still all one docs.
- Harumi - Even if others can't write well
- Docs that should go into PRRTE
- Some infrastructure with sphynx - can be started as well.
- Decent handle on infrastructure.
- Doc could also start in PMIX/PRRTE so we can slurp in.
-
PMIX / PRRTE plan to release in next few weeks
-
Need to do a v5.0 rc as soon as PRRTE v2 ships.
- Need feedback if we've missed an important one.
-
PMIx Tools support is still not functional. Opened tickets in PRRTE.
- Not a common case for most users.
- This also impacts the MPIR shim.
- PRRTE v2 will probably ship with broken tool support.
-
Is the driving force for PRRTE v2.0 OMPI?
- So we'd be indirectly/directly responsible for PRRTE shipping with broken tool support?
- Ralph would like to retire, and really wants to finish PRRTE v2.0 before he retires.
- Or just fix it in PRRTE v2.0?
- Is broken tool support a blocker for PRRTE v2.0?
- Don't ship OMPI v5.0 with broken Tools support.
-
Is there any objections to delaying
- Either we resource this
-
https://github.com/openpmix/pmix-tests/issues/88#issuecomment-861006665
- Current state of PMIx tool support.
- We'd like to get Tool support in CI, but need it to be working to enable the CI.
-
https://github.com/openpmix/prrte/issues/978#issuecomment-856205950
- Blocking issue for Open-MPI
- Brian
-
PR 9014 - new blocker.
- fix should just be a couple of lines of code... hard to decide what we want.
- Ralph, Jeff and Brian started talking.
- Simplest solution was to have our own
-
Need people working on v5.0 stuff.
-
Need some configury changes in before we RC.
-
Issue 8850, 8990 and more
-
Brian will file 3-ish issues
- One is configure pmix
-
Dynamic Windows fix in for UCX.
-
Any update on debugger support?
-
Need some documentation that Open MPI v5.0 supports PMIx based debuggers, and that if
-
UCC coll component updating to just set to be default when UCX is selected. PR 8969
- Intent is that this will eventually replace hcoll.
- Qaulity
- Solid progress happening, on Read the docs.
- These docs would be on the readthedocs.io site, or on our site?
- Haven't thought either way yet.
- No strong opinion yet.
- Geoff is going to help
-
Issue 8884 - ROMIO detects CUDA differently.
- Giles proposed a quick fix for now.
-
https://github.com/open-mpi/ompi/wiki/Meeting-2021-07
- Find link to Web-ex HERE.
- July 22nd (2pm Central)
- July 29st (10-12 Central)
-
Now released.
-
Virtual Face to face.
-
Persistant Collectives
- So nice to get MPIX_ rename into v5.0
- Don't think this was planned for v5.0
- Don't know if anyone asked them this. - Might not matter to them
- Virtual face to face -
-
a bunch of stuff in pipeline. Then details.
-
Plan to open Sessions pull request.
- Big, almost all in OMPI.
- Some of it are more impacted by clang format changes.
- New functions.
- Considerably more functions can be called before MPI_Init/Finalize
- Don't want to do sessions in v5.0
- Hessam Miradeghi is interested in trying MPI_Sessions.
- Interested in a timeline of a release that will contain MPI_Sessions.
- Sessions working group meets every monday at noon central time.
- https://github.com/mpiwg-sessions/sessions-issues/wiki
- Several of the tools tests are busted on master.
- Sessions branch fixes some of these.
- Initialize tools after finalize MPI
- Update:
- Did some cleanup of refactoring.
- Topology might NOT change with Sessions relative to whats currently in master
- Extra topology work that wasn't accepted by MPI v4.0 standard.
- Question on how we do mca versioning
-
We don't KNOW that OMPI v6.0 may not be an ABI break
-
Would be NICE to get MPIX symbols into a seperate library.
- What's left in MPIX after persistant collectives?
- Short Float,
- Pcall_req - persistant collective
- Affinity
- If they're NOT built by default, it's not too high of a priority.
- Should just be some code-shuffling.
- On the surface shouldn't be too much.
- If they use wrapper compilers, or official mechanism
- Top level library, since app -> MPI and app -> MPIX lib.
- libmpi_x library can then be versioned differently.
- Should just be some code-shuffling.
- What's left in MPIX after persistant collectives?
-
Dont change to build MPIX by default.
-
Open an issue to track all of our MPI 4.0 items
- MPI Forum will want, certainly before supercomputing.
-
Do we want an MPI 4.0 Design meeting in place of a Tuesday meeting.
- In person meeting is off the table for many of us. We might want an out of sequence meeting.
- Lets doodle something a couple of weeks out.
- Doodle and send it out
- trivial wiki page in style of other in person wiki.
-
Two days of 2 hour blocks - wiki *
-
Who owns our open-SQL?
- noone?
- What value is the viewer using to generate the ORG data?
- Looking for field in the perl client
- It's just the username. It's nothing simple.
- Something about how the cherry-pie server is stuffing stuff into the database.
- It's just the username. It's nothing simple.
- Thought it was in the ini file, but isn't.
- Looking for field in the perl client
- Concerned that we don't have an owner.
- Back in the day, we used MTT because there was nothing else.
- But perhaps there's something else now?
-
A lot of segfaults in UCX 1sided in IBM
-
Howard Pritchard Does someone at nVidia have a good set of test for GPU
- Can ask around.
- Only tests is The OSU MPI has support for CUDA and ROCM tests.
- Good enough for sanity.
- No support for Intel low level stuff now.
- PyTorch - machine learning framework - resembles an actual application.
- Has different backends, collectives reduction tool NCCL, but also has a CUDA backend for single/multiple nodes.
-
ECP - worried we're going to get so far behind MPICH because all 3 major exascale systems are using essentially the same technology and their vendors use MPICH. They're racing ahead with integrating GPU offloaded code with MPICH. Just a heads up.
- A thread on The GPU can trigger something to happen in MPI.
- CUDA_Async Not sure of
-
Jeff will send out committer list to remove people from list.
- Trivial to re-add someone, so error on kicking folks out.
- No discussion
- No update
- No discussion.