Skip to content

WeeklyTelcon_20170131

Geoffrey Paulsen edited this page Jan 9, 2018 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Brian Barrett
  • Howard
  • josh Hursey
  • Josh Ladd
  • Ralph
  • Sorry - I may have missed a few this week.

Agenda

  • Put out RC yesterday.
  • ireduce fix - could and should release today.
  • No objections, releasing today, assuming MTT looks okay.
  • close the door on PRs for v2.0.x for a weekish.
  • What's going on with --disable_dlopen failure. Recent PRs are getting Xed due to an undefined ref with --disable_dlopen on yoohoo cluster.
    • Giles posted on a couple, that it looks like a dirty tree. Failure is it can't copy bits to m4 directory.
    • Howard see, when it links hello_world it looks like it fails with unsat symbol.
    • Don't see this issue in v2.0
    • The symbol is opal_declspeced
    • Perhaps it's several things going on.
    • PR2885 - PMIx 2885 - red X on disable_dlopen.
    • No one can hit this by hand (jeff and Ralph both tried). GCC 4.85 and 4.81.
  • Ralph is working on update to PMIx Master that will have pthread_locking fix.
    • If that's okay, Ralph will add new PR to roll PMIx Master into OMPI 2.1.1 (within a few days)
  • Mellanox has some PRs for v2.1.
    • want to resolve issues with CI before we Pull more PRs.
  • Estimated Schedule: Middle of March.

v3.0 over the horizon.

  • Amazon has approval to do Release Engineering.
  • IBM is still asking approval.
  • PR2861 - Stack traces - ready to go. Josh Hursey
    • Custom one for v2.x branch - Ralph signed off on both.
  • Support Datatypes - most for master look like they've gone through CI. Waiting for someone to click merge.
  • Nathan's thread perf fix was cleared.
  • PR2838, - removal of -heterogeneous configure option.
    • If someone can fix it, we should keep it... depend on how risky the fix is.
    • Is it being tested? @ggouaillardet is testing nightly, just not uploading data to MTT.
    • If @ggouaillardet agress to fix it, and upload tests to MTT nightly on hetrogeneous cluster, then no reason to remove the support for us.

  • Got some fixes committed upstream to OSHMEM
    • some new tests are supposed to HANG / Timeout.
    • Jeff submitted request to them to help MTT hook into this new test use-case.
  • MPI_ONE_SIDED + MPI_THREAD_MULTIPLE
    • Abort rather than wrong answer when component doesn't want to run due to lack of component.
    • osc_pt2pt - Nathan has fix wants IBM to test.

MTT Dev status:

  • Been going kinda slow lately. Intel committed some code to be able to report to the database.
  • Ralph is going to switch to Python client to post to MTT database.
  • Ralph's been asked to provide 3 other components:
    • Watchdog timer to Kill off hung jobs.
    • A harass launch other side procs to do nasty thing
    • Nightly regression on the DBM - starts DBM, and launch tests against DBM. Will run MTT in half the time.

Exceptional topics

  • SPI - got a reply on how to initiate, Cisco has the ball.
    • Added SPI to website.
  • Github transitioned us to a free account.

Status Updates:

Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, IBM, Fujitsu

Back to 2017 WeeklyTelcon-2017

Clone this wiki locally