Skip to content

WeeklyTelcon_20160517

Jeff Squyres edited this page Nov 18, 2016 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Brad Benton
  • Howard
  • Josh Hursey
  • Joshua Ladd
  • Nathan Hjelm
  • Ralph
  • Sylvain Jeaugey
  • Todd Kordenbrock

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
  • 1161 - Open IB Error Path - Giles asked Mike to review, in 2nd iteration.
    • Joshua Ladd tagged on 2.x version.
  • 1150 - 2 places in Init and 1 in Finalize where we do RTE Barrier.
    • If launched with mpirun, it works just fine.
    • But direct launch will hang in cray or slurm PMIx because those have Blocking RTE barriers, and those DONT progress.
    • Patched it in master with MPI Barrier to make other things progress.
    • Will need to block 2.0.x for this fix also. Ralph will create PR.
  • Once these get in, Do another RC and move this out.

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
    • PMIx barrier
    • Nathan will review 1164.
    • PR 1673 Multi-threaded issues that George ran into is a doozie.
      • Free path in C++. In one thread in dereg hooks in Delete.
      • Another thread was try to allocate space, and trigerring internal garbage collection.
      • Classic deadlock.
      • Nathan reworked the rcache / mpool code to not hold lock while doing deletes.
      • All locks are always on in RDMA because no way around it.
      • Last rcache bug was if you had > 100 registrations associated with memory registration being munmapped, ran into infinite loop.
      • Nathan and George testing.
      • IBM will do some multi-threaded testing as well.
    • PowerPC issues as well. Nathan had to revise table a bit.
      • ppc64le, if you do a dlsym, pointer is into table of contents: 1 is real address.
        • problem is TOC is getting patched.
      • when patching, need to patch the real function, not the other.
      • ppc64BE - may still
    • 1162 - multiple threads make same endpoint simultaneously.
      • Nathan thought he handled that case.
    • one thing we forgot to do for 2.0.0rc2, we forgot to send to users-alias. Will do for rc3.
      • Put announcement about Migration guide into Announcement list.

Review Master MTT testing (https://mtt.open-mpi.org/)

  • IBM trying to ramp up MTT testing. Hopefully will have Power8 XL compiler testing soon.
    • Some issues passing certain flags to XL compilers. Josh Hersey is working on.
  • Cisco / Intercomm create failures.
  • Getbyte offset test requires v2.0.0 or greater and spins until timeout on 1.10.
  • 2nd month of RED. Can't seem to break out of it.
  • IBM wants to get Jenkins on Power8LE enabled this week. Looks like got correct permission, using the polling method.
    • If people pushes quickly, if multiple pushes between polling interval, it'll just pickup the last.
  • Jenkins servers have been hanging / restarting lately.
    • Howard saw that there was a cron job doing auto-mated updates of jenkins. Last wednesday jenkins was updated with security fix, but that broken a lot of github integration.
  • Pull Request 1650 still causing red X on Mellanox Jenkins.
    • Red X on master, because issue that hasn't been resolved.
    • Need nathan or josh hursey or someone to follow. Who knows AMC code the best? We could move AMCA out
    • MCA variable system
    • envlist being available in an aggregate.

MTT Dev status:

  • Jenkins is still the best of the worst for running in non-cloud
    • Hudson is enterprise pay-for solution, but we want free
  • josh posted documentation on wiki, but not the scripts yet.
  • MTT some new development to clean out MTT github to MTT devel list.
    • Clear out some issues and set a new milestone, etc.
  • There is an alternative for Travis, but that hasn't been an issue.
    • Appvayer
  • What is combinatorial Executor for MTT?
    • Ralph explains: if you have two different ompi builds (different configure lines)
    • Big list of tests.
    • Existing sequential executor would sequentially build both.
      • but When building tests, it wouldn't automatically build for both, you have to tell it.
    • The Combinatorial executor would do that. Build list of tests for EACH configured OMPI build.
  • Chelsio getting some resources to possibly do MTT nightly testing.

Status Updates:


Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM
  3. Cisco, ORNL, UTK, NVIDIA

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally