-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PSM2 MTL doesn't work for single node jobs #1559
Comments
Okay the problem is that the PSM2 MTL is setting the PSM2_DEVICES env variable to just self,shm Logfile for when PSM2_DEVICES is not set by PSM2 MTL The psm2 related RPMS on our cluster are: infinipath-psm-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64 |
Per request, modinfo hfi1 output for the cluster. The rpms that appear to be relevant to hfi1 installed on the nodes of the cluster are: |
Here's a description of the changes made for the rpm list above for this DOE cluster. RPMS with "ch6" in the name have been repackaged for these clusters: opa-hfi1-0-201603141940.ch6.x86_64 and kmod-opa-hfi1 . code is from https://github.com/01org/opa-hfi1 on 201603141940. we run make on it, and package it up as a weak module. no changes from github hfi1-psm is built from hfi1-psm-0.7-221.src.rpm that Intel provided {noformat} only change to that rpm is in the spec file, I add %{?dist} to the Release, and added opa-hfi1 to the BuildRequires |
@matcabral in case you didn't get email about these updates. |
I was informed customer wants this problem resolved for 1.10.3. The clusters will be put in to production soon and a functional Open MPI 1.10.X is necessary. |
Just scratching my head here for a moment - if you are concerned about the PSM2 MTL, then shouldn't you need the psm2 library rpm? You are only showing the "psm" rpms above. |
Just FWIW: here's a list of the missing rpm's from your system (based on my CentOS7 system): libpsm2.x86_64 0.7-4.el7 @base
libpsm2-compat.x86_64 0.7-4.el7 base
libpsm2-compat-devel.x86_64 0.7-4.el7 base
libpsm2-devel.x86_64 0.7-4.el7 base |
Ralph, that information was provided near the beginning of this email thread. Howard 2016-04-21 10:31 GMT-06:00 rhc54 [email protected]:
|
@rhc54 it seems there are different names of the RPMs according to to who creates them. The names that @hppritcha provided above (hfi-psm*.rpm) are valid ones and are the ones distributed with intel IFS package. @hppritcha merging the ompi-devel email thread here, I see you are reproducing two different issues: Thanks, |
Correct about the two issues. Its not clear that they are related at this point. |
If I simply if def out the PSM2_DEVICES setting code in the PSM2 MTL, all problems with single node jobs vanish, both the hang for multi process cases and the assert for single process case. |
The PSM2 MTL was setting the PSM2_DEVICES environment variable to self,shm for single node jobs. This is causing single node/single process jobs to fail on some omnipath systems. Not setting this environment variable fixes these issues. This fix is needed as part of bringup of omnipath clusters at several customer sites. Fixes issue open-mpi#1559 Signed-off-by: Howard Pritchard <[email protected]>
@hppritcha, Couple comments: |
Actually ii isn't exactly accurate. Using mpirun one also observes the hang. |
@hppritcha Could you please setup a call between our folks, you, and the LLNL folks to discuss this? We're having a little trouble reproducing some of the hang issues and would like to ensure we are accurately replicating the setup. |
I'll try to set up something for tomorrow (Wednesday). |
@matcabral thanks for the info. Our clusters have Broadwell E5-2695 v4 processors - 36 HTs total/node. It wasn't quite clear from your comment if there is an upper limit one the number of MPI processes/node that can be run if PSM2_DEVICES is not set to self,shm? I don't seem to hit that limitation on our cluster. |
@hppritcha, did you see my answer to the ranks limit on #1578 ? |
customer has a satisfactory workaround for now. |
config options I used for ompi for these issue is
config options used by the admins for the system install of ompi is
|
If we do a 1.10.4 this can be set to that milestone. Setting to 2.0.1 milestone for now. Per discussions on a telecon today, there will be a ompi fix required as well as a fix in PSM2. |
@hppritcha as mentioned above, @rhc54 submitted fixes for both of the issues you reported in this bug: |
This was fixed months ago |
PSM2 MTL has problems with initialization, in particular, in the call to
psm2_ep_open
when runninig on a single node.The failure signature is
This is with v1.10.1 but also shows up in v2.x.
The issue appeared on devel mail list:
https://www.open-mpi.org/community/lists/devel/2016/04/18762.php
The text was updated successfully, but these errors were encountered: