Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSM2 MTL doesn't work for single node jobs #1559

Closed
hppritcha opened this issue Apr 19, 2016 · 22 comments
Closed

PSM2 MTL doesn't work for single node jobs #1559

hppritcha opened this issue Apr 19, 2016 · 22 comments
Assignees
Labels
Milestone

Comments

@hppritcha
Copy link
Member

PSM2 MTL has problems with initialization, in particular, in the call to psm2_ep_open when runninig on a single node.

The failure signature is

srun -N 1 -n 1 a.out          
opal13.0Assertion failure at psm_ep.c:832: ep->epid != 0

miniFE.x:52223 terminated with signal 6 at PC=2aaaabd835f7 SP=7fffffffb7d8. Backtrace:
/lib64/libc.so.6(gsignal+0x37)[0x2aaaabd835f7]
/lib64/libc.so.6(abort+0x148)[0x2aaaabd84ce8]
/lib64/libpsm2.so.2(+0x11dea)[0x2aaab97c7dea]
/lib64/libpsm2.so.2(+0x10782)[0x2aaab97c6782]
/lib64/libpsm2.so.2(psm2_ep_open+0x3c5)[0x2aaab97c5165]
/opt/openmpi/1.10/intel/lib/openmpi/mca_mtl_psm2.so(ompi_mtl_psm2_module_init+0x196)[0x2aaab95b25e6]
/opt/openmpi/1.10/intel/lib/openmpi/mca_mtl_psm2.so(+0x2a00)[0x2aaab95b2a00]
/opt/openmpi/1.10/intel/lib/libmpi.so.12(ompi_mtl_base_select+0x9d)[0x2aaaaad6148d]
/opt/openmpi/1.10/intel/lib/openmpi/mca_pml_cm.so(+0x3d9b)[0x2aaab8916d9b]
/opt/openmpi/1.10/intel/lib/libmpi.so.12(mca_pml_base_select+0x433)[0x2aaaaad68643]
/opt/openmpi/1.10/intel/lib/libmpi.so.12(ompi_mpi_init+0x6ac)[0x2aaaaad1a06c]
/opt/openmpi/1.10/intel/lib/libmpi.so.12(MPI_Init+0xe4)[0x2aaaaad3b134]
miniFE.x[0x432da2]
miniFE.x[0x404e1a]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaabd6fb15]
miniFE.x[0x404c69]

This is with v1.10.1 but also shows up in v2.x.

The issue appeared on devel mail list:
https://www.open-mpi.org/community/lists/devel/2016/04/18762.php

@hppritcha hppritcha self-assigned this Apr 19, 2016
@hppritcha hppritcha added the bug label Apr 19, 2016
@hppritcha
Copy link
Member Author

Okay the problem is that the PSM2 MTL is setting the PSM2_DEVICES env variable to just self,shm
when running on a single node. There seems to be a bug in the PSM2 installed on our systems that causes psm2_ep_connect to timeout for process 0 with PSM2_DEVICES so set. If I comment out the lines in the MTL which set this env. variable the single node jobs run to completion.

Logfile for when PSM2_DEVICES is not set by PSM2 MTL
logfile.works.txt
Logfile for when PSM2_DEVICES is set by PSM2 MTL
logfile.busted.txt

@matcabral

The psm2 related RPMS on our cluster are:

infinipath-psm-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64
hfi1-psm-0.7-221.ch6.x86_64
hfi1-psm-devel-0.7-221.ch6.x86_64
infinipath-psm-3.3-0.g6f42cdb1bb8.2.el7.x86_64

@hppritcha
Copy link
Member Author

Per request, modinfo hfi1 output for the cluster.
modinfo.txt

The rpms that appear to be relevant to hfi1 installed on the nodes of the cluster are:
opa-hfi1-0-201603141940.ch6.x86_64
hfi1-psm-0.7-221.ch6.x86_64
hfi1-psm-devel-0.7-221.ch6.x86_64
libhfi1-0.2-2.el7.x86_64
kmod-opa-hfi1-0-201603141940.ch6.x86_64
libhfi1-static-0.2-2.el7.x86_64
hfi1-firmware-0.9-31.noarch

@hppritcha
Copy link
Member Author

Here's a description of the changes made for the rpm list above for this DOE cluster. RPMS with "ch6" in the name have been repackaged for these clusters:

opa-hfi1-0-201603141940.ch6.x86_64 and kmod-opa-hfi1 . code is from https://github.com/01org/opa-hfi1 on 201603141940. we run make on it, and package it up as a weak module. no changes from github

hfi1-psm is built from hfi1-psm-0.7-221.src.rpm that Intel provided

{noformat}
rpm -qpi hfi1-psm-0.7-221.src.rpm
Name : hfi1-psm
Version : 0.7
Release : 221
Architecture: x86_64
Install Date: (not installed)
Group : System Environment/Libraries
Size : 516926
License : GPL
Signature : (none)
Source RPM : (none)
Build Date : Mon Feb 8 14:14:06 2016
Build Host : phbldprivrhel7-2.ph.intel.com
Relocations : (not relocatable)
URL : http://www.intel.com/
Summary : Intel PSM Libraries
Description :
The PSM Messaging API, or PSM API, is Intel's low-level
user-level communications interface for the Truescale
family of products. PSM users are enabled with mechanisms
necessary to implement higher level communications
interfaces in parallel environments.
{noformat}

only change to that rpm is in the spec file, I add %{?dist} to the Release, and added opa-hfi1 to the BuildRequires

@hppritcha
Copy link
Member Author

@matcabral in case you didn't get email about these updates.

@hppritcha hppritcha added this to the v1.10.3 milestone Apr 21, 2016
@hppritcha
Copy link
Member Author

I was informed customer wants this problem resolved for 1.10.3. The clusters will be put in to production soon and a functional Open MPI 1.10.X is necessary.

@rhc54
Copy link
Contributor

rhc54 commented Apr 21, 2016

Just scratching my head here for a moment - if you are concerned about the PSM2 MTL, then shouldn't you need the psm2 library rpm? You are only showing the "psm" rpms above.

@rhc54
Copy link
Contributor

rhc54 commented Apr 21, 2016

Just FWIW: here's a list of the missing rpm's from your system (based on my CentOS7 system):

libpsm2.x86_64                          0.7-4.el7                      @base    
libpsm2-compat.x86_64                   0.7-4.el7                      base     
libpsm2-compat-devel.x86_64             0.7-4.el7                      base     
libpsm2-devel.x86_64                    0.7-4.el7                      base     

@hppritcha
Copy link
Member Author

Ralph,

that information was provided near the beginning of this email thread.
It is also included in the issue 1559. LLNL did some repackaging.

Howard

2016-04-21 10:31 GMT-06:00 rhc54 [email protected]:

Just FWIW: here's a list of the missing rpm's from your system (based on
my CentOS7 system):

libpsm2.x86_64 0.7-4.el7 @base
libpsm2-compat.x86_64 0.7-4.el7 base
libpsm2-compat-devel.x86_64 0.7-4.el7 base
libpsm2-devel.x86_64 0.7-4.el7 base


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#1559 (comment)

@matcabral
Copy link
Contributor

matcabral commented Apr 21, 2016

@rhc54 it seems there are different names of the RPMs according to to who creates them. The names that @hppritcha provided above (hfi-psm*.rpm) are valid ones and are the ones distributed with intel IFS package.

@hppritcha merging the ompi-devel email thread here, I see you are reproducing two different issues:
a) epid = 0 assert. which I can reproduce and @rhc54 started trying a patch, so we have a root cause and need to polish the fix.
b) in psm2_ep_connect() hang. Which might be related, but still cannot reproduce. Now, I'm looking to have the same driver version. I'll share updates as I have them.

Thanks,

@hppritcha
Copy link
Member Author

Correct about the two issues. Its not clear that they are related at this point.

@hppritcha
Copy link
Member Author

If I simply if def out the PSM2_DEVICES setting code in the PSM2 MTL, all problems with single node jobs vanish, both the hang for multi process cases and the assert for single process case.

hppritcha added a commit to hppritcha/ompi that referenced this issue Apr 25, 2016
The PSM2 MTL was setting the PSM2_DEVICES environment
variable to self,shm for single node jobs.  This is
causing single node/single process jobs to fail on
some omnipath systems.  Not setting this environment
variable fixes these issues.

This fix is needed as part of bringup of omnipath
clusters at several customer sites.

Fixes issue open-mpi#1559

Signed-off-by: Howard Pritchard <[email protected]>
@matcabral
Copy link
Contributor

@hppritcha, Couple comments:
I) the patch you share to disable setting the env var will has side effects that limit the number of ranks to ran on the PSM2 shm. In other words, by initializing the hfi device with will have certain limits imposed by the HW. I don't have the numbers here by hand, but the point is that such limit shouldn't be imposed for shm communication.
ii) I made some progress on Friday on issue b) (above stated), the hang only occurs when using an OMPI built --with-pmi and running under srun -n x hello_c. I will continues looking at it this week.

@hppritcha
Copy link
Member Author

Actually ii isn't exactly accurate. Using mpirun one also observes the hang.

@rhc54
Copy link
Contributor

rhc54 commented Apr 26, 2016

@hppritcha Could you please setup a call between our folks, you, and the LLNL folks to discuss this? We're having a little trouble reproducing some of the hang issues and would like to ensure we are accurately replicating the setup.

@hppritcha
Copy link
Member Author

I'll try to set up something for tomorrow (Wednesday).

@hppritcha
Copy link
Member Author

@matcabral thanks for the info. Our clusters have Broadwell E5-2695 v4 processors - 36 HTs total/node. It wasn't quite clear from your comment if there is an upper limit one the number of MPI processes/node that can be run if PSM2_DEVICES is not set to self,shm? I don't seem to hit that limitation on our cluster.

@matcabral
Copy link
Contributor

@hppritcha, did you see my answer to the ranks limit on #1578 ?

@hppritcha
Copy link
Member Author

customer has a satisfactory workaround for now.

@hppritcha
Copy link
Member Author

config options I used for ompi for these issue is

./configure --prefix=/users/hpp/ompi_install --with-pmi --with-slurm --with-libfabric=/users/hpp/libfabric_install

config options used by the admins for the system install of ompi is

--with-io-romio-flags=--with-file-system=ufs+nfs+lustre --with-cuda=/opt/cudatoolkit/7.5/include --with-pmi --enable-mpi-thread-multiple

@matcabral

@hppritcha hppritcha modified the milestones: Future, v1.10.3, v2.0.1 Apr 28, 2016
@hppritcha
Copy link
Member Author

If we do a 1.10.4 this can be set to that milestone. Setting to 2.0.1 milestone for now. Per discussions on a telecon today, there will be a ompi fix required as well as a fix in PSM2.

@matcabral
Copy link
Contributor

matcabral commented May 10, 2016

@hppritcha as mentioned above, @rhc54 submitted fixes for both of the issues you reported in this bug:
open-mpi/ompi-release#1138 solved the epid=0 issue
open-mpi/ompi-release#1150 solved the hang.
thanks,

@rhc54
Copy link
Contributor

rhc54 commented Oct 21, 2016

This was fixed months ago

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants