-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with multinode running with iimpi-2020a #10899
Comments
@jhein32 I have similar hardware. I'll see if i can repeat this when i'm back from vacation (during July) |
@jhein32 We added UCX because it was recommended by Intel, see #10280 and https://software.intel.com/content/www/us/en/develop/articles/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html (where UCX is listed as required even). Have you reported this to Intel support? |
@boegel Thanks for getting involved. As written above, the performance of intel mpi 2019.7 without UCX is poor. So the decision to include UCX is correct. If you don't want UCX, in my current view Intel MPI 18.5 is the choice (which can't be your choice forever). From the error messages I am wondering whether the issues sits in UCX and not in Intel MPI. OFI is also mentioned. Who provides that? Intel MPI, UCX, CentOS? Anyone here any clues? |
We haven't yet engaged with Intel. |
Hi, We (LUNAC team members) had a virtual 6-hands one keyboard sessions (via zoom thanks to Covid 19) and went over the error messages and the information available within the docs shared on here and within the easyconfigs. Our hardware is a bit old (2015 or 2016), so we get:
Intel writes, output should include: dc, rc, and ud transports. Our hardware lacks dc, which according to Intel is a common issue with older hardware. They recommend setting:
Intel calls it a workaround. Assuming that goes well, here are two questions/tasks for EB:
|
Yes "dc" is a little complex. It used to only work if you use MOFED, but now dc support is upstream and backported in newer CentOS (7.7 has it, not sure about 7.6, 7.5 and older definitely not). We have a cluster without dc as well running CentOS 7.8, will check it there later today (lspci | grep Mellanox reports
there. Setting UCX_TLS would be appropriate in the UCX module (as Open MPI uses UCX too and can't use dc either), ideally it should auto-detect the lack of dc though and not need that env var. |
Setting AFAIK disabling For instance, we have nodes with ConnectX-4 and ConnectX-5. We only disable |
Hi, the performance test was ok. We noticed the Intel MPI lib is happy with When it runs, we get a warning: I also looked at the foss/2020a Linpack, using the same UCX for OpenMPI. foss runs with Based on this, I feel the UCX_TLS should be set in the INTEL MPI module. |
@jhein32 The warning message regarding |
I can just confirm that we see the same issues on our older cluster.
Putting in something like this
into the UCX config sounds about right to me. |
This is still open - didn't get round to finish this off before the summer. Based on my current understanding of the issue, I would like to add a comment as proposed by @Micket into a relevant config. However I feel that UCX module is not the correct place. To me it seems to be an Intel MPI issue. With the standard UCX module, as reported, the OpenMPI in foss seems to work well. I is only the Intel MPI that needs this kind of help. My proposal would be to amend the Intel MPI module. In addition, when issues are encountered with Intel MPI, a user would look there first to look for hints. Until the UCX config is examined it would take some poking around. Any opinions on the above? |
@jhein32 How should we proceed with this? |
I correct my previous statement. After further investigation, the |
Hi, I installed HPL and pre-requisites from PR #11337. I have massaged the So I am happy to proceed. |
@jhein32 The |
As mentioned in the Slack already, we have issues with MPI executables build against iimpi-2020a starting multinode. Within a node I am not aware of issues.
The problem seems associated with the UCX/1.8.0 dependency. Executables utilising iimpi/2020.00, which utilises Intel's MPI 19.6 without an UCX dependency work multinode. Also if I "massage" the easyconfig
impi-2019.7.217-iccifort-2020.1.217.eb
and comment the linein the dependencies list, basic hello world codes or the HPL for intel/2020a will run. Though performance, when compared to an HPL build with intel/2017b is 10% poorer. Using the HPL from PR #10864 the performance is within 1% of the one from intel/2017b.
A few details on our cluster. The system is using Intel Xeon Xeon E5-2650 v3 (Haswell) and 4xFDR InfiniBand. We are using CentOS 7, currently 7.6 or 7.8, linux kernel 3.10, infiniband stuff from CentOS. Slurm is setup with cgroups for process control and accounting
(TaskPlugin=task/cgroup, ProctrackType=proctrack/cgroup ). The slurm is quite old slurm 17.02.
To get the Intel MPI started I add (in an editor)
to the impi modules (we have versions as far back as iimpi/7.3.5, predating iimpi/2016b). I tested multiple times, but
libpmi2.so
does not work for us. From the methods to start an Intel MPI jobs, described in the slurm guide, only srun works for us. We never got hydra or MPD to work. I tested, setting'I_MPI_HYDRA_TOPOLIB': 'ipl'
does not help anything.When running I load:
The modules are build with unmodified configs from EB 4.2.1. When compiling and running a simple MPI hello world code, I get the following in stdout:
and this in stderr:
Ok, that went long. Any suggestions would be highly appreciated.
The text was updated successfully, but these errors were encountered: