Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{lib,mpi}[GCCcore/13.3.0,NVHPC/24.9] Add NCCL 2.22.3, UCC-CUDA 1.3.0, OpenMPI 5.0.3 w/CUDA 12.6.0 #21546

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

Thyre
Copy link
Contributor

@Thyre Thyre commented Oct 4, 2024

Add NCCL 2.22.3 & UCC-CUDA 1.3.0 for GCCcore 13.3.0.
Add OpenMPI 5.0.3 for NVHPC 24.9.

NVHPC 24.9 requires some patches to work correctly with OpenMPI 5.0.3.

@Thyre
Copy link
Contributor Author

Thyre commented Oct 4, 2024

Test report by @Thyre
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
Linux - Linux EndeavourOS UNKNOWN, x86_64, AMD Ryzen 7 7800X3D 8-Core Processor, 1 x NVIDIA NVIDIA GeForce RTX 3070, 560.35.03, Python 3.12.7
See https://gist.github.com/Thyre/455444cb24a87d6904c430ee1332c464 for a full test report.

Signed-off-by: Jan André Reuter <[email protected]>
@SebastianAchilles SebastianAchilles added update 2024a issues & PRs related to 2024a common toolchains labels Oct 5, 2024
@SebastianAchilles SebastianAchilles added this to the release after 4.9.4 milestone Oct 5, 2024
@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
skl-rockylinux-810 - Linux Rocky Linux 8.10, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 555.42.06, Python 3.6.8
See https://gist.github.com/SebastianAchilles/b85abb42f2523431c1e1acdb99a8c2f0 for a full test report.

@SebastianAchilles
Copy link
Member

@boegelbot please test @ jsc-zen3-a100
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@SebastianAchilles: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=21546 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_21546 --ntasks="16" --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 5009

Test results coming soon (I hope)...

- notification for comment with ID 2394983210 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 555.42.06, Python 3.9.18
See https://gist.github.com/boegelbot/b264a6ea5d8fe659647a51a1668af921 for a full test report.

@Thyre
Copy link
Contributor Author

Thyre commented Oct 6, 2024

Test report by @boegelbot FAILED Build succeeded for 2 out of 3 (3 easyconfigs in total) jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 555.42.06, Python 3.9.18 See https://gist.github.com/boegelbot/b264a6ea5d8fe659647a51a1668af921 for a full test report.

Unfortunately its hard to say why that particular test opal_path_nfs failed. Looking that test up online, one can find several occurrences where this test fails but shouldn't [1][2][3][4].
It might be interesting to have the full log, since that might include the exit code. Trying a second run might also be interesting, just to see if this was a one-time failure or is related to something specific to that system.

I'm also trying to build this on a second system of mine to see if it fails there. This will take some time, as EasyBuild is not set up there.

[1] open-mpi/ompi#10152
[2] open-mpi/ompi#628 (comment)
[3] https://www.mail-archive.com/[email protected]/msg33810.html
[4] https://www.mail-archive.com/[email protected]/msg35301.html

@Thyre
Copy link
Contributor Author

Thyre commented Oct 6, 2024

Test report by @Thyre
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
datenlager - Linux Ubuntu 24.04, x86_64, AMD Ryzen 7 3700X 8-Core Processor, Python 3.12.3
See https://gist.github.com/Thyre/0f30df76f9467fd9a84c608721c2614f for a full test report.


Edit (2024-01-07): I guess the issue might be related to NFS mounts. This system (datenlager) only provides SMB shares, while my main system doesn't mount any network shares by default. I'll check if something changes when mounting some NFS share.

@sassy-crick
Copy link
Collaborator

Test report by @sassy-crick
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
Full report for OpenMPI-5.0.3-NVHPC-24.9-CUDA-12.6.0.eb can be found here
I am building using EASYBUILD_CUDA_COMPUTE_CAPABILITIES=8.9 for our L40s is that helps.

@Thyre
Copy link
Contributor Author

Thyre commented Oct 7, 2024

With NFS share & mount:

Test report by @Thyre
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
Linux - Linux EndeavourOS UNKNOWN, x86_64, AMD Ryzen 7 7800X3D 8-Core Processor, 1 x NVIDIA NVIDIA GeForce RTX 3070, 560.35.03, Python 3.12.7
See https://gist.github.com/Thyre/9305d3751fab2f7cee7c0d436a1dbf1f for a full test report.


Edit: I can certainly imagine that NFS shares might be the reason for the observed failure. If the NFS server doesn't exist anymore but is still mounted, building OpenMPI simply hangs indefinitely in the test step. So this tests seems to be fragile when it comes to NFS shares.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024a issues & PRs related to 2024a common toolchains update
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants