Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For pm-cpu, increase compiler version for gcc,nvidia,amd (and other modules to be consistent across all NERSC machines) #6702

Merged

Conversation

ndkeen
Copy link
Contributor

@ndkeen ndkeen commented Oct 21, 2024

On pm-cpu we were using updated Intel compiler and other module versions that were compatible, but had not yet updated other compilers due in part to #6516. After #6687, I think they are resolved and we can now update.

The main change here is updating gcc compiler, but other module versions are also updated at the same time.
Also, try to clean up the machine config settings across all NERSC machines to be more consistent.

module                      current        machine defaults (in this PR)
gcc                         12.2.0         12.3 (gcc-native)
PrgEnv-gnu                  8.3.3          8.5.0
cray-libsci                 23.02.1.1      23.12.5
cray-mpich                  8.1.25         8.1.28
craype                      2.7.20         2.7.30
cray-hdf5-parallel          1.12.2.3       1.12.2.9
cray-netcdf-hdf5parallel    4.9.0.3        4.9.0.9
cray-parallel-netcdf        1.12.3.3       1.12.3.9

On muller-cpu, already tried updating the compiler versions and was testing a work-around with special compiler flags for GNU. With this PR, can remove that work-around.

This PR does not make any change to the default configuration on pm-cpu, which is Intel compiler (already updated modules for Intel).

We might see some cases not BFB using GNU compiler. For the e3sm-developer tests (what we test nightly), the only test that did not pass baseline compare was ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu.

For those compilers, update other module versions to now be same as Intel uses.
Various updates to muller-cpu/muller-gpu/alvarez.

Changes to make machine entries of the NERSC machines more consistent.
@ndkeen ndkeen self-assigned this Oct 21, 2024
@ndkeen ndkeen added Machine Files GNU GNU compiler related issues pm-gpu Perlmutter machine at NERSC (GPU nodes) pm-cpu Perlmutter at NERSC (CPU-only nodes) AMD-compiler Issues related to AMD Compiler nvidia compiler nvidia compiler (formerly PGI) labels Oct 21, 2024
@ndkeen ndkeen requested a review from rljacob October 21, 2024 14:56
@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 21, 2024

Many months ago, NERSC made software updates that required several changes to be made at the same time. To update the compiler, needed to also update several other module versions. Have been trying for quite a while to make this change, but kept getting stuck. This PR finally does the update. It looks a little messy as I tried to make all NERSC machines looks about the same. We only care about pm-cpu/pm-gpu, but I went ahead and kept as many settings the same on other machines as possible. Could split PR into "changes for machines that matter (pm-cpu/pm-gpu), and changes for other NERSC machines".

Copy link

PR Preview Action v1.4.8
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6702/
on branch gh-pages at 2024-10-21 15:17 UTC

<command name="load">PrgEnv-intel</command>
<command name="load">intel</command>
<command name="load">PrgEnv-intel/8.5.0</command>
<command name="load">intel/2024.1.0</command>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this change answers for intel?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind. Its just on alverez.

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 21, 2024

Right, but let me clarify that in top comment (no changes to Intel).

xylar added a commit to xylar/E3SM that referenced this pull request Oct 21, 2024
This is to match proposed updates to Perlmutter-CPU
E3SM-Project#6702
xylar added a commit to xylar/E3SM that referenced this pull request Oct 21, 2024
This is to match proposed updates to Perlmutter-CPU
E3SM-Project#6702
xylar added a commit to xylar/E3SM that referenced this pull request Oct 21, 2024
This is to match proposed updates to Perlmutter-CPU
E3SM-Project#6702
ndkeen added a commit that referenced this pull request Oct 21, 2024
…' into next (PR #6702)

On pm-cpu we were using updated Intel compiler and other module versions that were compatible, but had not yet updated the others due to #6516. After #6687, I think they are resolved and we can now update.

The main change here is updating gcc compiler, but other module versions are also updated at the same time.
Also, try to clean up the machine config settings across all NERSC machines to be more consistent.

module                      current        machine defaults (in this PR)
gcc                         12.2.0         12.3 (gcc-native)
PrgEnv-gnu                  8.3.3          8.5.0
cray-libsci                 23.02.1.1      23.12.5
cray-mpich                  8.1.25         8.1.28
craype                      2.7.20         2.7.30
cray-hdf5-parallel          1.12.2.3       1.12.2.9
cray-netcdf-hdf5parallel    4.9.0.3        4.9.0.9
cray-parallel-netcdf        1.12.3.3       1.12.3.9
On muller-cpu, already tried updating the compiler versions and was testing a work-around with special compiler flags for GNU. With this PR, can remove that work-around.

Removing FI_CXI_RX_MATCH_MODE=software for machines other than primary pm-cpu. Will remove this in another PR as the default FI_CXI_RX_MATCH_MODE=hybrid now seems fine.

We might see some cases not BFB using GNU compiler. For the e3sm-developer tests (what we test nightly), the only test that did not pass baseline compare was ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu
@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 21, 2024

merged to next

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 22, 2024

As expected, the ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu diffs with baseline -- merging to master

@ndkeen ndkeen merged commit 1bb0f9b into master Oct 22, 2024
9 checks passed
@ndkeen ndkeen deleted the ndk/machinefiles/pm-cpu-update-gcc-nvidia-amd-compilers branch October 22, 2024 16:50
xylar added a commit to xylar/E3SM that referenced this pull request Oct 24, 2024
This is to match proposed updates to Perlmutter-CPU
E3SM-Project#6702
ndkeen added a commit that referenced this pull request Oct 28, 2024
…ware' into next (PR #6702)

We had been using FI_CXI_RX_MATCH_MODE=software on pm-cpu to avoid some issues when Perlmutter was young.
There was no measurable performance difference, so did not change this setting.
Now let's try removing this and letting it use the default which is FI_CXI_RX_MATCH_MODE=hybrid.

[BFB]
@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 28, 2024

Fixes #6677

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AMD-compiler Issues related to AMD Compiler GNU GNU compiler related issues Machine Files nvidia compiler nvidia compiler (formerly PGI) pm-cpu Perlmutter at NERSC (CPU-only nodes) pm-gpu Perlmutter machine at NERSC (GPU nodes)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants