Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update config_machines.xml and corresponding files #173

Merged
merged 9 commits into from
Jul 12, 2024

Conversation

altheaden
Copy link
Collaborator

@altheaden altheaden commented Jul 9, 2024

In this PR, I updated the config_machines.xml file, and likewise updated the module names and versions in chrysalis_gnu_openmpi.yaml. I also added some missing environment variables in the various .csh and .sh files for chrysalis, pm-cpu, and pm-gpu.

Checklist

  • Testing comment in the PR documents testing used to verify the changes

@altheaden altheaden added the in progress This PR is not ready for review or merging label Jul 9, 2024
Copy link
Collaborator

@xylar xylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things to change for now.

mache/spack/chrysalis_gnu_openmpi.yaml Outdated Show resolved Hide resolved
mache/spack/chrysalis_gnu_openmpi.yaml Outdated Show resolved Hide resolved
mache/spack/chrysalis_gnu_openmpi.yaml Outdated Show resolved Hide resolved
@altheaden
Copy link
Collaborator Author

@xylar Sorry about the comment confusion, I actually had several that I wanted your feedback on. I just re-wrote them so you could see them (instead of including them, accidentally, in a review). Would you mind looking at the other questions when you get a chance?

@altheaden altheaden force-pushed the update-config-machines branch from ffea3a4 to 6866c38 Compare July 9, 2024 17:34
@altheaden
Copy link
Collaborator Author

Little progress update: Just added what I think is the last of the changes from the E3SM update, and I am about to start adding all the other missing environment variables.

@altheaden altheaden force-pushed the update-config-machines branch from 11c9380 to 74adc57 Compare July 10, 2024 23:25
@xylar
Copy link
Collaborator

xylar commented Jul 11, 2024

Everything looks good. Let's see how testing with the pr suite goes on Chrysalis and Perlmutter with these changes.

@altheaden
Copy link
Collaborator Author

@xylar I forgot to finish and push the last bit of the pm files. These are basically the same changes, except I did add what appeared to be a missing NETCDF env var - I can remove this if necessary but I figured I'd add it for now since I was in the files anyways. I figured since there were other NETCDF env vars that it might be OK to add the extra one. I can start testing now.

@altheaden
Copy link
Collaborator Author

@xylar Just set up the conda env on pm and it appears to have been successful! These are the flags I tested with for reference:
./configure_polaris_envs.py --conda ~/miniforge3/ --env_name polaris_test --verbose --update_spack --spack $SCRATCH/spack_test --tmpdir $SCRATCH/spack_tmp --compiler gnu --machine pm-cpu --mache_fork altheaden/mache --mache_branch update-config-machines

@altheaden
Copy link
Collaborator Author

@xylar gnu run of the pr suite on pm passed all tests, intel run failed the 10km threads test but passed all others. I will take a look at the environment variables again in the morning and see if there's anything that might have screwed up threading in particular. Here is the log file from the failed run:
threads-out.log

@xylar
Copy link
Collaborator

xylar commented Jul 12, 2024

@altheaden, great! It seems like it might be worth having me explore whether commenting back in one or more of these flags makes things pass for that thread test:

## purposefully omitting OMP variables that cause trouble in ESMF
# export OMP_STACKSIZE=128M
# export OMP_PROC_BIND=spread
# export OMP_PLACES=threads

In the meantime, I think we can proceed. We will just want to make a Polaris issue reporting that that particular test on that particular machine and compiler is not producing bit-for-bit results as expected. It's a little too early to add that report but if we see the same thing right before we're ready to merge the branch that updates to Polaris 0.4.0-alpha.1, we will post an issue and link to it in the comments about testing that version of Polaris.

Copy link
Collaborator

@xylar xylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@altheaden, in my testing today I noticed 2 things that we need to fix.

mache/spack/pm-cpu_nvidia_mpich.sh Show resolved Hide resolved
mache/spack/pm-cpu_intel_mpich.sh Show resolved Hide resolved
@altheaden altheaden force-pushed the update-config-machines branch from 3663ae9 to 13c8563 Compare July 12, 2024 15:14
@altheaden
Copy link
Collaborator Author

@xylar After the if-statement change, the intel test had the same results as I got last night.

@xylar
Copy link
Collaborator

xylar commented Jul 12, 2024

@altheaden, yes, that's what I saw, too. I also reran after commenting in the OMP* environment variables and I still see small differences. So this is a bug we're going to have to report to MPAS-Ocean developers to investigate.

@xylar xylar self-assigned this Jul 12, 2024
@altheaden altheaden removed the in progress This PR is not ready for review or merging label Jul 12, 2024
@altheaden altheaden added the spack Changes relate to creating conda and Spack environments, and creating a load script label Jul 12, 2024
@xylar
Copy link
Collaborator

xylar commented Jul 12, 2024

Testing

@altheaden and I have used this branch to build Polaris' spack environment for:

  • Chrysalis
    • intel
    • gnu
  • Perlmutter-CPU
    • gnu
    • intel
    • nvidia
  • Perlmutter-GPU
    • gnugpu
    • nvidiagpu (not quite done yet)

We have run the Polaris pr suite with the first 4 configurations (the last 3 are not supported). The one test thread test failed with intel on Perlmutter-CPU but we don't think that's related to these changes and will be reported for later investigation.

We ran the Omega CTests on Chrysalis with intel and gnu, and on Perlmutter-GPU with gnugpu. This required #174 and some modifications to Polaris that will be included in the next update.

Copy link
Collaborator

@xylar xylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing, @altheaden. Very nice work figuring all of this out and being so thorough.

@altheaden altheaden added the config-machines Changes to the config_machines.xml file label Jul 12, 2024
@xylar xylar merged commit bae6e7e into E3SM-Project:main Jul 12, 2024
6 checks passed
@xylar xylar deleted the update-config-machines branch July 12, 2024 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
config-machines Changes to the config_machines.xml file spack Changes relate to creating conda and Spack environments, and creating a load script
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants