Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable easier mache testing #523

Merged
merged 4 commits into from
Mar 1, 2023
Merged

Conversation

xylar
Copy link
Collaborator

@xylar xylar commented Feb 3, 2023

The configure_compass_env.py script now takes two flags, --mache_fork and --mache_branch that are used to clone the fork and branch locally, then install mache from it. This saves the trouble of developers cloning it themselves and building the mache conda package.

This merge also simplifies the deployment scripts by requiring python3, which I believe all systems have available without loading any modules. This allows us to remove some awkward imports and to use f-strings instead of format commands.

This merge also removes the cxx17 spack variant of Albany because this variant is no longer supported (Albany always uses the c++17 standard).

Checklist
changes look as expected

  • Document (in a comment titled Testing in this PR) any testing that was used to verify the changes

closes #520

@xylar xylar added clean-up dependencies and deployment Changes relate to creating conda and Spack environments, and creating a load script labels Feb 3, 2023
@xylar xylar requested a review from matthewhoffman February 3, 2023 06:38
@xylar xylar self-assigned this Feb 3, 2023
@xylar
Copy link
Collaborator Author

xylar commented Feb 3, 2023

Testing

So far, I have only tested this on my Ubuntu laptop. I used:

./conda/configure_compass_env.py --env_name compass_test --mache_fork xylar/mache --mache_branch update_cime_machines

and could see in the logs that the expected fork and branch were cloned, and that mache 1.12.0rc1 was installed as expected.

We will need to verify that this works on all 6 supported machines.

@xylar xylar changed the title Simplify local mache Enable easier mache testing Feb 3, 2023
@matthewhoffman
Copy link
Member

Might be a good activity for which to enlist the newly formed compass support group.

@xylar
Copy link
Collaborator Author

xylar commented Feb 3, 2023

Yep, I think the easiest way to do that is add these changes to #522 for testing. That will be handy anyway. I'll modify the notes accordingly.

@xylar xylar mentioned this pull request Feb 3, 2023
48 tasks
@mark-petersen mark-petersen self-requested a review February 7, 2023 15:44
@xylar
Copy link
Collaborator Author

xylar commented Feb 8, 2023

@jonbob, @mark-petersen, @matthewhoffman, @trhille and @darincomeau, I somewhat arbitrarily assigned each of you to a machine to test this on tomorrow as part of our follow-up spack testing. Please trade with someone if you don't have access to the machine I assigned you to (or have some other reason for preferring a different one).

@darincomeau
Copy link
Collaborator

darincomeau commented Feb 9, 2023

I'm getting the following error on perlmutter from the conda/logs/bootstrap.log file:

running:
   source /global/u1/d/dcomeau/mambaforge/etc/profile.d/conda.sh
   source /global/u1/d/dcomeau/mambaforge/etc/profile.d/mamba.sh
   conda activate
   mamba create -y -n compass_test --override-channels -c conda-forge -c defaults -c e3sm/label/compass --file spec-file-nompi.txt python=3.10

/global/u1/d/dcomeau/mambaforge/etc/profile.d/mamba.sh: line 32: __add_sys_prefix_to_path: command not found
prefix already exists: /global/homes/d/dcomeau/.conda/envs/compass_test

CondaValueError: prefix already exists: /global/homes/d/dcomeau/.conda/envs/compass_test

the printed out traceback:

creating compass_test
Traceback (most recent call last):
  File "/global/cfs/cdirs/m1199/dcomeau/compass/compass/conda/bootstrap.py", line 951, in <module>
    main()
  File "/global/cfs/cdirs/m1199/dcomeau/compass/compass/conda/bootstrap.py", line 867, in main
    build_conda_env(
  File "/global/cfs/cdirs/m1199/dcomeau/compass/compass/conda/bootstrap.py", line 294, in build_conda_env
    check_call(commands, logger=logger)
  File "/global/cfs/cdirs/m1199/dcomeau/compass/compass/conda/shared.py", line 148, in check_call
    raise subprocess.CalledProcessError(process.returncode, commands)
subprocess.CalledProcessError: Command 'source /global/u1/d/dcomeau/mambaforge/etc/profile.d/conda.sh; source /global/u1/d/dcomeau/mambaforge/etc/profile.d/mamba.sh; conda activate; mamba create -y -n compass_test --override-channels -c conda-forge -c defaults -c e3sm/label/compass --file spec-file-nompi.txt python=3.10' returned non-zero exit status 1.
Traceback (most recent call last):
  File "./conda/configure_compass_env.py", line 121, in <module>
    main()
  File "./conda/configure_compass_env.py", line 117, in main
    bootstrap(activate_install_env, source_path, local_conda_build)
  File "./conda/configure_compass_env.py", line 32, in bootstrap
    check_call(command)
  File "/global/cfs/cdirs/m1199/dcomeau/compass/compass/conda/shared.py", line 148, in check_call
    raise subprocess.CalledProcessError(process.returncode, commands)
subprocess.CalledProcessError: Command 'source /global/u1/d/dcomeau/mambaforge/etc/profile.d/conda.sh; source /global/u1/d/dcomeau/mambaforge/etc/profile.d/mamba.sh; conda activate compass_bootstrap; /global/cfs/cdirs/m1199/dcomeau/compass/compass/conda/bootstrap.py --env_name compass_test --mache_fork xylar/mache --mache_branch update_cime_machines' returned non-zero exit status 1.

@xylar
Copy link
Collaborator Author

xylar commented Feb 9, 2023

@darincomeau, I think this might relate to you trying to use a shared conda base as your starting point? You ideally shouldn't have a ~/.conda/envs directory at all and this indicates that you used a shared base that you didn't have permission to install to, so it started installing in your home directory instead. But that's not really safe for us. We need full access to everything including the base environment to install mamba and other necessary tools. Please delete the ~/.conda directory and try the instructions again.

@darincomeau
Copy link
Collaborator

Ok removing that directory got me further, my environment must have been messed up from that previous error.

./conda/configure_compass_env.py --env_name compass_test --mache_fork xylar/mache --mache_branch update_cime_machines

Now executes cleanly, ends with

Writing:
   /global/cfs/cdirs/m1199/dcomeau/compass/compass/load_compass_test_pm-cpu_gnu_mpich.sh

I tried running that script, but get an error

Loading Spack environment...
load_compass_test_pm-cpu_gnu_mpich.sh: line 19: /global/cfs/cdirs/e3sm/software/compass/pm-cpu/spack/spack_for_mache_1.12.0/share/spack/setup-env.sh: No such file or directory

So it's still looking in system software for the spack environment. I do see conda/build_mache/mache that's your update_cime_machines branch.

I don't mean to clog this PR with my issues.

@xylar
Copy link
Collaborator Author

xylar commented Feb 9, 2023

@darincomeau, I see, I think you are perhaps following the instructions based on what I ran in my own test? That was on my laptop, where spack isn't needed. On Perlmutter, you need to follow the instructions in the Google Doc, which means you need the --update_spack, --spack and --tmpdir flags, something like:

export TMPDIR=${PSCRATCH}/spack_temp
mkdir -p ${TMPDIR}

./conda/configure_compass_env.py \
    --conda ${CONDA_BASE} \
    --mache_fork xylar/mache \
    --mache_branch update_cime_machines \
    --update_spack \
    --spack ${PSCRATCH}/spack_test \
    --tmpdir ${TMPDIR} \
    --compiler gnu \
    --mpi mpich \
    --recreate

Your error results because you have a new mache version installed but the corresponding spack environment has never been created. That is precisely your job as a member of the "compass support group". The load script is looking in the directory for shared compass spack environments by default because you didn't specify --spack and it's assuming it already exists because you didn't specify --update_spack.

@darincomeau
Copy link
Collaborator

darincomeau commented Feb 9, 2023

Ah thanks @xylar , sorry I missed that part in the Google Doc. explains why I wasnt seeing CONDA_BASE do anything...

Thanks for accommodating the lowest common denominator!

@xylar xylar force-pushed the simplify_local_mache branch from d257d7d to c34896d Compare February 9, 2023 20:42
@mark-petersen
Copy link
Collaborator

For the record, this is what I did on chicoma. I started in a compass directory, on the head of xylar:simplify_local_mache. Then it installs gnu mpich with:

export  CONDA_BASE=/usr/projects/climate/mpeterse/miconda3
./conda/configure_compass_env.py  \
   --conda ${CONDA_BASE}  \
   --mache_fork xylar/mache  \
   --mache_branch update_cime_machines  \
   --update_spack   \
   --spack /lustre/scratch5/mpeterse/spack_test  \
   --tmpdir /lustre/scratch5/mpeterse/spack_tmp  \
   --compiler gnu  \
   --recreate

@xylar xylar force-pushed the simplify_local_mache branch from c34896d to f0f1d34 Compare February 9, 2023 21:02
@mark-petersen
Copy link
Collaborator

With the last commits, my previous command works to completion on chicoma. Unfortunately, I still can't compile E3SM/master:

source load_dev_compass_1.2.0-alpha.4_chicoma-cpu_gnu_mpich.sh
cd /usr/projects/climate/mpeterse/repos/E3SM/master/components/mpas-ocean
make gfortran  OPENMP=true DEBUG=true
...
************ ERROR ************
Failed to compile a PIO test program
Please ensure the PIO environment variable is set to the PIO installation directory
************ ERROR ************

@xylar this is probably unrelated to this compass PR, but how did you get MPAS-Ocean to compile on chicoma?

@xylar
Copy link
Collaborator Author

xylar commented Feb 10, 2023

@mark-petersen, as always, you're going to have to comment out the &> /dev/null in the PIO compilation test to find out what's wrong. From that error message there's almost literally no way to know what went wrong. I leave off the OPENMP=true because that's already set in the load script but otherwise I do the same as you.

You could try building main and see if you have the same problem. If so, I think it's likely something about how your environment is configured but I can try again to see whether I can reproduce it.

But debugging the build command and the actual error message is really the starting point.

Copy link
Collaborator

@trhille trhille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, @xylar. I'm approving based on my testing on Cori, in which I built spack with Albany and successfully ran the MALI full_integration test suite. I also had a quick look through the code diffs and nothing jumped out at me, although I can't pretend that I understand all of it.

@mark-petersen
Copy link
Collaborator

To see the correct PIO error messages on chicoma I did this to the stand-alone make file:

diff --git a/components/mpas-framework/Makefile b/components/mpas-framework/Makefile
index fe6d875ac0..427b81a448 100644
--- a/components/mpas-framework/Makefile
+++ b/components/mpas-framework/Makefile
@@ -915,8 +915,8 @@ pio_test:
-       @($(FC) pio1.f90 $(FCINCLUDES) $(FFLAGS) $(LDFLAGS) $(LIBS) -o pio1.out &> /dev/null && echo "=> PIO 1 detected") || \
-        ($(FC) pio2.f90 $(FCINCLUDES) $(FFLAGS) $(LDFLAGS) $(LIBS) -o pio2.out &> /dev/null && echo "=> PIO 2 detected") || \
+       @($(FC) pio1.f90 $(FCINCLUDES) $(FFLAGS) $(LDFLAGS) $(LIBS) -o pio1.out  && echo "=> PIO 1 detected") || \
+        ($(FC) pio2.f90 $(FCINCLUDES) $(FFLAGS) $(LDFLAGS) $(LIBS) -o pio2.out  && echo "=> PIO 2 detected") || \
         (echo "************ ERROR ************"; \
          echo "Failed to compile a PIO test program"; \
          echo "Please ensure the PIO environment variable is set to the PIO installation directory"; \
@@ -929,13 +929,13 @@ pio_test:
 ifeq "$(USE_PIO2)" "true"
-       @($(FC) pio2.f90 $(FCINCLUDES) $(FFLAGS) $(LDFLAGS) $(LIBS) -o pio2.out &> /dev/null) || \
+       @($(FC) pio2.f90 $(FCINCLUDES) $(FFLAGS) $(LDFLAGS) $(LIBS) -o pio2.out ) || \

We then figured out that the netcdf and pnetcdf paths have a gnu version number in the path before the lib directory. We needed to use environmental variables that specify the correct path on chicoma:

load_dev_compass_1.2.0-alpha.4_chicoma-cpu_gnu_mpich.sh
69,71c69,71
< export NETCDF=$(dirname $(dirname $(which nc-config)))
< export NETCDFF=$(dirname $(dirname $(which nf-config)))
< export PNETCDF=$(dirname $(dirname $(which pnetcdf-config)))
---
> export NETCDF=$CRAY_NETCDF_HDF5PARALLEL_PREFIX
> export NETCDFF=$CRAY_NETCDF_HDF5PARALLEL_PREFIX
> export PNETCDF=$CRAY_PARALLEL_NETCDF_PREFIX

echo $CRAY_NETCDF_HDF5PARALLEL_PREFIX
/opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/gnu/9.1

ls $CRAY_NETCDF_HDF5PARALLEL_PREFIX
bin  include  lib  plugins
echo $CRAY_PARALLEL_NETCDF_PREFIX
/opt/cray/pe/parallel-netcdf/1.12.3.1/gnu/9.1

ls $CRAY_PARALLEL_NETCDF_PREFIX
bin  include  lib

@xylar
Copy link
Collaborator Author

xylar commented Feb 14, 2023

Thanks @mark-petersen. I will fix that in a separate PR.

@xylar
Copy link
Collaborator Author

xylar commented Feb 16, 2023

Also, is there any chance you didn't source the load script since you created it again? Safest to start with a fresh terminal.

@darincomeau
Copy link
Collaborator

Also, is there any chance you didn't source the load script since you created it again? Safest to start with a fresh terminal.

I did, but in the same terminal window. I think it's best if I pick this up again once Perlmutter comes back up.

Copy link
Collaborator

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to successfully create the conda load script on chicoma with

export  CONDA_BASE=/usr/projects/climate/mpeterse/miconda3
./conda/configure_compass_env.py  \
   --conda ${CONDA_BASE}  \
   --mache_fork xylar/mache  \
   --mache_branch update_cime_machines  \
   --update_spack   \
   --spack /lustre/scratch5/mpeterse/spack_test  \
   --tmpdir /lustre/scratch5/mpeterse/spack_tmp  \
   --compiler gnu  \
   --recreate

source load_dev_compass_1.2.0-alpha.4_chicoma-cpu_gnu_mpich.sh

then compile MPAS-Ocean stand-alone, and run the nightly test suite. Thanks for the help @xylar!

@matthewhoffman
Copy link
Member

On Anvil I did this:

./conda/configure_compass_env.py     --conda ${CONDA_BASE}     --mache_fork E3SM-Project/mache     --mache_branch main     --update_spack      --spack /lcrc/group/e3sm/ac.mhoffman/tmp/spack  --tmpdir /lcrc/group/e3sm/ac.mhoffman/tmp/spacktmp     --compiler gnu     --mpi openmpi --recreate

and the compass env creation worked successfully. However, when I tried to build MALI after loading the env script, I got this error:

/usr/bin/ld: src/libdycore.a(Interface_velocity_solver.o): undefined reference to symbol 'ompi_mpi_cxx_op_intercept'
/gpfs/fs1/software/centos7/spack-latest/opt/spack/linux-centos7-x86_64/gcc-8.2.0/openmpi-4.1.1-x5n4m36/lib/libmpi_cxx.so.40: error adding symbols: DSO missing from command line

I'll follow up with @xylar about this offline when I see him this morning.

@xylar
Copy link
Collaborator Author

xylar commented Feb 17, 2023

@matthewhoffman, it looks like you're missing the --with_albany flag to ./conda/configure_compass_env.py. Did I just miss it?

@matthewhoffman
Copy link
Member

@xylar , yes, you are right, I forgot to include the albany flag. I tried again with adding it, and spack is clearly now building trilinos and albany. After a few hours it died with an albany build error:

  >> 360    /lcrc/group/e3sm/ac.mhoffman/tmp/spacktmp/spack-stage/spack-stage-albany-develop-hflw55t6ascfv7jdjfitbyl3wcegai5o/spack-src/src/disc/stk/Albany_ExtrudedSTKMeshStruct.cpp:1057:36: internal compiler error: in tsubst_copy, at cp/pt.c:15478

I can try again with a different compiler or mpi. I had selected a combination from the compass docs that is supported for anvil, but it wasn't clear to me if that meant we should expect albany to build. What do you recommend?

(From my previous attempt without albany, it seems this PR works correctly. Do you agree that is sufficient for approving the PR?)

@xylar
Copy link
Collaborator Author

xylar commented Feb 17, 2023

@matthewhoffman, yes, I'm seeing the same error on Anvil. That's new since I last built albany (maybe something to check in with Irina and Mauro about -- maybe related to Anvil having old compilers).

As long as you see that compass is trying to use a spack branch called spack_for_mache_1.12.0, you can consider this particular PR successfully tested. Thanks for working on this.

@xylar
Copy link
Collaborator Author

xylar commented Feb 17, 2023

@jonbob, how is testing on Compy going for you? Anything I can do to help?

@xylar
Copy link
Collaborator Author

xylar commented Feb 17, 2023

@matthewhoffman, I'm going to try again with mvapich instead of openmpi, just to see if that helps. I doubt it.

@darincomeau
Copy link
Collaborator

darincomeau commented Feb 17, 2023

Safest to start with a fresh terminal.

@xylar now that Perlmutter is back up I tried again from sourcing the load compass script again in an (obviously) new terminal window, and then building with gfortran worked. So that other library error I was seeing is indeed fixed.

compass suite -s -c ocean -t pr -p initially timed out at 1 hour, so I'm rerunning again with 4 hours and will update this comment once completed.

EDIT: ok that time it completed in 22 minutes with PASS: All passed successfully!. So my PR approval stands - thanks!

Copy link
Member

@matthewhoffman matthewhoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a new compass environment on Anvil succeeded using this branch. Trying with albany support failed, but that appears to be unrelated to this branch.

@xylar xylar force-pushed the simplify_local_mache branch 3 times, most recently from 7401d14 to 0b6f924 Compare February 18, 2023 22:29
@jonbob
Copy link
Collaborator

jonbob commented Feb 21, 2023

@xylar -- I am trying to push on testing this on compy today, but running into issues. I had checked out compass to get mambaforge, following the Compass Quick Start. Is it OK to use this same compass clone for the spack testing?

@xylar
Copy link
Collaborator Author

xylar commented Feb 22, 2023

@jonbob, yes, you can use the same clone for this testing. You just need to add my fork as a remote and check out this branch from it.

@xylar
Copy link
Collaborator Author

xylar commented Feb 28, 2023

@jonbob, I'm going to rebase this onto #545 because I don't think it's going to work without that fix. Let me know if you've had any luck so far, and let me know if there's anything I can do to help.

@xylar xylar force-pushed the simplify_local_mache branch 2 times, most recently from 2822028 to f9e2b6e Compare February 28, 2023 07:03
xylar and others added 4 commits February 28, 2023 14:23
The `configure_compass_env.py` script now takes two flags,
`--mache_fork` and `--mache_branch` that are used to clone the
fork and branch locally, then install mache from it.  This
saves the trouble of developers cloning it themselves and building
the mache conda package.
I believe all systems have python3 available.

This allows us to remove some awkward imports and to use f-strings
instead of format commands.
@xylar xylar force-pushed the simplify_local_mache branch from f9e2b6e to 0f0ab0f Compare February 28, 2023 20:24
Copy link
Collaborator

@jonbob jonbob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved based on successful compy testing

@xylar xylar merged commit 3e757a1 into MPAS-Dev:main Mar 1, 2023
@xylar xylar deleted the simplify_local_mache branch March 1, 2023 03:06
@xylar
Copy link
Collaborator Author

xylar commented Mar 1, 2023

Thanks everyone!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clean-up dependencies and deployment Changes relate to creating conda and Spack environments, and creating a load script
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a --local_mache argument to configure_compass_env.py
6 participants