Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring run_neon for PLUMBER2 part1 #2315

Closed
wants to merge 7 commits into from

Conversation

TeaganKing
Copy link
Contributor

@TeaganKing TeaganKing commented Jan 11, 2024

Description of changes

  • 1. Remove unused functions and arguments
  • 2. Move arg_parse to its own file and NeonSite to own file
  • 3. Unit tests for arg_parse and NeonSite

Additional (not-yet-ordered) expected changes (for future PR):

  • Build abstract TowerSite class
  • Generic class
  • Add NEON class with NEON-specific behavior
  • Create new methods and unit tests
  • Add PLUMBER class and behaviors
  • Integration testing

Contributors other than yourself, if any:
@ekluzek @adrifoster

Some additional resources

https://docs.google.com/document/d/19QAvUSJY0QJdFF5LrOFhU0JVhfT3QwvFQaf8IiPtXxU/edit#heading=h.cyie9ucyj2pr
https://docs.google.com/document/d/1i57JZgu6vWtdvr8rNMEnrtNIMIl8hB4jUeE4UvxNh38/edit

CTSM Issues Fixed (include github issue #):

Are answers expected to change (and if so in what way)? No. This will work towards supporting additional tower sites.

Note, this is a WIP PR. I changed branches from the original PR-- #2259

@TeaganKing TeaganKing added enhancement new capability or improved behavior of existing capability PR status: work in progress labels Jan 11, 2024
@TeaganKing TeaganKing changed the title New refactoring [WIP] Refactoring run_neon for PLUMBER2 Jan 11, 2024
@TeaganKing TeaganKing self-assigned this Jan 11, 2024
@TeaganKing TeaganKing changed the title [WIP] Refactoring run_neon for PLUMBER2 [WIP] Refactoring run_neon for PLUMBER2 part1 Jan 12, 2024
@TeaganKing
Copy link
Contributor Author

After talking with @ekluzek , we plan to bring in this PR with a few minor modifications that I'll make shortly, and then create a new PR for building a generic base class, an abstract TowerSite class, and then a neon class with neon-specific behavior. A third PR will then create the PLUMBER specific class and any necessary changes to the base/tower-site classes.

@TeaganKing TeaganKing changed the title [WIP] Refactoring run_neon for PLUMBER2 part1 Refactoring run_neon for PLUMBER2 part1 Jan 12, 2024
@TeaganKing
Copy link
Contributor Author

TeaganKing commented Jan 12, 2024

This PR is now ready for review. Requesting review from either Erik or Sam, but it's fine to change that, too, if needed.

@wwieder
Copy link
Contributor

wwieder commented Jan 12, 2024

Thanks, @TeaganKing.
I'd suggest @slevis-lmwg review this, as it's directly related to the NEON project he's supported on.
Is that OK with everyone?

@TeaganKing TeaganKing removed the request for review from ekluzek January 12, 2024 20:52
@slevis-lmwg
Copy link
Contributor

@TeaganKing I will be happy to go over the code with you in a meeting. Pls feel free to send me an invite.

@wwieder
Copy link
Contributor

wwieder commented Jan 16, 2024

Thanks @slevis-lmwg. It seems like @TeaganKing is planning on a series of PRs to enable Plumber runs. hopefully this PR can come in pretty quickly (maybe the dev_branch or with other BFB PRs)? Once this PR is approved, can you also give some thought into how we can bring this to main dev efficiently?

@TeaganKing
Copy link
Contributor Author

FYI @wwieder and @slevis-lmwg , I'm also planning to continue working off of this branch on the next PR. It'd be ideal if it comes in quickly, but not a roadblock to continued work on this project if it doesn't.

Copy link
Collaborator

@ekluzek ekluzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TeaganKing and I went over this last week. The one change that I'm asking for right now is that there be added some extra error checking into the new arg_parse.py, and some unit tests for it. Basic things like files entered are checked for existence, check that a directory actually is a directory -- that sort of thing.

Also @TeaganKing you should change the name of the arg_parse.py, to include something about run_neon in the name. There could be arg_parse modules for lots of different tools, so they will need to be distinguished from each other.

The steps after what has currently been done for making base classes will be done in a part2 PR.

@TeaganKing
Copy link
Contributor Author

Hi @ekluzek , thanks for the suggestions. I renamed the arg_parse file to be neon-specific; that seems like a great suggestion to make sure that it is easily recognizable as tied to run_neon. We may also want it to be more generic as we create the PLUMBER class, but maybe that can be renamed in the refactoring in the next PR when we do all of the towersite renaming and rename run_neon (I'd like them to be visibly linked for the time being, too).

RE additional tests, at the bottom of neon_arg_parse.py, there are checks to make sure the information in the sys arguments are valid. These include the following:

  • neon sites are valid site names
  • run_length specifications
  • check that base_case_root exists (I added this since our discussion, and I believe it addresses your point above)

Within the get_parser function, we have specified choices for neon-sites, run-type, and neon-version-- so those should be very straightforward. Some arguments are boolean values: overwrite, setup_only, rerun, no_batch, prism, run_from_postad. And experiment really only changes the name of the case with a given string that the user provides (we discussed that there is potential for wanting some special characters (eg, '/') and that this should be fine as is).

Additionally, we have unit tests provided in python/ctsm/test/test_unit_neon_arg_parse.py. I believe this covers every argument that goes through neon_arg_parse; if there are specific additional tests that you feel are needed to make this more robust, I am happy to implement them.

@TeaganKing
Copy link
Contributor Author

Future PR expected changes have been documented in #1487

@TeaganKing
Copy link
Contributor Author

Following up on a conversation with @slevis-lmwg , I just ran make all. There are two sys errors, one in TestSysMeshMaskModifier that I think is unrelated to these changes, and the other is a permission error in TestSysRunNeon-- is this expected?

I also want to note there are many pylint errors not copied here; these are unrelated to this PR.

python3 ./run_ctsm_py_tests  --sys
................ERROR: FakeCase does not support getting value of 'GPU_OFFLOAD'
E..E
Stdout:
---- building a base case -------
---- creating a base case -------
---- base case created ------
---- base case setup ------

======================================================================
ERROR: test_allInfo (test.test_sys_mesh_modifier.TestSysMeshMaskModifier)
This test specifies all the information that one may specify
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/glade/work/tking/CTSM/python/ctsm/test/test_sys_mesh_modifier.py", line 66, in setUp
    subprocess.check_call(configure_cmd, shell=False)
  File "/glade/u/home/tking/.conda/envs/ctsm_pylib/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/glade/work/tking/CTSM/cime/CIME/scripts/configure' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/glade/work/tking/CTSM/python/ctsm/test/test_sys_mesh_modifier.py", line 68, in setUp
    sys.exit(f"{e} ERROR using {configure_cmd}")
SystemExit: Command '/glade/work/tking/CTSM/cime/CIME/scripts/configure' returned non-zero exit status 1. ERROR using /glade/work/tking/CTSM/cime/CIME/scripts/configure

======================================================================
ERROR: test_one_site (test.test_sys_run_neon.TestSysRunNeon)
This test specifies a site to run
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/glade/work/tking/CTSM/python/ctsm/test/test_sys_run_neon.py", line 57, in test_one_site
    main("")
  File "/glade/work/tking/CTSM/python/ctsm/site_and_regional/run_neon.py", line 224, in main
    cesmroot, output_root, res, compset, overwrite, setup_only
  File "/glade/work/tking/CTSM/python/ctsm/site_and_regional/neon_site.py", line 101, in build_base_case
    case.case_setup()
  File "/glade/work/tking/CTSM/cime/CIME/case/case_setup.py", line 499, in case_setup
    is_batch=is_batch,
  File "/glade/work/tking/CTSM/cime/CIME/utils.py", line 2477, in run_and_log_case_status
    rv = func()
  File "/glade/work/tking/CTSM/cime/CIME/case/case_setup.py", line 462, in <lambda>
    self, caseroot, clean=clean, test_mode=test_mode, reset=reset, keep=keep
  File "/glade/work/tking/CTSM/cime/CIME/case/case_setup.py", line 290, in _case_setup_impl
    case.load_env()
  File "/glade/work/tking/CTSM/cime/CIME/case/case.py", line 2157, in load_env
    self._loaded_envs = env_module.load_env(self, job=job, verbose=verbose)
  File "/glade/work/tking/CTSM/cime/CIME/XML/env_mach_specific.py", line 140, in load_env
    modules_to_load, force_method=force_method, verbose=verbose
  File "/glade/work/tking/CTSM/cime/CIME/XML/env_mach_specific.py", line 169, in _load_modules
    self._load_module_modules(modules_to_load, verbose=verbose)
  File "/glade/work/tking/CTSM/cime/CIME/XML/env_mach_specific.py", line 461, in _load_module_modules
    "module command {} failed with message:\n{}".format(cmd, errout),
  File "/glade/work/tking/CTSM/cime/CIME/utils.py", line 175, in expect
    raise exc_type(msg)
CIME.utils.CIMEError: ERROR: module command /glade/u/apps/dav/opt/lmod/7.7.29/libexec/lmod python purge  failed with message:
/bin/sh: /glade/u/apps/dav/opt/lmod/7.7.29/libexec/lmod: Permission denied

Stdout:
---- building a base case -------
---- creating a base case -------
---- base case created ------
---- base case setup ------

----------------------------------------------------------------------
Ran 20 tests in 19.372s

FAILED (errors=2)

@slevis-lmwg
Copy link
Contributor

@TeaganKing I went to my copy of "vanilla" ctsm5.1.dev165 and ran make all.
The tests worked there, other than the long list of pylint warnings.
I think you're right that the latter are unrelated to your PR (see #2255).

Copy link
Contributor

@slevis-lmwg slevis-lmwg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TeaganKing and I went over this PR, and it seems ready to me. My requests before approving were:

  • Move TODOs to a new issue.
  • Ensure that python unit and system tests pass.

@TeaganKing
Copy link
Contributor Author

@slevis-lmwg thanks for doing that test on your side, too. Is there something in particular that I need to do to access /glade/u/apps/dav? Do I need to set up testing in a particular way to avoid this?

@slevis-lmwg
Copy link
Contributor

For the tests to work for me, I had to do the following:

module load conda
conda activate ctsm_pylib
module load nco

If you have already done these things, I do not know what else to suggest...

@ekluzek
Copy link
Collaborator

ekluzek commented Jan 22, 2024

@TeaganKing in order to get the ctsm_pylib installed, you have to run the py_env_create script at the top level of CTSM.

So you might need to do that before the conda activate in Sam's instructions above.

@wwieder wwieder mentioned this pull request Jan 23, 2024
@TeaganKing
Copy link
Contributor Author

TeaganKing commented Jan 23, 2024

Thank you @slevis-lmwg and @ekluzek for these suggestions. I had previously activated ctsm_pylib, and that doesn't seem to be the issue. I'm also getting the same errors on a fresh checkout with the changes made in this PR. Since I'm a bit stumped on why these are occurring (and also don't want to be a roadblock for getting this PR merged), I'm curious if you are also getting the same errors when running make all?

@slevis-lmwg
Copy link
Contributor

Thank you @slevis-lmwg and @ekluzek for these suggestions. I had previously activated ctsm_pylib, and that doesn't seem to be the issue. I'm also getting the same errors on a fresh checkout with the changes made in this PR. Since I'm a bit stumped on why these are occurring (and also don't want to be a roadblock for getting this PR merged), I'm curious if you are also getting the same errors when running make all?

@TeaganKing in the meanwhile could you also test whether you get the errors when you run vanilla dev165 (i.e. without your mods)?

@slevis-lmwg
Copy link
Contributor

@TeaganKing I found a moment to try what you suggested and it worked for me. Thank you for suggesting this test. So I think I can proceed with the merge of this PR.

@TeaganKing
Copy link
Contributor Author

Thanks @slevis-lmwg ! I guess you got to it before I did, and I'm glad that worked!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement new capability or improved behavior of existing capability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants