Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demonstrate the use of GitHub actions in E3SM testing #5949

Closed
wants to merge 2 commits into from

Conversation

mahf708
Copy link
Contributor

@mahf708 mahf708 commented Sep 22, 2023

Testing running basic tests on ghas

Please don't merge. This is just an illustration. Stay tuned. In the meanwhile, please let me know if there are issues you'd like addressed or if you have ideas for improvements.

Things I'd like to do before merging:

  • standardize the docker container build (abstract options and enable matrix build)
    • add missing items
    • build for different compiler/os/libs matrix
  • generalize the workflow to add run tests besides the build test
  • print full logs if job fails
  • chain some jobs to test restarting on two different instances
  • explore caching

@mahf708 mahf708 closed this Sep 22, 2023
@mahf708

This comment was marked as outdated.

@bartgol
Copy link
Contributor

bartgol commented Sep 22, 2023

I was able to build scream in a github action using a self-hosted runner on my workstation. I can get one set up from the e3sm-autotester user account (so that it can do ssh ops with its ) on mappy. Then, we would be able to rely on all the modules that we already use for other testing (nightly, scream AT, ...). For security reasons (since it would run on SNL machines), we may have to check that a certain label is set, so that only ppl with write privileges can trigger the action.

@mahf708
Copy link
Contributor Author

mahf708 commented Sep 22, 2023

Thanks, yes, that will eventually be the goal. I still would like to get our basic and easy tests to run on public ci resources since the reporting will be quicker and safer and it'll be more generic.

@mahf708 mahf708 reopened this Sep 23, 2023
@mahf708 mahf708 changed the title add ghas SMS_Ln1.ne4_oQU480.F2010.linux-generic copy WCYCL1850.ne4_oQU240 from circle ci to ghas Sep 23, 2023
@mahf708 mahf708 changed the title copy WCYCL1850.ne4_oQU240 from circle ci to ghas Demonstrate the use of GitHub actions in E3SM testing Sep 23, 2023
@bartgol
Copy link
Contributor

bartgol commented Sep 25, 2023

I see lots of wget errors in the action log, such as

wget failed with output:  and errput --2023-09-24 04:49:31--  https://web.lcrc.anl.gov/public/e3sm/inputdata/atm/waccm/phot/wasolar_ave.nc
Resolving web.lcrc.anl.gov (web.lcrc.anl.gov)... 140.221.70.30
Connecting to web.lcrc.anl.gov (web.lcrc.anl.gov)|140.221.70.30|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2023-09-24 04:49:31 ERROR 404: Not Found.

Indeed, those files do not seem to exist on the ANL server. Is this concerning?

build:
runs-on: ubuntu-latest
container:
image: ghcr.io/mahf708/e3sm-imgs@sha256:2657196ea9eec7dbd04f7da61f4e7d5e566d4b501dff5881f0cb5125ba031158
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this container has all TPLs already installed? If so, what's the point of the apg-get calls below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all because I was too lazy to rebuild the container 😉 Btw, here's the container repo: https://github.com/mahf708/e3sm-imgs we definitely want to get rid of the addenda apt-get calls once we finalize this

@mahf708
Copy link
Contributor Author

mahf708 commented Sep 29, 2023

Yeah, I don't like those warnings and I am somewhat concerned they are happening. Only one of them appears to be actually fatal: #5953

<DIN_LOC_ROOT_CLMFORC>$ENV{HOME}/projects/e3sm/ptclm-data</DIN_LOC_ROOT_CLMFORC>
<DOUT_S_ROOT>$ENV{HOME}/projects/e3sm/scratch/archive/$CASE</DOUT_S_ROOT>
<BASELINE_ROOT>$ENV{HOME}/projects/e3sm/baselines/$COMPILER</BASELINE_ROOT>
<CCSM_CPRNC>$CCSMROOT/tools/cprnc/build/cprnc</CCSM_CPRNC>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgfouca A quick grep through the repo and CIME makes me think that CCSMROOT is no longer used, but still present on some (unused?) machines in config_machines.xml. Is that correct? If so, this line should be changed I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you're right. For transparency, all I did was to copy-paste the "singularity" MACH entry (even inadvertently keeping the email of someone else...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good strategy leaving the email of someone else. I shall use that trick in the future... :)

@bartgol
Copy link
Contributor

bartgol commented Oct 4, 2023

Food for thoughts: by creating a self-hosted runner on e3sm machines, we could use gh actions to automatically test PRs without the need for a next branch, and without the need of waiting for overnight testing.

Workflow idea is as follows.

  • Keep the gh action that runs small tests on a github runner. This verifies that no "obvious" errors are there (e.g., the code builds and runs without crashing).
  • Add gh action that checks PR was approved by an integrator (can use something like this one). This is necessary b/c we want to avoid running malicious code on our DOE machines. Having a trusted user approve the PR first gives us some confidence that the code is ok.
  • If the above two pass, kick off gh action on self-hosted runners (can use a matrix).
  • Add an action that runs every time master is updated. For each open PR, disable merging if the last round of testing was done wrt a different base ref. That's to avoid merging a PR that passed before, but doesn't pass after last master update.
  • Make the 1st and 3rd step trigger if the base branch changes. I don't think we can handle this from those gh actions (I can't find a trigger like "if base ref changes"). So the closest approach is to make the workflow at the step above add some special label (eg. "RETEST") to all open PRs. Then, make the pr test workflows trigger if labels change (and "RETEST" is among them). Drawback is that people can manually remove the label, but I assume we "trust" users with write access to the repo to not do something like that. The benefit of a trigger if "RETEST" is set is that we can use it also to manually re-run a testing framework.
  • For BFB PRs only, if
    • every action passes
    • the base branch has not changed since the PR was created
    • automerge label is set
      we can go ahead and merge.
  • For non-BFB PRs, an integrator must use the force, and manually merge.

Notice that, by checking that the base ref has not changed, we no longer need next: all PRs are re-tested when another PR is merged. While this may somewhat reduce the amount of PRs that can go in at the same time, we will gain by having testing throughout the day. We could also still use next as the base ref, of course.

The drawback of continuos testing (as opposed to nightly only) is the amount of work we throw at the e3sm machines. Each self-hosted runner can run 1 job at a time, so by having 1 runner/machine we limit how much of the machine we can use. However, on some machines (e.g., mappy), e3sm testing takes a long time (3h), and we don't want to hold the machine for that long during the day. That said, since mappy is not a production machine, we may not use it for CI, and just keep it for nightlies, for extra safety.

Edit: it occurred to me that certain clusters may not like having a self-hosted runner that is listening 24/7 (not sure if this is the case for chrys or pm). If that's the case, we can do it in two steps: 1) put the self hosted runner on a SNL machine, and 2) make the action trigger one of the existing SNL jenkins jobs that are currently used to do nightly testing of next (possibly with slight mods). This will comply with possible rules from the cluster admins, requiring to not keep cronjobs or any sort of never-ending-job running.

@mahf708
Copy link
Contributor Author

mahf708 commented Oct 4, 2023

@rljacob
Copy link
Member

rljacob commented Oct 4, 2023

The need to test overnight comes from both how long it takes to run the integration suite AND the need to test on multiple busy platforms-compilers. We then put all the PRs from the day on next so we only have to run it once and not once-per-PR. The only way we can get to a test-per-PR workflow is to both have a suite that runs in a short amount of time AND a way to run it on all our machine-compiler combinations.

@rljacob
Copy link
Member

rljacob commented Oct 4, 2023

That being said, one reason PRs spend a lot of time open is Integrators just forget to move them through the workflow. So even a way to get them on to next for overnight testing automatically and then to master will increase the velocity.

@mahf708
Copy link
Contributor Author

mahf708 commented Oct 4, 2023

I believe containerizations will help us a lot here. Let's discuss more tomorrow. If Luca and I understand the exact needs of E3SM main more carefully and exact constraints, I am certain we will be able to significantly improve our workflow. There's a lot of potential here.

Before this PR, the common wisdom (at least as I heard it from others) was that we cannot get anything to run on public CI. This PR proves that wrong by showing five different standard F2010 tests passing on public CI (four of them within an hour).

@bartgol
Copy link
Contributor

bartgol commented Oct 4, 2023

To be fair, those were ne4 tests. Nightly tests use ne30, which takes considerably longer than ne4...

We can discuss running small ne4 integration testsuites (possibly on a variety of machines) before merging to next, to rule out initial build/run errors and unexpected diffs. Then, if these pass, automate the merge to next (somehow querying if it's open). If this is already a good improvement over current framework, it's worth doing it.

@mahf708
Copy link
Contributor Author

mahf708 commented Oct 8, 2023

This PR has reached its end of life as a demo. I will reissue a PR to take care of containers and GHAs at a later time. I will do so after the build system reaches a steady state from the cmake changes.

@mahf708 mahf708 closed this Oct 8, 2023
@mahf708 mahf708 deleted the SMS-ne4_oQU480-ghas branch October 8, 2023 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants