Demonstrate the use of GitHub actions in E3SM testing #5949

mahf708 · 2023-09-22T00:51:47Z

Testing running basic tests on ghas

Please don't merge. This is just an illustration. Stay tuned. In the meanwhile, please let me know if there are issues you'd like addressed or if you have ideas for improvements.

Things I'd like to do before merging:

standardize the docker container build (abstract options and enable matrix build)
- add missing items
- build for different compiler/os/libs matrix
generalize the workflow to add run tests besides the build test
print full logs if job fails
chain some jobs to test restarting on two different instances
explore caching

bartgol · 2023-09-22T04:31:40Z

I was able to build scream in a github action using a self-hosted runner on my workstation. I can get one set up from the e3sm-autotester user account (so that it can do ssh ops with its ) on mappy. Then, we would be able to rely on all the modules that we already use for other testing (nightly, scream AT, ...). For security reasons (since it would run on SNL machines), we may have to check that a certain label is set, so that only ppl with write privileges can trigger the action.

mahf708 · 2023-09-22T13:14:20Z

Thanks, yes, that will eventually be the goal. I still would like to get our basic and easy tests to run on public ci resources since the reporting will be quicker and safer and it'll be more generic.

bartgol · 2023-09-25T15:00:47Z

I see lots of wget errors in the action log, such as

wget failed with output:  and errput --2023-09-24 04:49:31--  https://web.lcrc.anl.gov/public/e3sm/inputdata/atm/waccm/phot/wasolar_ave.nc
Resolving web.lcrc.anl.gov (web.lcrc.anl.gov)... 140.221.70.30
Connecting to web.lcrc.anl.gov (web.lcrc.anl.gov)|140.221.70.30|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2023-09-24 04:49:31 ERROR 404: Not Found.

Indeed, those files do not seem to exist on the ANL server. Is this concerning?

bartgol · 2023-09-25T15:03:15Z

.github/workflows/old.WCYCL1850.ne4_oQU240.yaml

+  build:
+    runs-on: ubuntu-latest
+    container: 
+      image: ghcr.io/mahf708/e3sm-imgs@sha256:2657196ea9eec7dbd04f7da61f4e7d5e566d4b501dff5881f0cb5125ba031158


I'm assuming this container has all TPLs already installed? If so, what's the point of the apg-get calls below?

Not all because I was too lazy to rebuild the container 😉 Btw, here's the container repo: https://github.com/mahf708/e3sm-imgs we definitely want to get rid of the addenda apt-get calls once we finalize this

mahf708 · 2023-09-29T21:07:27Z

Yeah, I don't like those warnings and I am somewhat concerned they are happening. Only one of them appears to be actually fatal: #5953

bartgol · 2023-10-02T18:05:46Z

cime_config/machines/config_machines.xml

+    <DIN_LOC_ROOT_CLMFORC>$ENV{HOME}/projects/e3sm/ptclm-data</DIN_LOC_ROOT_CLMFORC>
+    <DOUT_S_ROOT>$ENV{HOME}/projects/e3sm/scratch/archive/$CASE</DOUT_S_ROOT>
+    <BASELINE_ROOT>$ENV{HOME}/projects/e3sm/baselines/$COMPILER</BASELINE_ROOT>
+    <CCSM_CPRNC>$CCSMROOT/tools/cprnc/build/cprnc</CCSM_CPRNC>


@jgfouca A quick grep through the repo and CIME makes me think that CCSMROOT is no longer used, but still present on some (unused?) machines in config_machines.xml. Is that correct? If so, this line should be changed I think.

I believe you're right. For transparency, all I did was to copy-paste the "singularity" MACH entry (even inadvertently keeping the email of someone else...)

Good strategy leaving the email of someone else. I shall use that trick in the future... :)

bartgol · 2023-10-04T16:13:27Z

Food for thoughts: by creating a self-hosted runner on e3sm machines, we could use gh actions to automatically test PRs without the need for a next branch, and without the need of waiting for overnight testing.

Workflow idea is as follows.

Keep the gh action that runs small tests on a github runner. This verifies that no "obvious" errors are there (e.g., the code builds and runs without crashing).
Add gh action that checks PR was approved by an integrator (can use something like this one). This is necessary b/c we want to avoid running malicious code on our DOE machines. Having a trusted user approve the PR first gives us some confidence that the code is ok.
If the above two pass, kick off gh action on self-hosted runners (can use a matrix).
Add an action that runs every time master is updated. For each open PR, disable merging if the last round of testing was done wrt a different base ref. That's to avoid merging a PR that passed before, but doesn't pass after last master update.
Make the 1st and 3rd step trigger if the base branch changes. I don't think we can handle this from those gh actions (I can't find a trigger like "if base ref changes"). So the closest approach is to make the workflow at the step above add some special label (eg. "RETEST") to all open PRs. Then, make the pr test workflows trigger if labels change (and "RETEST" is among them). Drawback is that people can manually remove the label, but I assume we "trust" users with write access to the repo to not do something like that. The benefit of a trigger if "RETEST" is set is that we can use it also to manually re-run a testing framework.
For BFB PRs only, if
- every action passes
- the base branch has not changed since the PR was created
- automerge label is set
  we can go ahead and merge.
For non-BFB PRs, an integrator must use the force, and manually merge.

Notice that, by checking that the base ref has not changed, we no longer need next: all PRs are re-tested when another PR is merged. While this may somewhat reduce the amount of PRs that can go in at the same time, we will gain by having testing throughout the day. We could also still use next as the base ref, of course.

The drawback of continuos testing (as opposed to nightly only) is the amount of work we throw at the e3sm machines. Each self-hosted runner can run 1 job at a time, so by having 1 runner/machine we limit how much of the machine we can use. However, on some machines (e.g., mappy), e3sm testing takes a long time (3h), and we don't want to hold the machine for that long during the day. That said, since mappy is not a production machine, we may not use it for CI, and just keep it for nightlies, for extra safety.

Edit: it occurred to me that certain clusters may not like having a self-hosted runner that is listening 24/7 (not sure if this is the case for chrys or pm). If that's the case, we can do it in two steps: 1) put the self hosted runner on a SNL machine, and 2) make the action trigger one of the existing SNL jenkins jobs that are currently used to do nightly testing of next (possibly with slight mods). This will comply with possible rules from the cluster admins, requiring to not keep cronjobs or any sort of never-ending-job running.

mahf708 · 2023-10-04T16:32:08Z

btw, GitHub offers a native "Merge Queue" feature: https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-a-merge-queue

rljacob · 2023-10-04T17:12:06Z

The need to test overnight comes from both how long it takes to run the integration suite AND the need to test on multiple busy platforms-compilers. We then put all the PRs from the day on next so we only have to run it once and not once-per-PR. The only way we can get to a test-per-PR workflow is to both have a suite that runs in a short amount of time AND a way to run it on all our machine-compiler combinations.

rljacob · 2023-10-04T17:18:10Z

That being said, one reason PRs spend a lot of time open is Integrators just forget to move them through the workflow. So even a way to get them on to next for overnight testing automatically and then to master will increase the velocity.

mahf708 · 2023-10-04T17:22:07Z

I believe containerizations will help us a lot here. Let's discuss more tomorrow. If Luca and I understand the exact needs of E3SM main more carefully and exact constraints, I am certain we will be able to significantly improve our workflow. There's a lot of potential here.

Before this PR, the common wisdom (at least as I heard it from others) was that we cannot get anything to run on public CI. This PR proves that wrong by showing five different standard F2010 tests passing on public CI (four of them within an hour).

bartgol · 2023-10-04T18:39:05Z

To be fair, those were ne4 tests. Nightly tests use ne30, which takes considerably longer than ne4...

We can discuss running small ne4 integration testsuites (possibly on a variety of machines) before merging to next, to rule out initial build/run errors and unexpected diffs. Then, if these pass, automate the merge to next (somehow querying if it's open). If this is already a good improvement over current framework, it's worth doing it.

mahf708 · 2023-10-08T12:59:00Z

This PR has reached its end of life as a demo. I will reissue a PR to take care of containers and GHAs at a later time. I will do so after the build system reaches a steady state from the cmake changes.

mahf708 closed this Sep 22, 2023

This comment was marked as outdated.

Sign in to view

mahf708 reopened this Sep 23, 2023

mahf708 changed the title ~~add ghas SMS_Ln1.ne4_oQU480.F2010.linux-generic~~ copy WCYCL1850.ne4_oQU240 from circle ci to ghas Sep 23, 2023

mahf708 changed the title ~~copy WCYCL1850.ne4_oQU240 from circle ci to ghas~~ Demonstrate the use of GitHub actions in E3SM testing Sep 23, 2023

add ghas mach, demo jobs, test matrix

b29aebd

mahf708 force-pushed the SMS-ne4_oQU480-ghas branch from 82ac885 to b29aebd Compare September 24, 2023 04:24

bartgol reviewed Sep 25, 2023

View reviewed changes

bartgol reviewed Oct 2, 2023

View reviewed changes

Merge branch 'E3SM-Project:master' into SMS-ne4_oQU480-ghas

7858ba3

mahf708 closed this Oct 8, 2023

mahf708 deleted the SMS-ne4_oQU480-ghas branch October 8, 2023 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demonstrate the use of GitHub actions in E3SM testing #5949

Demonstrate the use of GitHub actions in E3SM testing #5949

mahf708 commented Sep 22, 2023 •

edited

Loading

This comment was marked as outdated.

bartgol commented Sep 22, 2023

mahf708 commented Sep 22, 2023

bartgol commented Sep 25, 2023

bartgol Sep 25, 2023

mahf708 Sep 29, 2023

mahf708 commented Sep 29, 2023 •

edited

Loading

bartgol Oct 2, 2023

mahf708 Oct 2, 2023

bartgol Oct 2, 2023

bartgol commented Oct 4, 2023 •

edited

Loading

mahf708 commented Oct 4, 2023

rljacob commented Oct 4, 2023

rljacob commented Oct 4, 2023

mahf708 commented Oct 4, 2023

bartgol commented Oct 4, 2023

mahf708 commented Oct 8, 2023 •

edited

Loading

Demonstrate the use of GitHub actions in E3SM testing #5949

Demonstrate the use of GitHub actions in E3SM testing #5949

Conversation

mahf708 commented Sep 22, 2023 • edited Loading

This comment was marked as outdated.

bartgol commented Sep 22, 2023

mahf708 commented Sep 22, 2023

bartgol commented Sep 25, 2023

bartgol Sep 25, 2023

Choose a reason for hiding this comment

mahf708 Sep 29, 2023

Choose a reason for hiding this comment

mahf708 commented Sep 29, 2023 • edited Loading

bartgol Oct 2, 2023

Choose a reason for hiding this comment

mahf708 Oct 2, 2023

Choose a reason for hiding this comment

bartgol Oct 2, 2023

Choose a reason for hiding this comment

bartgol commented Oct 4, 2023 • edited Loading

mahf708 commented Oct 4, 2023

rljacob commented Oct 4, 2023

rljacob commented Oct 4, 2023

mahf708 commented Oct 4, 2023

bartgol commented Oct 4, 2023

mahf708 commented Oct 8, 2023 • edited Loading

mahf708 commented Sep 22, 2023 •

edited

Loading

mahf708 commented Sep 29, 2023 •

edited

Loading

bartgol commented Oct 4, 2023 •

edited

Loading

mahf708 commented Oct 8, 2023 •

edited

Loading