-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Demonstrate the use of GitHub actions in E3SM testing #5949
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
I was able to build scream in a github action using a self-hosted runner on my workstation. I can get one set up from the |
Thanks, yes, that will eventually be the goal. I still would like to get our basic and easy tests to run on public ci resources since the reporting will be quicker and safer and it'll be more generic. |
82ac885
to
b29aebd
Compare
I see lots of wget errors in the action log, such as
Indeed, those files do not seem to exist on the ANL server. Is this concerning? |
build: | ||
runs-on: ubuntu-latest | ||
container: | ||
image: ghcr.io/mahf708/e3sm-imgs@sha256:2657196ea9eec7dbd04f7da61f4e7d5e566d4b501dff5881f0cb5125ba031158 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this container has all TPLs already installed? If so, what's the point of the apg-get calls below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all because I was too lazy to rebuild the container 😉 Btw, here's the container repo: https://github.com/mahf708/e3sm-imgs we definitely want to get rid of the addenda apt-get calls once we finalize this
Yeah, I don't like those warnings and I am somewhat concerned they are happening. Only one of them appears to be actually fatal: #5953 |
<DIN_LOC_ROOT_CLMFORC>$ENV{HOME}/projects/e3sm/ptclm-data</DIN_LOC_ROOT_CLMFORC> | ||
<DOUT_S_ROOT>$ENV{HOME}/projects/e3sm/scratch/archive/$CASE</DOUT_S_ROOT> | ||
<BASELINE_ROOT>$ENV{HOME}/projects/e3sm/baselines/$COMPILER</BASELINE_ROOT> | ||
<CCSM_CPRNC>$CCSMROOT/tools/cprnc/build/cprnc</CCSM_CPRNC> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jgfouca A quick grep through the repo and CIME makes me think that CCSMROOT is no longer used, but still present on some (unused?) machines in config_machines.xml. Is that correct? If so, this line should be changed I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you're right. For transparency, all I did was to copy-paste the "singularity" MACH entry (even inadvertently keeping the email of someone else...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good strategy leaving the email of someone else. I shall use that trick in the future... :)
Food for thoughts: by creating a self-hosted runner on e3sm machines, we could use gh actions to automatically test PRs without the need for a next branch, and without the need of waiting for overnight testing. Workflow idea is as follows.
Notice that, by checking that the base ref has not changed, we no longer need The drawback of continuos testing (as opposed to nightly only) is the amount of work we throw at the e3sm machines. Each self-hosted runner can run 1 job at a time, so by having 1 runner/machine we limit how much of the machine we can use. However, on some machines (e.g., mappy), e3sm testing takes a long time (3h), and we don't want to hold the machine for that long during the day. That said, since mappy is not a production machine, we may not use it for CI, and just keep it for nightlies, for extra safety. Edit: it occurred to me that certain clusters may not like having a self-hosted runner that is listening 24/7 (not sure if this is the case for chrys or pm). If that's the case, we can do it in two steps: 1) put the self hosted runner on a SNL machine, and 2) make the action trigger one of the existing SNL jenkins jobs that are currently used to do nightly testing of |
btw, GitHub offers a native "Merge Queue" feature: https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-a-merge-queue |
The need to test overnight comes from both how long it takes to run the integration suite AND the need to test on multiple busy platforms-compilers. We then put all the PRs from the day on next so we only have to run it once and not once-per-PR. The only way we can get to a test-per-PR workflow is to both have a suite that runs in a short amount of time AND a way to run it on all our machine-compiler combinations. |
That being said, one reason PRs spend a lot of time open is Integrators just forget to move them through the workflow. So even a way to get them on to next for overnight testing automatically and then to master will increase the velocity. |
I believe containerizations will help us a lot here. Let's discuss more tomorrow. If Luca and I understand the exact needs of E3SM main more carefully and exact constraints, I am certain we will be able to significantly improve our workflow. There's a lot of potential here. Before this PR, the common wisdom (at least as I heard it from others) was that we cannot get anything to run on public CI. This PR proves that wrong by showing five different standard F2010 tests passing on public CI (four of them within an hour). |
To be fair, those were ne4 tests. Nightly tests use ne30, which takes considerably longer than ne4... We can discuss running small ne4 integration testsuites (possibly on a variety of machines) before merging to next, to rule out initial build/run errors and unexpected diffs. Then, if these pass, automate the merge to next (somehow querying if it's open). If this is already a good improvement over current framework, it's worth doing it. |
This PR has reached its end of life as a demo. I will reissue a PR to take care of containers and GHAs at a later time. I will do so after the build system reaches a steady state from the cmake changes. |
Testing running basic tests on ghas
Please don't merge. This is just an illustration. Stay tuned. In the meanwhile, please let me know if there are issues you'd like addressed or if you have ideas for improvements.
Things I'd like to do before merging: