Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add overrun qos for pm-cpu/pm-gpu #6709

Merged
merged 1 commit into from
Oct 25, 2024

Conversation

ndkeen
Copy link
Contributor

@ndkeen ndkeen commented Oct 24, 2024

Add overrun as an option for submitting jobs via CIME on pm-cpu/pm-gpu.
Currently, jobs can submit to the overrun qos with case.submit -a="-q overrun".
However, without this PR, tests like this will fail:

create_test SMS_D.ne4pg2_oQU480.F2010 --compiler gnu -q overrun

[bfb]

@ndkeen ndkeen self-assigned this Oct 24, 2024
@ndkeen ndkeen added Machine Files BFB PR leaves answers BFB pm-gpu Perlmutter machine at NERSC (GPU nodes) pm-cpu Perlmutter at NERSC (CPU-only nodes) labels Oct 24, 2024
@ndkeen ndkeen requested a review from rljacob October 24, 2024 21:51
Copy link

PR Preview Action v1.4.8
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6709/
on branch gh-pages at 2024-10-24 21:53 UTC

@rljacob
Copy link
Member

rljacob commented Oct 24, 2024

So with this we could change the test script to run in the overrun queue?

ndkeen added a commit that referenced this pull request Oct 24, 2024
Add overrun as an option for submitting jobs via CIME on pm-cpu/pm-gpu.
Currently, jobs can submit to the overrun qos with case.submit -a="-q overrun".
However, without this PR, tests like this will fail:

create_test SMS_D.ne4pg2_oQU480.F2010 --compiler gnu -q overrun
[bfb]
@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 25, 2024

Yes, as with previous experience using the shared qos, there must be something in CIME that is unhappy when a job requests to use a qos that is not specified in env_batch.xml. It does not give a great error message when that happens, but adding overrun as an option seems to allow it to work. I suppose it's nice that CIME catches a potential problem when job tries to request a non-existent qos, but it makes it less flexible when a center/machine makes a change to add/remove these.

Note: I'm purposefully trying to use the term qos instead of queue as that is what it's called, but we (and others) often use the word queue. That is, there is no such thing as "debug queue", it's the "debug qos".

Even though the tests don't run on pm-cpu, I will merge to master.

@ndkeen ndkeen merged commit bf31b4e into master Oct 25, 2024
9 checks passed
@ndkeen ndkeen deleted the ndk/machinesfiles/pm-cpu-add-overrun branch October 25, 2024 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB Machine Files pm-cpu Perlmutter at NERSC (CPU-only nodes) pm-gpu Perlmutter machine at NERSC (GPU nodes)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants