Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.NARRM PE-layouts on Chrysalis #6700

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

amametjanov
Copy link
Member

@amametjanov amametjanov commented Oct 18, 2024

v3.NARRM WCYCL PE-layouts on Chrysalis:

  • Tiny: 10 nodes -- ~1 sypd
  • XSmall: 20 nodes -- ~2 sypd
  • Small: 30 nodes -- ~3 sypd
  • SMedium: 40 nodes -- ~4 sypd
  • Medium: 50 nodes -- ~5 sypd
  • XMedium: 64 nodes -- ~6 sypd

Todo:

  • Large: 100 nodes --- ~10sypd

[BFB]

- Tiny:   10 nodes, ~1 sypd
- XSmall: 20 nodes, ~2 sypd
@amametjanov amametjanov added Machine Files BFB PR leaves answers BFB RRM Regionally refined model Chrysalis labels Oct 18, 2024
@amametjanov amametjanov self-assigned this Oct 18, 2024
Copy link

github-actions bot commented Oct 18, 2024

PR Preview Action v1.4.8
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6700/
on branch gh-pages at 2024-10-29 04:31 UTC

@rljacob rljacob requested a review from tangq October 21, 2024 15:49
Copy link
Contributor

@tangq tangq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two PE layouts are great for short tests. Are we going to have larger layouts, e.g., S, M, and L, for longer runs? Or those will be added in a separate PR?

- Small:   30 nodes, ~3 sypd
- SMedium: 40 nodes, ~4 sypd
- Medium:  50 nodes, ~5 sypd
- Small:    64 nodes, ~1.8 sypd
- SMedium:  96 nodes, ~2.5 sypd
- Medium:  128 nodes, ~3.2 sypd
@tangq
Copy link
Contributor

tangq commented Oct 29, 2024

Hi @amametjanov , I notice that the XMedium (64 nodes) layout gives ~6 SYPD. This is very close to what I am having (5.92 SYPD) with all components equally assigned to 60 nodes (https://pace.ornl.gov/exp-details/200339). Is it because atm uses most of the time and we don't gain much by putting other components to different nodes from atm?

@amametjanov
Copy link
Member Author

Yes, ATM uses ~30 seconds/mday out of total ~39 seconds/mday.
In your runs, OCN uses ~3.7 seconds/mday.
I saw 6.1 SYPD in Ld10-long runs.
I think you'll observe more than 6.1 SYPD in lengthier Ly5-long jobs.
This can be tried out with a few xmlchange to match this pelayout:

$ ./pelayout
Comp  NTASKS  NTHRDS  ROOTPE PSTRIDE
CPL :   3840/     1;      0      1
ATM :   3840/     1;      0      1
LND :   2304/     1;   1536      1
ICE :   1536/     1;      0      1
OCN :    256/     1;   3840      1
ROF :   2304/     1;   1536      1
GLC :      1/     1;      0      1
WAV :      1/     1;      0      1
IAC :      1/     1;      0      1
ESP :      1/     1;      0      1

and

$ ./case.setup -r && ./xmlchange BUILD_COMPLETE=TRUE && ./preview_run
$ ./case.submit

@tangq
Copy link
Contributor

tangq commented Oct 31, 2024

Yes, we can try it with the ongoing Ly-5 test run.

@amametjanov , could you paste the xmlchange commands here? (I am not quite sure about these commands.)

Do I have to run ./case.submit again after ./case.setup -r && ./xmlchange BUILD_COMPLETE=TRUE && ./preview_run? It will lose the existing place in the queue (The waiting time is 1-2 days.) I'd prefer keeping the place in the queue if we can skip .case.submit. Thanks.

@amametjanov
Copy link
Member Author

Due to 60->64 node number change, ./case.submit is necessary (queue priority accumulation is slow though).
Here is a way to change PEs:

$ ./xmlchange NTASKS_CPL=3840,NTASKS_ATM=3840,NTASKS_ICE=1536,NTASKS_OCN=256,NTASKS_LND=2304,NTASKS_ROF=2304,NTASKS_GLC=1,NTASKS_WAV=1,NTASKS_IAC=1,NTASKS_ESP=1
$ ./xmlchange ROOTPE_OCN=3840,ROOTPE_LND=1536,ROOTPE_ROF=1536

@tangq
Copy link
Contributor

tangq commented Oct 31, 2024

OK, I cancelled the job and submitted a new one with this 64-node layout (jobid = 618254).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB Chrysalis Machine Files RRM Regionally refined model
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants