Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory bottleneck in chgres_cube #633

Closed
GeorgeGayno-NOAA opened this issue Feb 25, 2022 · 12 comments · Fixed by #766
Closed

Memory bottleneck in chgres_cube #633

GeorgeGayno-NOAA opened this issue Feb 25, 2022 · 12 comments · Fixed by #766
Assignees
Labels
enhancement New feature or request

Comments

@GeorgeGayno-NOAA
Copy link
Collaborator

GeorgeGayno-NOAA commented Feb 25, 2022

Users occasionally get out-of-memory issues when running chgres_cube for large domains. Almost always, this happens during the regridding of the 3-D winds to the edges of the grid box.

! Interpolate winds to 'd' grid.

I suspect this is because the ESMF field for winds is 4-dimensional (x,y,z wind components in the vertical).

Interpolating each wind component separately or as a field bundle would likely save memory.

@GeorgeGayno-NOAA GeorgeGayno-NOAA added the enhancement New feature or request label Feb 25, 2022
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Feb 25, 2022
@GeorgeGayno-NOAA GeorgeGayno-NOAA self-assigned this Feb 25, 2022
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Feb 28, 2022
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Feb 28, 2022
@GeorgeGayno-NOAA
Copy link
Collaborator Author

I set up at case using GFSv16 netcdf data as input for a C1152 L128 grid. My test script and config file are on Dell: /gpfs/dell2/emc/modeling/noscrub/George.Gayno/ufs_utils.git/chgres_mem

Using 'develop' at 570ea39 required 8 nodes/6 tasks per node. Using the branch at f584c91 only required 4 nodes/6 tasks per node.

Will try additional tests.

@GeorgeGayno-NOAA
Copy link
Collaborator Author

Tried C3072 L65 on Dell. Using 'develop' required 30 nodes/6 tasks per node. Using the branch required 20 nodes/6 tasks per node.

@GeorgeGayno-NOAA
Copy link
Collaborator Author

Here is the error I get from 'FieldRegrid', which is solved my doubling the number of nodes:

Fatal error in MPI_Irecv: Invalid count, error stack:
MPI_Irecv(170): MPI_Irecv(buf=0x2b5d79127010, count=-219953152, MPI_BYTE, src=1, tag=0, comm=0x84000002, request=0x4835ba0) failed
MPI_Irecv(107): Negative count, value is -219953152

According to the ESMF group (@rsdunlapiv) this is the result of using 32 bit pointers in some ESMF routines

@GeorgeGayno-NOAA
Copy link
Collaborator Author

The ESMF group recommends a switch to ESMF v8.3 to help fix this. I just tried v8.3 on Hera using develop at f658c1e and all chgres regression tests passed. Will open an issue to upgrade to v8.3.

@GeorgeGayno-NOAA
Copy link
Collaborator Author

The ESMF group provided a test branch that fixes this - https://github.com/esmf-org/esmf/tree/feature/large-messages

I cloned and compiled this on Hera here: /scratch1/NCEPDEV/da/George.Gayno/noscrub/esmf.git/esmf

@GeorgeGayno-NOAA
Copy link
Collaborator Author

On Hera, I compiled 'develop' at 2a07b2c for use as the 'control'.

For the 'test', I compiled 'develop' using the update ESMF branch. This was done by modifying the build module as follows:

< setenv("ESMFMKFILE","/scratch1/NCEPDEV/da/George.Gayno/noscrub/esmf.git/esmf/lib/libO/Linux.intel.64.intelmpi.default/esmf.mk")
---
> esmf_ver=os.getenv("esmf_ver") or "8.2.1b04"
> load(pathJoin("esmf", esmf_ver))

The test case was a C1152 grid using 128 vertical levels. All config files and scripts are here: /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/chgres_memory

@GeorgeGayno-NOAA
Copy link
Collaborator Author

Running the 'control' with 7 nodes/6 tasks per node, resulted in this error (see "log.fail.7nodes.develop"):

33: Fatal error in MPI_Irecv: Invalid count, error stack:
33: MPI_Irecv(170): MPI_Irecv(buf=0x2b367e533010, count=-1980497920, MPI_BYTE, src=33, tag=0, comm=0x84000002, request=0x5be22e0) failed
33: MPI_Irecv(107): Negative count, value is -1980497920

Rerunning with 8 nodes/6 tasks per node was successful. See "log.pass.8nodes.develop".

@GeorgeGayno-NOAA
Copy link
Collaborator Author

Running the 'test' (which used the update ESMF branch) was successful using only 5 nodes/6 tasks per node. See "log.pass.5nodes.new.esmf.branch".

So, using the new ESMF test branch eliminates the MPI error and reduces the amount of resources to run large grids.

@GeorgeGayno-NOAA
Copy link
Collaborator Author

Update from the ESMF team (Gerhard):

The large-message fix will be part of the upcoming v8.3.1 patch release. I will 
let you know once it's released. Of course the fix will also go into ESMF develop 
toward the 8.4 release. 

@GeorgeGayno-NOAA
Copy link
Collaborator Author

ESMF v8.3.1 was officially released: https://github.com/esmf-org/esmf/releases/tag/v8.3.1

GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Nov 7, 2022
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Nov 7, 2022
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Nov 9, 2022
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Nov 10, 2022
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Nov 10, 2022
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Nov 10, 2022
@GeorgeGayno-NOAA
Copy link
Collaborator Author

Anning Cheng was trying to create a C3072 L128 grid using the gdas_init utility on Cactus. The wind fields in the coldstart files were not correct. I was able to repeat the problem using develop at 711a4dc. I then upgraded to ESMF v8.4.0bs08, but the problem persisted. I ran with 8 nodes/18 tasks per node, and I requested memory of 500 GB. A plot of the problem is attached:

Screenshot (24)

@GeorgeGayno-NOAA
Copy link
Collaborator Author

So, the way I create the ESMF fields for 3-d winds must have some other problems. As a test, I merged the latest updates from develop to the bug_fix/chgres_memory branch. I compiled 57792e3 on Cactus then reran the test in the previous comment. The wind fields looked correct.

GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Nov 29, 2022
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Dec 7, 2022
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Dec 9, 2022
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Dec 9, 2022
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Dec 13, 2022
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Dec 19, 2022
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jan 13, 2023
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jan 27, 2023
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jan 30, 2023
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jan 30, 2023
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jan 30, 2023
Update unit tests for new specification of wind fields.

Fixes ufs-community#633.
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jan 31, 2023
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jan 31, 2023
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Jan 31, 2023
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Feb 1, 2023
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Feb 2, 2023
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Feb 2, 2023
GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant