Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

writing native grid atmf history files is too slow in FV3ATM #2439

Closed
junwang-noaa opened this issue Sep 13, 2024 · 12 comments · Fixed by NOAA-EMC/fv3atm#876 or #2463
Closed

writing native grid atmf history files is too slow in FV3ATM #2439

junwang-noaa opened this issue Sep 13, 2024 · 12 comments · Fixed by NOAA-EMC/fv3atm#876 or #2463
Assignees
Labels
bug Something isn't working

Comments

@junwang-noaa
Copy link
Collaborator

Description

The G-W gdas fcst job slows down significantly when the option of writing the native grid history files is turned on. Besides the resources issue on write grid component, it is also found that the model writes native grid atmf history files significantly slower than writing the Gaussian grid atmf history files or writing the native grid restart files. The timing from Dave's test is showing below:

nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: ./atmf003.nc write time is 18.91891 at fcst 03:00
nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: ./cubed_sphere_grid_atmf003.nc write time is 184.79446 at fcst 03:00
nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: ./cubed_sphere_grid_sfcf003.nc write time is 36.00565 at fcst 03:00
nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: ./sfcf003.nc write time is 36.36828 at fcst 03:00
nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: RESTART/20211220.210000.fv_core.res.nc write time is 5.30265 at fcst 03:00
nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: RESTART/20211220.210000.fv_srf_wnd.res.nc write time is 0.01886 at fcst 03:00
nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: RESTART/20211220.210000.fv_tracer.res.nc write time is 7.70513 at fcst 03:00
nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: RESTART/20211220.210000.phy_data.nc write time is 7.28120 at fcst 03:00
nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: RESTART/20211220.210000.sfc_data.nc write time is 3.23882 at fcst 03:00

To Reproduce:

Additional context

Output

@junwang-noaa junwang-noaa added the bug Something isn't working label Sep 13, 2024
@junwang-noaa
Copy link
Collaborator Author

@DavidHuber-NOAA provided a HR4 gdasfcst test case on dogwood at:

/lfs/h2/emc/global/noscrub/David.Huber/keep/gdasfcst_w_native_rundir

@DusanJovic-NOAA
Copy link
Collaborator

I noticed that in the above run directory (/lfs/h2/emc/global/noscrub/David.Huber/keep/gdasfcst_w_native_rundir) in model_configure lossy compression parameters (quantization) are set as:

$ grep quantize /lfs/h2/emc/global/noscrub/David.Huber/keep/gdasfcst_w_native_rundir/model_configure 
quantize_mode:           'quantize_bitround'
quantize_nsd:            5

quantize_nsd parameter for 'quantize_bitround' mode specifies the number of significant bits (5 in this case). 5 bits is very low and probably not enough for fields like temperature. This has nothing to do with the native grid file write time, but I just wanted to see if this is really intended.

@DavidHuber-NOAA
Copy link
Collaborator

@aerorahul the quantize_nsd and quantize_bitround fields were updated at NOAA-EMC/global-workflow@386ce38. Just checking if 5 digits is enough for our needs.

@aerorahul
Copy link
Contributor

@junwang-noaa and @aerorahul had a conversation on what these should be.
If we need more fine-grain control based on resolution/run, we can. Just let us know what those values should be.

@junwang-noaa
Copy link
Collaborator Author

@DusanJovic-NOAA the quantize_nsd and quantize_bitround configurations are corresponding to the previous nbits=14 with our customized lossy compression code. The physics group evaluated with results for nbits setting from (nbits=12-32), and decided the nbits=14 to be used in GFSv16. The quantize_nsd=5 is corresponding to nbits=14.

@DusanJovic-NOAA
Copy link
Collaborator

@DavidHuber-NOAA Can you sync the input data for the test case you provided on Cactus. I see these errors:

    72.1536.grb
 FATAL ERROR: in opening file
 /lfs/h2/emc/global/noscrub/David.Huber/GW/develop/fix/am/global_slmask.t1534.30
 72.1536.grb
  FATAL ERROR: in opening file
 /lfs/h2/emc/global/noscrub/David.Huber/GW/develop/fix/am/global_slmask.t1534.30
 72.1536.grb

@DavidHuber-NOAA
Copy link
Collaborator

@DusanJovic-NOAA This test case was run on Dogwood and I do not have access to it now that it is in production. However, I just created a fresh clone into develop. Let me know if that works for you. If not, I will rerun the case on Cactus.

@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA This test case was run on Dogwood and I do not have access to it now that it is in production. However, I just created a fresh clone into develop. Let me know if that works for you. If not, I will rerun the case on Cactus.

Thanks. It works, but I had to change the directory names in input.nml.

ls: cannot access '/lfs/h2/emc/global/noscrub/David.Huber/GW/develop/fix/am/global_slmask.t1534.3072.1536.grb': No such file or directory

but the one with 'david.huber` does exist.

@DusanJovic-NOAA
Copy link
Collaborator

I found that native history files write is noticeably faster if I change the size of the chunks, specifically:

diff --git a/io/module_write_netcdf.F90 b/io/module_write_netcdf.F90
index b016415..03a9d57 100644
--- a/io/module_write_netcdf.F90
+++ b/io/module_write_netcdf.F90
@@ -398,14 +398,14 @@ contains
             par_access = NF90_COLLECTIVE
             if (rank == 2 .and. ichunk2d(grid_id) > 0 .and. jchunk2d(grid_id) > 0) then
                if (is_cubed_sphere) then
-                  chunksizes = [im, jm, tileCount, 1]
+                  chunksizes = [im, jm, 1, 1]
                else
                   chunksizes = [ichunk2d(grid_id), jchunk2d(grid_id),            1]
                end if
                ncerr = nf90_def_var_chunking(ncid, varids(i), NF90_CHUNKED, chunksizes) ; NC_ERR_STOP(ncerr)
             else if (rank == 3 .and. ichunk3d(grid_id) > 0 .and. jchunk3d(grid_id) > 0 .and. kchunk3d(grid_id) > 0) then
                if (is_cubed_sphere) then
-                  chunksizes = [im, jm, lm, tileCount, 1]
+                  chunksizes = [im, jm, 1, 1, 1]
                else
                   chunksizes = [ichunk3d(grid_id), jchunk3d(grid_id), min(kchunk3d(grid_id),fldlev(i)), 1]
                end if

Can apply this change in the code, recompile, and rerun your test.

@DavidHuber-NOAA
Copy link
Collaborator

@DusanJovic-NOAA Thanks for the quick attention on this. I gave your code changes a try and ran a fresh forecast with native grid writes enabled at C768. This significantly reduce the runtime from ~60 minutes to ~23 minutes. I copied the run directory into /lfs/h2/emc/global/noscrub/david.huber/keep/gdasfcst_fast_native and the log file can be found here: /lfs/h2/emc/global/noscrub/david.huber/para/COMROOT/fix_slow_writes/logs/2021122018/gdasfcst_seg0.log.

@DusanJovic-NOAA
Copy link
Collaborator

@DavidHuber-NOAA Thank you for checking. @junwang-noaa should we update the code in develop with these changes?

@junwang-noaa
Copy link
Collaborator Author

@DusanJovic-NOAA Thanks for debugging the issue. The timing looks good now. Please update the develop branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment