Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of ROMS+CICE coupling #4

Open
uturuncoglu opened this issue Mar 20, 2024 · 13 comments
Open

Implementation of ROMS+CICE coupling #4

uturuncoglu opened this issue Mar 20, 2024 · 13 comments
Assignees

Comments

@uturuncoglu
Copy link
Collaborator

This is generic issue that aims to have discussion about ROMS+CICE coupling. Here is the recent conversation with Hernan,


Hernan:

I designed two test cases: LAKE_ICE and LAKE_ERIE.

The LAKE_ICE is an idealized application to test the ROMS native sea ice model. I ran it for three years. See the following link for the roms_test repository:

https://github.com/myroms/roms_test/tree/main/lake_ice/Forward

I haven’t finished configuring LAKE_ERIE. I already have the grid and downloaded the ECMWF forcing, but I haven’t created the initial conditions yet. The plan was to spin up for a few years, then test the ROMS native sea ice model and use the same test for coupling ROMS-CICE in the UFS. I stopped doing this because I was asked to develop the decimation/interpolation scheme for ROMS 4D-Var inner loops that our group urgently needs to run ECCOFS. I am almost done with it, and I come back to LAKE_ERIE.

@uturuncoglu
Copy link
Collaborator Author

Followup question: Okay. That is great. Let me know when the ERIE case is ready. BTW, I just wonder if it possible to configure ROMS to get atmospheric forcing from CDEPS under UFS Coastal but using ROMS provided CICE in the first round. If that runs, it provides a baseline or reference point for us. Then we could try to replace internal CICE with UFS Weather Model provided one. Let me know what do you think?

@uturuncoglu
Copy link
Collaborator Author

uturuncoglu commented Mar 20, 2024

This is also related with #5

@janahaddad janahaddad moved this to In Progress in ufs-coastal project Mar 21, 2024
@janahaddad janahaddad moved this from In Progress to Todo in ufs-coastal project Mar 21, 2024
@janahaddad janahaddad moved this from Todo to Backlog in ufs-coastal project Sep 23, 2024
@janahaddad janahaddad moved this from Backlog to Todo in ufs-coastal project Sep 23, 2024
@SmithJos13
Copy link

@uturuncoglu I'm going to revive this thread so we have a place to communicate.

The directory I'm currently working in on Hercules is:

/work2/noaa/vdatum/jsmith/dev/ufs-coastal-cice-dev/tests/run_dir/cice_dev_cdeps2cice_debug_intel

the base directory for my current version of ufs coastal is:

/work2/noaa/vdatum/jsmith/dev/ufs-coastal-cice-dev/

the input forcing / input grids are located in:

cdeps: /work2/noaa/vdatum/jsmith/ufs_coast_setup/Forcing/
grids for cice: /work2/noaa/vdatum/jsmith/ufs_coast_setup/Grids/CICE/

to build my current case I use

./rt.sh -l rt_cice_dev.conf -a vdatum -k -c -n "cice_dev_cdeps2cice_debug intel"

The ufs run configuration is

runSeq::
@@[DT_CICE]
  OCN
  ATM
  OCN -> ICE :remapMethod=redist
  ATM -> ICE :remapMethod=redist
  ICE
@
::

finally the error I'm experiencing is during the run phase the model will hang and not advance any further. In the PET logs I have the follow errors :

PET1

20241028 170356.355 ERROR            PET1 OCN-TO-ICE:src/addon/NUOPC/src/NUOPC_Connector.F90:1603 Invalid argument  - Ambiguous connection status, multiple connections with identical bondLevel found for: cpl_scalars
20241028 170356.355 ERROR            PET1 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3099 Invalid argument  - Phase 'IPDv05p2b' Initialize for connectorComp 2 -> 3: OCN-TO-ICE did not return ESMF_SUCCESS
20241028 170356.355 ERROR            PET1 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:1423 Invalid argument  - Passing error in return code
20241028 170356.355 ERROR            PET1 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:486 Invalid argument  - Passing error in return code
20241028 170356.355 ERROR            PET1 UFS.F90:394 Invalid argument  - Aborting UFS

PET2

20241028 170356.355 ERROR            PET2 OCN-TO-ICE:src/addon/NUOPC/src/NUOPC_Connector.F90:1603 Invalid argument  - Ambiguous connection status, multiple connections with identical bondLevel found for: cpl_scalars
20241028 170356.355 ERROR            PET2 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3099 Invalid argument  - Phase 'IPDv05p2b' Initialize for connectorComp 2 -> 3: OCN-TO-ICE did not return ESMF_SUCCESS
20241028 170356.355 ERROR            PET2 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:1423 Invalid argument  - Passing error in return code
20241028 170356.355 ERROR            PET2 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:486 Invalid argument  - Passing error in return code
20241028 170356.355 ERROR            PET2 UFS.F90:394 Invalid argument  - Aborting UFS

appreaciate any input that you have!

@SmithJos13
Copy link

SmithJos13 commented Oct 29, 2024

Ahhh okay!

I think I found one issue I still have more that I'm working from: https://earthsystemmodeling.org/docs/nightly/develop/NUOPC_refdoc/node3.html

the error is steaming from:

After the first stage, there may be umbiguous Field pairs present. Ambiguous Field pairs are those that map different producer Fields (i.e. Fields in the importState of a Connector) to the same consumer Field (i.e. a Field in the exportState of a Connector). While the NUOPC Layer support having multiple consumer Fields connected to a single producer Field, it does not support the opposite condition. The second stage of Field pairing is responsible for disambiguating Field pairs with the same consumer.

So switching verbosity to high yields the following confliction:

20241029 145140.936 INFO             PET0   ATM-TO-ICE: ProducerConnection (bondLevelMax):         1
20241029 145140.936 INFO             PET0   ATM-TO-ICE:      importStandardNameList(i= 14): cpl_scalars
20241029 145140.936 INFO             PET0   ATM-TO-ICE:         importNamespaceList(i= 14): ATM
20241029 145140.936 INFO             PET0   ATM-TO-ICE:            importCplSetList(i= 14): __UNSPECIFIED__
20241029 145140.936 INFO             PET0   ATM-TO-ICE:      exportStandardNameList(j= 26): cpl_scalars
20241029 145140.936 INFO             PET0   ATM-TO-ICE:         exportNamespaceList(j= 26): ICE
20241029 145140.936 INFO             PET0   ATM-TO-ICE:            exportCplSetList(j= 26): __UNSPECIFIED__
20241029 145140.936 INFO             PET0   ATM-TO-ICE: bondLevel= 1

and

20241029 145140.942 INFO             PET1   OCN-TO-ICE: ProducerConnection (bondLevelMax):         1
20241029 145140.942 INFO             PET1   OCN-TO-ICE:      importStandardNameList(i= 10): cpl_scalars
20241029 145140.942 INFO             PET1   OCN-TO-ICE:         importNamespaceList(i= 10): OCN
20241029 145140.942 INFO             PET1   OCN-TO-ICE:            importCplSetList(i= 10): __UNSPECIFIED__
20241029 145140.942 INFO             PET1   OCN-TO-ICE:      exportStandardNameList(j= 26): cpl_scalars
20241029 145140.942 INFO             PET1   OCN-TO-ICE:         exportNamespaceList(j= 26): ICE
20241029 145140.942 INFO             PET1   OCN-TO-ICE:            exportCplSetList(j= 26): __UNSPECIFIED__
20241029 145140.942 INFO             PET1   OCN-TO-ICE: bondLevel= 1

which is inline with the error. There are two producers for cpl_scalars one consumer of spl_scalars thus ambiguous.

So going to the CDEPS cap for the ocn component

CDEPS-interface/CDEPS/docn/docn_datamode_copyall_mod.F90

and comment the following (line 61) :

!call dshr_fldList_add(fldsExport, trim(flds_scalar_name))

get rid of the error and the model resolved its issue!

This is not the right solution to this issue since it leads to there being no cpl_scalars in the OCN state bundle and the model falls over

Second purpose fix is to comment out the following line:

    !call fldlist_add(fldsToIce_num, fldsToIce, trim(flds_scalar_name))

in

/work2/noaa/vdatum/jsmith/dev/ufs-coastal-cice-dev

which avoids the issue listed above and yields the same error listed bellow.

@SmithJos13
Copy link

SmithJos13 commented Oct 29, 2024

This being said the model still crashes

more specifically I have the following standard error:

2: Obtained 10 stack frames.
2: /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/envs/unified-env/install/intel/2021.9.0/parallelio-2.5.10-rdwrsed/lib/libpioc.so(print_trace+0x29) [0x150e6df41ba9]
2: /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/envs/unified-env/install/intel/2021.9.0/parallelio-2.5.10-rdwrsed/lib/libpioc.so(pio_err+0xa7) [0x150e6df41b57]
2: /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/envs/unified-env/install/intel/2021.9.0/parallelio-2.5.10-rdwrsed/lib/libpioc.so(PIOc_Init_Intracomm+0x5ed) [0x150e6df44dad]
2: /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/envs/unified-env/install/intel/2021.9.0/parallelio-2.5.10-rdwrsed/lib/libpioc.so(PIOc_Init_Intracomm_from_F90+0x14) [0x150e6df44764]
2: /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/envs/unified-env/install/intel/2021.9.0/parallelio-2.5.10-rdwrsed/lib/libpiof.so(piolib_mod_mp_init_intracom_+0xd6) [0x150e6ded1d16]
2: /work2/noaa/vdatum/jsmith/dev/stmp/jsmith/FV3_RT/rt_3779924/cice_dev_cdeps2cice_debug_intel/./fv3.exe() [0x28e42c9]
2: /work2/noaa/vdatum/jsmith/dev/stmp/jsmith/FV3_RT/rt_3779924/cice_dev_cdeps2cice_debug_intel/./fv3.exe() [0x2884069]
2: /work2/noaa/vdatum/jsmith/dev/stmp/jsmith/FV3_RT/rt_3779924/cice_dev_cdeps2cice_debug_intel/./fv3.exe() [0x1d9ecb9]
2: /work2/noaa/vdatum/jsmith/dev/stmp/jsmith/FV3_RT/rt_3779924/cice_dev_cdeps2cice_debug_intel/./fv3.exe() [0x1b55f62]
2: /work2/noaa/vdatum/jsmith/dev/stmp/jsmith/FV3_RT/rt_3779924/cice_dev_cdeps2cice_debug_intel/./fv3.exe() [0xadbb94]
2: Abort(-1) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
0: slurmstepd: error: *** STEP 2977625.0 ON hercules-01-64 CANCELLED AT 2024-10-29T15:15:56 ***

must be a PIO error netcdf error? idk though will update as I find out more

I've isolated the error to the following:

CDEPS-interface/CDEPS/streams/dshr_strdata_mod.F90

and the line

call pio_read_darray(pioid, varid, per_stream%stream_pio_iodesc, data_dbl1d, rcode)

This seems to causing a ton of issues in CICE...
Does ufs require PIO to work is there a way to build the model just using NetCDF?

@uturuncoglu
Copy link
Collaborator Author

@SmithJos13 Okay. I could not find time to look at this but eventually check and let you know. Sorry.

@SmithJos13
Copy link

No worries I'm going to keep working on this. I think I'm making some headway! Ill provide periodic updates!

@SmithJos13
Copy link

SmithJos13 commented Oct 31, 2024

I've built the model in a "serial" format by only using one PET. I accomplished this by setting:

ATM_pet_bounds = OCN_pet_bounds = ICE_pet_bound = '0 0' 

This was to diagnose what model component was causing the crash. I found a couple error in my ATM cap and fixed those. Now the model is crashing again in the ICE component. I'm currently trying to build the CICE component without PIO by changing the following line,

set(CICE_IO "PIO" CACHE STRING "CICE OPTIONS: Choose IO options.") ==> set(CICE_IO "NetCDF" CACHE STRING "CICE OPTIONS: Choose IO options.")

in

/work2/noaa/vdatum/jsmith/dev/ufs-coastal-cice-dev/CICE-interface/CMakeLists.txt

getting a little closer now it looks like there is an issues with the export pointers,

20241031 145133.099 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Faii_tauy is not connected.
20241031 145133.099 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Faii_lat is not connected.
20241031 145133.099 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Faii_sen is not connected.
20241031 145133.099 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Faii_lwup is not connected.
20241031 145133.099 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Faii_evap is not connected.
20241031 145133.099 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Faii_swnet is not connected.
20241031 145133.099 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Fioi_melth is not connected.
20241031 145133.099 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Fioi_swpen is not connected.
20241031 145133.099 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Fioi_swpen_vdr is not connected.
20241031 145133.099 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Fioi_swpen_vdf is not connected.
20241031 145133.099 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Fioi_swpen_idr is not connected.
20241031 145133.099 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Fioi_swpen_idf is not connected.
20241031 145133.099 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Fioi_meltw is not connected.
20241031 145133.099 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Fioi_salt is not connected.
20241031 145133.099 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Fioi_taux is not connected.
20241031 145133.099 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Fioi_tauy is not connected.
20241031 145133.099 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Fioi_bcpho is not connected.
20241031 145133.100 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Fioi_bcphi is not connected.
20241031 145133.100 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Export Field = Fioi_flxdst is not connected.
20241031 145133.100 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = So_dhdx is connected using mesh without ungridded dimension
20241031 145133.100 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = So_dhdy is connected using mesh without ungridded dimension
20241031 145133.100 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = So_t is connected using mesh without ungridded dimension
20241031 145133.101 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = So_s is connected using mesh without ungridded dimension
20241031 145133.101 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = So_u is connected using mesh without ungridded dimension
20241031 145133.101 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = So_v is connected using mesh without ungridded dimension
20241031 145133.102 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Fioo_q is connected using mesh without ungridded dimension
20241031 145133.102 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Sa_z is connected using mesh without ungridded dimension
20241031 145133.102 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Sa_u is connected using mesh without ungridded dimension
20241031 145133.102 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Sa_v is connected using mesh without ungridded dimension
20241031 145133.103 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Sa_shum is connected using mesh without ungridded dimension
20241031 145133.103 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Sa_tbot is connected using mesh without ungridded dimension
20241031 145133.103 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Sa_pbot is connected using mesh without ungridded dimension
20241031 145133.104 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Faxa_swvdr is connected using mesh without ungridded dimension
20241031 145133.104 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Faxa_swvdf is connected using mesh without ungridded dimension
20241031 145133.104 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Faxa_swndr is connected using mesh without ungridded dimension
20241031 145133.105 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Faxa_swndf is connected using mesh without ungridded dimension
20241031 145133.105 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Faxa_lwdn is connected using mesh without ungridded dimension
20241031 145133.105 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Faxa_rain is connected using mesh without ungridded dimension
20241031 145133.105 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Faxa_snow is connected using mesh without ungridded dimension
20241031 145133.106 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Sa_ptem is not connected.
20241031 145133.106 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Sa_dens is not connected.
20241031 145133.106 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Faxa_bcph is not connected.
20241031 145133.106 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Faxa_dstwet is not connected.
20241031 145133.106 INFO             PET0   (ice_import_export:fld_list_realize)(ice_import_export:realize_fields):CICE_Import Field = Faxa_dstdry is not connected.
20241031 145133.106 INFO             PET0   (ice_comp_nuopc):(InitializeRealize)Debug pnt 13
20241031 145133.106 INFO             PET0   ice_export called
20241031 145133.110 INFO             PET0   (field_getfldptr): ERROR data not allocated
20241031 145133.110 ERROR            PET0 ice_shr_methods.F90:340   Failure  - Passing error in return code
20241031 145133.110 ERROR            PET0 ice_import_export.F90:1074   Failure  - Passing error in return code
20241031 145133.110 ERROR            PET0 ice_comp_nuopc.F90:967   Failure  - Passing error in return code
20241031 145133.110 ERROR            PET0 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:2901   Failure  - Phase 'IPDv01p3' Initialize for modelComp 3: ICE did not return ESMF_SUCCESS
20241031 145133.110 ERROR            PET0 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:1985   Failure  - Passing error in return code
20241031 145133.110 ERROR            PET0 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:489   Failure  - Passing error in return code
20241031 145133.111 ERROR            PET0 UFS.F90:394   Failure  - Aborting UFS

maybe the export pointers need to initialized at the top of the ice_import_export.F90 file?

Yeah I don't know what to do now. I can get the model to advance further in the initialization phase by introducing the mediator but then I think that is going to open up a whole new stack of problems. The issue that I run into there is that data being sent to the model is not on the same time step? this could be an issue caused by commenting out

  !call fldlist_add(fldsToIce_num, fldsToIce, trim(flds_scalar_name))

I've also tried turning on the component complance checker and that was uninformative.

@uturuncoglu
Copy link
Collaborator Author

@SmithJos13 I could not find your mail to send invitation. Could you send it to me?

@janahaddad janahaddad moved this from Todo to In Progress in ufs-coastal project Nov 15, 2024
@SmithJos13
Copy link

SmithJos13 commented Nov 25, 2024

Okay! I've done a comparison of the standalone CICE+DATM+DOCN+CMEPS run here are the results for 2018-2019 season:

2018_2019_NUOP_Fi

Now updates on the profiling:

********
IMPORTANT: Large deviations between Connector times on different PETs
are typically indicators of load imbalance in the system. The following
Connectors in this profile may indicate a load imbalance:
         - [ICE-TO-MED] RunPhase1
********

Region                                                                  PETs   PEs    Count    Mean (s)    Min (s)     Min PET Max (s)     Max PET
  [ESMF]                                                                160    160    1        53.6415     53.6388     157     53.6438     29
    [UFS Driver Grid Comp] RunPhase1                                    160    160    1        42.1470     42.1455     158     42.1476     80
      [ICE] RunPhase1                                                   80     80     30       41.5085     41.3196     112     41.7550     85
      [ICE-TO-MED] RunPhase1                                            160    160    30       20.6333     0.0051      106     41.7312     51
      [ATM] RunPhase1                                                   40     40     30       1.2654      1.2235      31      1.3079      5
        (cdeps_datm_comp):(ModelAdvance)                                40     40     30       1.2623      1.2205      31      1.3049      5
          datm_run                                                      40     40     30       1.2617      1.2199      31      1.3042      5
            DATM_RUN                                                    40     40     30       1.2616      1.2199      31      1.3042      5
              datm_strdata_advance                                      40     40     30       1.2327      1.1911      31      1.2742      5
                datm_strd_adv_total                                     40     40     30       1.2326      1.1911      31      1.2742      5
                  datm_strd_adv_readLBUB                                40     40     30       1.2072      1.1659      31      1.2485      3
                    datm_readLBUB_UB_readpio                            40     40     3        1.2066      1.1654      31      1.2479      5
                    datm_readLBUB_fbound                                40     40     3        0.0003      0.0003      19      0.0003      21
                    datm_readLBUB_setup                                 40     40     30       0.0000      0.0000      17      0.0000      3
                    datm_readLBUB_filemgt                               40     40     30       0.0000      0.0000      10      0.0003      0
                  datm_strd_adv_tint                                    40     40     30       0.0253      0.0246      27      0.0266      0
              datm_dfield_copy                                          40     40     30       0.0193      0.0191      28      0.0200      37
              datm_datamode                                             40     40     30       0.0095      0.0094      24      0.0107      0
      [ATM-TO-MED] RunPhase1                                            160    160    30       0.1361      0.0450      16      0.4816      159
      [MED-TO-ICE] RunPhase1                                            160    160    30       0.0551      0.0106      5       0.1334      157
      [MED] med_phases_post_atm                                         160    160    30       0.0512      0.0326      158     0.0792      90
        MED:(med_phases_post_atm)                                       160    160    30       0.0483      0.0300      158     0.0762      90
          MED:(med_phases_post_atm) map_atm2ice                         160    160    30       0.0478      0.0294      158     0.0756      90
            MED: (med_map_mod:med_map_field_packed)                     160    160    30       0.0476      0.0293      158     0.0755      90
              MED: (med_map_mod:med_map_field_packed) map               160    160    30       0.0241      0.0061      157     0.0526      90
              MED: (med_map_mod:med_map_field_packed) copy from src     160    160    30       0.0103      0.0099      88      0.0123      85
              MED: (med_map_mod:med_map_field_packed) copy to dest      160    160    30       0.0097      0.0094      86      0.0117      85
          MED:(med_phases_history_write_comp_aux)                       160    160    30       0.0001      0.0001      159     0.0001      57
          MED:(med_phases_history_write_inst_comp)                      160    160    30       0.0000      0.0000      158     0.0001      33
          MED:(med_phases_history_write_comp_avg)                       160    160    30       0.0000      0.0000      105     0.0000      2
      [MED] med_phases_post_ocn                                         160    160    30       0.0465      0.0191      158     0.0715      145
        MED:(med_phases_post_ocn)                                       160    160    30       0.0437      0.0166      158     0.0687      145
          MED:(med_phases_post_ocn) map_ocn2ice                         160    160    30       0.0432      0.0161      158     0.0682      145
            MED: (med_map_mod:med_map_field_packed)                     160    160    30       0.0431      0.0160      158     0.0681      145
              MED: (med_map_mod:med_map_field_packed) map               160    160    30       0.0302      0.0035      157     0.0551      145
              MED: (med_map_mod:med_map_field_packed) copy from src     160    160    30       0.0054      0.0051      93      0.0065      131
              MED: (med_map_mod:med_map_field_packed) copy to dest      160    160    30       0.0052      0.0050      148     0.0063      85
          MED:(med_phases_history_write_comp_aux)                       160    160    30       0.0001      0.0001      151     0.0001      63
          MED:(med_phases_history_write_inst_comp)                      160    160    30       0.0000      0.0000      156     0.0000      120
          MED:(med_phases_history_write_comp_avg)                       160    160    30       0.0000      0.0000      66      0.0000      98
      [OCN-TO-MED] RunPhase1                                            160    160    30       0.0416      0.0177      48      0.1013      26

I think this all the relevant info needed. @uturuncoglu do you know what the ice-> med phase is so slow when the med->ice phase is so fast? I thought there might be an issue with PIO (since I disabled that for CICE) but even with it enabled there are still issues. I've tried bumping the number of tasks and that has no effect. Is there a setting I need to enable in UFS configure or somewhere else?

@uturuncoglu
Copy link
Collaborator Author

@SmithJos13 That is great progress. I am not sure why ICE->MED is taking too much time. In your case you have only data atmosphere and data ocean. Right? I don't think it is related with PIO since this is time for the connector. So, it is just transferring data from one component to another using redistribution. So, if you also provide following informations that would be great,

  • What is the PET distribution in your case?
  • Run sequence? (incl. coupling interval)

BTW, it would be nice to run the case couple of times and see the issue is persistent. You might hit issue with the system.

@SmithJos13
Copy link

SmithJos13 commented Nov 25, 2024

Yes, that is correct. I'm only sending data.

Here are the relevant lines from UFS.configure:

# MED #
MED_model:                      cmeps
MED_petlist_bounds:             0 159
MED_compute_tasks:              1
MED_omp_num_threads:            1
MED_attributes::
  ATM_model = datm
  OCN_model = docn
  ICE_model = cice6
  coupling_mode = coastal
  pio_typename = PNETCDF
::

# ATM #
ATM_model:                      datm
ATM_petlist_bounds:             0 9
ATM_omp_num_threads:            1
ATM_compute_tasks:              10
ATM_attributes::
  DumpFields = false
  mesh_atm  = /work2/noaa/vdatum/jsmith/ufs_coast_setup/4km/mesh/mesh.cice.4km.nc
  diro = "."
  logfile = datm.log
  write_restart_at_endofrun = .true.
::

# OCN #
OCN_model:                      docn
OCN_petlist_bounds:             10 19
OCN_omp_num_threads:            1
OCN_compute_tasks:              10
OCN_attributes::
  mesh_ocn = /work2/noaa/vdatum/jsmith/ufs_coast_setup/4km/mesh/mesh.cice.4km.nc
  logfile = docn.log
  write_restart_at_endofrun = .true.
::

# ICE #
ICE_model:                      cice6
ICE_petlist_bounds:             20 159
ICE_omp_num_threads:            1
ICE_attributes::
  ProfileMemory  = false
  OverwriteSlice = false
  mesh_ice = /work2/noaa/vdatum/jsmith/ufs_coast_setup/4km/mesh/mesh.cice.4km.nc
  eps_imesh = 5.0e-1
::


# CMEPS concurrent warm run sequence
# MED med_phases_prep_ocn_avg
runSeq::
@1200
   MED med_phases_prep_atm
   MED med_phases_prep_ocn_avg
   MED med_phases_prep_ice
   MED -> ICE :remapMethod=redist
   MED -> ATM :remapMethod=redist
   MED -> OCN :remapMethod=redist
   ATM
   OCN
   ICE
   ICE -> MED :remapMethod=redist
   ATM -> MED :remapMethod=redist
   OCN -> MED :remapMethod=redist
   MED med_phases_post_ice
   MED med_phases_post_atm
   MED med_phases_post_ocn
   MED med_phases_history_write
   MED med_phases_restart_write
@
::

still produces the following when run again:

********
IMPORTANT: Large deviations between Connector times on different PETs
are typically indicators of load imbalance in the system. The following
Connectors in this profile may indicate a load imbalance:
         - [ICE-TO-MED] RunPhase1
********

Region                                                                  PETs   PEs    Count    Mean (s)    Min (s)     Min PET Max (s)     Max PET
  [ESMF]                                                                160    160    1        58.1983     58.1870     102     59.3576     20
    [UFS Driver Grid Comp] RunPhase1                                    160    160    1        44.0727     44.0696     159     44.0738     97
      [ICE] RunPhase1                                                   140    140    30       41.5490     41.4204     84      43.2113     20
      [ICE-TO-MED] RunPhase1                                            160    160    30       5.2503      0.0049      123     43.1342     15
      [ATM-TO-MED] RunPhase1                                            160    160    30       1.9294      0.1321      0       2.1024      122

@uturuncoglu
Copy link
Collaborator Author

@SmithJos13 As you can see from the log ICE-TO-MED is very minimal. It is just 5 sec. It seems that the slowest component in here ICE (41 sec.). If you don't mind could you try to increase number of cores in ICE. So, you might see some improvement over there. You could also try to disable history write in ice side (just put very large number to history interval in CICE config file) just for checking contribution of I/O. If bottleneck is coming from I/O, I could find the right configuration for PIO. Then, maybe another test could be aligning ICE with MED.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

2 participants