Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running clm-fates with 1PFT at f45 and larger PE layout fails in driver... #242

Closed
ekluzek opened this issue Jul 11, 2017 · 3 comments
Closed

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Jul 11, 2017

The following configuration with fates-clm master fails immediately before initialization is finished.

./create_newcase -case 1PFTfastPEclm-fatesr244f45GSWP3 -res f45_f45 -compset 2000_DATM%GSWP3_CLM45%ED_SICE_SOCN_RTM_SGLC_SWAV --user-compset --pesfile ../../components/clm/cime_config/config_pes.xml --run-unsupported

user_nl_clm:
fates_paramfile='/glade/p/work/jkshuman/FATES_data/fates_params.c170526_troptree.nc'

./xmlchange STOP_OPTION=nyears,STOP_N=5,DATM_CLMNCEP_YR_ALIGN=1985,DATM_CLMNCEP_YR_START=1985,DATM_CLMNCEP_YR_END=2004

diff ../../../components/clm/src/fates/main/EDTypesMod.F90 SourceMods/src.clm/
25c25
<   integer, parameter :: numpft_ed = 2             ! number of PFTs used in ED. 
---
>   integer, parameter :: numpft_ed = 1             ! number of PFTs used in ED. 


@bandre-ucar @jkshuman

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jul 11, 2017

The traceback that is given in the cesm.log file is...

144:MPT: #1  0x00002aaab01e979c in mpi_sgi_system (command=<optimized out>) at sig.c:98
144:MPT: #2  MPI_SGI_stacktraceback (header=<optimized out>) at sig.c:339
144:MPT: #3  0x00002aaab0143fae in print_traceback (ecode=0) at abort.c:227
144:MPT: #4  0x00002aaab0144025 in MPI_SGI_abort () at abort.c:90
144:MPT: #5  0x00002aaab0182e01 in errors_are_fatal (comm=<optimized out>, 
144:MPT:     code=<optimized out>) at errhandler.c:230
144:MPT: #6  0x00002aaab0183194 in MPI_SGI_error (comm=114, code=15) at errhandler.c:58
144:MPT: #7  0x00002aaab0141141 in MPI_SGI_request_test (request=0x7fffffff6024, 
144:MPT:     status=0x3441190 <mpi_sgi_status_ignore>, set=0x7fffffff6020, 
144:MPT:     rc=0x7fffffff601c) at req.c:1401
144:MPT: #8  0x00002aaab0141220 in MPI_SGI_request_wait (request=0x7fffffff6024, 
144:MPT:     status=0x3441190 <mpi_sgi_status_ignore>, set=0x7fffffff6020, 
144:MPT:     gen_rc=0x7fffffff601c) at req.c:1659
144:MPT: #9  0x00002aaab01edf7d in MPI_SGI_recv (buf=<optimized out>, 
144:MPT:     count=<optimized out>, type=<optimized out>, des=<optimized out>, 
144:MPT:     tag=<optimized out>, comm=<optimized out>, 
144:MPT:     status=0x3441190 <mpi_sgi_status_ignore>) at sugar.c:40
144:MPT: #10 0x00002aaab015a6dc in MPI_SGI_bcast_basic (comm=<optimized out>, 
144:MPT:     root=<optimized out>, type=<optimized out>, count=<optimized out>, 
144:MPT:     buffer=<optimized out>) at bcast.c:252
144:MPT: #11 MPI_SGI_bcast (
144:MPT:     buffer=0x34466c0 <cplcomp_exchange_mod_mp_seq_mctext_avextend_$ILIST.0.9>, 
144:MPT:     count=4096, type=19, root=15, comm=114) at bcast.c:482
144:MPT: #12 0x00002aaab015ac09 in MPI_SGI_bcast_topo (
144:MPT:     buffer=0x34466c0 <cplcomp_exchange_mod_mp_seq_mctext_avextend_$ILIST.0.9>, 
144:MPT:     count=4096, type=19, root=540, comm=9, force=0) at bcast.c:340
144:MPT: #13 0x00002aaab015a768 in MPI_SGI_bcast (
144:MPT:     buffer=0x34466c0 <cplcomp_exchange_mod_mp_seq_mctext_avextend_$ILIST.0.9>, 
144:MPT:     count=4096, type=19, root=540, comm=9) at bcast.c:457
144:MPT: #14 0x00002aaab015ad52 in PMPI_Bcast (
144:MPT:     buffer=0x34466c0 <cplcomp_exchange_mod_mp_seq_mctext_avextend_$ILIST.0.9>, 
144:MPT:     count=-42580, type=19, root=540, comm=9) at bcast.c:93
144:MPT: #15 0x00002aaab015af7a in pmpi_bcast__ ()
144:MPT:    from /glade/u/apps/opt/mpt/2.15-sgi715a158/lib/libmpi.so
144:MPT: #16 0x000000000042bba5 in cplcomp_exchange_mod_mp_seq_mctext_avextend_ ()
144:MPT:     at /glade/u/home/erik/fates-clm/cime/src/drivers/mct/main/cplcomp_exchange_mod.F90:833
144:MPT: #17 0x000000000042d297 in cplcomp_exchange_mod_mp_seq_mctext_avinit_ ()
144:MPT:     at /glade/u/home/erik/fates-clm/cime/src/drivers/mct/main/cplcomp_exchange_mod.F90:392
144:MPT: #18 0x0000000000426fb4 in component_mod_mp_component_init_cx_ ()
144:MPT:     at /glade/u/home/erik/fates-clm/cime/src/drivers/mct/main/component_mod.F90:347
144:MPT: #19 0x00000000004187cc in cesm_comp_mod_mp_cesm_init_ ()
144:MPT:     at /glade/u/home/erik/fates-clm/cime/src/drivers/mct/main/cesm_comp_mod.F90:1211
144:MPT: #20 0x0000000000424853 in MAIN__ ()
144:MPT:     at /glade/u/home/erik/fates-clm/cime/src/drivers/mct/main/cesm_driver.F90:62
144:MPT: #21 0x000000000040835e in main ()
144:MPT: (gdb) A debugging session is active.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jul 11, 2017

The PE layout is:

scripts/1PFTfastPEclm-fatesr244f45GSWP3> ./xmlquery NTASKS,ROOTPE

Results in group mach_pes
	NTASKS: ['CPL:540', 'ATM:540', 'LND:540', 'ICE:540', 'OCN:540', 'ROF:540', 'GLC:540', 'WAV:540', 'ESP:540']
	ROOTPE: ['CPL:36', 'ATM:0', 'LND:36', 'ICE:36', 'OCN:36', 'ROF:36', 'GLC:36', 'WAV:36', 'ESP:36']

Running with the default PE layout runs fine (as per @jkshuman).

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jul 12, 2017

OK, I didn't notice this at first, but the PE layout above is bad since, it should have ATM_NTASKS=-1, rather than -15. When I give it the correct PE layout it runs fine.

So basically the driver is telling us "you have a bad PE layout" in a completely obtuse and obfuscated way. I'll add an issue for this in cime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant