Runtime errors with `ERS.f09_g16_g.MALISIA` #6600

ndkeen · 2024-09-10T00:10:53Z

On gcp12, I'm seeing an error with a test that was working before.

In init

e3sm.log:

10: [gcpe3sm12-compute-test-11:08117] *** An error occurred in MPI_Bcast
10: [gcpe3sm12-compute-test-11:08117] *** reported by process [932315137,10]
10: [gcpe3sm12-compute-test-11:08117] *** on communicator MPI COMMUNICATOR 58 DUP FROM 27
10: [gcpe3sm12-compute-test-11:08117] *** MPI_ERR_TRUNCATE: message truncated
10: [gcpe3sm12-compute-test-11:08117] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
10: [gcpe3sm12-compute-test-11:08117] ***    and potentially your MPI job)


gcpe3sm12-login-c2d-standard-32% cat run/log.landice.0010.d\*\*\*\*.err 
----------------------------------------------------------------------
Beginning MPAS-landice Error Log File for task      10 of      16
    for domain ID 8134542
    Opened at 2024/09/10 00:08:37
----------------------------------------------------------------------

ERROR: MPAS IO Error: Bad return value from PIO


gcpe3sm12-login-c2d-standard-32% tail run/glc.log.79762.240910-000831
 Flood fill initialized 5071 cells to global seedMask
   Added 23 new cells to global mask
   Added 0 new cells to global mask
 Flood fill complete.
 Flood fill initialized 7925 cells to global seedMask
   Added 97 new cells to global mask
   Added 0 new cells to global mask
 Flood fill complete.
 Iceberg-detection flood-fill complete. Removed 0 iceberg cells.
 Notice: Nonzero velocity has been calculated on 'uphill' margin edge(s).  normalVelocity has been set to 0 at these location(s).  Number of edges affected on this processor: 4

The text was updated successfully, but these errors were encountered:

ndkeen · 2024-09-10T17:11:23Z

OK, the issue may be a system issue and/or intermittent.
Running same test with same checkout passes.
I tried OPT and DEBUG.

ERS.f09_g16_g.MALISIA.gcp12_gnu
ERS_D.f09_g16_g.MALISIA.gcp12_gnu

Will leave open until I see a few more daily passes.

jonbob · 2024-09-10T17:17:39Z

Thanks @ndkeen -- please let me know what you figure out. Do the log or error files give you any more clues?

ndkeen · 2024-09-10T21:22:24Z

Noting what looks like to be a different fail, but same test. on pm-cpu, the test has been failing last 4-5 time.
ERS.f09_g16_g.MALISIA.pm-cpu_gnu

 35: MPICH ERROR [Rank 35] [job id 30335681.1] [Tue Sep 10 03:18:10 2024] [nid006251] - Abort(2170894) (rank 35 in comm 0): Fatal error in PMPI_Bcast: Message truncated, error stack:
 35: PMPI_Bcast(446)..........: MPI_Bcast(buf=0x7ffd7a2e5bbc, count=1, MPI_INT, root=0, comm=comm=0xc400007d) failed
 35: PMPI_Bcast(431)..........: 
 35: MPIR_CRAY_Bcast(493).....: 
 35: MPIR_CRAY_Bcast_Tree(162): 
 35: progress_recv(174).......: Message from rank 32 and tag 2 truncated; 4 bytes received but buffer size is 6
 35: 
 35: aborting job:
 35: Fatal error in PMPI_Bcast: Message truncated, error stack:
 35: PMPI_Bcast(446)..........: MPI_Bcast(buf=0x7ffd7a2e5bbc, count=1, MPI_INT, root=0, comm=comm=0xc400007d) failed
 35: PMPI_Bcast(431)..........: 
 35: MPIR_CRAY_Bcast(493).....: 
 35: MPIR_CRAY_Bcast_Tree(162): 
 35: progress_recv(174).......: Message from rank 32 and tag 2 truncated; 4 bytes received but buffer size is 6


and in log.landice.0035.d0001.err

----------------------------------------------------------------------
Beginning MPAS-landice Error Log File for task      35 of     128
    for domain ID       1
    Opened at 2024/09/10 03:18:10
----------------------------------------------------------------------

ERROR: MPAS IO Error: Bad return value from PIO

jonbob · 2024-09-10T22:00:14Z

We merged a MALI PR when it started having issues. Can you tell if it's the first run that fails or the second one that's a restart?

jonbob · 2024-09-10T22:10:58Z

I'm suspicious it's a gnu compiler thing -- the same test passes with intel. I'm testing it with gnu on chrysalis right now

jonbob · 2024-09-10T22:22:50Z

it passed on chrysalis with gnu, so not that

ndkeen · 2024-09-10T23:54:17Z

I am pretty sure that in both cases (gcp12 and pm-cpu), it was the second run of ERS that failed.

ndkeen · 2024-09-11T01:26:47Z

The gcp12 test today failed in the same way as described above. ERS.f09_g16_g.MALISIA.gcp12_gnu

I can also repeat the fail on pm-cpu:

/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-sep6/ERS.f09_g16_g.MALISIA.pm-cpu_gnu.gh6600

ndkeen · 2024-09-12T20:46:06Z

I just ran some tests on gcp12, where I increased the number of MPI's. And it does look like there is an increased chance of this error as number of MPI's increase. At 512 tasks, I was able to run 3 times in a row with same fail on this machine. The default tests only use 16 tasks for GLC.

I made a complete copy of the case (with 3 fails) on perlmutter here:

/pscratch/sd/n/ndk/gcp12/ERS_P512.f09_g16_g.MALISIA.gcp12_gnu.20240912_202719_9wivr5

fwiw, I also just tested ERS_D_P512.f09_g16_g.MALISIA.gcp12_gnu and it passes 3 times in a row.

jonbob · 2024-09-16T16:33:14Z

Let me check with the MALI people -- something is not right. Thanks for all the testing

MALI update to fix issues from earlier PR causing sporadic test failures Including a variable that was deactivated in the globalStats stream caused sporadic failures during the second run of some ERS tests on several platform/compiler combinations. That variable is now only included when MALI is using Albany. Also updates a namelist default that had been missed but does not change answers. Fixes #6600 [NML] for configurations with MALI [BFB]

ndkeen added the GCP google cloud platform label Sep 10, 2024

ndkeen changed the title ~~ERROR: MPAS IO Error: Bad return value from PIO with ERS.f09_g16_g.MALISIA.gcp12_gnu~~ Runtime errors with ERS.f09_g16_g.MALISIA Sep 10, 2024

ndkeen added the pm-cpu Perlmutter at NERSC (CPU-only nodes) label Sep 16, 2024

jonbob mentioned this issue Sep 19, 2024

MALI update to fix issues from earlier PR causing sporadic test failures #6627

Merged

jonbob closed this as completed in 39d5295 Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime errors with `ERS.f09_g16_g.MALISIA` #6600

Runtime errors with `ERS.f09_g16_g.MALISIA` #6600

ndkeen commented Sep 10, 2024 •

edited

Loading

ndkeen commented Sep 10, 2024

jonbob commented Sep 10, 2024

ndkeen commented Sep 10, 2024 •

edited

Loading

jonbob commented Sep 10, 2024

jonbob commented Sep 10, 2024

jonbob commented Sep 10, 2024

ndkeen commented Sep 10, 2024

ndkeen commented Sep 11, 2024 •

edited

Loading

ndkeen commented Sep 12, 2024 •

edited

Loading

jonbob commented Sep 16, 2024

Runtime errors with ERS.f09_g16_g.MALISIA #6600

Runtime errors with ERS.f09_g16_g.MALISIA #6600

Comments

ndkeen commented Sep 10, 2024 • edited Loading

ndkeen commented Sep 10, 2024

jonbob commented Sep 10, 2024

ndkeen commented Sep 10, 2024 • edited Loading

jonbob commented Sep 10, 2024

jonbob commented Sep 10, 2024

jonbob commented Sep 10, 2024

ndkeen commented Sep 10, 2024

ndkeen commented Sep 11, 2024 • edited Loading

ndkeen commented Sep 12, 2024 • edited Loading

jonbob commented Sep 16, 2024

Runtime errors with `ERS.f09_g16_g.MALISIA` #6600

Runtime errors with `ERS.f09_g16_g.MALISIA` #6600

ndkeen commented Sep 10, 2024 •

edited

Loading

ndkeen commented Sep 10, 2024 •

edited

Loading

ndkeen commented Sep 11, 2024 •

edited

Loading

ndkeen commented Sep 12, 2024 •

edited

Loading