Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime errors with ERS.f09_g16_g.MALISIA #6600

Closed
ndkeen opened this issue Sep 10, 2024 · 10 comments · Fixed by #6627
Closed

Runtime errors with ERS.f09_g16_g.MALISIA #6600

ndkeen opened this issue Sep 10, 2024 · 10 comments · Fixed by #6627
Labels
GCP google cloud platform pm-cpu Perlmutter at NERSC (CPU-only nodes)

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Sep 10, 2024

On gcp12, I'm seeing an error with a test that was working before.

In init

e3sm.log:

10: [gcpe3sm12-compute-test-11:08117] *** An error occurred in MPI_Bcast
10: [gcpe3sm12-compute-test-11:08117] *** reported by process [932315137,10]
10: [gcpe3sm12-compute-test-11:08117] *** on communicator MPI COMMUNICATOR 58 DUP FROM 27
10: [gcpe3sm12-compute-test-11:08117] *** MPI_ERR_TRUNCATE: message truncated
10: [gcpe3sm12-compute-test-11:08117] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
10: [gcpe3sm12-compute-test-11:08117] ***    and potentially your MPI job)


gcpe3sm12-login-c2d-standard-32% cat run/log.landice.0010.d\*\*\*\*.err 
----------------------------------------------------------------------
Beginning MPAS-landice Error Log File for task      10 of      16
    for domain ID 8134542
    Opened at 2024/09/10 00:08:37
----------------------------------------------------------------------

ERROR: MPAS IO Error: Bad return value from PIO


gcpe3sm12-login-c2d-standard-32% tail run/glc.log.79762.240910-000831
 Flood fill initialized 5071 cells to global seedMask
   Added 23 new cells to global mask
   Added 0 new cells to global mask
 Flood fill complete.
 Flood fill initialized 7925 cells to global seedMask
   Added 97 new cells to global mask
   Added 0 new cells to global mask
 Flood fill complete.
 Iceberg-detection flood-fill complete. Removed 0 iceberg cells.
 Notice: Nonzero velocity has been calculated on 'uphill' margin edge(s).  normalVelocity has been set to 0 at these location(s).  Number of edges affected on this processor: 4


@ndkeen ndkeen added the GCP google cloud platform label Sep 10, 2024
@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 10, 2024

OK, the issue may be a system issue and/or intermittent.
Running same test with same checkout passes.
I tried OPT and DEBUG.

ERS.f09_g16_g.MALISIA.gcp12_gnu
ERS_D.f09_g16_g.MALISIA.gcp12_gnu

Will leave open until I see a few more daily passes.

@jonbob
Copy link
Contributor

jonbob commented Sep 10, 2024

Thanks @ndkeen -- please let me know what you figure out. Do the log or error files give you any more clues?

@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 10, 2024

Noting what looks like to be a different fail, but same test. on pm-cpu, the test has been failing last 4-5 time.
ERS.f09_g16_g.MALISIA.pm-cpu_gnu

 35: MPICH ERROR [Rank 35] [job id 30335681.1] [Tue Sep 10 03:18:10 2024] [nid006251] - Abort(2170894) (rank 35 in comm 0): Fatal error in PMPI_Bcast: Message truncated, error stack:
 35: PMPI_Bcast(446)..........: MPI_Bcast(buf=0x7ffd7a2e5bbc, count=1, MPI_INT, root=0, comm=comm=0xc400007d) failed
 35: PMPI_Bcast(431)..........: 
 35: MPIR_CRAY_Bcast(493).....: 
 35: MPIR_CRAY_Bcast_Tree(162): 
 35: progress_recv(174).......: Message from rank 32 and tag 2 truncated; 4 bytes received but buffer size is 6
 35: 
 35: aborting job:
 35: Fatal error in PMPI_Bcast: Message truncated, error stack:
 35: PMPI_Bcast(446)..........: MPI_Bcast(buf=0x7ffd7a2e5bbc, count=1, MPI_INT, root=0, comm=comm=0xc400007d) failed
 35: PMPI_Bcast(431)..........: 
 35: MPIR_CRAY_Bcast(493).....: 
 35: MPIR_CRAY_Bcast_Tree(162): 
 35: progress_recv(174).......: Message from rank 32 and tag 2 truncated; 4 bytes received but buffer size is 6


and in log.landice.0035.d0001.err

----------------------------------------------------------------------
Beginning MPAS-landice Error Log File for task      35 of     128
    for domain ID       1
    Opened at 2024/09/10 03:18:10
----------------------------------------------------------------------

ERROR: MPAS IO Error: Bad return value from PIO

@ndkeen ndkeen changed the title ERROR: MPAS IO Error: Bad return value from PIO with ERS.f09_g16_g.MALISIA.gcp12_gnu Runtime errors with ERS.f09_g16_g.MALISIA Sep 10, 2024
@jonbob
Copy link
Contributor

jonbob commented Sep 10, 2024

We merged a MALI PR when it started having issues. Can you tell if it's the first run that fails or the second one that's a restart?

@jonbob
Copy link
Contributor

jonbob commented Sep 10, 2024

I'm suspicious it's a gnu compiler thing -- the same test passes with intel. I'm testing it with gnu on chrysalis right now

@jonbob
Copy link
Contributor

jonbob commented Sep 10, 2024

it passed on chrysalis with gnu, so not that

@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 10, 2024

I am pretty sure that in both cases (gcp12 and pm-cpu), it was the second run of ERS that failed.

@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 11, 2024

The gcp12 test today failed in the same way as described above. ERS.f09_g16_g.MALISIA.gcp12_gnu

I can also repeat the fail on pm-cpu:

/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-sep6/ERS.f09_g16_g.MALISIA.pm-cpu_gnu.gh6600

@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 12, 2024

I just ran some tests on gcp12, where I increased the number of MPI's. And it does look like there is an increased chance of this error as number of MPI's increase. At 512 tasks, I was able to run 3 times in a row with same fail on this machine. The default tests only use 16 tasks for GLC.

I made a complete copy of the case (with 3 fails) on perlmutter here:

/pscratch/sd/n/ndk/gcp12/ERS_P512.f09_g16_g.MALISIA.gcp12_gnu.20240912_202719_9wivr5

fwiw, I also just tested ERS_D_P512.f09_g16_g.MALISIA.gcp12_gnu and it passes 3 times in a row.

@ndkeen ndkeen added the pm-cpu Perlmutter at NERSC (CPU-only nodes) label Sep 16, 2024
@jonbob
Copy link
Contributor

jonbob commented Sep 16, 2024

Let me check with the MALI people -- something is not right. Thanks for all the testing

jonbob added a commit that referenced this issue Sep 23, 2024
MALI update to fix issues from earlier PR causing sporadic test failures

Including a variable that was deactivated in the globalStats stream
caused sporadic failures during the second run of some ERS tests on
several platform/compiler combinations. That variable is now only
included when MALI is using Albany. Also updates a namelist default
that had been missed but does not change answers.

Fixes #6600

[NML] for configurations with MALI
[BFB]
@jonbob jonbob closed this as completed in 39d5295 Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GCP google cloud platform pm-cpu Perlmutter at NERSC (CPU-only nodes)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants