-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
maint-2.0 branch appears broken on Perlmutter #6195
Comments
Thanks for finding reproducer. I can also get the same fail using ERS, but running longer: So then the issue is what changed? I looked again the changes i made to machine config to maint-2.0 recently and I don't see anything that would cause trouble. I then used some previous maint-2.0 branches from the past -- back to July, all of them also have this same failure pattern (fail with ERS_Lm3, fail compsre with ERS_Ld31). I then tried the same tests with a recent E3SM master and at least these 2 tests both PASS. So I'm curious if this is something that ever working on pm-cpu with maint-2.0? |
If I run the ERS test with DEBUG, I get:
Which looks familiar -- will search github |
Thanks for looking into this, @ndkeen. I can say that as of last October I was successfully using a maint-2.0 like version of the code with some of my own developments living on a branch in my E3SM fork (beharrop@f113954). It has the most recent SCORPIO updates for maint-2.0 (E3SM-Project/scorpio@de0b1ca), so I would think they would handle IO the same. I was able to run a decade of simulations one year at a time with that code. I will try running your I also know a colleague who has been running E3SMv2 (again with modifications) for an RGMA project who is getting the same error as I am. Her issue only just started last week, which tracks with what @milenaveneziani is seeing. |
Running the |
I don't see the fail with maint-2.1 -- which may be closer to maint-2.0. I have not been able to get the fail with several E3SM master checkouts I have from several months ago. I also don't see a fail with |
Could it be related to the |
It is most certainly having issue with a NaN. The routine where it fails is
And it looks like input array carr has a value that is NaN. But that does not explain why something would have been working with maint-2.0 before, but not now. |
I already tried adding the fix noted in PR5811 -- but that only impacts tests using that user_nl_elm ( |
Yea, I noticed the fix to #5811 was just a swap for a corrected restart file, but that's what had me thinking that perhaps the problem is related to how the model is writing restart files with the current maint-2.0 branch. Following that chain of reasoning, I set Unfortunately, I haven't had any luck so far coming up with a reproducer of something that did restart OK before. |
fwiw, I get same error with ne30 I also tried several tests to reduce the optimization flags to no avail. |
The NaN problem was from when we turned on debugging and NaN trapping, it found NaNs in some restart files that had to be removed. But maint-2.0 has the old debug flags so it shouldn't matter and Bryce wasn't doing a debug compile. Adding @dqwu since its a scorpio error. |
Rob: is still doesn't explain why the users are claiming this was working before and now it does not. I can try to merge in that PR, but it's a lot of files and I think maybe someone else should do that. Also, maint-2.1 seems to work and I do not think it has the land PR either... |
Trying on chrysalis
|
Try going back to c16b21f on the maint-2.0 branch. That's before the SCORPIO update. |
I already tried July21st checkout maint-2.0 with same results
I do have one idea that I'm trying now -- nope that wasn't it |
With GNU, I don't get the NaN crash, but I do see both tests fail compare:
|
I don't know if this is helpful at all, but last night I restarted my job (going back to pnetcdf for all components) and I got first this error while writing the mpas-si restart file:
After a series of those (for a bunch of sea ice variables), I got the usual error:
|
For PIO_IOTYPE_PNETCDF, "NetCDF: Numeric conversion not representable" usually means that the write buffer passed to PnetCDF write APIs contains invalid floating point values. @jayeshkrishna I remember a similar issue was reported before. |
Those are old issues but Bryce's case was working back in October so what changed since then? |
Let me iterate here that the branch I'm working with (close to v2.0 with changes to use intel on pm-cpu) worked just fine on Jan 15, but it started throwing those errors on the 19th.. |
Bryce can you show us your case that was working in October? |
Peter Schwartz reported a similar issue for a recently failed CDash test:
|
Yeah, usually NaNs in the user buffer (the array being written out) |
So, is it possible that the intel compiler and PIO were more 'forgiving' with writing these nans prior to the Jan 17 downtime, for some reason? |
I guess there are no NaNs in the write buffer prior to the Jan 17 downtime. Machine environment changes on pm might produce NaNs for the same test to trigger "NetCDF: Numeric conversion not representable (err=-60)". |
Jim Foucar confirmed a similar issue for a new CDash test on mappy with GNU compiler:
|
Is it the case that an error will not be given if the code writes a NaN to netcdf? As I'm not seeing an error with DEBUG tests writing restarts -- it's only reading them back in. Is it easy to look for NaN's in a given netcdf file? If so, one could count the number of garbage values in files written/read before and after the Jan 17th maint. |
As I mentioned above, I was using a version of maint-2.0 with my own modifications on top (beharrop@f113954) for the runs I did last October. The run I was doing lives here: |
Since in my case that particular error happens when writing elm history files, I checked some files that were written prior to Jan 17 (using |
It could also be due to uninitialized buffers (after updates these buffers might not be initialized to 0 etc) used for the variable (being written out to the file) |
@jayeshkrishna, could you please elaborate on this:
Are you saying that the uninitialized buffers were not a problem before the update but are a problem now? |
I am just speculating here, but updates to compilers etc can cause compiler strategies to change (for perf etc) regarding uninitialized buffers (all buffers written out need to be initialized to valid/fill values). |
I have been toying with some things to better understand what is happening, and thought I'd share. First, running with |
@beharrop Yes, I believe I resolved that issue in master with this PR ( #5311 ) . So adding it to maint-2.0 may help |
OK, after looking at a different error on master with @peterdschwartz, (oh I see he already posted above) he noted that his change here #5311 might impact this issue. After I added these changes, both of these pass:
But that still does not explain how you are saying that it was working before the Jan17th PM maintenance and not after. Seems like it would have always not worked... |
Thanks @peterdschwartz! I tried merging in the changes from #5311 into my code and it can now run without error. I agree with @ndkeen that I don't understand why this only suddenly became a problem, but I am happy to have a fix available. Can the powers that be simply merge #5311 into maint-2.0 (and maint-2.1 if it needs it) or do we need a new PR? |
Thank you all for pointing us to this! I just merged #5311 in my branch as well and waiting for the job to start, hopefully successfully this time. |
@peterdschwartz please put that fix on the maint-2.0 and maint-2.1 branches. Probably no point in trying to figure out more about what changed. Its annoying when sysadmins change things under a module but it happens. |
On the other hand, system/compiler changes might help us find out previously unnoticed issues like uninitialized buffers. |
Add endwb to ELM restart file in maint-2.0. [BFB] Fixes #6195
I got similar errors for other variables when writing files, including AR, CWDN, TCS_MONTH_BEGIN, etc... @rljacob @ndkeen 384: PIO: FATAL ERROR: Aborting... An error occured, Writing variables (number of variables = 474) to file (...elm.h0.1985-03.nc, ncid=4347) using PIO_IOTYPE_PNETCDF iotype failed. Non blocking write for variable (CWDN, varid=72) failed (Number of subarray requests/regions=1, Size of data local to this process = 432). NetCDF: Numeric conversion not representable (err=-60). Aborting since the error handler was set to PIO_INTERNAL_ERROR... |
If its on latest master, you should make a new issue for it. |
Fixed by #6206 |
This solution works for me. Thank you |
The maint-2.0 branch (b583f7d) is having problems on Perlmutter (CPU nodes). Any run that restarts will fail with the following PIO error in the e3sm.log file.
PIO: FATAL ERROR: Aborting... An error occured, Writing variables (number of variables = 180) to file (./20240129.restart_test2.maint_20.elm.h0.0001-02.nc, ncid=143) using PIO_IOTYPE_PNETCDF iotype failed. Non blocking write for variable (TWS_MONTH_BEGIN, varid=205) failed (Number of subarray requests/regions=1, Size of data local to this process = 433). NetCDF: Numeric conversion not representable (err=-60). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/pscratch/sd/b/beharrop/temp/e3sm_code/20240129/E3SM/externals/scorpio/src/clib/pio_darray_int.c: 395)
I tried to reproduce this with a shorter test using
./create_test ERS.ne4_oQU240.F2010
, but that did not reproduce the error.ERS.ne4_oQU240.F2010.pm-cpu_intel (Overall: PASS)
. I have also been unable to reproduce this error on either Chrysalis or Compy. The smallest reproducible setup I've been able to come to that has this error is the following.I have been using the intel compiler, and haven't tested any of the others yet. Has anyone else encountered anything like this?
The text was updated successfully, but these errors were encountered: