Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CMIP6 CMOR-ization & ESGF-publication] NorESM2-MM - piControl #140

Closed
matsbn opened this issue Nov 21, 2019 · 74 comments
Closed

[CMIP6 CMOR-ization & ESGF-publication] NorESM2-MM - piControl #140

matsbn opened this issue Nov 21, 2019 · 74 comments
Assignees

Comments

@matsbn
Copy link

matsbn commented Nov 21, 2019

Mandatory information:

Full path to the case(s) of the experiment on NIRD
/projects/projects/NS9560K/noresm/cases
/projects/projects/NS9560K/FRAM/noresm/cases

experiment_id
piControl

model_id
NorESM2-MM

CASENAME(s) and years to be CMORized
N1850frc2_f09_tn14_20191001, 1200-1299
N1850frc2_f09_tn14_20191012, 1300-1449
N1850frc2_f09_tn14_20191113, 1450-1699

Optional information

parent_experiment_id
piControl-spinup

parent_experiment_rip
r1i1p1f1

parent_time_units
'days since 0001-01-01'

branch_method
'Hybrid-restart from year 1200-01-01 of piControl-spinup'

other information

@matsbn
Copy link
Author

matsbn commented Nov 21, 2019

The full path to the case(s) of the experiment on NIRD should be

/projects/NS9560K/noresm/cases
/projects/NS9560K/FRAM/noresm/cases

with case N1850frc2_f09_tn14_20191001 in /projects/NS9560K/noresm/cases and N1850frc2_f09_tn14_20191012 in /projects/NS9560K/FRAM/noresm/cases.

@YanchunHe
Copy link
Collaborator

A not on the post-processing of the MM experiments:

The processing of the NorESM2-MM experiments are slow due to two factors now:

One reason is of course the high-resolution and high-frequency output.

Another reason is that the cmor tool crashed many times at some arbitrary points.

with normally a simple error as HDF error

or something more in detail as:

*** Error in `./noresm2cmor3': free(): invalid pointer: 0x00002b3f39edcd68 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81489)[0x2b3f39b97489]
/opt/hdf5-1.10.2-intel/lib/libhdf5.so.101(H5MM_xfree+0xb)[0x2b3f3d80c38b]
/opt/hdf5-1.10.2-intel/lib/libhdf5.so.101(+0x204d3d)[0x2b3f3d848d3d]
/opt/hdf5-1.10.2-intel/lib/libhdf5.so.101(H5O_msg_reset+0x62)[0x2b3f3d84b2c2]
/opt/hdf5-1.10.2-intel/lib/libhdf5.so.101(H5G__link_release_table+0x4f)[0x2b3f3d7b042f]
/opt/hdf5-1.10.2-intel/lib/libhdf5.so.101(H5G__dense_iterate+0xac)[0x2b3f3d7a65dc]
/opt/hdf5-1.10.2-intel/lib/libhdf5.so.101(H5G__obj_iterate+0x131)[0x2b3f3d7b93d1]
/opt/hdf5-1.10.2-intel/lib/libhdf5.so.101(H5G_iterate+0xe6)[0x2b3f3d7ad886]
/opt/hdf5-1.10.2-intel/lib/libhdf5.so.101(H5Literate+0x12c)[0x2b3f3d7f9f0c]
/opt/netcdf-4.6.1-intel/lib/libnetcdf.so.13(+0xeed38)[0x2b3f38acdd38]
/opt/netcdf-4.6.1-intel/lib/libnetcdf.so.13(NC4_open+0x2ee)[0x2b3f38acf02e]
/opt/netcdf-4.6.1-intel/lib/libnetcdf.so.13(NC_open+0x28f)[0x2b3f38a0c99f]
/opt/netcdf-4.6.1-intel/lib/libnetcdf.so.13(nc_open+0x17)[0x2b3f38a0c707]
/opt/netcdf-4.6.1-intel/lib/libnetcdff.so.6(nf_open_+0x9c)[0x2b3f3856ce7c]
/opt/netcdf-4.6.1-intel/lib/libnetcdff.so.6(netcdf_mp_nf90_open_+0x132)[0x2b3f38597fc2]

...

ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source
noresm2cmor3       00000000005EB46A  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B3F399095D0  Unknown               Unknown  Unknown
libc-2.17.so       00002B3F39B4C207  gsignal               Unknown  Unknown
libc-2.17.so       00002B3F39B4D8F8  abort                 Unknown  Unknown
libc-2.17.so       00002B3F39B8ED27  Unknown               Unknown  Unknown
libc-2.17.so       00002B3F39B97489  Unknown               Unknown  Unknown
libhdf5.so.101.1.  00002B3F3D80C38B  H5MM_xfree            Unknown  Unknown
libhdf5.so.101.1.  00002B3F3D848D3D  Unknown               Unknown  Unknown
libhdf5.so.101.1.  00002B3F3D84B2C2  H5O_msg_reset         Unknown  Unknown
libhdf5.so.101.1.  00002B3F3D7B042F  H5G__link_release     Unknown  Unknown
libhdf5.so.101.1.  00002B3F3D7A65DC  H5G__dense_iterat     Unknown  Unknown
libhdf5.so.101.1.  00002B3F3D7B93D1  H5G__obj_iterate      Unknown  Unknown
libhdf5.so.101.1.  00002B3F3D7AD886  H5G_iterate           Unknown  Unknown
libhdf5.so.101.1.  00002B3F3D7F9F0C  H5Literate            Unknown  Unknown
libnetcdf.so.13.1  00002B3F38ACDD38  Unknown               Unknown  Unknown
libnetcdf.so.13.1  00002B3F38ACF02E  NC4_open              Unknown  Unknown
libnetcdf.so.13.1  00002B3F38A0C99F  NC_open               Unknown  Unknown
libnetcdf.so.13.1  00002B3F38A0C707  nc_open               Unknown  Unknown
libnetcdff.so.6.1  00002B3F3856CE7C  nf_open_              Unknown  Unknown
libnetcdff.so.6.1  00002B3F38597FC2  netcdf_mp_nf90_op     Unknown  Unknown
noresm2cmor3       000000000048B937  m_utilities_mp_ge         791  m_utilities.F
noresm2cmor3       000000000048A387  m_utilities_mp_sc         686  m_utilities.F
noresm2cmor3       00000000004DFF5B  Unknown               Unknown  Unknown
noresm2cmor3       000000000055451C  MAIN__                     55  noresm2cmor.F
noresm2cmor3       000000000040DE6E  Unknown               Unknown  Unknown
libc-2.17.so       00002B3F39B383D5  __libc_start_main     Unknown  Unknown
noresm2cmor3       000000000040DD69  Unknown               Unknown  Unknown

It looks it crashes during NC file reading. But I think this should not a problem of the file itself.

What I plan to do is to try to change the optimisation from -O2 to -O0 in the compiler flag.

Ingo, do you agree and have any idea? @IngoBethke

@IngoBethke
Copy link
Collaborator

I experienced Matlab and nco crashes on NIRD today and wonder whether the login nodes had some resource issues today.

-O2 is usually safe, so I don't think -O0 will have any effect other than making the code slow. In any case, the crash above was in the HDF5 library which is compiled with -O2.

It is hard to identify the problem if the crashes do not occur at the same point. If the crashes occur only for long simulations with extensive file scanning then it can be worthwile checking the code for missing netcdf close statements (very typical bug). There is a user limit for how many open files a user can have at the same time (check with ulimit -a) and if you are running several instances of the tool in parallel then a missing close statement can cause a crash at a seemingly arbitrary position.

First I would recommend to test just running a single instance of noresm2cmor3 per node and use top to monitor the memory consumption.

If the crashes occurs always during reading of the same input file then I usually use ncdump (ideally from the same library installation as used in noresm2cmor) to dump the entire content of the input file. In most cases this will reproduce the problem.

@monsieuralok
Copy link
Collaborator

Hi Yanchun, Could you come to my office we can bit try to debug it?
Alok

@YanchunHe
Copy link
Collaborator

Many thanks for the reply, Ingo!

I experienced Matlab and nco crashes on NIRD today and wonder whether the login nodes had some resource issues today.

This happened not only yesterday, but quite for quite some days. So it should not a problem of the disk.

-O2 is usually safe, so I don't think -O0 will have any effect other than making the code slow. In any case, the crash above was in the HDF5 library which is compiled with -O2.

It is hard to identify the problem if the crashes do not occur at the same point. If the crashes occur only for long simulations with extensive file scanning then it can be worthwile checking the code for missing netcdf close statements (very typical bug). There is a user limit for how many open files a user can have at the same time (check with ulimit -a) and if you are running several instances of the tool in parallel then a missing close statement can cause a crash at a seemingly arbitrary position.

This is the MM piControl run, the simulation is not that long compared to some other simulations. And each cmor tool process only 10 years data. Some jobs can finish successfully for some 10-yr spans, but some others can't.

So don't know if in such situation, the 'netcdf close' statements matters?
I see the ulimit -n is 1048576, so that is not strict limit for file numbers.

Each time I only submit 8 cmor tasks, either 8 parallel threads or 8 different serial jobs. But this problem can happen in both situations.

First I would recommend to test just running a single instance of noresm2cmor3 per node and use top to monitor the memory consumption.

I also tried one single instance of noresm2cmor3, it also failed. But looks like due to other temporary disk problem Stale file handle

It is hard to monitor the memory consumption, since it take quite long until crash. But may can use some automatically logging of the memory consumption.

If the crashes occurs always during reading of the same input file then I usually use ncdump (ideally from the same library installation as used in noresm2cmor) to dump the entire content of the input file. In most cases this will reproduce the problem.

I will check if this is reproducible, e..g, crashes occur during reading the same file.

@YanchunHe
Copy link
Collaborator

Hi Yanchun, Could you come to my office we can bit try to debug it?
Alok

good, I will talk to you around kl13:00

@YanchunHe
Copy link
Collaborator

I also tried one single instance of noresm2cmor3, it also failed. But looks like due to other temporary disk problem Stale file handle

btw, the stale file handle problem occurs just because the temporary fram:/cluster/NS9560K is not minutely properly to nird mount point /projects/NS9560K/FRAM/

I will try to change to other login node of nird.

@YanchunHe YanchunHe added the FRAM label Nov 30, 2019
@YanchunHe
Copy link
Collaborator

Hi Ingo and Alok,

I tried again both 8 mpi tasks for historical run and one serial task for picontrol. Both of them now succeed finishing the job.

I monitored the maximum memory occupation, and the mpi threads take at most 3.0 GB and serial task takes at most 6.5 GB. Therefore, there should be no memory leak for this case. And we don't need to debug on this now. @monsieuralok

I suspect the 'HDF error' problem is likely caused by the instability of mounted temporary disk from FRAM to nird:/projects/NS9560K/FRAM. Since this is quite unstable as I noticed.

During some days of the last week and the weekend, the /projects/NS9560K/FRAM was only mounted to login0 node of nird, so I can only run the post-processing on the login0 node for MM (and some other LM experiments). I wrote to sigma2 support, and now they are available on other nodes.

The progress of post-processing MM experiments should hopefully faster this week. @matsbn .

@YanchunHe
Copy link
Collaborator

This 'HDF error' still occurs very often, for those experiments stored on the temporary /projects/NS9560K/FRAM, for both the NorESM2-MM and NorESM2-LM experiments.

I strongly suspect this is due to instability during data read from these data.

I will launch a noresm2cmor task with debug-mode on, maybe @monsieuralok can help to debug into this.

@YanchunHe
Copy link
Collaborator

I submitted the job before lunch, it crashed again at ca. kl.12.

At some point before that, I used ls command, and it again showed error on file system:

yanchun@login-nird-0:~
$ ls
ls: cannot access ftp: Stale file handle
ls: cannot access workshop: Stale file handle
ls: cannot access mld_diff_new-old.nc: Stale file handle
ls: cannot access logs: Stale file handle
ls: cannot access archive: Stale file handle
ls: cannot access cmor2.log.v20191108b: Stale file handle
ls: cannot access Datasets: Stale file handle
ls: cannot access mld_diff_new-old.pdf: Stale file handle
ls: cannot access mld_diff_new-old.png: Stale file handle
...

This Stale file handle happens very often, I am afraid to say.

The noresm2cmor program aborted, this time very likely due to this.

But there are no error reporting. (no HDF error either this time).
Log files are:

  • /projects/NS9560K/cmor/noresm2cmor/namelists/CMIP6_NorESM2-MM/historical/debug/cmor_debug.log
  • /projects/NS9560K/cmor/noresm2cmor/namelists/CMIP6_NorESM2-MM/historical/debug/cmor_debug.err

We have to find another solution to this experiments stored on /projects/NS9560K/FRAM, otherwise, it wastes too much resources and time but crashes again and again.

wondering if possible to run the noresm2cmor on FRAM and transfer data to NIRD?

Or ideally transfer these model output to NIRD at some places, and them delete?

Or wait until the new storage in NIRD for these data.

Any other ideas?

@oyvindseland
Copy link
Collaborator

I am sorry I have not followed this discussion well, but saw it now due to the fact that I was asked today about when any of the NorESM scenarios might be found of ESGF.

wondering if possible to run the noresm2cmor on FRAM and transfer data to NIRD?

How much work is it to make the script work? Anyone having an idea.

Or ideally transfer these model output to NIRD at some places, and them delete?

Is the temporary disk stable enough to copy from nird to fram, e.g. LM control
to get some free space on nird? Probably should use rsync via internet and not try to copy the data
the data directly
Or do we need to rsync them to the work disk on Fram and then to the temporary disk of curse checking that the data is kept all the time?

Or ideally transfer these model output
Run noresm2cmor locally at Nersc, Norce or MET, i.e copying the data to local disks?

Or wait until the new storage in NIRD for these data.

Probably the best solution but uncertain time-line and not good for use of NorESM2 data in the mips

@oyvindseland
Copy link
Collaborator

Made a wrong citation for one of the suggestions

Or ideally transfer these model output
Run noresm2cmor locally at Nersc, Norce or MET, i.e copying the data to local disks?

@YanchunHe
Copy link
Collaborator

I would like to copy the data to NIRD temporary.

There are 260 disk quota of NS9034K, and it is now used 200T for cmorized data.

I don't know if this is allowed for this project? @IngoBethke

If so, maybe Jan can help to copy the experiments there, and I can do the experiment.

I can come with a detailed list of experiments (or partially some of the years of the experiments).

@matsbn
Copy link
Author

matsbn commented Dec 10, 2019

I think copying data temporary to NS9034K could be an idea and I actually discussed this option with @oyvindseland this afternoon. It will be a balancing act of space used for raw data and space needed for the cmorized output.

@YanchunHe
Copy link
Collaborator

I think copying data temporary to NS9034K could be an idea and I actually discussed this option with @oyvindseland this afternoon. It will be a balancing act of space used for raw data and space needed for the cmorized output.

This sounds good! But you may soon need to ask for more space for this project NS9034K

Mats, would you invite/ask Jan join this repository, so that he can subscribe and be notified.

I will update in different issues if they need to be copied to NS9034K.

@YanchunHe
Copy link
Collaborator

YanchunHe commented Dec 10, 2019

The following period of piControl model outoput (see path and cases names in the first thread above) needs to be copied to:
/tos-project1/NS9034K/noresm/cases

The first and second years indicate the start and ending period,
e.g., 1320 1329 means all years from ((including)) 1320 to 1329.

1320 1329
1330 1339
1340 1349
1350 1359
1360 1369
1370 1379
1410 1419
1420 1429
1430 1439
1440 1449

files should be organized as the same folder structures as the original model output.

Experiments need rsync to NIRD NS9034 are labelled with Rsync

@YanchunHe
Copy link
Collaborator

Wondering if you prefer to sync all model output to NS9034K or only those years that are not cmorized successfully? @matsbn

@jgriesfeller
Copy link
Collaborator

Hi,
JFYI, at this point I am not part of the NS9034K group on nird and can therefore not write to /tos-project1/NS9034K/noresm/cases.

@jgriesfeller
Copy link
Collaborator

I ran a very tiny speed test for transferral between FRAM and NIRD.

Over the internet I roughly get 90MB/sec, using the NFS mount I get roughly 170MB/sec.

In any case transferring the 25TB of thN1850frc2_f09_tn14_20191012e directory will take a significant amount of time.

@matsbn
Copy link
Author

matsbn commented Dec 10, 2019

I have added @jgriesfeller to the NS9034K project. Before Sigma2 have more disks installed, the chance of having more space for NS9034K is very limited. It is the same reason we are out of space on NS9560K and are dealing with the temporary /cluster/NS9560K solution.

@jgriesfeller
Copy link
Collaborator

Thanks Mats, I can write to /tos-project1/NS9034K/noresm/cases now.

Shall I transfer the data now or not? Do we really need all 25TB?

I also wonder if I should tell sigma2 the experience with NFS mounts here at met. Basically NFS4 showed similar problems here while NFS3 was much more stable. What do you think?

@jgriesfeller
Copy link
Collaborator

Just to summarise what Mats and I have just talked on the phone about in conjuntion with this thread:
I will transfer the data for the needed years to /tos-project1/NS9034K/noresm/cases keeping the current file structure on Fram, but only copying the years needed.
Since N1850frc2_f09_tn14_20191001 is not on Fram anymore, I will just do that for N1850frc2_f09_tn14_20191012.

@matsbn
Copy link
Author

matsbn commented Dec 10, 2019

I slightly embarrassing fact is that the first 50 years of N1850frc2_f09_tn14_20191012 is actually on NIRD already. This means the time slices 1320-1329, 1330-1339, 1340-1349 can be found under /projects/NS9560K/noresm/cases/N1850frc2_f09_tn14_20191012. I gave the path to the full dataset on /cluster/NS9560K since I initially assumed the availability of the complete dataset was more convenient for the processing. When the unstable mount issues appeared, I failed to see that part of this experiment could be more efficiently processed using the already transferred data. Sorry about that!

A significant portion (100 of 120 years) of 1pctCO2 and abrupt-4xCO2 NorESM2-MM experiments are also on NIRD. I will comment under the relevant CMOR and ESGF publishing requests about this.

@YanchunHe
Copy link
Collaborator

Cmorized with additional iLAMB variables (#262), AERday zg500 (#263) and corrected fNup (#251).

They are ready to be published to ESGF.

data path

  • /projects/NS9034K/CMIP6/.cmorout/NorESM2-MM/piControl
  • /projects/NS9034K/CMIP6/CMIP/NCC/NorESM2-MM/piControl

version

  • v20210203

sha256sum
/projects/NS9034K/CMIP6/CMIP/NCC/NorESM2-MM/piControl

  • .r1i1p1f1.sha256sum_v20210203

@monsieuralok
Copy link
Collaborator

@YanchunHe published

@YanchunHe
Copy link
Collaborator

New dataset version to fix issues #269, #270, #271, #272, #273 is ready to be published:

data path

  • /projects/NS9034K/CMIP6/.cmorout/NorESM2-MM/piControl
  • /projects/NS9034K/CMIP6/CMIP/NCC/NorESM2-MM/piControl

version

  • v20210319

sha256sum
/projects/NS9034K/CMIP6/CMIP/NCC/NorESM2-MM/piControl

  • .r1i1p1f1.sha256sum_v20210319

Note, 6 hourly data are only available for years 1300-1450.

@YanchunHe YanchunHe reopened this Apr 13, 2021
@monsieuralok
Copy link
Collaborator

@YanchunHe published

@monsieuralok
Copy link
Collaborator

@YanchunHe retracted

@YanchunHe
Copy link
Collaborator

Seems like there are no 3 hourly precipitation data available in the model output, so no cmorization will done for MM piControl as discussed in issue #41

@YanchunHe YanchunHe reopened this Jun 30, 2023
@YanchunHe
Copy link
Collaborator

No, some years has 3 hourly precipitation in MM piControl.

@YanchunHe
Copy link
Collaborator

CMORized additional 3-hourly precipitation dataset for NorESM2-MM piControl, only for years 1300-1309.

data path

  • /projects/NS9034K/CMIP6/.cmorout/NorESM2-MM/piControl
  • /projects/NS9034K/CMIP6/CMIP/NCC/NorESM2-MM/piControl

version

  • v20230616

sha256sum
/projects/NS9034K/CMIP6/CMIP/NCC/NorESM2-MM/piControl

  • .r1i1p1f1.sha256sum_v20230616

@monsieuralok
Copy link
Collaborator

@YanchunHe published

YanchunHe added a commit that referenced this issue Oct 3, 2023
NorESM2-MM ssp585 with extension: #151, #326; NorESM2-MM piControl (#140), historical (#143) ssp126 (#152), ssp245 (#149), ssp370 (#150) with 3-hourly precipitation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants