-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CMIP6 CMOR-ization & ESGF-publication] NorESM2-MM - piControl #140
Comments
The full path to the case(s) of the experiment on NIRD should be /projects/NS9560K/noresm/cases with case N1850frc2_f09_tn14_20191001 in /projects/NS9560K/noresm/cases and N1850frc2_f09_tn14_20191012 in /projects/NS9560K/FRAM/noresm/cases. |
A not on the post-processing of the MM experiments: The processing of the NorESM2-MM experiments are slow due to two factors now: One reason is of course the high-resolution and high-frequency output. Another reason is that the cmor tool crashed many times at some arbitrary points. with normally a simple error as or something more in detail as:
It looks it crashes during NC file reading. But I think this should not a problem of the file itself. What I plan to do is to try to change the optimisation from -O2 to -O0 in the compiler flag. Ingo, do you agree and have any idea? @IngoBethke |
I experienced Matlab and nco crashes on NIRD today and wonder whether the login nodes had some resource issues today. -O2 is usually safe, so I don't think -O0 will have any effect other than making the code slow. In any case, the crash above was in the HDF5 library which is compiled with -O2. It is hard to identify the problem if the crashes do not occur at the same point. If the crashes occur only for long simulations with extensive file scanning then it can be worthwile checking the code for missing netcdf close statements (very typical bug). There is a user limit for how many open files a user can have at the same time (check with ulimit -a) and if you are running several instances of the tool in parallel then a missing close statement can cause a crash at a seemingly arbitrary position. First I would recommend to test just running a single instance of noresm2cmor3 per node and use top to monitor the memory consumption. If the crashes occurs always during reading of the same input file then I usually use ncdump (ideally from the same library installation as used in noresm2cmor) to dump the entire content of the input file. In most cases this will reproduce the problem. |
Hi Yanchun, Could you come to my office we can bit try to debug it? |
Many thanks for the reply, Ingo!
This happened not only yesterday, but quite for quite some days. So it should not a problem of the disk.
This is the MM piControl run, the simulation is not that long compared to some other simulations. And each cmor tool process only 10 years data. Some jobs can finish successfully for some 10-yr spans, but some others can't. So don't know if in such situation, the 'netcdf close' statements matters? Each time I only submit 8 cmor tasks, either 8 parallel threads or 8 different serial jobs. But this problem can happen in both situations.
I also tried one single instance of noresm2cmor3, it also failed. But looks like due to other temporary disk problem It is hard to monitor the memory consumption, since it take quite long until crash. But may can use some automatically logging of the memory consumption.
I will check if this is reproducible, e..g, crashes occur during reading the same file. |
good, I will talk to you around kl13:00 |
btw, the I will try to change to other login node of nird. |
Hi Ingo and Alok, I tried again both 8 mpi tasks for historical run and one serial task for picontrol. Both of them now succeed finishing the job. I monitored the maximum memory occupation, and the mpi threads take at most 3.0 GB and serial task takes at most 6.5 GB. Therefore, there should be no memory leak for this case. And we don't need to debug on this now. @monsieuralok I suspect the 'HDF error' problem is likely caused by the instability of mounted temporary disk from FRAM to nird:/projects/NS9560K/FRAM. Since this is quite unstable as I noticed. During some days of the last week and the weekend, the /projects/NS9560K/FRAM was only mounted to login0 node of nird, so I can only run the post-processing on the login0 node for MM (and some other LM experiments). I wrote to sigma2 support, and now they are available on other nodes. The progress of post-processing MM experiments should hopefully faster this week. @matsbn . |
This 'HDF error' still occurs very often, for those experiments stored on the temporary /projects/NS9560K/FRAM, for both the NorESM2-MM and NorESM2-LM experiments. I strongly suspect this is due to instability during data read from these data. I will launch a noresm2cmor task with debug-mode on, maybe @monsieuralok can help to debug into this. |
I submitted the job before lunch, it crashed again at ca. kl.12. At some point before that, I used
This The noresm2cmor program aborted, this time very likely due to this. But there are no error reporting. (no HDF error either this time).
We have to find another solution to this experiments stored on /projects/NS9560K/FRAM, otherwise, it wastes too much resources and time but crashes again and again. wondering if possible to run the noresm2cmor on FRAM and transfer data to NIRD? Or ideally transfer these model output to NIRD at some places, and them delete? Or wait until the new storage in NIRD for these data. Any other ideas? |
I am sorry I have not followed this discussion well, but saw it now due to the fact that I was asked today about when any of the NorESM scenarios might be found of ESGF.
How much work is it to make the script work? Anyone having an idea.
Is the temporary disk stable enough to copy from nird to fram, e.g. LM control
Probably the best solution but uncertain time-line and not good for use of NorESM2 data in the mips |
Made a wrong citation for one of the suggestions
|
I would like to copy the data to NIRD temporary. There are 260 disk quota of NS9034K, and it is now used 200T for cmorized data. I don't know if this is allowed for this project? @IngoBethke If so, maybe Jan can help to copy the experiments there, and I can do the experiment. I can come with a detailed list of experiments (or partially some of the years of the experiments). |
I think copying data temporary to NS9034K could be an idea and I actually discussed this option with @oyvindseland this afternoon. It will be a balancing act of space used for raw data and space needed for the cmorized output. |
This sounds good! But you may soon need to ask for more space for this project NS9034K Mats, would you invite/ask Jan join this repository, so that he can subscribe and be notified. I will update in different issues if they need to be copied to NS9034K. |
The following period of piControl model outoput (see path and cases names in the first thread above) needs to be copied to: The first and second years indicate the start and ending period, 1320 1329 files should be organized as the same folder structures as the original model output. Experiments need rsync to NIRD NS9034 are labelled with |
Wondering if you prefer to sync all model output to NS9034K or only those years that are not cmorized successfully? @matsbn |
Hi, |
I ran a very tiny speed test for transferral between FRAM and NIRD. Over the internet I roughly get 90MB/sec, using the NFS mount I get roughly 170MB/sec. In any case transferring the 25TB of thN1850frc2_f09_tn14_20191012e directory will take a significant amount of time. |
I have added @jgriesfeller to the NS9034K project. Before Sigma2 have more disks installed, the chance of having more space for NS9034K is very limited. It is the same reason we are out of space on NS9560K and are dealing with the temporary /cluster/NS9560K solution. |
Thanks Mats, I can write to /tos-project1/NS9034K/noresm/cases now. Shall I transfer the data now or not? Do we really need all 25TB? I also wonder if I should tell sigma2 the experience with NFS mounts here at met. Basically NFS4 showed similar problems here while NFS3 was much more stable. What do you think? |
Just to summarise what Mats and I have just talked on the phone about in conjuntion with this thread: |
I slightly embarrassing fact is that the first 50 years of N1850frc2_f09_tn14_20191012 is actually on NIRD already. This means the time slices 1320-1329, 1330-1339, 1340-1349 can be found under /projects/NS9560K/noresm/cases/N1850frc2_f09_tn14_20191012. I gave the path to the full dataset on /cluster/NS9560K since I initially assumed the availability of the complete dataset was more convenient for the processing. When the unstable mount issues appeared, I failed to see that part of this experiment could be more efficiently processed using the already transferred data. Sorry about that! A significant portion (100 of 120 years) of 1pctCO2 and abrupt-4xCO2 NorESM2-MM experiments are also on NIRD. I will comment under the relevant CMOR and ESGF publishing requests about this. |
Cmorized with additional iLAMB variables (#262), AERday They are ready to be published to ESGF. data path
version
sha256sum
|
@YanchunHe published |
New dataset version to fix issues #269, #270, #271, #272, #273 is ready to be published: data path
version
sha256sum
Note, 6 hourly data are only available for years 1300-1450. |
@YanchunHe published |
@YanchunHe retracted |
Seems like there are no 3 hourly precipitation data available in the model output, so no cmorization will done for MM piControl as discussed in issue #41 |
No, some years has 3 hourly precipitation in MM piControl. |
CMORized additional 3-hourly precipitation dataset for NorESM2-MM piControl, only for years 1300-1309. data path
version
sha256sum
|
@YanchunHe published |
Mandatory information:
Full path to the case(s) of the experiment on NIRD
/projects/projects/NS9560K/noresm/cases
/projects/projects/NS9560K/FRAM/noresm/cases
experiment_id
piControl
model_id
NorESM2-MM
CASENAME(s) and years to be CMORized
N1850frc2_f09_tn14_20191001, 1200-1299
N1850frc2_f09_tn14_20191012, 1300-1449
N1850frc2_f09_tn14_20191113, 1450-1699
Optional information
parent_experiment_id
piControl-spinup
parent_experiment_rip
r1i1p1f1
parent_time_units
'days since 0001-01-01'
branch_method
'Hybrid-restart from year 1200-01-01 of piControl-spinup'
other information
The text was updated successfully, but these errors were encountered: