Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor bit-for-bit diagnostics #300

Merged
merged 14 commits into from
Mar 27, 2019
Merged

Conversation

apcraig
Copy link
Contributor

@apcraig apcraig commented Mar 18, 2019

Refactor bit-for-bit diagnostics, #204

  • Developer(s): tcraig

  • Are the code changes bit for bit, different at roundoff level, or more substantial? bit-for-bit, #77addedd, https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks.
    All results are bit-for-bit but due to some new additions and some refactoring of the tests, some tests show missing regression results.

  • Does this PR create or have dependencies on Icepack or any other models? Yes, requires latest version of Icepack included in this PR.

  • Is the documentation being updated with this PR? (Y/N) Y

  • Other Relevant Details:

Removed all references to cpps DITTO, REPRODUCIBLE, and the old bfbflag code. Added new algorithms to carry out the global sum diagnostics leveraging various implementations. There is NO global gather to generate a bit-for-bit sum. bfbflag is now a character string and the supported options are off, lsum4, lsum8, lsum16, reprosum, and ddpdd. The algorithms are described briefly in the documentation. off==lsum8 and this option does the global sum via a local reduction then a global scalar MPI allreduce. reprosum is the recommended setting for bit-for-bit diagnostics.

This has been implemented primarily in the ice_global_reductions.F90 code. The same algorithms are implemented in the mpi and serial versions and the mpi and serial global reductions are bit-for-bit with each other with bfbflag=reprosum. The implementations work for different pe counts, different blocks sizes, mpi, threading on and off, and serial. Prior implementations only worked on a subset of these conditions.

Some new tests and scripts were added to test the log files. The logbfb test is identical to the smoke test except for comparisons to other cases, the log file is compared instead of the restart. reprosum_suite is a suite of tests that test the log file diagnostics with bfbflag='reprosum'.

The comparebfb.csh script was refactored to simplify the implementation and the log comparison was separated to a different script. In addition, there were some errors introduced in October 2018 that broke some of the comparebfb testing and this has been fixed.

A new feature to allow sleeps between test case build/submission was added to try to manage dependencies between cases better. This is not a perfect solution. The problem is that when one test is comparing it's results to another, the second has to complete after the first is done, but this does not always happen depending on how jobs are flowing thru the queue.

Obsolete cpps were removed and a bug that broke the cpps implementation in the build was fixed.

The decomp test was refactored to be more compatible with the updated comparebfb.csh script. It also now should work with multiple binary files. As a result, the directory structure of the results is slightly different and regression testing with prior versions results in missing data. This will self correct as we test in the future.

Fixed a bug in a few of the diagnostic output fields. This bug created non-reproducible results when running threaded tests. There was an error in zeroing out of the al*_init fields in threaded regions in ice_step_mod.F90.

Added 8 byte integer kinds, needed for the reprosum and ddpdd algorithms. Also use real*16 sums and reduction in the "lsum16" option. PGI does not support this type, so had to add a NO_R16 cpp for pgi builds.

Update travis to support build for c code.

Refactor Macros file to better support serial and parallel builds as well as building c code. Also needed a small update to the Makefile which had "cc" hardwired as the c compiler command.

Refactor poll_queue script and fix --bgen default bug with single tests, #299, #301.

Update job launching to avoid mpi launches with serial code. Also, work towards using serial (vs mpi) builds when possible.

@duvivier
Copy link
Contributor

Doc looks okay, not sure why Travis CI failed though.

@apcraig
Copy link
Contributor Author

apcraig commented Mar 18, 2019

I just relaunched the travis test, looks like it was failing on the input data download. Will monitor it.

@duvivier
Copy link
Contributor

Ok. Let me know if there's something wrong with the FTP again. I haven't touched it lately and I don't think Dave has, so hopefully there's not a system issue.

@apcraig
Copy link
Contributor Author

apcraig commented Mar 18, 2019

ftp worked this time, but I'm running into another problem that I have to fix. I added a c file and travis doesn't yet know how to compile c. I need to update the travis env. Hope to fix this today.

Copy link
Contributor

@dabail10 dabail10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not changed anything with the input files. Everything is still there. The code looks fine as far as I can tell.

@apcraig
Copy link
Contributor Author

apcraig commented Mar 18, 2019

I am going to refactor the Macros files a bit to better support both serial and parallel compiles as well as compiling c code. More changes coming. Will try to test on most platforms as I go.

@eclare108213
Copy link
Contributor

Thanks @apcraig.
When you do that, could you provide the option for serial runs to be launched without using mpirun, please? thx

@apcraig
Copy link
Contributor Author

apcraig commented Mar 18, 2019

@eclare108213, good idea. We have that for a few machines already, will try to extend that. I may need you to do some testing on your local machines. I'll email you when I need some help with that.

@apcraig apcraig requested a review from phil-blain March 20, 2019 18:24
Copy link
Member

@phil-blain phil-blain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reviewing for the issues mentioned in #301, #299 and at the end of #281. Everything seems to be resolved on my machine (cesium).
The only thing is I had to add ICE_MACHINE_QSTAT to my env file, it was missing for some reason. I've already added it to my sandbox but haven't issued a PR for that yet ; if you want to add it to this PR, go ahead:
setenv ICE_MACHINE_QSTAT "qstat "
(not sure if the space is needed, some env files have it and some don't...)
It would need to be added also to JF's machine (fram). For now we are just running on our local workstations, not on our clusters. I plan to add machines files for our clusters in the near future.

@apcraig
Copy link
Contributor Author

apcraig commented Mar 21, 2019

@phil-blain, great to hear things seem to be working for you. I can add missing ICE_MACHINE_QSTAT settings in the various machine files. If we need to update them later, we can.

@apcraig
Copy link
Contributor Author

apcraig commented Mar 21, 2019

@eclare108213 , are you OK with merging this PR?

@apcraig
Copy link
Contributor Author

apcraig commented Mar 26, 2019

@eclare108213 , just another ping. I think this PR is ready, but it's big enough that I prefer to have your OK before merging. I will start a round of testing once this is merged, before the automated weekend testing.

@eclare108213
Copy link
Contributor

I have started another set of tests on badger. In the last set, 16 of them failed, but they all seem to be due to machine flakiness, e.g.

badger_intel_decomp_gx3_4x2x25x29x5.bfb/logs/cice.runlog.190320-155405:ORTE has lost communication with a remote daemon.

I suspect this is not a code-generated problem. Will update when test results come in.

Copy link
Contributor

@eclare108213 eclare108213 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the cpp SERIAL_REMOVE_MPI mean that we can use a single-source ice_global_reductions.F90 instead of having copies in comm/mpi and comm/serial? I don't understand why mpi stuff is added to comm/serial/ice_global_reductions.F90 -- the original point of having an mpi and a serial version was so that the serial version would be completely mpi-free. Nowadays that's probably not necessary, but I did find it handy on the loft laptop to be able to run quick, serial tests.

Are we allowed to redistribute cicecore/cicedynB/infrastructure/ice_shr_reprosum86.c ?
I don't understand what that code does, or when/how it interacts with CICE. Is that documented somewhere?

@apcraig
Copy link
Contributor Author

apcraig commented Mar 27, 2019

@eclare108213 , these are two good questions.

Regarding the SERIAL_REMOVE_MPI, one of the problems with having separate serial and mpi directories is that things get out of sync. Stuff gets added to one, but not the other, and so forth. In addition, for these global sums, it's important (if we want global sums to be reproducible in serial and mpi mode, which they are not with bfbflag='reprosum') that the implementations in serial and mpi mirror each other. So I decided to basically take the mpi version that I implemented first and add the ability to turn off the MPI parts and then drop it into the serial directory with a directive turned on that does just that. The two files are basically identical now which makes it easy to maintain moving forward. It would also be possible to merge the serial and mpi directories if we wanted to and to use the file we have now for mpi and serial mode. So anyway, there were a few reasons why it is not bad that the mpi and serial versions of ice_global_reductions.F90 are similar. A bigger question is whether we want to continue to maintain the serial and mpi directories or whether it's better to have one with MPI cpp'ed out. I noticed a lot of differences in the global_reductions code in the two directories before I refactored them. I wonder if there are other differences (like in handling the tripole halo, etc) that might actually be missing or wrong in mpi or (more likely) serial. Generally, people aren't good at maintaining two versions of things, especially when they are only using one. I think this is an important question.

On the question about ice_shr_reprosum86.c. Yes, this can be distributed. It doesn't do anything unless the code is triggered by cpps set by the machine (or possibly the user). I inquired about removing it, but decided we should keep it. What is does is deal with integer size mismatches between fortran and c that exist in some older x86 hardware implementations and this is important to deal with if you are going to use the reprosum algorithms to get bit-for-bit correct results. A side benefit is that including a piece of c code highlighted some problems in our build that got fixed. I can remove it and then some bfbflag options might not work correctly on some architectures. I believe the community is unlikely to encounter those architectures. Having said that, if we don't like carry around the small c code, then that might be a reason to get rid of it. Otherwise, there is some benefit (both to the computation and general build capabilities) to keep it in the repo. But I'm open on this.

@eclare108213
Copy link
Contributor

I gravitate toward "simpler" when possible, and so my choices for both of these questions would be to simplify and document. The mpi and serial directories were hand-me-downs from POP, and I'm happy to get rid of them if the codes in them are sufficiently similar. I think a large reason for doing it that way in the first place was to reduce cpp usage, so if you have to add a cpp anyhow for reproducibility, and it can be used to merge the mpi and serial codes, then it makes sense to merge them. That sounds like a big project though, suitable for its own PR. In the meantime, does it make sense to just have one copy of ice_global_reductions.F90? It would need to be moved up in the directory structure.

Re the c code, I'd prefer not to carry it around, especially if our users are unlikely to be running on the particular architectures that wouldn't be BFB. I'd rather put a note in the documentation (there's a 'known issues' section somewhere), and perhaps include a pointer to where a user could get this code if he or she desperately needed it. I did not happen to see where it interfaced into CICE, but perhaps that part could be commented out with instructions for what to do if it were needed. I'm also open to keeping it, if there are genuine, ongoing benefits to having it, or if my suggestion to document and/or leave comments in the code doesn't make sense.

@apcraig
Copy link
Contributor Author

apcraig commented Mar 27, 2019

Good comments.

I don't think we want to move to a single global_reductions file until we've thought that thru a bit. In particular, we'd have to add cpps to the build and we'd want to come up with a clean set of cpps to do that. We wouldn't want to keep the cpp name that is there. That name is particularly ugly for good reasons. I would prefer to keep the current files separate (even if they are nearly the same) until we decide to merge mpi and serial, at which point it should be a separate PR. I think it would be a mistake to have a mixed strategy of some cpps and some split directories. Right now we have split directories, the cpps that is in global_reductions is local, so it doesn't act like a model cpp. It is true that the main tradeoff for merging serial and mpi is adding cpps. But maybe it's just one, like NO_MPI that is triggered by the serial setting. That setting would trigger a cpp and not a separate file path as it doesn't now. I'm not arguing for merging or not merging right now, I think we need to think about it and get more input. In fact, one of the problems with serial and mpi is they are NOT the same and that worries me. Merging them would be some work with pros and cons. But I'm thinking there are more pros than cons at this point.

I'm OK pulling out the c code. What I would propose is that we create a new issue for it, and I do a bit more research on this implementation. This is basically Pat Worley's code and I had some emails with him a couple weeks ago about just this issue and we went back and forth about whether it should stay. I think he felt it didn't hurt anything and it was hard to argue against that. Let me create an issue at high priority to sort this out and then we can remove it if we decide that's OK.

How does that sound?

Copy link
Contributor

@eclare108213 eclare108213 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@apcraig
Copy link
Contributor Author

apcraig commented Mar 27, 2019

See issues #288 and #302.

@apcraig apcraig merged commit d417b97 into CICE-Consortium:master Mar 27, 2019
phil-blain added a commit to phil-blain/Icepack that referenced this pull request Apr 24, 2019
* This is similar to the machine files cleanup in CICE-Consortium/CICE#300
* Remove MPICC, MPIFC since there is no MPI in Icepack
apcraig pushed a commit to CICE-Consortium/Icepack that referenced this pull request May 9, 2019
* Move makdep compilation to Makefile

* makdep compiled from a makdep rule in the Makefile
* Add CFLAG_HOST macro
* makdep is only compiled if needed
* This addresses CICE-Consortium/CICE#306

* Fix Travis-CI build

* Update machine files to define SCC,SFC,CC,FC

* This is similar to the machine files cleanup in CICE-Consortium/CICE#300
* Remove MPICC, MPIFC since there is no MPI in Icepack

* Use $(DEPGEN) instead of makdep as target (Makefile)

* Update documentation
@apcraig apcraig deleted the bfb branch August 17, 2022 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants