-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OMP threading in CICE #114
Comments
Let's look to see whether @mhrib addressed the ice_dyn_* loops in his refactorization |
see also #128 |
I did find issues with the same OMPs (and a few more), but no solution other than comment-out as here. See also #252 |
In addition to the ones that Mads (MHRI) found I found OMP issues in ice_history and ice_grid. I uncommented all OMP directives in these two files, which saved the model from crashing when running with Intel and GNU compilers. I have not found solutions nor specific locations for these bugs witin the file.. |
I am uploading a set of slides here from a LANL training course on OpenMP profiling and debugging that I attended last week. Most of it is old news, but the profiling and debugging info at the end might be useful as we move forward with this task. |
I have created a perf_suite that will be PR'ed soon. This runs a fixed suite of tests that attempt to assess CICE performance at different task and thread counts. It basically does three things.
This is all done with the gx1 grid, roundrobin decomp, 2 day runs, basic out of the box configuration. The idea is not to optimize the performance of CICE but to compare the performance of CICE on different hardware, different compilers, and different tasks/threads for a very fixed problem. This is, in part, a starting point for further OMP tuning. I attach an xl spreadsheet, CICE_OMP_perf.xlsx, that shows the results from testing on Narwhal with 4 compilers and Cheyenne with 3 compilers in table and graph form. This is for hash 9fb518e of CICE dated Dec 21, 2021, but also includes the Narwhal port and the perf_suite (which will be PR'ed soon). There are lots of interesting insights. But with regard to OMP, we see that in this version of CICE (which has lots of OMP loops turned off that still need debugging), OMP is still doing something. In these tests, OMP is never faster than just using all MPI for the same total PE count. But for a given MPI task count, threads run faster than running the same MPI task count but single threaded (i.e. 16x4 vs 16x1), at least on Narwhal. Cheyenne shows less benefit from threading. This establishes a performance baseline and provides a starting point to improve OMP performance, probably using Narwhal gnu or cray to continue OMP tuning efforts. |
Note that CICE_OMP_perf.xlsx has an error, the 4x16 run is actually 8x16. I've fixed the error in perf_suite in my sandbox for future use. Ignore the 4x16 results for now. |
I attach an updated OMP results table and graphs, CICE_OMP_perf.xlsx. This also has a second sheet that shows all timing info for the threading and unthreaded tests. If you look closely, you can see that Advection is just about the only section that threads reasonably. Column and Dynamics do not thread well and maybe not at all. I'll try to understand this better. |
For the dynamic part most of the OMP has been commented out including the one in the subcycling iteration. |
This has largely been addressed in #680 and apcraig#64. There are still some known issues in VP and 1d EVP. |
I will close, VP and 1d EVP has their own issues. FYI, added omp_suite and perf_suite to check OpenMP and evaluate performance. |
typo in ice_pio
A few problematic OMP loops were unthreaded due to reproducibility problems found during testing. grep for TCXOMP. These are in ice_dyn_eap, ice_dyn_evp, and ice_transport_remap. One issue may be thread safety in icepack_ice_strength, but it requires additional debugging.
More generally, we need to review and validate that threading is working properly in CICE and Icepack.
The text was updated successfully, but these errors were encountered: