-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hydro optimizations: Explicit work aggregation, Comm optimizations, explicit SIMD instructions #426
Conversation
Does not deadlock but is not yet wired up
By getting src/grid.cpp directly from master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check all newly added files for the license header.
As the rocm 4.5 installation seems to be broken
As the libatomic gcc bug on Rostam seems to be resolved
As we disabled the legacy circleci tests for now
The license information were missing in quite a few files actually! I have added them accordingly in 4236a52 |
I adapted the Jenkins pipelines to the changes on Rostam and removed the legacy tests on CircleCI. I left the script for CircleCI in case we actually want to update these tests and re-enable them, but in their current form they're redundant anyway since the Jenkins tests also cover the same things (and far more). Once the remaining pipelines are completed, this PR should be good to go! @diehlpk Could you have another look at it? I recommend that this PR should be merged before #425, as we need the Jenkins pipeline fixes there as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This PR contains 3 major changes to the hydro module:
Added explicit work aggregation for the two major hydro compute kernel (reconstruct, flux) using the new executors added in Work aggregation experimental SC-SGS/CPPuddle#12.
The maximum number of aggregated kernel launches can be steered, CLI parameter
--max_executor_slices=<unsigned int>
.This feature is yet to be considered experimental, though it passed all tests and increases performance substantially. This PR also adds some new tests to continually test different aggregation sizes. The work aggregation works for all hydro GPU kernel implementations (CUDA, HIP, Kokkos).
Improved communication in the hydro solver! Notably, exchanging the hydro_boundaries (and subsequently the hydro_amr_boundaries) now takes the HPX localities into account. If the neighbor in question is on the same locality, no HPX action will be used for communication! Further, no communication buffer will be used. Instead, the memory of the neighbor is accessed directly (with some new local hpx promises/futures to avoid any races). The old communication implementation can still be used with the parameter
--optimize_local_communication=0
.Added first hydro SIMD implementation for flux and reconstruct kernels. The kernels still work on the GPU (using the Kokkos SIMD double specialization), but support a wide range of CPU SIMD instructions sets as we can both use std::experimental::simd and the Kokkos SIMD types on CPU (tested mostly with AVX512 so far). This also makes the previous experimental Vc flux kernel obsolete. Hence, I removed it. Furthermore, while adding this SIMD implementation I also cleaned up the experimental scalar kernels for flux and reconstruct which were leftovers from the initial hydro GPU port (removing a lot of code duplication in the process.)
While I am still actively working on the implementation of 1. and 3., right now is a good point to merge this back into master before the PR gets too large/unwieldy as everything is working and the tests pass. I will add subsequent optimizations / code cleanups in separate PRs.
Some additional features in this PR:
I add this as a draft PR, as some CI pipelines still show problems. The problems seem to be caused by the changed module environment on rostam (g++ 10 and clang 12 modules crash during loading) and obsolete g++/HPX versions on CircleCI. However, all tests pass locally, so it is likely we can fix this on the server-side independent of the PR (fix the modules on Rostam, upgrade CircleCI dependencies).