-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROCm 3.3 support #3623
ROCm 3.3 support #3623
Conversation
Codecov Report
@@ Coverage Diff @@
## python #3623 +/- ##
========================================
- Coverage 87% 87% -1%
========================================
Files 524 512 -12
Lines 23409 22036 -1373
========================================
- Hits 20595 19371 -1224
+ Misses 2814 2665 -149
Continue to review full report at Codecov.
|
No idea what's wrong with the ek_fluctuations test. Philox works, LB thermalization statistics are correct, EK thermalization statistics are incorrect. The responsible function does not look suspicious in any way: espresso/src/core/grid_based_algorithms/electrokinetics_cuda.cu Lines 1346 to 1414 in 8813c4c
Is anyone familiar with that test? |
Note: while tinkering with |
this flag was added by Clang 10
Interesting. I'll bisect optimization flags tomorrow to find out what is causing it. EDIT: |
also make sure that the EK uses only 32-bit floats and that it calls its own RNG wrapper and not the LB's
Fixed now. That |
What should we do now with regards to ROCm support? We aren't the only ones affected by the disastrous ROCm versioning strategy. Only supporting ROCm 3.3 might be an issue for users who pinned an earlier ROCm version on their system to stabilize their environment. Should we test both ROCm 3.0 and 3.3 in CI? We could test one on a weekly schedule to limit workload. Only testing 3.3 in CI runs the risk of running into a regression for 3.0. |
This has become less of an issue now that you can install multiple ROCm versions side-by-side. The changes in this merge request certainly won't break compatibility with 3.0. Testing 3.0 only wouldn't be sufficient either as someone might be pinning an even older version. Testing every release since 2.0 isn't an option either. That means we can only support older versions on a best-effort basis and simply guard any problematic changes with version checks. Looking back at my ROCm compatibility patches since v2.0, they primarily deal with CMake issues. Silent breakage like the ek_fluctuations test is a rather rare thing. |
The
ln -s /opt/rocm/bin/hcc* /opt/rocm/hip/bin/
issue has been worked around by properly settingHCC_PATH
on the CMake side.The shutdown issue has been worked around by replacing interrupts with polling (suggested at ROCm/roctracer#22 (comment)). Something is wrong with the destruction order in our code, but I cannot easily identify what. It's not the missing
cudaDestoryStream
though.Fixes #3620 (according to
ctest -R save_checkpoint_lb.cpu-p3m.cpu-lj-therm.lb_1 --repeat-until-fail 1000
).Fixes #3587 (according to
ctest -R ek_charged_plate --repeat-until-fail 100
).TODO