-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for ROCm 3.1.0 #156
Comments
That‘s a different issue. There it talks about an actual version difference, while in our case it fails to identify the version.
I couldn‘t find aynthing related to this specific issue though. Could you please post links? |
Yes, they are unrelated (gpuowl, Davinci Resolve, etc.). The issue tracker doesn't have anything about hanging programs or |
Since our C++ unit tests work and the Python tests don't, perhaps they changed something about the linking process that only breaks libraries but not executables? We should probably have a quick look at why they hang and make a bug report if necessary. I'm not worried about the HCC_HOME thing, they'll probably discover and fix that one on their own pretty quickly. |
The timeout happens because the python interpreter hangs out when calling |
Thanks, I will test locally without Docker and see if we need to change anything in the destruction order to make it work again. |
The hcc not found error is an actual bug in https://github.com/ROCm-Developer-Tools/HIP/blob/master/bin/hipcc_cmake_linker_helper, which was not adapted to the changed paths. One can run The Python interpreter hanging problem is that I suspect this error is introduced in the final linking stage. I cannot reproduce the issue outside of Espresso unless I use g++ instead of hipcc for linking hipcc-generated .o files into a .so file. That causes the resulting .so file to have a .kernel_ir section instead of a .kernel section. libEspresso.so has a .kernel section, so that is likely not the cause. |
The
|
Offline discussion with @mkuron: we'll rollback to ROCm 3.0 until ROCm 3.1 gets patched. |
Description of changes: - move logic to import packages from `CMakeLists.txt` to dedicated helper files `cmake/Find<package>.cmake` for `find_package()` - enforce the Cython version requested in `CMakeLists.txt` - CMake now fails if `WITH_CUDA` is set to true but no CUDA-capable compiler is found - CMake now fails if `WITH_CLANG_TIDY` is set to true but Clang-tidy is not found or its version doesn't match the Clang compiler version - drop deprecated `FindCUDA` in favor of native CUDA support in CMake 3.10 (required for #3445) - add partial support for ROCm 3.1 (closes #3571, required for espressomd/docker#156)
In ROCm 3.0 and 3.1, environment variables for hipcc and hcc are overriden by incorrect paths (espressomd/docker#156). This causes CMake to generate an incorrect linking command for EspressoCore.so: in `/opt/rocm/bin/hipcc_cmake_linker_helper /opt/rocm -fPIC ...`, either path `/opt/rocm` is an empty string, or both the linker path and path `/opt/rocm` are empty strings. Calling `find_package()` twice with an overriden `HCC_PATH` fixes the linking command.
In ROCm 3.0 and 3.1, environment variables for hipcc and hcc are overriden by incorrect paths (espressomd/docker#156). This causes CMake to generate an incorrect linking command for EspressoCore.so: in `/opt/rocm/bin/hipcc_cmake_linker_helper /opt/rocm -fPIC ...`, either path `/opt/rocm` is an empty string, or both the linker path and path `/opt/rocm` are empty strings. Calling `find_package()` twice with an overriden `HCC_PATH` fixes the linking command.
In ROCm 3.0 and 3.1, environment variables for hipcc and hcc are overriden by incorrect paths (espressomd/docker#156). This causes CMake to generate an incorrect linking command for EspressoCore.so: in `/opt/rocm/bin/hipcc_cmake_linker_helper /opt/rocm -fPIC ...`, either path `/opt/rocm` is an empty string, or both the linker path and path `/opt/rocm` are empty strings. Calling `find_package()` twice with an overriden `HCC_PATH` fixes the linking command.
Libraries in ROCm 3.1.0 are now separated in subfolders, causing the weekly build to fail CI (logfile). After updating the CMake logic to support both 3.0.x and 3.1.0 (jngrad/espresso@d24236a107e), CMake didn't fail anymore and the compilation went fine (rocFFT and thrust are now visible to the compiler), but the linking stage failed:
The perl wrapper script
/opt/rocm-3.1.0/hip/bin/hipcc
has an issue at line 246:Setting the environment variable doesn't change the error message:
export HCC_HOME=/opt/rocm/hcc
Something overrides the value of that environment variable with
/opt/rocm-3.1.0/hip
before we hit line 246. A quick fix with symlinks (jngrad/espresso-docker@07665f1415) made the error message vanish, however all the python tests now time out on 3.1.0 (the C++ unit tests don't). The python tests run just fine on 3.0.0, so it's not a regression from the new CMake logic.The error message is documented in a 2017 wiki page of ROCmSoftwarePlatform/hipBLAS (link) and the proposed solution is to recompile HIP, but it's probably obsolete. Two users have reported multiple issues with ROCm 3.1.0 on the RadeonOpenCompute/ROCm issue tracker last week.
@mkuron should we just wait?
The text was updated successfully, but these errors were encountered: