Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for ROCm 3.1.0 #156

Closed
jngrad opened this issue Mar 6, 2020 · 8 comments · Fixed by #157
Closed

Support for ROCm 3.1.0 #156

jngrad opened this issue Mar 6, 2020 · 8 comments · Fixed by #157

Comments

@jngrad
Copy link
Member

jngrad commented Mar 6, 2020

Libraries in ROCm 3.1.0 are now separated in subfolders, causing the weekly build to fail CI (logfile). After updating the CMake logic to support both 3.0.x and 3.1.0 (jngrad/espresso@d24236a107e), CMake didn't fail anymore and the compilation went fine (rocFFT and thrust are now visible to the compiler), but the linking stage failed:

[ 46%] Linking HIP shared library EspressoCore.so
Can't exec "/opt/rocm-3.1.0/hip/bin/hcc": No such file or directory at /opt/rocm-3.1.0/hip/bin/hipcc line 248.
Use of uninitialized value $HCC_VERSION in pattern match (m//) at /opt/rocm-3.1.0/hip/bin/hipcc line 249.
Use of uninitialized value $HCC_VERSION_MAJOR in substitution (s///) at /opt/rocm-3.1.0/hip/bin/hipcc line 252.
Can't exec "/opt/rocm-3.1.0/hip/bin/hcc-config": No such file or directory at /opt/rocm-3.1.0/hip/bin/hipcc line 264.
Use of uninitialized value $HCC_VERSION_MAJOR in string eq at /opt/rocm-3.1.0/hip/bin/hipcc line 269.
Use of uninitialized value $HCC_VERSION in string ne at /opt/rocm-3.1.0/hip/bin/hipcc line 935.
Use of uninitialized value $HCC_VERSION in concatenation (.) or string at /opt/rocm-3.1.0/hip/bin/hipcc line 936.
HIP (/opt/rocm-3.1.0/hip) was built using hcc 3.1.20023-6d267cfb-6a70953f87a, but you are using /opt/rocm-3.1.0/hip/hcc with version  from hipcc. Please rebuild HIP including cmake or update HCC_HOME variable.
Died at /opt/rocm-3.1.0/hip/bin/hipcc line 937.

The perl wrapper script /opt/rocm-3.1.0/hip/bin/hipcc has an issue at line 246:

$HCC_HOME=$ENV{'HCC_HOME'} // $hipConfig{'HCC_HOME'} // "$ROCM_PATH/hcc";

Setting the environment variable doesn't change the error message:

export HCC_HOME=/opt/rocm/hcc

Something overrides the value of that environment variable with /opt/rocm-3.1.0/hip before we hit line 246. A quick fix with symlinks (jngrad/espresso-docker@07665f1415) made the error message vanish, however all the python tests now time out on 3.1.0 (the C++ unit tests don't). The python tests run just fine on 3.0.0, so it's not a regression from the new CMake logic.

The error message is documented in a 2017 wiki page of ROCmSoftwarePlatform/hipBLAS (link) and the proposed solution is to recompile HIP, but it's probably obsolete. Two users have reported multiple issues with ROCm 3.1.0 on the RadeonOpenCompute/ROCm issue tracker last week.

@mkuron should we just wait?

@mkuron
Copy link
Member

mkuron commented Mar 7, 2020

The error message is documented in a 2017 wiki page of ROCmSoftwarePlatform/hipBLAS (link) and the proposed solution is to recompile HIP, but it's probably obsolete.

That‘s a different issue. There it talks about an actual version difference, while in our case it fails to identify the version.

Two users have reported multiple issues with ROCm 3.1.0 on the RadeonOpenCompute/ROCm issue tracker last week.

I couldn‘t find aynthing related to this specific issue though. Could you please post links?

@jngrad
Copy link
Member Author

jngrad commented Mar 7, 2020

I couldn‘t find aynthing related to this specific issue though.

Yes, they are unrelated (gpuowl, Davinci Resolve, etc.). The issue tracker doesn't have anything about hanging programs or HCC_HOME-related error message. It didn't leave me with a good impression when I saw multiple packages broke. But looking closer, gpuowl already broke on 3.0.0.

@mkuron
Copy link
Member

mkuron commented Mar 7, 2020

Since our C++ unit tests work and the Python tests don't, perhaps they changed something about the linking process that only breaks libraries but not executables? We should probably have a quick look at why they hang and make a bug report if necessary.

I'm not worried about the HCC_HOME thing, they'll probably discover and fix that one on their own pretty quickly.

@jngrad
Copy link
Member Author

jngrad commented Mar 9, 2020

The timeout happens because the python interpreter hangs out when calling exit(). You can execute or import a sample in pypresso and the simulation will run just fine, but the interpreter cannot quit.

@mkuron
Copy link
Member

mkuron commented Mar 9, 2020

Thanks, I will test locally without Docker and see if we need to change anything in the destruction order to make it work again.

@mkuron
Copy link
Member

mkuron commented Mar 9, 2020

The hcc not found error is an actual bug in https://github.com/ROCm-Developer-Tools/HIP/blob/master/bin/hipcc_cmake_linker_helper, which was not adapted to the changed paths. One can run /opt/rocm-3.1.0/hip/bin/hipcc_cmake_linker_helper /opt/rocm-3.1.0/hip --version to reproduce the message outside of a build.

The Python interpreter hanging problem is that hipGetSymbolAddress and hipMemcpyToSymbol return hipErrorInvalidSymbol. Some parts of Espresso fail to correctly clean up after themselves if that happens and cause HIP's shutdown procedure to hang.

I suspect this error is introduced in the final linking stage. I cannot reproduce the issue outside of Espresso unless I use g++ instead of hipcc for linking hipcc-generated .o files into a .so file. That causes the resulting .so file to have a .kernel_ir section instead of a .kernel section. libEspresso.so has a .kernel section, so that is likely not the cause.

@mkuron
Copy link
Member

mkuron commented Mar 10, 2020

The hipErrorInvalidSymbol problem disappeared overnight. Now I get failing EK and LB tests (with NaNs, so probably some data doesn't get transferred, possibly the same problem as ROCm/HIP#1920), succeeding P3M-GPU tests and all of them hang at shutdown with backtraces like this:

#0  0x00007f20fea36ac5 in core::InterruptSignal::WaitRelaxed(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) () from /opt/rocm-3.1.0/lib/libhsa-runtime64.so.1
#1  0x00007f20fea28a59 in HSA::hsa_signal_wait_relaxed(hsa_signal_s, hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) () from /opt/rocm-3.1.0/lib/libhsa-runtime64.so.1
#2  0x00007f20fea19109 in amd::AqlQueue::~AqlQueue() ()
   from /opt/rocm-3.1.0/lib/libhsa-runtime64.so.1
#3  0x00007f20fea19379 in amd::AqlQueue::~AqlQueue() ()
   from /opt/rocm-3.1.0/lib/libhsa-runtime64.so.1
#4  0x00007f20fea14701 in amd::GpuAgent::~GpuAgent() ()
   from /opt/rocm-3.1.0/lib/libhsa-runtime64.so.1
#5  0x00007f20fea14869 in amd::GpuAgent::~GpuAgent() ()
   from /opt/rocm-3.1.0/lib/libhsa-runtime64.so.1
#6  0x00007f20fea3ebaf in core::Runtime::Unload() ()
   from /opt/rocm-3.1.0/lib/libhsa-runtime64.so.1
#7  0x00007f20fea3f2d5 in core::Runtime::Release() ()
   from /opt/rocm-3.1.0/lib/libhsa-runtime64.so.1
#8  0x00007f20fea254ad in HSA::hsa_shut_down() ()
   from /opt/rocm-3.1.0/lib/libhsa-runtime64.so.1
#9  0x00007f20ff4d9601 in ~HSAContext ()
    at /data/jenkins-workspace/compute-rocm-rel-3.1/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:3743
#10 0x00007f20ff4d9335 in ShutdownImpl ()
    at /data/jenkins-workspace/compute-rocm-rel-3.1/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:5797
#11 0x00007f20ff4bd592 in ~RuntimeImpl ()
    at /data/jenkins-workspace/compute-rocm-rel-3.1/external/hcc-tot/lib/mcwamp.cpp:51

@jngrad
Copy link
Member Author

jngrad commented Mar 10, 2020

Offline discussion with @mkuron: we'll rollback to ROCm 3.0 until ROCm 3.1 gets patched.

kodiakhq bot added a commit to espressomd/espresso that referenced this issue Mar 12, 2020
Description of changes:
- move logic to import packages from `CMakeLists.txt` to dedicated helper files `cmake/Find<package>.cmake` for `find_package()`
- enforce the Cython version requested in `CMakeLists.txt`
- CMake now fails if `WITH_CUDA` is set to true but no CUDA-capable compiler is found
- CMake now fails if `WITH_CLANG_TIDY` is set to true but Clang-tidy is not found or its version doesn't match the Clang compiler version
- drop deprecated `FindCUDA` in favor of native CUDA support in CMake 3.10 (required for  #3445)
- add partial support for ROCm 3.1 (closes #3571, required for espressomd/docker#156)
jngrad added a commit to jngrad/espresso that referenced this issue Mar 13, 2020
In ROCm 3.0 and 3.1, environment variables for hipcc and hcc are
overriden by incorrect paths (espressomd/docker#156). This causes
CMake to generate an incorrect linking command for EspressoCore.so:
in `/opt/rocm/bin/hipcc_cmake_linker_helper /opt/rocm -fPIC ...`,
either path `/opt/rocm` is an empty string, or both the linker path
and path `/opt/rocm` are empty strings. Calling `find_package()`
twice with an overriden `HCC_PATH` fixes the linking command.
jngrad added a commit to jngrad/espresso that referenced this issue Mar 13, 2020
In ROCm 3.0 and 3.1, environment variables for hipcc and hcc are
overriden by incorrect paths (espressomd/docker#156). This causes
CMake to generate an incorrect linking command for EspressoCore.so:
in `/opt/rocm/bin/hipcc_cmake_linker_helper /opt/rocm -fPIC ...`,
either path `/opt/rocm` is an empty string, or both the linker path
and path `/opt/rocm` are empty strings. Calling `find_package()`
twice with an overriden `HCC_PATH` fixes the linking command.
jngrad added a commit to jngrad/espresso that referenced this issue Mar 14, 2020
In ROCm 3.0 and 3.1, environment variables for hipcc and hcc are
overriden by incorrect paths (espressomd/docker#156). This causes
CMake to generate an incorrect linking command for EspressoCore.so:
in `/opt/rocm/bin/hipcc_cmake_linker_helper /opt/rocm -fPIC ...`,
either path `/opt/rocm` is an empty string, or both the linker path
and path `/opt/rocm` are empty strings. Calling `find_package()`
twice with an overriden `HCC_PATH` fixes the linking command.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants