-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop Stream 2024-07-01 #377
Develop Stream 2024-07-01 #377
Conversation
For some reason, the CI builds are failing because the inline assembly for thread_load.hpp aren't being formed correctly:
However, I am unable to reproduce this compile time error with the same ROCM build used in our CI, nor in even newer builds. It's bizarre. I think we can merge in this PR, and if this error is legitimate, then it should show itself again in other environments and we can attempt to reproduce the problem then. |
@Beanavil @Naraenda Upon consultation with @lawruble13 and looking closely at the compile error log:
|
…prim as much as possible
This reverts commit 894f60cae177096bcf1000e5dd321174faf2476b.
…unction attributes
17b72bd
to
0399aa6
Compare
@stanleytsang-amd I just rebased the branch 👍🏼 |
Wow, that was quick. Thanks! I'll rerun CI now and keep you posted. |
Further analysis revealed that a change from 41f6679 seems to be the trigger - namely the moving of the \t and the removal of the hash symbol as part of a fix required for gfx12. It compiled and functioned fine in past testing, but newer compiler builds seem to have changed that, resulting in the above build error. |
@Beanavil @Naraenda #390 fixed the build error, but it's now exposed a potential problem with the DeviceSegmentedReduce tests, they appear to be hanging on our CI on multiple platforms. It doesn't say which test, but probably it is the new LargeIndices test? What is the expected GPU memory usage of that test? The reason I ask is because both of the failing GPUs so far have 32GB of memory, and while I unfortunately don't have access to a MI100 or Navi32 at the moment to test, I have tested on a 48GB Navi31 and a MI300 with 192GB of memory and both execute the DeviceSegmentedReduce tests just fine. The MI200 pipeline on CI is still waiting for the next available executor, I'll check back later. I'll see if I can find a MI100 or Navi32 to work on in the meantime. |
It is also hanging on MI200, so the hang doesn't seem 100% correlated with GPU memory capacity/usage. However, the I just tried the test on Navi32 with the same ROCm build and it passed... @Beanavil @Naraenda Assuming we can't get this resolved before the end of the week, I'm considering merging in rocPRIM PR 575 tomorrow (and rocThrust) if the rocPRIM build issue is resolved (see my latest comment in the rocPRIM PR) and all tests pass, to get a head start on deeper testing for rocPRIM. I don't see explicit mention of this PR being dependent on rocPRIM PR 575, is that actually the case? Can we safely merge in rocPRIM without breaking hipCUB? |
@stanleytsang-amd sorry for the late reply:
In our CI we test MI100 (also Vega20 and V620), and all the tests passed successfully.
Hmm, could we get some more details about the platforms on which it's hanging? I'll see if I can reproduce it within the docker image you mentioned in the meantime.
Let me double-check this just to be sure. Update: it builds fine 👍🏼 |
Thanks for verifying the independence of this hipCUB PR from the rocPRIM PR. The Linux platforms we test on are:
The docker image I sent you yesterday is the same base image used in 1). |
This is a bit unorthodox but I think we will merge in this PR, in order to expose it to a larger set of testing environments - if the hang is indeed legitimate, it's my hope that one of these environments will actually allow us to reproduce the hang locally. |
An update to the TwiddleIn/Out functions from PR ROCm#377 seems to be causing a build failure in onnxruntime. This change reverts the single commit (0721c2c) that made those changes. We can re-apply the change with an appropriate fix in the future. Note: the commits in the PR were squashed, so that commit will not show up in the log.
An update to the TwiddleIn/Out functions from PR ROCm#377 seems to be causing a build failure in onnxruntime. This change reverts the single commit (0721c2c) that made those changes. We can re-apply the change with an appropriate fix in the future. Note: the commits in the PR were squashed, so that commit will not show up in the log.
An update to the TwiddleIn/Out functions from PR #377 seems to be causing a build failure in onnxruntime. This change reverts the single commit (0721c2c) that made those changes. We can re-apply the change with an appropriate fix in the future. Note: the commits in the PR were squashed, so that commit will not show up in the log.
* Revert Bit Twiddle change from PR #377 (#397) An update to the TwiddleIn/Out functions from PR #377 seems to be causing a build failure in onnxruntime. This change reverts the single commit (0721c2c) that made those changes. We can re-apply the change with an appropriate fix in the future. Note: the commits in the PR were squashed, so that commit will not show up in the log. * Remove website URL from comments (#398) Referencing or using code from some websites is prohibited in this repository. This change removes an informational reference in the comments. * Add gfx1151 target (#399) (#401) Co-authored-by: Stanley Tsang <[email protected]> * Spolifroni amd/624 changelogcleanup upcoming (#411) * edited to conform to standards * edited to conform to standards * updated the changelog for 6.3 (#418) --------- Co-authored-by: amd-garydeng <[email protected]> Co-authored-by: Stanley Tsang <[email protected]> Co-authored-by: spolifroni-amd <[email protected]>
* Revert Bit Twiddle change from PR ROCm#377 (ROCm#397) An update to the TwiddleIn/Out functions from PR ROCm#377 seems to be causing a build failure in onnxruntime. This change reverts the single commit (0721c2c) that made those changes. We can re-apply the change with an appropriate fix in the future. Note: the commits in the PR were squashed, so that commit will not show up in the log. * Remove website URL from comments (ROCm#398) Referencing or using code from some websites is prohibited in this repository. This change removes an informational reference in the comments. * Add gfx1151 target (ROCm#399) (ROCm#401) Co-authored-by: Stanley Tsang <[email protected]> * Spolifroni amd/624 changelogcleanup upcoming (ROCm#411) * edited to conform to standards * edited to conform to standards * updated the changelog for 6.3 (ROCm#418) --------- Co-authored-by: amd-garydeng <[email protected]> Co-authored-by: Stanley Tsang <[email protected]> Co-authored-by: spolifroni-amd <[email protected]>
* Resolve "hipCUB compilation fails with latest rocPRIM changes" * Rework device_histogram test * Const-qualify half_t::operator+/* * Update CUB/Thrust/libcu++ to 2.5.0 * Add tets for large number of items for hipcub::DeviceSelect::If * Update example_device_radix_sort.cu * fix: reset error code in device_radix_sort test after out-of-memory error * Expose DeviceSelect::FlaggedIf * Add test for DeviceSelect::FlaggedIf * Add benchmark for DeviceSelect::FlaggedIf * Set c++ version to 17 and create warning * Fix ambiguous variable error * Fix nodiscard warnings * Set CI tests for both c++14 and 17 * Fix nodiscard warnings in example * Examples clang-format * Fixed clang format and dates * temp fix: wrong error on cuda machines * Ignore error from hipGetLastError to prevent warning * Deleted declaration of unecessary hipError_t result * Deleted declaration of unecessary hipError_t result * Format amending * Format amending * Format amending * Cleanup * Formatting * Added wrapper for BindTexture * Fixed some issues in the test * Removed unnecessary code * Mergeback 6.3 fixes (#420) * Revert Bit Twiddle change from PR #377 (#397) An update to the TwiddleIn/Out functions from PR #377 seems to be causing a build failure in onnxruntime. This change reverts the single commit (0721c2c) that made those changes. We can re-apply the change with an appropriate fix in the future. Note: the commits in the PR were squashed, so that commit will not show up in the log. * Remove website URL from comments (#398) Referencing or using code from some websites is prohibited in this repository. This change removes an informational reference in the comments. * Add gfx1151 target (#399) (#401) Co-authored-by: Stanley Tsang <[email protected]> * Spolifroni amd/624 changelogcleanup upcoming (#411) * edited to conform to standards * edited to conform to standards * updated the changelog for 6.3 (#418) --------- Co-authored-by: amd-garydeng <[email protected]> Co-authored-by: Stanley Tsang <[email protected]> Co-authored-by: spolifroni-amd <[email protected]> * Add parameter to specify rocPRIM branch to use (#403) * added link to documentation (#416) --------- Co-authored-by: Balint Soproni <[email protected]> Co-authored-by: Beatriz Navidad Vilches <[email protected]> Co-authored-by: Saiyang Zhang <[email protected]> Co-authored-by: Wayne Franz <[email protected]> Co-authored-by: amd-garydeng <[email protected]> Co-authored-by: Stanley Tsang <[email protected]> Co-authored-by: spolifroni-amd <[email protected]> Co-authored-by: Lauren Wrubleski <[email protected]>
* Revert Bit Twiddle change from PR #377 (#397) An update to the TwiddleIn/Out functions from PR #377 seems to be causing a build failure in onnxruntime. This change reverts the single commit (0721c2c) that made those changes. We can re-apply the change with an appropriate fix in the future. Note: the commits in the PR were squashed, so that commit will not show up in the log. * Remove website URL from comments (#398) Referencing or using code from some websites is prohibited in this repository. This change removes an informational reference in the comments. * Add gfx1151 target (#399) (#401) Co-authored-by: Stanley Tsang <[email protected]> * Spolifroni amd/624 changelogcleanup upcoming (#411) * edited to conform to standards * edited to conform to standards * updated the changelog for 6.3 (#418) * added support gfx1151 and gfx12 to default gpu list * updated changelog * fixed minor grammar mistakes in changelog * Update CHANGELOG.md Co-authored-by: spolifroni-amd <[email protected]> * Update CHANGELOG.md Co-authored-by: spolifroni-amd <[email protected]> * Remove gfx940,gfx941 targets (#424) * Update version for 6.4 * Add extended tests changelog entry --------- Co-authored-by: Wayne Franz <[email protected]> Co-authored-by: amd-garydeng <[email protected]> Co-authored-by: spolifroni-amd <[email protected]> Co-authored-by: NguyenNhuDi <[email protected]> Co-authored-by: Val Movsik <[email protected]>
config.hpp
half
andbfloat16
to hipCUB'sWarpReduce
andScan
teststest_hipcub_thread.cpp
build errors withhalf
andbfloat16
typeshalf
andbfloat16
to hipCUB'sThread
testIteratorCategory
to rocPRIM's backendDeviceSegmentedReduce
TwiddleIn/Out
implementation (for rocPRIM's backend) to use rocPRIM'sradix_key_codec