-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix worker streams in OLS-eig executing in an unsafe order #4539
Fix worker streams in OLS-eig executing in an unsafe order #4539
Conversation
@@ -228,11 +252,15 @@ class OlsTest : public ::testing::TestWithParam<OlsInputs<T>> { | |||
T intercept, intercept2, intercept3; | |||
}; | |||
|
|||
const std::vector<OlsInputs<float>> inputsf2 = { | |||
{0.001f, 4, 2, 2, 0}, {0.001f, 4, 2, 2, 1}, {0.001f, 4, 2, 2, 2}}; | |||
const std::vector<OlsInputs<float>> inputsf2 = {{hconf::NON_BLOCKING_ONE, 0.001f, 4, 2, 2, 0}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the changes in this PR, are these assertions able to reliably reproduce the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this bug seems to be very elusive. I managed to reproduce it only under some specific conditions in python, but then it disappeared again after I did some further changes to optimize preProcessData
(perhaps, due to changing the pattern of calls using the main stream / rmm resources).
Yet I hope the changes in these tests will help to find other streams-related bugs if there are any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @achirkin for fixing this problem! It looks good, I just have a few smaller comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Artem for the update, the PR looks good to me.
If I understand correctly, this bug affects LinearRegression(handle=Handle(), ...) or use a different algorithm: LinearRegression(algorithm='qr', ...) @achirkin would the first option (passing handle) be the preferred workaround? |
Yes, I think it would make less impact on performance. Thanks for the suggestion! Also I should note that I could only reproduce the problem when both |
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #4539 +/- ##
===============================================
Coverage ? 85.73%
===============================================
Files ? 239
Lines ? 19585
Branches ? 0
===============================================
Hits ? 16791
Misses ? 2794
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
@gpucibot merge |
…4539) The latest version of the "eig" OLS solver has a bug producing garbage results under some conditions. When at least one worker stream is used to run some operations concurrently, for sufficiently large workset sizes, the memory allocation in the main stream may finish later than the worker stream starts to use it. This PR adds more ordering between the main and the worker streams, fixing this and some other theoretically possible edge cases. Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: rapidsai#4539
The latest version of the "eig" OLS solver has a bug producing garbage results under some conditions. When at least one worker stream is used to run some operations concurrently, for sufficiently large workset sizes, the memory allocation in the main stream may finish later than the worker stream starts to use it.
This PR adds more ordering between the main and the worker streams, fixing this and some other theoretically possible edge cases.