-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Close results stream before data frame analytics job stops #67828
[ML] Close results stream before data frame analytics job stops #67828
Conversation
Investigating the failures in elastic#67581 it looked like after restarting the regression job the process was started but no data was loaded. So the process was getting stuck waiting for data. Looking into the code it looks like this can be explained by the fact that `AnalyticsResultProcessor` counts down its completion latch before it closes the results stream. This means the job may go to `stopped` state while the out stream is still alive, which on windows results to the directory with the named pipes staying around. Then when the job is started again, which the test does immediately, the old pipes are used and thus the data is not sent to the the new process. This commit fixes this while refactoring ML processes to consume and close the out stream in their close method so that calling code is not responsible for doing that. Fixes elastic#67581
Pinging @elastic/ml-core (:ml) |
@@ -117,7 +117,6 @@ public void process(AnalyticsProcess<AnalyticsResult> process) { | |||
completeResultsProgress(); | |||
} | |||
completionLatch.countDown(); | |||
process.consumeAndCloseOutputStream(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the offending code piece. Latch down was counted down before we close output stream.
@elasticmachine update branch |
It turns out this fix is not correct. We cannot consume and close the output stream in the process close method. The first thing close does it to close the processes input stream. This signals the process termination. However, the process might be writing out results as a result of terminating. Thus, we need the result processors to keep being able to read results and only after those results are all written out close the output stream. I'll come back with a different fix. |
Investigating the failures in #67581 it looked like after restarting
the regression job the process was started but no data was loaded.
So the process was getting stuck waiting for data.
Looking into the code it looks like this can be explained by the
fact that
AnalyticsResultProcessor
counts down its completionlatch before it closes the results stream. This means the job
may go to
stopped
state while the out stream is still alive,which on windows results to the directory with the named pipes
staying around. Then when the job is started again, which the
test does immediately, the old pipes are used and thus the
data is not sent to the the new process.
This commit fixes this while refactoring ML processes to consume
and close the out stream in their close method so that calling
code is not responsible for doing that.
Fixes #67581