Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Close results stream before data frame analytics job stops #67828

Conversation

dimitris-athanasiou
Copy link
Contributor

Investigating the failures in #67581 it looked like after restarting
the regression job the process was started but no data was loaded.
So the process was getting stuck waiting for data.

Looking into the code it looks like this can be explained by the
fact that AnalyticsResultProcessor counts down its completion
latch before it closes the results stream. This means the job
may go to stopped state while the out stream is still alive,
which on windows results to the directory with the named pipes
staying around. Then when the job is started again, which the
test does immediately, the old pipes are used and thus the
data is not sent to the the new process.

This commit fixes this while refactoring ML processes to consume
and close the out stream in their close method so that calling
code is not responsible for doing that.

Fixes #67581

Investigating the failures in elastic#67581 it looked like after restarting
the regression job the process was started but no data was loaded.
So the process was getting stuck waiting for data.

Looking into the code it looks like this can be explained by the
fact that `AnalyticsResultProcessor` counts down its completion
latch before it closes the results stream. This means the job
may go to `stopped` state while the out stream is still alive,
which on windows results to the directory with the named pipes
staying around. Then when the job is started again, which the
test does immediately, the old pipes are used and thus the
data is not sent to the the new process.

This commit fixes this while refactoring ML processes to consume
and close the out stream in their close method so that calling
code is not responsible for doing that.

Fixes elastic#67581
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@@ -117,7 +117,6 @@ public void process(AnalyticsProcess<AnalyticsResult> process) {
completeResultsProgress();
}
completionLatch.countDown();
process.consumeAndCloseOutputStream();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the offending code piece. Latch down was counted down before we close output stream.

@dimitris-athanasiou
Copy link
Contributor Author

@elasticmachine update branch

@dimitris-athanasiou
Copy link
Contributor Author

It turns out this fix is not correct. We cannot consume and close the output stream in the process close method.

The first thing close does it to close the processes input stream. This signals the process termination. However, the process might be writing out results as a result of terminating. Thus, we need the result processors to keep being able to read results and only after those results are all written out close the output stream. I'll come back with a different fix.

@dimitris-athanasiou dimitris-athanasiou deleted the close-results-stream-before-dfa-job-stops branch January 21, 2021 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] RegressionIT.testStopAndRestart failure
3 participants