Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] MachineLearningLicensingTests] [testAutoCloseJobWithDatafeed fails with NPE #51285

Closed
henningandersen opened this issue Jan 22, 2020 · 4 comments · Fixed by #51302
Closed
Assignees
Labels
:ml Machine learning >test-failure Triaged test failures from CI v8.0.0-alpha1

Comments

@henningandersen
Copy link
Contributor

MachineLearningLicensingTests.testAutoCloseJobWithDatafeed fails rarely with NPE. Can be reproduced by adding a sleep in the onResponse method here and using following repro line:

./gradlew ':x-pack:plugin:ml:test' --tests "org.elasticsearch.license.MachineLearningLicensingTests.testAutoCloseJobWithDatafeed" -Dtests.seed=8505B129BBE04142 -Dtests.security.manager=true -Dtests.locale=en-SG -Dtests.timezone=Asia/Hovd -Dcompiler.java=13

It failed on a couple of PR builds, for instance this one:

https://gradle-enterprise.elastic.co/s/doocabu3vhmdm/tests/7bifs6bt3pims-kf5sbxwq5fdpm

The assertion error in that build is new code, but caused by the NullPointerException:

Caused by: java.lang.NullPointerException
	at org.elasticsearch.xpack.ml.datafeed.DatafeedManager.getJobId(DatafeedManager.java:271)
	at org.elasticsearch.xpack.ml.datafeed.DatafeedManager.getJobState(DatafeedManager.java:275)
	at org.elasticsearch.xpack.ml.datafeed.DatafeedManager$TaskRunner.runWhenJobIsOpened(DatafeedManager.java:498)
	at org.elasticsearch.xpack.ml.datafeed.DatafeedManager$1.onResponse(DatafeedManager.java:94)
	at org.elasticsearch.xpack.ml.datafeed.DatafeedManager$1.onResponse(DatafeedManager.java:91)
	at org.elasticsearch.action.ActionListener$4.onResponse(ActionListener.java:162)
@henningandersen henningandersen added >test-failure Triaged test failures from CI :ml Machine learning v8.0.0 labels Jan 22, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195
Copy link
Contributor

droberts195 commented Jan 22, 2020

@henningandersen how long did you sleep for to make it easily reproducible? I tried a variety of values from 20ms to 9 seconds and none of them reproduced the problem. (10 seconds or more causes an unrelated failure because an assertBusy wait would need extending with such a long sleep.)

Having said that I can see a theoretical flaw in the code, and a reason why adding a sleep would make it more likely to happen. So I am happy to fix that today.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jan 22, 2020
The ID of the datafeed's associated job was being obtained
frequently by looking up the datafeed task in a map that
was being modified in other threads.  This could lead to
NPEs if the datafeed stopped running at an unexpected time.

This change reduces the number of places where a datafeed's
associated job ID is looked up to avoid the possibility of
failures when the datafeed's task is removed from the map
of running tasks during multi-step operations in other
threads.

Fixes elastic#51285
@droberts195
Copy link
Contributor

I think #51302 will fix this, but it's hard to be certain as I couldn't reproduce it locally.

droberts195 added a commit that referenced this issue Jan 22, 2020
The ID of the datafeed's associated job was being obtained
frequently by looking up the datafeed task in a map that
was being modified in other threads.  This could lead to
NPEs if the datafeed stopped running at an unexpected time.

This change reduces the number of places where a datafeed's
associated job ID is looked up to avoid the possibility of
failures when the datafeed's task is removed from the map
of running tasks during multi-step operations in other
threads.

Fixes #51285
droberts195 added a commit that referenced this issue Jan 22, 2020
The ID of the datafeed's associated job was being obtained
frequently by looking up the datafeed task in a map that
was being modified in other threads.  This could lead to
NPEs if the datafeed stopped running at an unexpected time.

This change reduces the number of places where a datafeed's
associated job ID is looked up to avoid the possibility of
failures when the datafeed's task is removed from the map
of running tasks during multi-step operations in other
threads.

Fixes #51285
@henningandersen
Copy link
Contributor Author

Thanks @droberts195 , you are right, it is only surfaced due to the added assertion in my PR (sorry, should have double checked against master before reporting here). It looks like the resulting NPE is ignored in AllocatedPersistentTask.completeAndNotifyIfNeeded since the task is already completed.

I used a sleep of 1 second.

I checked after merging in your changes locally and it seems to work fine, thanks for fixing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI v8.0.0-alpha1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants