Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] MachineLearningIT testStopDatafeed fails on close/cleanup #64463

Closed
hendrikmuhs opened this issue Nov 2, 2020 · 3 comments
Closed

[CI] MachineLearningIT testStopDatafeed fails on close/cleanup #64463

hendrikmuhs opened this issue Nov 2, 2020 · 3 comments
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@hendrikmuhs
Copy link

Build scan: https://gradle-enterprise.elastic.co/s/lsc4liio5ogt6

Repro line:

./gradlew ':client:rest-high-level:asyncIntegTest' --tests "org.elasticsearch.client.MachineLearningIT.testStopDatafeed" \
  -Dtests.seed=6BE3C7FF3E3CC718 \
  -Dtests.security.manager=true \
  -Dtests.locale=de-LU \
  -Dtests.timezone=America/Louisville \
  -Druntime.java=8

Reproduces locally?: no

Applicable branches: 7.10

Failure history:

Failure excerpt:

10:13:38 org.elasticsearch.client.MachineLearningIT > testStopDatafeed FAILED
10:13:38     ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot open job [test-stop-datafeed1] because it has already been opened]]; nested: ElasticsearchException[Elasticsearch exception [type=resource_already_exists_exception, reason=task with id {job-test-stop-datafeed1} already exist]];
10:13:38         at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:187)
10:13:38         at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:1907)
10:13:38         at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:1884)
10:13:38         at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1641)
10:13:38         at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1598)
10:13:38         at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1568)
10:13:38         at org.elasticsearch.client.MachineLearningClient.openJob(MachineLearningClient.java:357)
10:13:38         at org.elasticsearch.client.MachineLearningIT.openJob(MachineLearningIT.java:2675)
10:13:38         at org.elasticsearch.client.MachineLearningIT.testStopDatafeed(MachineLearningIT.java:715)
10:13:38 
10:13:38         Caused by:
10:13:38         ElasticsearchException[Elasticsearch exception [type=resource_already_exists_exception, reason=task with id {job-test-stop-datafeed1} already exist]]
10:13:38             at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:496)
10:13:38             at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:407)
10:13:38             at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:437)
10:13:38             at org.elasticsearch.ElasticsearchException.failureFromXContent(ElasticsearchException.java:603)
10:13:38             at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:179)
10:13:38             ... 8 more
10:13:38 
10:13:38     java.lang.RuntimeException: Had to resort to force-closing jobs, something went wrong?
10:13:38         at org.elasticsearch.client.MlTestStateCleaner.closeAllJobs(MlTestStateCleaner.java:105)
10:13:38         at org.elasticsearch.client.MlTestStateCleaner.deleteAllJobs(MlTestStateCleaner.java:85)
10:13:38         at org.elasticsearch.client.MlTestStateCleaner.clearMlMetadata(MlTestStateCleaner.java:55)
10:13:38         at org.elasticsearch.client.MachineLearningIT.cleanUp(MachineLearningIT.java:226)
10:13:38 
10:13:38         Caused by:
10:13:38         java.net.SocketTimeoutException: 60.000 milliseconds timeout on connection http-outgoing-236 [ACTIVE]
10:13:38             at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:850)
10:13:38             at org.elasticsearch.client.RestClient.performRequest(RestClient.java:275)
10:13:38             at org.elasticsearch.client.RestClient.performRequest(RestClient.java:262)
10:13:38             at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1628)
10:13:38             at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1598)
10:13:38             at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1568)
10:13:38             at org.elasticsearch.client.MachineLearningClient.closeJob(MachineLearningClient.java:400)
10:13:38             at org.elasticsearch.client.MlTestStateCleaner.closeAllJobs(MlTestStateCleaner.java:96)
10:13:38             ... 3 more
10:13:38 
10:13:38             Caused by:
10:13:38             java.net.SocketTimeoutException: 60.000 milliseconds timeout on connection http-outgoing-236 [ACTIVE]
10:13:38                 at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
10:13:38                 at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92)
10:13:38                 at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39)
10:13:38                 at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
10:13:38                 at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261)
10:13:38                 at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:502)
10:13:38                 at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211)
10:13:38                 at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
10:13:38                 at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
10:13:38                 at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
10:13:38                 at java.lang.Thread.run(Thread.java:748)

this resource_already_exists_exception looks suspicious:

10:13:38 org.elasticsearch.client.MachineLearningIT > testStopDatafeed FAILED
10:13:38     ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot open job [test-stop-datafeed1] because it has already been opened]]; nested: ElasticsearchException[Elasticsearch exception [type=resource_already_exists_exception, reason=task with id {job-test-stop-datafeed1} already exist]];
10:13:38         at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:187)
10:13:38         at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:1907)
10:13:38         at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:1884)
10:13:38         at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1641)
10:13:38         at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1598)
10:13:38         at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1568)
10:13:38         at org.elasticsearch.client.MachineLearningClient.openJob(MachineLearningClient.java:357)
10:13:38         at org.elasticsearch.client.MachineLearningIT.openJob(MachineLearningIT.java:2675)
10:13:38         at org.elasticsearch.client.MachineLearningIT.testStopDatafeed(MachineLearningIT.java:715)
10:13:38 
10:13:38         Caused by:
10:13:38         ElasticsearchException[Elasticsearch exception [type=resource_already_exists_exception, reason=task with id {job-test-stop-datafeed1} already exist]]
10:13:38             at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:496)
10:13:38             at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:407)
10:13:38             at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:437)
10:13:38             at org.elasticsearch.ElasticsearchException.failureFromXContent(ElasticsearchException.java:603)
10:13:38             at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:179)
10:13:38             ... 8 more
@hendrikmuhs hendrikmuhs added >test-failure Triaged test failures from CI :ml Machine learning labels Nov 2, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195
Copy link
Contributor

I strongly suspect this is just a particular case of #58286. This message suggests that the cluster was running extremely slowly - it took over 60 seconds to send one message between nodes:

java.net.SocketTimeoutException: 60.000 milliseconds timeout on connection

@benwtrent
Copy link
Member

I am going to close this issue. It seems most likely that the test suffered slowness from the darwin test runner.

If this happens again, we should re-open AND include the cluster logs for the test run for more investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants