[CI] AsyncSearchActionTests.testCancellation fails on CI #54180

tlrx · 2020-03-25T12:59:49Z

The test AsyncSearchActionTests.testCancellation failed on CI on the 7.x branch:

org.elasticsearch.xpack.search.AsyncSearchActionTests > testCancellation FAILED
    java.lang.AssertionError: Shard [.async-search][0] is still locked after 5 sec waiting
        at __randomizedtesting.SeedInfo.seed([DD4D761E3C3EE063:B7A2968A281D7E5]:0)
        at org.junit.Assert.fail(Assert.java:88)
        at org.elasticsearch.test.InternalTestCluster.assertAfterTest(InternalTestCluster.java:2483)
        at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:600)
        at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2143)

Build scan: https://gradle-enterprise.elastic.co/s/4ksqilghmk6ii

This is the first time this test fails on CI. I'm not sure this is really related to async search as the logs mentioned some version conflict exception:

[2020-03-25T03:22:06,439][ERROR][o.e.x.s.TransportSubmitAsyncSearchAction] [node_s0] failed to store async-search [FkdmWEtzRnhjUVZ1cE9ya090RUE1TVEaWWR1SU5FRV9RcTZuYTBqa0kyYVVDZzo1NTQ=]
  org.elasticsearch.transport.RemoteTransportException: [node_s0][127.0.0.1:63916][indices:data/write/update[s]]
  Caused by: org.elasticsearch.index.engine.VersionConflictEngineException: [GfXKsFxcQVupOrkOtEA5MQ]: version conflict, required seqNo [0], primary term [1]. but no document was found
       at org.elasticsearch.index.engine.InternalEngine.planIndexingAsPrimary(InternalEngine.java:1065) ~[elasticsearch-7.7.0-SNAPSHOT.jar:7.7.0-SNAPSHOT]

It does not reproduce locally with:

./gradlew ':x-pack:plugin:async-search:test' --tests "org.elasticsearch.xpack.search.AsyncSearchActionTests.testCancellation" \
  -Dtests.seed=DD4D761E3C3EE063 \
  -Dtests.security.manager=true \
  -Dtests.locale=es-NI \
  -Dtests.timezone=Atlantic/Faroe \
  -Dcompiler.java=13

This also looks different from #53360.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-03-25T12:59:51Z

Pinging @elastic/es-search (:Search/Search)

DaveCTurner · 2020-03-26T15:19:54Z

I saw another failure of this test. It looks like there's a problem during test cleanup where the .async-search index is deleted and then re-created:

1> [2020-03-26T21:29:32,687][INFO ][o.e.x.s.AsyncSearchActionTests] [testCancellation] [AsyncSearchActionTests#testCancellation]: cleaning up after test |  
-- | --
  | 1> [2020-03-26T21:29:32,736][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_s0] [test-async/4xCsJ1nESwa4J36TnnOulA] deleting index |  
  | 1> [2020-03-26T21:29:32,736][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_s0] [.async-search/WeVwFd18SPWmCrHQx0qgSg] deleting index |  
  | 1> [2020-03-26T21:29:32,848][WARN ][o.e.d.c.m.MetaDataCreateIndexService] [node_s0] index name [.async-search] starts with a dot '.', in the next major version, index names starting with a dot are reserved for hidden indices and system indices |  
  | 1> [2020-03-26T21:29:32,851][INFO ][o.e.c.m.MetaDataCreateIndexService] [node_s0] [.async-search] creating index, cause [api], templates [random_index_template], shards [1]/[0], mappings [_doc] |  
  | 1> [2020-03-26T21:29:32,851][INFO ][o.e.c.r.a.AllocationService] [node_s0] updating number_of_replicas to [1] for indices [.async-search] |  
  | 1> [2020-03-26T21:29:32,876][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_s0] removing template [random_index_template] |  
  | 1> [2020-03-26T21:29:32,894][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_s0] removing template [ilm-history] |  
  | 1> [2020-03-26T21:29:32,931][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_s0] adding template [ilm-history] for index patterns [ilm-history-2*] |  
  | 1> [2020-03-26T21:29:32,945][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_s0] removing template [.slm-history] |  
  | 1> [2020-03-26T21:29:32,972][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_s0] adding template [.slm-history] for index patterns [.slm-history-2*] |  
  | 1> [2020-03-26T21:29:33,068][INFO ][o.e.c.r.a.AllocationService] [node_s0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.async-search][0]]]). |  
  | 1> [2020-03-26T21:29:38,202][INFO ][o.e.x.s.AsyncSearchActionTests] [testCancellation] after test

DaveCTurner · 2020-03-26T15:38:49Z

Muted in master, 7.x and 7.7: 1d3a8de f48e8f3 7491f7f

mark-vieira · 2020-03-27T15:58:59Z

There are a number of failures that seem to be specific to our MacOS agents. They typically look like this:

org.elasticsearch.xpack.search.AsyncSearchActionTests > testCleanupOnFailure FAILED
    java.lang.AssertionError: 6 channels still being tracked in RestCancellableNodeClient while there should be none expected:<0> but was:<6>
        at __randomizedtesting.SeedInfo.seed([4C0451DF11A87EF0:C71070CDB7AD6138]:0)
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.failNotEquals(Assert.java:834)
        at org.junit.Assert.assertEquals(Assert.java:645)
        at org.elasticsearch.test.ESIntegTestCase.lambda$clearClusters$2(ESIntegTestCase.java:551)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:917)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:890)
        at org.elasticsearch.test.ESIntegTestCase.clearClusters(ESIntegTestCase.java:549)
        at org.elasticsearch.test.ESIntegTestCase.buildAndPutCluster(ESIntegTestCase.java:523)
        at org.elasticsearch.test.ESIntegTestCase.beforeInternal(ESIntegTestCase.java:381)
        at org.elasticsearch.test.ESIntegTestCase.setupTestCluster(ESIntegTestCase.java:2192)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

This doesn't happen locally for me on my MBP so it might be infra related but I don't know enough about these tests to infer what that might be. Seems isolated to AsyncSearchActionTests.

Let me know if I should open a separate issue for this.

javanna · 2020-03-27T16:04:26Z

thanks for your comment @mark-vieira I will look into this.

mark-vieira · 2020-03-27T16:18:34Z

Ah, I forgot to include a link to an example build. Here you are @javanna.

https://gradle-enterprise.elastic.co/s/msim2zdegwhzg

benwtrent · 2020-03-30T11:39:55Z

More MacOS failures like this comment:

https://gradle-enterprise.elastic.co/s/7rni7zuujym2m (7.x)
https://gradle-enterprise.elastic.co/s/wr5bjpe6hejig (7.7)

jimczi · 2020-03-30T11:45:21Z

I seems that we have two types of failure here. One that fails with:

java.lang.AssertionError: 6 channels still being tracked in RestCancellableNodeClient while there should be none expected:<0> but was:<6>

and another one that fails with:

org.elasticsearch.xpack.search.AsyncSearchActionTests > testCancellation FAILED
    java.lang.AssertionError: Shard [.async-search][0] is still locked after 5 sec waiting

They are maybe linked but the former is weird since it fails with a rest layer assertion on tests that should only use the transport client. I suspect that the leaked channels are coming from other tests that failed to cleanup their connections. @javanna were you able to reproduce ?

`AsyncSearchActionTests` currently fails quite often. That is since the introduction of `RestSubmitAsyncSearchActionTests` which indirectly manipulates the channels being tracked in `RestCancellableNodeClient`. There are channels left in the map after `RestSubmitAsyncSearchActionTests` is run, and later `AsyncSearchActionTests` checks that there are no channels in the map which makes each test method fail. This is particularily hard to reproduce as the order in which tests are run appears to be platform dependent. The test cluster assertion that there are no channels in the map only makes sense in the context of internal cluster tests, while there may be collisions with unit tests that register http channels as part of their testing. This can be solved by renaming `AsyncSearchActionTests` to `AsyncSearchActionIT`. This way it won't be run as part of unit tests but rather within another JVM where the number of channels is `0` and such assumption holds, because there are no expected manual manipulation of the channels. Relates to elastic#54180

ywangd · 2020-04-01T02:15:15Z

I commented and re-openned #53360. But decided to move it here since this issue is more recent and active. I do want to emphasize that the errors are not just for testCancellation as it says in the title. In fact, ~~the new failures do not even include testCancellation~~ (I noticed it's muted). They are for the following test methods:

AsyncSearchActionTests classMethod
AsyncSearchActionTests testCleanupOnFailure
AsyncSearchActionTests testDeleteCancelRunningTask
AsyncSearchActionTests testDeleteCleanupIndex
AsyncSearchActionTests testInvalidId
AsyncSearchActionTests testMaxMinAggregation
AsyncSearchActionTests testNoIndex
AsyncSearchActionTests testRestartAfterCompletion
AsyncSearchActionTests testTermsAggregation

It started to fail for "darwin-compatibility" since March 27 for master, 7.x and 7.7. On CI, all tests of this class fail all the time for "darwin-compatibility". The trend can be seen here.

But I cannot reproduce locally:

./gradlew ':x-pack:plugin:async-search:test' --tests "org.elasticsearch.xpack.search.AsyncSearchActionTests.testNoIndex" \
  -Dtests.seed=49A4746586643385 \
  -Dtests.security.manager=true \
  -Dtests.locale=es-MX \
  -Dtests.timezone=SystemV/MST7MDT \
  -Dcompiler.java=13

Sample error message

:x-pack:plugin:async-search:test » org.elasticsearch.xpack.search.AsyncSearchActionTests » testNoIndex (11.246s)
6 channels still being tracked in RestCancellableNodeClient while there should be none expected:<0> but was:<6>
java.lang.AssertionError: 6 channels still being tracked in RestCancellableNodeClient while there should be none expected:<0> but was:<6>

Build scan:
https://gradle-enterprise.elastic.co/s/q4uaac6hl4jvu

Build log:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-darwin-compatibility/613/console

Failure email notification:
https://groups.google.com/a/elastic.co/forum/?utm_medium=email&utm_source=footer#!topic/build-elasticsearch/vxu0iDmfzcg

javanna · 2020-04-01T07:55:43Z

Thanks @ywangd you are right that there isn't a proper issue for the failure that you saw. I opened a PR: #54520 . The reason why you don't see testCancellation fail is because it's muted due to its failures ;)

…sts (#54520) `AsyncSearchActionTests` currently fails quite often. That is since the introduction of `RestSubmitAsyncSearchActionTests` which indirectly manipulates the channels being tracked in `RestCancellableNodeClient`. There are channels left in the map after `RestSubmitAsyncSearchActionTests` is run, and later `AsyncSearchActionTests` checks that there are no channels in the map which makes each test method fail. This is particularily hard to reproduce as the order in which tests are run appears to be platform dependent. The test cluster assertion that there are no channels in the map only makes sense in the context of internal cluster tests, while there may be collisions with unit tests that register http channels as part of their testing. This can be solved by renaming `AsyncSearchActionTests` to `AsyncSearchActionIT`. This way it won't be run as part of unit tests but rather within another JVM where the number of channels is `0` and such assumption holds, because there are no expected manual manipulation of the channels. Relates to #54180

…initial response We currently create the .async-search index if necessary before performing any action (index, update or delete). Truth is that this is needed only before storing the initial response. The other operations are either update or delete, which will anyways not find the document to update/delete even if the index gets created when missing. This also caused `testCancellation` failures as we were trying to delete the document twice from the .async-search index, once from `TransportDeleteAsyncSearchAction` and once as a consequence of the search task being completed. The latter may be called after the test is completed, but before the cluster is shut down and causing problems to the after test checks, for instance if it happens after all the indices have been cleaned up. It is totally fine to try to delete a response that is no longer found, but not quite so if such call will also trigger an index creation. With this commit we remove all the calls to createIndexIfNecessary from the update/delete operation, and we leave one call only from storeInitialResponse which is where the index is expected to be created. Closes elastic#54180

…nse (#54619) We currently create the .async-search index if necessary before performing any action (index, update or delete). Truth is that this is needed only before storing the initial response. The other operations are either update or delete, which will anyways not find the document to update/delete even if the index gets created when missing. This also caused `testCancellation` failures as we were trying to delete the document twice from the .async-search index, once from `TransportDeleteAsyncSearchAction` and once as a consequence of the search task being completed. The latter may be called after the test is completed, but before the cluster is shut down and causing problems to the after test checks, for instance if it happens after all the indices have been cleaned up. It is totally fine to try to delete a response that is no longer found, but not quite so if such call will also trigger an index creation. With this commit we remove all the calls to createIndexIfNecessary from the update/delete operation, and we leave one call only from storeInitialResponse which is where the index is expected to be created. Closes #54180

tlrx added :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI labels Mar 25, 2020

javanna self-assigned this Mar 25, 2020

DaveCTurner added a commit that referenced this issue Mar 26, 2020

AwaitsFix for #54180

1d3a8de

DaveCTurner added a commit that referenced this issue Mar 26, 2020

AwaitsFix for #54180

f48e8f3

DaveCTurner added a commit that referenced this issue Mar 26, 2020

AwaitsFix for #54180

7491f7f

javanna mentioned this issue Mar 31, 2020

[TEST] rename AsyncSearchActionTests to IT and move it out of unit tests #54520

Merged

ywangd mentioned this issue Apr 1, 2020

[CI] AsyncSearchActionTests fails unpredictably #53360

Closed

javanna mentioned this issue Apr 1, 2020

Async search: create internal index only before storing initial response #54619

Merged

javanna closed this as completed in #54619 Apr 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] AsyncSearchActionTests.testCancellation fails on CI #54180

[CI] AsyncSearchActionTests.testCancellation fails on CI #54180

tlrx commented Mar 25, 2020

elasticmachine commented Mar 25, 2020

DaveCTurner commented Mar 26, 2020

DaveCTurner commented Mar 26, 2020

mark-vieira commented Mar 27, 2020

javanna commented Mar 27, 2020

mark-vieira commented Mar 27, 2020

benwtrent commented Mar 30, 2020 •

edited

Loading

jimczi commented Mar 30, 2020

ywangd commented Apr 1, 2020 •

edited

Loading

javanna commented Apr 1, 2020

[CI] AsyncSearchActionTests.testCancellation fails on CI #54180

[CI] AsyncSearchActionTests.testCancellation fails on CI #54180

Comments

tlrx commented Mar 25, 2020

elasticmachine commented Mar 25, 2020

DaveCTurner commented Mar 26, 2020

DaveCTurner commented Mar 26, 2020

mark-vieira commented Mar 27, 2020

javanna commented Mar 27, 2020

mark-vieira commented Mar 27, 2020

benwtrent commented Mar 30, 2020 • edited Loading

jimczi commented Mar 30, 2020

ywangd commented Apr 1, 2020 • edited Loading

javanna commented Apr 1, 2020

benwtrent commented Mar 30, 2020 •

edited

Loading

ywangd commented Apr 1, 2020 •

edited

Loading