Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] AsyncSearchActionTests.testCancellation fails on CI #54180

Closed
tlrx opened this issue Mar 25, 2020 · 10 comments · Fixed by #54619
Closed

[CI] AsyncSearchActionTests.testCancellation fails on CI #54180

tlrx opened this issue Mar 25, 2020 · 10 comments · Fixed by #54619
Assignees
Labels
:Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI

Comments

@tlrx
Copy link
Member

tlrx commented Mar 25, 2020

The test AsyncSearchActionTests.testCancellation failed on CI on the 7.x branch:

org.elasticsearch.xpack.search.AsyncSearchActionTests > testCancellation FAILED
    java.lang.AssertionError: Shard [.async-search][0] is still locked after 5 sec waiting
        at __randomizedtesting.SeedInfo.seed([DD4D761E3C3EE063:B7A2968A281D7E5]:0)
        at org.junit.Assert.fail(Assert.java:88)
        at org.elasticsearch.test.InternalTestCluster.assertAfterTest(InternalTestCluster.java:2483)
        at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:600)
        at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2143)

Build scan: https://gradle-enterprise.elastic.co/s/4ksqilghmk6ii

This is the first time this test fails on CI. I'm not sure this is really related to async search as the logs mentioned some version conflict exception:

[2020-03-25T03:22:06,439][ERROR][o.e.x.s.TransportSubmitAsyncSearchAction] [node_s0] failed to store async-search [FkdmWEtzRnhjUVZ1cE9ya090RUE1TVEaWWR1SU5FRV9RcTZuYTBqa0kyYVVDZzo1NTQ=]
  org.elasticsearch.transport.RemoteTransportException: [node_s0][127.0.0.1:63916][indices:data/write/update[s]]
  Caused by: org.elasticsearch.index.engine.VersionConflictEngineException: [GfXKsFxcQVupOrkOtEA5MQ]: version conflict, required seqNo [0], primary term [1]. but no document was found
       at org.elasticsearch.index.engine.InternalEngine.planIndexingAsPrimary(InternalEngine.java:1065) ~[elasticsearch-7.7.0-SNAPSHOT.jar:7.7.0-SNAPSHOT]

It does not reproduce locally with:

./gradlew ':x-pack:plugin:async-search:test' --tests "org.elasticsearch.xpack.search.AsyncSearchActionTests.testCancellation" \
  -Dtests.seed=DD4D761E3C3EE063 \
  -Dtests.security.manager=true \
  -Dtests.locale=es-NI \
  -Dtests.timezone=Atlantic/Faroe \
  -Dcompiler.java=13

This also looks different from #53360.

@tlrx tlrx added :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI labels Mar 25, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@javanna javanna self-assigned this Mar 25, 2020
@DaveCTurner
Copy link
Contributor

I saw another failure of this test. It looks like there's a problem during test cleanup where the .async-search index is deleted and then re-created:

1> [2020-03-26T21:29:32,687][INFO ][o.e.x.s.AsyncSearchActionTests] [testCancellation] [AsyncSearchActionTests#testCancellation]: cleaning up after test |  
-- | --
  | 1> [2020-03-26T21:29:32,736][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_s0] [test-async/4xCsJ1nESwa4J36TnnOulA] deleting index |  
  | 1> [2020-03-26T21:29:32,736][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_s0] [.async-search/WeVwFd18SPWmCrHQx0qgSg] deleting index |  
  | 1> [2020-03-26T21:29:32,848][WARN ][o.e.d.c.m.MetaDataCreateIndexService] [node_s0] index name [.async-search] starts with a dot '.', in the next major version, index names starting with a dot are reserved for hidden indices and system indices |  
  | 1> [2020-03-26T21:29:32,851][INFO ][o.e.c.m.MetaDataCreateIndexService] [node_s0] [.async-search] creating index, cause [api], templates [random_index_template], shards [1]/[0], mappings [_doc] |  
  | 1> [2020-03-26T21:29:32,851][INFO ][o.e.c.r.a.AllocationService] [node_s0] updating number_of_replicas to [1] for indices [.async-search] |  
  | 1> [2020-03-26T21:29:32,876][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_s0] removing template [random_index_template] |  
  | 1> [2020-03-26T21:29:32,894][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_s0] removing template [ilm-history] |  
  | 1> [2020-03-26T21:29:32,931][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_s0] adding template [ilm-history] for index patterns [ilm-history-2*] |  
  | 1> [2020-03-26T21:29:32,945][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_s0] removing template [.slm-history] |  
  | 1> [2020-03-26T21:29:32,972][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_s0] adding template [.slm-history] for index patterns [.slm-history-2*] |  
  | 1> [2020-03-26T21:29:33,068][INFO ][o.e.c.r.a.AllocationService] [node_s0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.async-search][0]]]). |  
  | 1> [2020-03-26T21:29:38,202][INFO ][o.e.x.s.AsyncSearchActionTests] [testCancellation] after test


DaveCTurner added a commit that referenced this issue Mar 26, 2020
DaveCTurner added a commit that referenced this issue Mar 26, 2020
DaveCTurner added a commit that referenced this issue Mar 26, 2020
@DaveCTurner
Copy link
Contributor

Muted in master, 7.x and 7.7: 1d3a8de f48e8f3 7491f7f

@mark-vieira
Copy link
Contributor

There are a number of failures that seem to be specific to our MacOS agents. They typically look like this:

org.elasticsearch.xpack.search.AsyncSearchActionTests > testCleanupOnFailure FAILED
    java.lang.AssertionError: 6 channels still being tracked in RestCancellableNodeClient while there should be none expected:<0> but was:<6>
        at __randomizedtesting.SeedInfo.seed([4C0451DF11A87EF0:C71070CDB7AD6138]:0)
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.failNotEquals(Assert.java:834)
        at org.junit.Assert.assertEquals(Assert.java:645)
        at org.elasticsearch.test.ESIntegTestCase.lambda$clearClusters$2(ESIntegTestCase.java:551)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:917)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:890)
        at org.elasticsearch.test.ESIntegTestCase.clearClusters(ESIntegTestCase.java:549)
        at org.elasticsearch.test.ESIntegTestCase.buildAndPutCluster(ESIntegTestCase.java:523)
        at org.elasticsearch.test.ESIntegTestCase.beforeInternal(ESIntegTestCase.java:381)
        at org.elasticsearch.test.ESIntegTestCase.setupTestCluster(ESIntegTestCase.java:2192)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

This doesn't happen locally for me on my MBP so it might be infra related but I don't know enough about these tests to infer what that might be. Seems isolated to AsyncSearchActionTests.

Let me know if I should open a separate issue for this.

@javanna
Copy link
Member

javanna commented Mar 27, 2020

thanks for your comment @mark-vieira I will look into this.

@mark-vieira
Copy link
Contributor

Ah, I forgot to include a link to an example build. Here you are @javanna.

https://gradle-enterprise.elastic.co/s/msim2zdegwhzg

@benwtrent
Copy link
Member

benwtrent commented Mar 30, 2020

@jimczi
Copy link
Contributor

jimczi commented Mar 30, 2020

I seems that we have two types of failure here. One that fails with:

java.lang.AssertionError: 6 channels still being tracked in RestCancellableNodeClient while there should be none expected:<0> but was:<6>

and another one that fails with:

org.elasticsearch.xpack.search.AsyncSearchActionTests > testCancellation FAILED
    java.lang.AssertionError: Shard [.async-search][0] is still locked after 5 sec waiting

They are maybe linked but the former is weird since it fails with a rest layer assertion on tests that should only use the transport client. I suspect that the leaked channels are coming from other tests that failed to cleanup their connections. @javanna were you able to reproduce ?

javanna added a commit to javanna/elasticsearch that referenced this issue Mar 31, 2020
`AsyncSearchActionTests` currently fails quite often. That is since the introduction of `RestSubmitAsyncSearchActionTests` which indirectly manipulates the channels being tracked in `RestCancellableNodeClient`. There are channels left in the map after `RestSubmitAsyncSearchActionTests`  is run, and later `AsyncSearchActionTests` checks that there are no channels in the map which makes each test method fail. This is particularily hard to reproduce as the order in which tests are run appears to be platform dependent.

The test cluster assertion that there are no channels in the map only makes sense in the context of internal cluster tests, while there may be collisions with unit tests that register http channels as part of their testing.

This can be solved by renaming `AsyncSearchActionTests` to `AsyncSearchActionIT`. This way it won't be run as part of unit tests but rather within another JVM where the number of channels is `0` and such assumption holds, because there are no expected manual manipulation of the channels.

Relates to elastic#54180
@ywangd
Copy link
Member

ywangd commented Apr 1, 2020

I commented and re-openned #53360. But decided to move it here since this issue is more recent and active. I do want to emphasize that the errors are not just for testCancellation as it says in the title. In fact, the new failures do not even include testCancellation (I noticed it's muted). They are for the following test methods:

AsyncSearchActionTests classMethod
AsyncSearchActionTests testCleanupOnFailure
AsyncSearchActionTests testDeleteCancelRunningTask
AsyncSearchActionTests testDeleteCleanupIndex
AsyncSearchActionTests testInvalidId
AsyncSearchActionTests testMaxMinAggregation
AsyncSearchActionTests testNoIndex
AsyncSearchActionTests testRestartAfterCompletion
AsyncSearchActionTests testTermsAggregation

It started to fail for "darwin-compatibility" since March 27 for master, 7.x and 7.7. On CI, all tests of this class fail all the time for "darwin-compatibility". The trend can be seen here.

But I cannot reproduce locally:

./gradlew ':x-pack:plugin:async-search:test' --tests "org.elasticsearch.xpack.search.AsyncSearchActionTests.testNoIndex" \
  -Dtests.seed=49A4746586643385 \
  -Dtests.security.manager=true \
  -Dtests.locale=es-MX \
  -Dtests.timezone=SystemV/MST7MDT \
  -Dcompiler.java=13

Sample error message

:x-pack:plugin:async-search:test » org.elasticsearch.xpack.search.AsyncSearchActionTests » testNoIndex (11.246s)
6 channels still being tracked in RestCancellableNodeClient while there should be none expected:<0> but was:<6>
java.lang.AssertionError: 6 channels still being tracked in RestCancellableNodeClient while there should be none expected:<0> but was:<6>

Build scan:
https://gradle-enterprise.elastic.co/s/q4uaac6hl4jvu

Build log:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-darwin-compatibility/613/console

Failure email notification:
https://groups.google.com/a/elastic.co/forum/?utm_medium=email&utm_source=footer#!topic/build-elasticsearch/vxu0iDmfzcg

@javanna
Copy link
Member

javanna commented Apr 1, 2020

Thanks @ywangd you are right that there isn't a proper issue for the failure that you saw. I opened a PR: #54520 . The reason why you don't see testCancellation fail is because it's muted due to its failures ;)

javanna added a commit that referenced this issue Apr 1, 2020
…sts (#54520)

`AsyncSearchActionTests` currently fails quite often. That is since the introduction of `RestSubmitAsyncSearchActionTests` which indirectly manipulates the channels being tracked in `RestCancellableNodeClient`. There are channels left in the map after `RestSubmitAsyncSearchActionTests`  is run, and later `AsyncSearchActionTests` checks that there are no channels in the map which makes each test method fail. This is particularily hard to reproduce as the order in which tests are run appears to be platform dependent.

The test cluster assertion that there are no channels in the map only makes sense in the context of internal cluster tests, while there may be collisions with unit tests that register http channels as part of their testing.

This can be solved by renaming `AsyncSearchActionTests` to `AsyncSearchActionIT`. This way it won't be run as part of unit tests but rather within another JVM where the number of channels is `0` and such assumption holds, because there are no expected manual manipulation of the channels.

Relates to #54180
javanna added a commit that referenced this issue Apr 1, 2020
…sts (#54520)

`AsyncSearchActionTests` currently fails quite often. That is since the introduction of `RestSubmitAsyncSearchActionTests` which indirectly manipulates the channels being tracked in `RestCancellableNodeClient`. There are channels left in the map after `RestSubmitAsyncSearchActionTests`  is run, and later `AsyncSearchActionTests` checks that there are no channels in the map which makes each test method fail. This is particularily hard to reproduce as the order in which tests are run appears to be platform dependent.

The test cluster assertion that there are no channels in the map only makes sense in the context of internal cluster tests, while there may be collisions with unit tests that register http channels as part of their testing.

This can be solved by renaming `AsyncSearchActionTests` to `AsyncSearchActionIT`. This way it won't be run as part of unit tests but rather within another JVM where the number of channels is `0` and such assumption holds, because there are no expected manual manipulation of the channels.

Relates to #54180
javanna added a commit that referenced this issue Apr 1, 2020
…sts (#54520)

`AsyncSearchActionTests` currently fails quite often. That is since the introduction of `RestSubmitAsyncSearchActionTests` which indirectly manipulates the channels being tracked in `RestCancellableNodeClient`. There are channels left in the map after `RestSubmitAsyncSearchActionTests`  is run, and later `AsyncSearchActionTests` checks that there are no channels in the map which makes each test method fail. This is particularily hard to reproduce as the order in which tests are run appears to be platform dependent.

The test cluster assertion that there are no channels in the map only makes sense in the context of internal cluster tests, while there may be collisions with unit tests that register http channels as part of their testing.

This can be solved by renaming `AsyncSearchActionTests` to `AsyncSearchActionIT`. This way it won't be run as part of unit tests but rather within another JVM where the number of channels is `0` and such assumption holds, because there are no expected manual manipulation of the channels.

Relates to #54180
javanna added a commit to javanna/elasticsearch that referenced this issue Apr 1, 2020
…initial response

We currently create the .async-search index if necessary before performing any action (index, update or delete). Truth is that this is needed only before storing the initial response. The other operations are either update or delete, which will anyways not find the document to update/delete even if the index gets created when missing. This also caused `testCancellation` failures as we were trying to delete the document twice from the .async-search index, once from `TransportDeleteAsyncSearchAction` and once as a consequence of the search task being completed. The latter may be called after the test is completed, but before the cluster is shut down and causing problems to the after test checks, for instance if it happens after all the indices have been cleaned up. It is totally fine to try to delete a response that is no longer found, but not quite so if such call will also trigger an index creation.

With this commit we remove all the calls to createIndexIfNecessary from the update/delete operation, and we leave one call only from storeInitialResponse which is where the index is expected to be created.

Closes elastic#54180
javanna added a commit that referenced this issue Apr 10, 2020
…nse (#54619)

We currently create the .async-search index if necessary before performing any action (index, update or delete). Truth is that this is needed only before storing the initial response. The other operations are either update or delete, which will anyways not find the document to update/delete even if the index gets created when missing. This also caused `testCancellation` failures as we were trying to delete the document twice from the .async-search index, once from `TransportDeleteAsyncSearchAction` and once as a consequence of the search task being completed. The latter may be called after the test is completed, but before the cluster is shut down and causing problems to the after test checks, for instance if it happens after all the indices have been cleaned up. It is totally fine to try to delete a response that is no longer found, but not quite so if such call will also trigger an index creation.

With this commit we remove all the calls to createIndexIfNecessary from the update/delete operation, and we leave one call only from storeInitialResponse which is where the index is expected to be created.

Closes #54180
javanna added a commit that referenced this issue Apr 10, 2020
…nse (#54619)

We currently create the .async-search index if necessary before performing any action (index, update or delete). Truth is that this is needed only before storing the initial response. The other operations are either update or delete, which will anyways not find the document to update/delete even if the index gets created when missing. This also caused `testCancellation` failures as we were trying to delete the document twice from the .async-search index, once from `TransportDeleteAsyncSearchAction` and once as a consequence of the search task being completed. The latter may be called after the test is completed, but before the cluster is shut down and causing problems to the after test checks, for instance if it happens after all the indices have been cleaned up. It is totally fine to try to delete a response that is no longer found, but not quite so if such call will also trigger an index creation.

With this commit we remove all the calls to createIndexIfNecessary from the update/delete operation, and we leave one call only from storeInitialResponse which is where the index is expected to be created.

Closes #54180
javanna added a commit that referenced this issue Apr 10, 2020
…nse (#54619)

We currently create the .async-search index if necessary before performing any action (index, update or delete). Truth is that this is needed only before storing the initial response. The other operations are either update or delete, which will anyways not find the document to update/delete even if the index gets created when missing. This also caused `testCancellation` failures as we were trying to delete the document twice from the .async-search index, once from `TransportDeleteAsyncSearchAction` and once as a consequence of the search task being completed. The latter may be called after the test is completed, but before the cluster is shut down and causing problems to the after test checks, for instance if it happens after all the indices have been cleaned up. It is totally fine to try to delete a response that is no longer found, but not quite so if such call will also trigger an index creation.

With this commit we remove all the calls to createIndexIfNecessary from the update/delete operation, and we leave one call only from storeInitialResponse which is where the index is expected to be created.

Closes #54180
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants