[CI] IndicesQueryCache.close assertion failure in internal cluster tests #37117

davidkyle · 2019-01-03T14:42:47Z

The timing suggests these failures are related to the change moving the ML config out of the cluster state.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu8,nodes=virtual&&linux/152/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu11,nodes=virtual&&linux/149/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+periodic/57/console

The log contains an assertion failure

java.lang.AssertionError: {org.apache.lucene.index.IndexReader$CacheKey@b62bec8=org.elasticsearch.indices.IndicesQueryCache$StatsAndCount@30fd04c3}
	at __randomizedtesting.SeedInfo.seed([374793047E6CA339:C9896ED516320174]:0)
	at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:191)
	at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:145)
	at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:130)
	at org.elasticsearch.indices.IndicesService.doClose(IndicesService.java:296)
	at org.elasticsearch.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:103)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61)
	at org.elasticsearch.node.NodeService.close(NodeService.java:135)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:85)
	at org.elasticsearch.node.Node.close(Node.java:847)
	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:990)
	at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:145)
	at org.elasticsearch.test.InternalTestCluster.close(InternalTestCluster.java:818)
	at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:586)
	at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2225)

Then the suite fails with a thread leak error.
thread-leak.txt

Does not reproduce:

./gradlew :x-pack:plugin:ml:unitTest \
  -Dtests.seed=374793047E6CA339 \
  -Dtests.class=org.elasticsearch.license.MachineLearningLicensingTests \
  -Dtests.method="testMachineLearningStopDatafeedActionNotRestricted" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-SD \
  -Dtests.timezone=BET \
  -Dcompiler.java=11 \
  -Druntime.java=8

./gradlew :x-pack:plugin:ml:unitTest \
  -Dtests.seed=B805872D5C817758 \
  -Dtests.class=org.elasticsearch.license.MachineLearningLicensingTests \
  -Dtests.method="testAutoCloseJobWithDatafeed" \
  -Dtests.security.manager=true \
  -Dtests.locale=zh-Hans-HK \
  -Dtests.timezone=Etc/GMT-5 \
  -Dcompiler.java=11 \
  -Druntime.java=11

./gradlew :x-pack:plugin:ml:unitTest \
  -Dtests.seed=CB48E860FD3AC07E \
  -Dtests.class=org.elasticsearch.license.MachineLearningLicensingTests \
  -Dtests.method="testAutoCloseJobWithDatafeed" \
  -Dtests.security.manager=true \
  -Dtests.locale=de-GR \
  -Dtests.timezone=America/Rankin_Inlet \
  -Dcompiler.java=11 \
  -Druntime.java=8

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-01-03T14:42:49Z

Pinging @elastic/ml-core

droberts195 · 2019-01-07T13:00:05Z

The assertion that's failing is

elasticsearch/server/src/main/java/org/elasticsearch/indices/IndicesQueryCache.java

Line 191 in 0000e7c

assert stats2.isEmpty() : stats2;

Can anyone from @elastic/es-search offer any advice on what might cause this? Is it because ML is doing some operation on the index while it's being closed?

droberts195 · 2019-01-07T13:08:45Z

One more thing is that many failures of this type seem to be running in the Zulu JVM.

However, I found one that isn't: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=opensuse/165/consoleText

So it may be that there's something about the Zulu JVM that makes these failures more likely to occur, but it's not the root cause of the problem.

alpar-t · 2019-01-09T14:22:13Z

another one here: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/906/console

davidkyle · 2019-01-10T16:48:42Z

There was a failure in BasicDistributedJobsIT that looks like it has the same root cause

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java8,nodes=virtual&&linux/172/console

java.lang.AssertionError: 2
	at __randomizedtesting.SeedInfo.seed([3EDB4FBFCECABCD4:D16E134409CDA259]:0)
	at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:189)
	at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:145)
	at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:130)
	at org.elasticsearch.indices.IndicesService.doClose(IndicesService.java:297)
	at org.elasticsearch.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:105)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61)
	at org.elasticsearch.node.NodeService.close(NodeService.java:135)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:85)
	at org.elasticsearch.node.Node.close(Node.java:946)
	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:1008)
	at org.elasticsearch.test.InternalTestCluster.stopNodesAndClients(InternalTestCluster.java:1651)
	at org.elasticsearch.test.InternalTestCluster.stopNodesAndClient(InternalTestCluster.java:1637)
	at org.elasticsearch.test.InternalTestCluster.stopRandomDataNode(InternalTestCluster.java:1531)
	at org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT.testFailOverBasics_withDataFeeder(BasicDistributedJobsIT.java:121)

In this test a node is shutdown tripping the assertion which doesn't kill the node but leaves it half running. Later there is the error message from the shutdown node:

  1> [2019-01-10T22:27:02,402][WARN ][o.e.t.TcpTransport       ] [node_t1] send message failed [channel: MockChannel{profile='default', isOpen=false, localAddress=/127.0.0.1:39548, isServerSocket=false}]
  1> java.net.SocketException: Socket is closed
  1> 	at java.net.Socket.getOutputStream(Socket.java:943) ~[?:1.8.0_192]
  1> 	at org.elasticsearch.transport.MockTcpTransport$MockChannel.sendMessage(MockTcpTransport.java:437) [framework-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.TcpTransport.internalSendMessage(TcpTransport.java:758) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:844) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:815) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:64) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.support.HandledTransportAction$ChannelActionListener.onResponse(HandledTransportAction.java:103) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.support.HandledTransportAction$ChannelActionListener.onResponse(HandledTransportAction.java:87) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:360) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:356) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1116) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]

The working theory is that some request hits the node during shutdown I've added some debug logging around the likely candidates in 92b519b

davidkyle · 2019-01-11T10:30:48Z

@jpountz could you take a look at the assertion failures in IndicesQueryCache.close thrown up by this test please. Any insights you have into what the cause may be are much appreciated.

java.lang.AssertionError: {org.apache.lucene.index.IndexReader$CacheKey@b62bec8=org.elasticsearch.indices.IndicesQueryCache$StatsAndCount@30fd04c3}
	at __randomizedtesting.SeedInfo.seed([374793047E6CA339:C9896ED516320174]:0)
	at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:191)

and

java.lang.AssertionError: 2
	at __randomizedtesting.SeedInfo.seed([3EDB4FBFCECABCD4:D16E134409CDA259]:0)
	at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:189)

cbuescher · 2019-02-04T15:33:20Z

Talking to @droberts195, he thinks these failures from today look related. Its a different test though:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=amazon/223/console

 ./gradlew :x-pack:plugin:ml:internalClusterTest \
  -Dtests.seed=2D587A9A8B6E70D8 \
  -Dtests.class=org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT \
  -Dtests.method="testFullClusterRestart" \
  -Dtests.security.manager=true \
  -Dtests.locale=es-PE \
  -Dtests.timezone=America/St_Thomas \
  -Dcompiler.java=11 \
  -Druntime.java=8

The logs contain similar failures:

13:53:28 ERROR   12.1s J1 | MlDistributedFailureIT.testFullClusterRestart <<< FAILURES!
13:53:28    > Throwable #1: java.lang.AssertionError: {org.apache.lucene.index.IndexReader$CacheKey@714f684a=org.elasticsearch.indices.IndicesQueryCache$StatsAndCount@3d9dc3df, org.apache.lucene.index.IndexReader$CacheKey@1875360d=org.elasticsearch.indices.IndicesQueryCache$StatsAndCount@46a79038}
13:53:28    > 	at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:191)
13:53:28    > 	at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:145)
13:53:28    > 	at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:130)
13:53:28    > 	at org.elasticsearch.indices.IndicesService.doClose(IndicesService.java:284)
13:53:28    > 	at org.elasticsearch.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:100)
13:53:28    > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
13:53:28    > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61)
13:53:28    > 	at org.elasticsearch.node.NodeService.close(NodeService.java:135)
13:53:28    > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
13:53:28    > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:85)
13:53:28    > 	at org.elasticsearch.node.Node.close(Node.java:857)
13:53:28    > 	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:1039)
13:53:28    > 	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.closeForRestart(InternalTestCluster.java:961)
13:53:28    > 	at org.elasticsearch.test.InternalTestCluster.fullRestart(InternalTestCluster.java:1860)
13:53:28    > 	at org.elasticsearch.test.InternalTestCluster.fullRestart(InternalTestCluster.java:1745)
13:53:28    > 	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$testFullClusterRestart$5(MlDistributedFailureIT.java:113)
13:53:28    > 	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.run(MlDistributedFailureIT.java:293)

droberts195 · 2019-02-04T17:33:09Z

I changed the title of the issue because the things that all these failures have in common are:

assert stats2.isEmpty() : stats2; fails in IndicesQueryCache.close
They're all using the internal cluster test framework

The stats2 variable and the assertion were added in 38f5cc2. The code then underwent a major refactor in 44c653f. The locking was changed but the assertion in close stayed the same. Could it be a problem with visibility of changes between threads that only manifests itself in the internal cluster tests?

droberts195 · 2019-02-07T16:28:51Z

Another example in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+internalClusterTest/49/consoleText

dakrone · 2019-02-08T23:27:51Z

I spent all day trying to reproduce this, but I was unable to.

The only thing I could think of that would cause it was related to what @droberts195 says about thread visibility, specifically maybe making this:

elasticsearch/server/src/main/java/org/elasticsearch/indices/IndicesQueryCache.java

Line 209 in 431c4fd

int count;

be volatile since that is the determinant of whether stats are removed from the stats2 map; but that is a shot in the dark and I don't have a way to test it since I haven't been able to reproduce this.

I'm curious if anyone else has any ideas.

droberts195 · 2019-02-11T12:06:50Z

Thanks for looking at this @dakrone. I agree that int count in StatsAndCount not being volatile is a likely cause of the problem.

I just had a look at the base class, LRUQueryCache in Lucene, and it has the following comment:

  // these variables are volatile so that we do not need to sync reads
  // but increments need to be performed under the lock

So I guess the base class either had the same problem at one point or else its original author realised the potential problem with visibility of changes.

Out of all the thousands of CI builds that run we only seem to get a failure of this assertion about twice a week. None of the 3 tests that have failed because of it have been muted, so it seems like the best way forward is just to make that int count volatile and see if the failures go away.

One more thing though. Can somebody who is more familiar with this confirm that ElasticsearchLRUQueryCache.onDocIdSetCache and ElasticsearchLRUQueryCache.onDocIdSetEviction can never get called by two different threads simultaneously? I guess that's covered by the // but increments need to be performed under the lock in the base class, but if not then there's a worse synchronization problem.

romseygeek · 2019-02-11T12:17:01Z

Can somebody who is more familiar with this confirm that ElasticsearchLRUQueryCache.onDocIdSetCache and ElasticsearchLRUQueryCache.onDocIdSetEviction can never get called by two different threads simultaneously?

The super methods of both assert that the current thread holds the cache's lock, so I think the behaviour here is fine.

+1 to making StatsAndCount.count volatile

The assertion that the stats2 map is empty in IndicesQueryCache.close has been observed to fail very occasionally in internal cluster tests. The likely cause is a cross-thread visibility problem for a count variable. This change makes that count volatile. Relates elastic#37117

droberts195 · 2019-02-11T12:26:53Z

+1 to making StatsAndCount.count volatile

I opened #38714 to do this

The assertion that the stats2 map is empty in IndicesQueryCache.close has been observed to fail very occasionally in internal cluster tests. The likely cause is a cross-thread visibility problem for a count variable. This change makes that count volatile. Relates #37117

) The assertion that the stats2 map is empty in IndicesQueryCache.close has been observed to fail very occasionally in internal cluster tests. The likely cause is a cross-thread visibility problem for a count variable. This change makes that count volatile. Relates elastic#37117

The assertion that the stats2 map is empty in IndicesQueryCache.close has been observed to fail very occasionally in internal cluster tests. The likely cause is a cross-thread visibility problem for a count variable. This change makes that count volatile. Relates #37117 Backport of #38714

dakrone · 2019-02-11T22:13:55Z

@droberts195 thanks for opening the PR, I'm going to close this and assume your fix worked (so we don't forget to resolve this). We can always re-open it if we see more of this failure.

jpountz · 2019-02-19T10:45:27Z

No worries at all @droberts195. I feel like we have an actual bug that the current behavior of MlLifkeCycleService is only more likely to expose. +1 to stop firing new requests in stop() though.

The ML memory tracker does searches against ML results and config indices. These searches can be asynchronous, and if they are running while the node is closing then they can cause problems for other components. This change adds a stop() method to the MlMemoryTracker that waits for in-flight searches to complete. Once stop() has returned the MlMemoryTracker will not kick off any new searches. The MlLifeCycleService now calls MlMemoryTracker.stop() before stopping stopping the node. Fixes elastic#37117

The ML memory tracker does searches against ML results and config indices. These searches can be asynchronous, and if they are running while the node is closing then they can cause problems for other components. This change adds a stop() method to the MlMemoryTracker that waits for in-flight searches to complete. Once stop() has returned the MlMemoryTracker will not kick off any new searches. The MlLifeCycleService now calls MlMemoryTracker.stop() before stopping stopping the node. Fixes #37117

droberts195 · 2019-02-25T10:00:16Z

This happened again in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+internalClusterTest/1179/consoleText

This included my change to stop the ML memory tracker as part of Node.stop before components get closed.

The exception has moved on from previous occurrences - now it says no master node:

  1> [2019-02-24T15:07:04,458][INFO ][o.e.x.m.i.MlDistributedFailureIT] [testFullClusterRestart] after test
ERROR   11.4s J1 | MlDistributedFailureIT.testFullClusterRestart <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: {org.apache.lucene.index.IndexReader$CacheKey@533b697c=org.elasticsearch.indices.IndicesQueryCache$StatsAndCount@31412048}
   > 	at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:191)
   > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
   > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61)
   > 	at org.elasticsearch.indices.IndicesService$2.closeInternal(IndicesService.java:265)
   > 	at org.elasticsearch.common.util.concurrent.AbstractRefCounted.decRef(AbstractRefCounted.java:63)
   > 	at org.elasticsearch.indices.IndicesService.doClose(IndicesService.java:309)
   > 	at org.elasticsearch.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:100)
   > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
   > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61)
   > 	at org.elasticsearch.node.NodeService.close(NodeService.java:135)
   > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
   > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:85)
   > 	at org.elasticsearch.node.Node.close(Node.java:857)
   > 	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:1051)
   > 	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.closeForRestart(InternalTestCluster.java:972)
   > 	at org.elasticsearch.test.InternalTestCluster.fullRestart(InternalTestCluster.java:1892)
   > 	at org.elasticsearch.test.InternalTestCluster.fullRestart(InternalTestCluster.java:1765)
   > 	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$testFullClusterRestart$5(MlDistributedFailureIT.java:113)
   > 	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.run(MlDistributedFailureIT.java:286)
   > 	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testFullClusterRestart(MlDistributedFailureIT.java:111)
   > 	at java.lang.Thread.run(Thread.java:748)Throwable #2: java.lang.RuntimeException: Had to resort to force-stopping datafeed, something went wrong?
   > 	at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.deleteAllDatafeeds(BaseMlIntegTestCase.java:294)
   > 	at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.cleanupWorkaround(BaseMlIntegTestCase.java:207)
   > 	at java.lang.Thread.run(Thread.java:748)
   > Caused by: java.util.concurrent.ExecutionException: MasterNotDiscoveredException[no known master node]
   > 	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:266)
   > 	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:253)
   > 	at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
   > 	at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.deleteAllDatafeeds(BaseMlIntegTestCase.java:282)
   > 	... 36 more
   > Caused by: MasterNotDiscoveredException[no known master node]
   > 	at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.doExecute(TransportStopDatafeedAction.java:108)
   > 	at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.doExecute(TransportStopDatafeedAction.java:43)
   > 	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:145)
   > 	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:121)
   > 	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:64)
   > 	at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83)
   > 	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72)
   > 	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:393)
   > 	at org.elasticsearch.client.FilterClient.doExecute(FilterClient.java:65)
   > 	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:393)
   > 	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:382)
   > 	at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.deleteAllDatafeeds(BaseMlIntegTestCase.java:281)
   > 	... 36 moreThrowable #3: java.lang.RuntimeException: already closed
   > 	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.getOrBuildNodeClient(InternalTestCluster.java:925)
   > 	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.client(InternalTestCluster.java:904)
   > 	at org.elasticsearch.test.InternalTestCluster$2.next(InternalTestCluster.java:2338)
   > 	at org.elasticsearch.test.InternalTestCluster$2.next(InternalTestCluster.java:2329)
   > 	at org.elasticsearch.test.ESIntegTestCase.ensureClusterStateConsistency(ESIntegTestCase.java:1134)
   > 	at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:572)
   > 	at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2191)
   > 	at java.lang.Thread.run(Thread.java:748)

The log confirms that the ML memory tracker is stopped before this happens:

  1> [2019-02-24T15:07:17,363][DEBUG][o.e.x.m.p.MlMemoryTracker] [testJobRelocationIsMemoryAware] ML memory tracker stopped

markharwood · 2019-02-25T10:05:11Z

build-stats link

This test is already muted in master (see #37117), backporting this mute to 7.0 as it's failing there as well.

This commit adds the following: - more tests to IndicesServiceCloseTests, one of them found a bug in the order in which `IndicesQueryCache#onClose` and `IndicesService.indicesRefCount#decRef` are called. - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are already protected by the lock of the Lucene cache, but the final read from an assertion in `IndicesQueryCache#close()` was not so this change should avoid any potential visibility issues. - human-readable `toString`s to make debugging easier. Relates elastic#37117

This commit adds the following: - more tests to IndicesServiceCloseTests, one of them found a bug in the order in which `IndicesQueryCache#onClose` and `IndicesService.indicesRefCount#decRef` are called. - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are already protected by the lock of the Lucene cache, but the final read from an assertion in `IndicesQueryCache#close()` was not so this change should avoid any potential visibility issues. - human-readable `toString`s to make debugging easier. Relates #37117

droberts195 · 2019-03-25T09:20:52Z

Since the fixes of #39099 21 days ago the build stats link only shows one failure on 7.x and above. (It has failed a few more times on 6.6 and 6.7, but they don't have that fix.)

That one failure was in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=java8fips,nodes=immutable&&linux&&docker/82/, which is on a FIPS JVM and also a number of other tests related to splitting indices also failed, so there may be some interaction there.

So it's looking like a combination of #39111 and #39099 have fixed the problems that lead to this issue being raised. #39111 is in 6.6.2 and above; #39099 is in 7.1.0 and above.

This commit adds the following: - more tests to IndicesServiceCloseTests, one of them found a bug in the order in which `IndicesQueryCache#onClose` and `IndicesService.indicesRefCount#decRef` are called. - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are already protected by the lock of the Lucene cache, but the final read from an assertion in `IndicesQueryCache#close()` was not so this change should avoid any potential visibility issues. - human-readable `toString`s to make debugging easier. Relates elastic#37117

This commit adds the following: - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are already protected by the lock of the Lucene cache, but the final read from an assertion in `IndicesQueryCache#close()` was not so this change should avoid any potential visibility issues. - human-readable `toString`s to make debugging easier. Relates elastic#37117

This commit adds the following: - more tests to IndicesServiceCloseTests, one of them found a bug in the order in which `IndicesQueryCache#onClose` and `IndicesService.indicesRefCount#decRef` are called. - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are already protected by the lock of the Lucene cache, but the final read from an assertion in `IndicesQueryCache#close()` was not so this change should avoid any potential visibility issues. - human-readable `toString`s to make debugging easier. Relates #37117

This commit adds the following: - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are already protected by the lock of the Lucene cache, but the final read from an assertion in `IndicesQueryCache#close()` was not so this change should avoid any potential visibility issues. - human-readable `toString`s to make debugging easier. Relates #37117

) Relates to issue #37117

…stic#42431) Relates to issue elastic#37117

davidkyle added >test-failure Triaged test failures from CI :ml Machine learning labels Jan 3, 2019

davidkyle mentioned this issue Jan 3, 2019

[ML] Metadata Migration Meta issue #32905

Closed

43 tasks

droberts195 changed the title ~~[CI][ML] MachineLearningLicensingTests failures~~ [CI] IndicesQueryCache.close assertion failure in internal cluster tests Feb 4, 2019

droberts195 added the :Search/Search Search-related issues that do not fall into other categories label Feb 4, 2019

droberts195 mentioned this issue Feb 11, 2019

Prevent possible assertion failure in IndicesQueryCache.close #38714

Merged

This was referenced Feb 11, 2019

Fix possible assertion failure in IndicesQueryCache.close #38731

Merged

Fix possible assertion failure in IndicesQueryCache.close #38732

Merged

Fix possible assertion failure in IndicesQueryCache.close #38733

Merged

droberts195 mentioned this issue Feb 19, 2019

[ML] Stop the ML memory tracker before closing node #39111

Merged

droberts195 closed this as completed in #39111 Feb 19, 2019

droberts195 reopened this Feb 25, 2019

gwbrown mentioned this issue Mar 1, 2019

Mute MlDistributedFailureIT.testFailOver in 7.0 #39590

Merged

gwbrown added a commit that referenced this issue Mar 1, 2019

Mute MlDistributedFailureIT.testFailOver (#39590)

414546c

This test is already muted in master (see #37117), backporting this mute to 7.0 as it's failing there as well.

jpountz mentioned this issue Mar 4, 2019

Enhancements to IndicesQueryCache. (#39099) #39626

Merged

droberts195 closed this as completed Mar 25, 2019

droberts195 mentioned this issue Mar 26, 2019

[CI] Failure in BasicDistributedJobsIT.testFailOverBasics_withDataFeeder #40441

Closed

jpountz mentioned this issue Mar 26, 2019

Enhancements to IndicesQueryCache. (#39099) #40445

Merged

jpountz mentioned this issue Mar 26, 2019

Enhancements to IndicesQueryCache. (#39099) #40446

Merged

davidkyle mentioned this issue May 23, 2019

[ML] Reenable ml distributed failure test after issue resolution #42431

Merged

davidkyle added a commit that referenced this issue May 24, 2019

[ML] Reenable ml distributed failure test after issue resolution (#42431

93f3d12

) Relates to issue #37117

gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this issue May 27, 2019

[ML] Reenable ml distributed failure test after issue resolution (ela…

33d3f09

…stic#42431) Relates to issue elastic#37117

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] IndicesQueryCache.close assertion failure in internal cluster tests #37117

[CI] IndicesQueryCache.close assertion failure in internal cluster tests #37117

davidkyle commented Jan 3, 2019

elasticmachine commented Jan 3, 2019

droberts195 commented Jan 7, 2019

droberts195 commented Jan 7, 2019

alpar-t commented Jan 9, 2019

davidkyle commented Jan 10, 2019 •

edited

Loading

davidkyle commented Jan 11, 2019

cbuescher commented Feb 4, 2019

droberts195 commented Feb 4, 2019

droberts195 commented Feb 7, 2019

dakrone commented Feb 8, 2019

droberts195 commented Feb 11, 2019 •

edited

Loading

romseygeek commented Feb 11, 2019

droberts195 commented Feb 11, 2019

dakrone commented Feb 11, 2019

jpountz commented Feb 19, 2019

droberts195 commented Feb 25, 2019

markharwood commented Feb 25, 2019 •

edited

Loading

droberts195 commented Mar 25, 2019

[CI] IndicesQueryCache.close assertion failure in internal cluster tests #37117

[CI] IndicesQueryCache.close assertion failure in internal cluster tests #37117

Comments

davidkyle commented Jan 3, 2019

elasticmachine commented Jan 3, 2019

droberts195 commented Jan 7, 2019

droberts195 commented Jan 7, 2019

alpar-t commented Jan 9, 2019

davidkyle commented Jan 10, 2019 • edited Loading

davidkyle commented Jan 11, 2019

cbuescher commented Feb 4, 2019

droberts195 commented Feb 4, 2019

droberts195 commented Feb 7, 2019

dakrone commented Feb 8, 2019

droberts195 commented Feb 11, 2019 • edited Loading

romseygeek commented Feb 11, 2019

droberts195 commented Feb 11, 2019

dakrone commented Feb 11, 2019

jpountz commented Feb 19, 2019

droberts195 commented Feb 25, 2019

markharwood commented Feb 25, 2019 • edited Loading

droberts195 commented Mar 25, 2019

davidkyle commented Jan 10, 2019 •

edited

Loading

droberts195 commented Feb 11, 2019 •

edited

Loading

markharwood commented Feb 25, 2019 •

edited

Loading