-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] IndicesQueryCache.close assertion failure in internal cluster tests #37117
Comments
Pinging @elastic/ml-core |
The assertion that's failing is elasticsearch/server/src/main/java/org/elasticsearch/indices/IndicesQueryCache.java Line 191 in 0000e7c
Can anyone from @elastic/es-search offer any advice on what might cause this? Is it because ML is doing some operation on the index while it's being closed? |
One more thing is that many failures of this type seem to be running in the Zulu JVM. However, I found one that isn't: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=opensuse/165/consoleText So it may be that there's something about the Zulu JVM that makes these failures more likely to occur, but it's not the root cause of the problem. |
There was a failure in
In this test a node is shutdown tripping the assertion which doesn't kill the node but leaves it half running. Later there is the error message from the shutdown node:
The working theory is that some request hits the node during shutdown I've added some debug logging around the likely candidates in 92b519b |
@jpountz could you take a look at the assertion failures in
and
|
Talking to @droberts195, he thinks these failures from today look related. Its a different test though:
The logs contain similar failures:
|
I changed the title of the issue because the things that all these failures have in common are:
The |
I spent all day trying to reproduce this, but I was unable to. The only thing I could think of that would cause it was related to what @droberts195 says about thread visibility, specifically maybe making this: elasticsearch/server/src/main/java/org/elasticsearch/indices/IndicesQueryCache.java Line 209 in 431c4fd
be I'm curious if anyone else has any ideas. |
Thanks for looking at this @dakrone. I agree that I just had a look at the base class,
So I guess the base class either had the same problem at one point or else its original author realised the potential problem with visibility of changes. Out of all the thousands of CI builds that run we only seem to get a failure of this assertion about twice a week. None of the 3 tests that have failed because of it have been muted, so it seems like the best way forward is just to make that One more thing though. Can somebody who is more familiar with this confirm that |
The super methods of both assert that the current thread holds the cache's lock, so I think the behaviour here is fine. +1 to making |
The assertion that the stats2 map is empty in IndicesQueryCache.close has been observed to fail very occasionally in internal cluster tests. The likely cause is a cross-thread visibility problem for a count variable. This change makes that count volatile. Relates elastic#37117
I opened #38714 to do this |
The assertion that the stats2 map is empty in IndicesQueryCache.close has been observed to fail very occasionally in internal cluster tests. The likely cause is a cross-thread visibility problem for a count variable. This change makes that count volatile. Relates #37117
) The assertion that the stats2 map is empty in IndicesQueryCache.close has been observed to fail very occasionally in internal cluster tests. The likely cause is a cross-thread visibility problem for a count variable. This change makes that count volatile. Relates elastic#37117
) The assertion that the stats2 map is empty in IndicesQueryCache.close has been observed to fail very occasionally in internal cluster tests. The likely cause is a cross-thread visibility problem for a count variable. This change makes that count volatile. Relates elastic#37117
@droberts195 thanks for opening the PR, I'm going to close this and assume your fix worked (so we don't forget to resolve this). We can always re-open it if we see more of this failure. |
No worries at all @droberts195. I feel like we have an actual bug that the current behavior of |
The ML memory tracker does searches against ML results and config indices. These searches can be asynchronous, and if they are running while the node is closing then they can cause problems for other components. This change adds a stop() method to the MlMemoryTracker that waits for in-flight searches to complete. Once stop() has returned the MlMemoryTracker will not kick off any new searches. The MlLifeCycleService now calls MlMemoryTracker.stop() before stopping stopping the node. Fixes elastic#37117
The ML memory tracker does searches against ML results and config indices. These searches can be asynchronous, and if they are running while the node is closing then they can cause problems for other components. This change adds a stop() method to the MlMemoryTracker that waits for in-flight searches to complete. Once stop() has returned the MlMemoryTracker will not kick off any new searches. The MlLifeCycleService now calls MlMemoryTracker.stop() before stopping stopping the node. Fixes #37117
The ML memory tracker does searches against ML results and config indices. These searches can be asynchronous, and if they are running while the node is closing then they can cause problems for other components. This change adds a stop() method to the MlMemoryTracker that waits for in-flight searches to complete. Once stop() has returned the MlMemoryTracker will not kick off any new searches. The MlLifeCycleService now calls MlMemoryTracker.stop() before stopping stopping the node. Fixes #37117
The ML memory tracker does searches against ML results and config indices. These searches can be asynchronous, and if they are running while the node is closing then they can cause problems for other components. This change adds a stop() method to the MlMemoryTracker that waits for in-flight searches to complete. Once stop() has returned the MlMemoryTracker will not kick off any new searches. The MlLifeCycleService now calls MlMemoryTracker.stop() before stopping stopping the node. Fixes #37117
The ML memory tracker does searches against ML results and config indices. These searches can be asynchronous, and if they are running while the node is closing then they can cause problems for other components. This change adds a stop() method to the MlMemoryTracker that waits for in-flight searches to complete. Once stop() has returned the MlMemoryTracker will not kick off any new searches. The MlLifeCycleService now calls MlMemoryTracker.stop() before stopping stopping the node. Fixes #37117
The ML memory tracker does searches against ML results and config indices. These searches can be asynchronous, and if they are running while the node is closing then they can cause problems for other components. This change adds a stop() method to the MlMemoryTracker that waits for in-flight searches to complete. Once stop() has returned the MlMemoryTracker will not kick off any new searches. The MlLifeCycleService now calls MlMemoryTracker.stop() before stopping stopping the node. Fixes #37117
This happened again in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+internalClusterTest/1179/consoleText This included my change to stop the ML memory tracker as part of The exception has moved on from previous occurrences - now it says no master node:
The log confirms that the ML memory tracker is stopped before this happens:
|
build-stats link |
This test is already muted in master (see #37117), backporting this mute to 7.0 as it's failing there as well.
This commit adds the following: - more tests to IndicesServiceCloseTests, one of them found a bug in the order in which `IndicesQueryCache#onClose` and `IndicesService.indicesRefCount#decRef` are called. - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are already protected by the lock of the Lucene cache, but the final read from an assertion in `IndicesQueryCache#close()` was not so this change should avoid any potential visibility issues. - human-readable `toString`s to make debugging easier. Relates elastic#37117
This commit adds the following: - more tests to IndicesServiceCloseTests, one of them found a bug in the order in which `IndicesQueryCache#onClose` and `IndicesService.indicesRefCount#decRef` are called. - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are already protected by the lock of the Lucene cache, but the final read from an assertion in `IndicesQueryCache#close()` was not so this change should avoid any potential visibility issues. - human-readable `toString`s to make debugging easier. Relates #37117
This commit adds the following: - more tests to IndicesServiceCloseTests, one of them found a bug in the order in which `IndicesQueryCache#onClose` and `IndicesService.indicesRefCount#decRef` are called. - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are already protected by the lock of the Lucene cache, but the final read from an assertion in `IndicesQueryCache#close()` was not so this change should avoid any potential visibility issues. - human-readable `toString`s to make debugging easier. Relates #37117
Since the fixes of #39099 21 days ago the build stats link only shows one failure on 7.x and above. (It has failed a few more times on 6.6 and 6.7, but they don't have that fix.) That one failure was in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=java8fips,nodes=immutable&&linux&&docker/82/, which is on a FIPS JVM and also a number of other tests related to splitting indices also failed, so there may be some interaction there. So it's looking like a combination of #39111 and #39099 have fixed the problems that lead to this issue being raised. #39111 is in 6.6.2 and above; #39099 is in 7.1.0 and above. |
This commit adds the following: - more tests to IndicesServiceCloseTests, one of them found a bug in the order in which `IndicesQueryCache#onClose` and `IndicesService.indicesRefCount#decRef` are called. - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are already protected by the lock of the Lucene cache, but the final read from an assertion in `IndicesQueryCache#close()` was not so this change should avoid any potential visibility issues. - human-readable `toString`s to make debugging easier. Relates elastic#37117
This commit adds the following: - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are already protected by the lock of the Lucene cache, but the final read from an assertion in `IndicesQueryCache#close()` was not so this change should avoid any potential visibility issues. - human-readable `toString`s to make debugging easier. Relates elastic#37117
This commit adds the following: - more tests to IndicesServiceCloseTests, one of them found a bug in the order in which `IndicesQueryCache#onClose` and `IndicesService.indicesRefCount#decRef` are called. - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are already protected by the lock of the Lucene cache, but the final read from an assertion in `IndicesQueryCache#close()` was not so this change should avoid any potential visibility issues. - human-readable `toString`s to make debugging easier. Relates #37117
This commit adds the following: - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are already protected by the lock of the Lucene cache, but the final read from an assertion in `IndicesQueryCache#close()` was not so this change should avoid any potential visibility issues. - human-readable `toString`s to make debugging easier. Relates #37117
The timing suggests these failures are related to the change moving the ML config out of the cluster state.
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu8,nodes=virtual&&linux/152/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu11,nodes=virtual&&linux/149/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+periodic/57/console
The log contains an assertion failure
Then the suite fails with a thread leak error.
thread-leak.txt
Does not reproduce:
The text was updated successfully, but these errors were encountered: