Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK G1 bug crashes with references [in]to jdk.internal.vm.FillerArray, when upgrading to 8.13.0 or 8.13.1 #106987

Closed
ChrisHegarty opened this issue Apr 2, 2024 · 28 comments
Labels
:Core/Infra/Core Core issues without another label jvm bug Team:Core/Infra Meta label for core/infra team

Comments

@ChrisHegarty
Copy link
Contributor

After upgrading Elasticsearch from 8.12.2 to 8.13.0, we see random nodes failure with the following message:

[2024-03-31T00:01:29,450][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [xxx] fatal error in thread [elasticsearch[xxx][write][T#7]], exiting
java.lang.IncompatibleClassChangeError: Class Ljdk.internal.vm.FillerArray; does not implement the requested interface java.util.concurrent.locks.Lock
at org.elasticsearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:43) ~[elasticsearch-8.13.0.jar:?]
at org.elasticsearch.index.translog.Translog.add(Translog.java:578) ~[elasticsearch-8.13.0.jar:?]
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:1223) ~[elasticsearch-8.13.0.jar:?]
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:1072) ~[elasticsearch-8.13.0.jar:?]
at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:997) ~[elasticsearch-8.13.0.jar:?]
at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:915) ~[elasticsearch-8.13.0.jar:?]
at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:378) ~[elasticsearch-8.13.0.jar:?]
at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:235) ~[elasticsearch-8.13.0.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.13.0.jar:?]
at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:305) ~[elasticsearch-8.13.0.jar:?]
at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:151) ~[elasticsearch-8.13.0.jar:?]
at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:79) ~[elasticsearch-8.13.0.jar:?]
at org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:216) ~[elasticsearch-8.13.0.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.13.0.jar:?]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33) ~[elasticsearch-8.13.0.jar:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.13.0.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.13.0.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1570) ~[?:?

It happens intermittently with all nodes and the service stops after this.

Looking into the logs, the exception seems to happen for different tasks(first one was a refresh and this one is a write operation)

[2024-04-01T15:21:46,691][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [xxx] fatal error in thread [elasticsearch[xxx][refresh][T#2]], exiting
java.lang.IncompatibleClassChangeError: Class Ljdk.internal.vm.FillerArray; does not implement the requested interface java.util.Collection
	at org.apache.lucene.index.ReadersAndUpdates.getNumDVUpdates(ReadersAndUpdates.java:168) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.ReaderPool.anyDocValuesChanges(ReaderPool.java:384) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:5776) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.StandardDirectoryReader.isCurrent(StandardDirectoryReader.java:455) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.FilterDirectoryReader.isCurrent(FilterDirectoryReader.java:133) ~[lucene-core-9.10.0.jar:?]
	at org.elasticsearch.index.engine.Engine.refreshNeeded(Engine.java:1093) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.shard.IndexShard.lambda$scheduledRefresh$47(IndexShard.java:3919) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.action.ActionListener.run(ActionListener.java:356) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:3915) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:998) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1134) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:137) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) ~[elasticsearch-8.13.0.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1570) ~[?:?]
@ChrisHegarty ChrisHegarty added :Core/Infra/Core Core issues without another label jvm bug Team:Core/Infra Meta label for core/infra team labels Apr 2, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@ChrisHegarty
Copy link
Contributor Author

The following JVM options are set:

JVM is bundled jvm.otions has these:
-Xmx28g
-Xms28g
-XX:+UseG1GC
--add-modules=jdk.incubator.vector
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-XX:HeapDumpPath=data
-XX:ErrorFile=logs/hs_err_pid%p.log
-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m

@ChrisHegarty
Copy link
Contributor Author

ChrisHegarty commented Apr 2, 2024

Sometimes this brings down the cluster, sometimes the cluster appears to recover. It probably depends on exactly where this exception happens.

Here are some snippets of stacktraces that we see:

java.lang.ClassCastException: class Ljdk.internal.vm.FillerArray; cannot be cast to class
  java.nio.ByteBuffer (Ljdk.internal.vm.FillerArray; and java.nio.ByteBuffer are in module java.base of loader 'bootstrap') 
  at [email protected]/io.netty.buffer.PoolChunk.allocate(PoolChunk.java:354)  
  at [email protected]/io.netty.buffer.PoolChunkList.allocate(PoolChunkList.java:108) 
  at [email protected]/io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:204)
...
java.lang.IncompatibleClassChangeError: Class Ljdk.internal.vm.FillerArray; does not implement the
  requested interface java.util.concurrent.locks.Lock
  at org.elasticsearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:43) ~[elasticsearch-8.13.0.jar:?]
  at org.elasticsearch.index.translog.Translog.add(Translog.java:578) ~[elasticsearch-8.13.0.jar:?]
  at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:1223) ~[elasticsearch-8.13.0.jar:?]
  at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:1072) ~[elasticsearch-8.13.0.jar:?]
  ...

In one particular case, I see hundreds of these, all appearing around the same time:

java.lang.ClassCastException: class Ljdk.internal.vm.FillerArray; cannot be cast to class 
  org.elasticsearch.index.engine.LiveVersionMap$VersionLookup (Ljdk.internal.vm.FillerArray; is in module java.base of loader 'bootstrap'; org.elasticsearch.index.engine.LiveVersionMap$VersionLookup is in module [email protected] of loader 'app')
  at [email protected]/co.elastic.elasticsearch.stateless.engine.StatelessLiveVersionMapArchive.getRamBytesUsed(StatelessLiveVersionMapArchive.java:156)
  at [email protected]/org.elasticsearch.index.engine.LiveVersionMap.ramBytesUsedForRefresh(LiveVersionMap.java:483)
  at [email protected]/org.elasticsearch.index.engine.InternalEngine.getIndexBufferRAMBytesUsed(InternalEngine.java:2573)
  at [email protected]/org.elasticsearch.index.shard.IndexShard.getIndexBufferRAMBytesUsed(IndexShard.java:2355)
...
java.lang.IncompatibleClassChangeError: Class Ljdk.internal.vm.FillerArray; does not implement the
  requested interface java.util.Map
  at [email protected]/org.apache.lucene.index.ReadersAndUpdates.getNumDVUpdates(ReadersAndUpdates.java:168)
  at [email protected]/org.apache.lucene.index.ReaderPool.anyDocValuesChanges(ReaderPool.java:384)
  at [email protected]/org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:5776)
  at [email protected]/org.apache.lucene.index.StandardDirectoryReader.isCurrent(StandardDirectoryReader.java:455)
  at [email protected]/org.apache.lucene.index.FilterDirectoryReader.isCurrent(FilterDirectoryReader.java:133)
  at [email protected]/org.elasticsearch.index.engine.Engine.refreshNeeded(Engine.java:1093)
  ...

@ChrisHegarty
Copy link
Contributor Author

Linking the JDK issue: https://bugs.openjdk.org/browse/JDK-8329528

@ChrisHegarty ChrisHegarty changed the title ElasticsearchUncaughtExceptionHandler exception after upgrading to 8.13.0 java.lang.ClassCastException: class Ljdk.internal.vm.FillerArray; cannot be cast to class, after upgrading to 8.13.0 Apr 2, 2024
@aydasraf
Copy link

aydasraf commented Apr 3, 2024

We are having same behavior of crashes and restarts , upgraded from 8.9.0

[2024-04-03T08:08:49,474][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [prod-elasticsearch-hot-tier-2] fatal error in thread [elasticsearch[prod-elasticsearch-hot-tier-2][refresh][T#7]], exiting
java.lang.IncompatibleClassChangeError: Class Ljdk.internal.vm.FillerArray; does not implement the requested interface java.util.Map
	at org.apache.lucene.index.ReadersAndUpdates.getNumDVUpdates(ReadersAndUpdates.java:168) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.ReaderPool.anyDocValuesChanges(ReaderPool.java:384) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:5776) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.StandardDirectoryReader.isCurrent(StandardDirectoryReader.java:455) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.FilterDirectoryReader.isCurrent(FilterDirectoryReader.java:133) ~[lucene-core-9.10.0.jar:?]
	at org.elasticsearch.index.engine.Engine.refreshNeeded(Engine.java:1093) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.shard.IndexShard.lambda$scheduledRefresh$47(IndexShard.java:3919) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.action.ActionListener.run(ActionListener.java:356) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:3915) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:998) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1134) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:137) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) ~[elasticsearch-8.13.0.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1570) ~[?:?]

Any known workarounds that may reduce the impact?

@ldematte
Copy link
Contributor

ldematte commented Apr 4, 2024

Since this seems to be very likely a JDK issue: as a workaround, when it's possible (i.e. self-hosted clusters), would it make sense to use a local JDK21 installation in place of the bundled JDK22?

@aydasraf
Copy link

aydasraf commented Apr 4, 2024

@ldematte Thank you for getting back, this is what I actually did, built a custom docker image with JDK 21.0.3 ( beta) which is the release that has a potential fix https://bugs.openjdk.org/browse/JDK-8319548 .. Still observing the outcome.

@ChrisHegarty
Copy link
Contributor Author

@ldematte Thank you for getting back, this is what I actually did, built a custom docker image with JDK 21.0.3 ( beta) which is the release that has a potential fix https://bugs.openjdk.org/browse/JDK-8319548 .. Still observing the outcome.

Ah this is very interesting.

To confirm: the issue is still happening even with JDK 21.0.3, correct? To be precise, since there are multiple JDK vendors, can you please post the output of java -version of this JDK.

Additionally, can you please post the stacktraces, even if the same (some times there is some small differences, and also differences with the failure sites)

@aydasraf
Copy link

aydasraf commented Apr 4, 2024

@ChrisHegarty, We are currently observing if this fix prevents the crash from happening, until the moment no crashes, but this is due to minimal load on the cluster ... load time is about to start and generally midday is where things go nasty. So will keep you posted whether it works or breaks.

I used Adoptium Nightly build

openjdk version "21.0.3-beta" 2024-04-16
OpenJDK Runtime Environment Temurin-21.0.3+7-202403202002 (build 21.0.3-beta+7-ea)
OpenJDK 64-Bit Server VM Temurin-21.0.3+7-202403202002 (build 21.0.3-beta+7-ea, mixed mode, sharing)

If new traces happen will post here as well.

@romain-chanu
Copy link

romain-chanu commented Apr 5, 2024

We have seen similar stack trace happening through the pruneDeletedTombstones method:

java.lang.IncompatibleClassChangeError: Class Ljdk.internal.vm.FillerArray; does not implement the requested interface java.util.Map
	at org.elasticsearch.index.engine.LiveVersionMap.pruneTombstones(LiveVersionMap.java:437) ~[elasticsearch-8.13.1.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.pruneDeletedTombstones(InternalEngine.java:2378) ~[elasticsearch-8.13.1.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.maybePruneDeletes(InternalEngine.java:1924) ~[elasticsearch-8.13.1.jar:?]
	at org.elasticsearch.index.shard.IndexShard.lambda$scheduledRefresh$47(IndexShard.java:3940) ~[elasticsearch-8.13.1.jar:?]
	at org.elasticsearch.action.ActionListener.run(ActionListener.java:356) ~[elasticsearch-8.13.1.jar:?]
	at org.elasticsearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:3915) ~[elasticsearch-8.13.1.jar:?]
	at org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:998) ~[elasticsearch-8.13.1.jar:?]
	at org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1134) ~[elasticsearch-8.13.1.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:137) ~[elasticsearch-8.13.1.jar:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) ~[elasticsearch-8.13.1.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1570) ~[?:?]

@romain-chanu
Copy link

romain-chanu commented Apr 5, 2024

Other stack traces that may have led to data corruption (Lucene segments files corrupted):

java.lang.IncompatibleClassChangeError: Class Ljdk.internal.vm.FillerArray; does not implement the requested interface java.util.Map
	at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.terms(PerFieldPostingsFormat.java:353) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.CodecReader.terms(CodecReader.java:132) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.FilterLeafReader.terms(FilterLeafReader.java:415) ~[lucene-core-9.10.0.jar:?]
	at org.elasticsearch.common.lucene.uid.PerThreadIDVersionAndSeqNoLookup.<init>(PerThreadIDVersionAndSeqNoLookup.java:68) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.common.lucene.uid.PerThreadIDVersionAndSeqNoLookup.<init>(PerThreadIDVersionAndSeqNoLookup.java:111) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.common.lucene.uid.VersionsAndSeqNoResolver.getLookupState(VersionsAndSeqNoResolver.java:66) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.common.lucene.uid.VersionsAndSeqNoResolver.timeSeriesLoadDocIdAndVersion(VersionsAndSeqNoResolver.java:140) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.resolveDocVersion(InternalEngine.java:1021) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.planIndexingAsPrimary(InternalEngine.java:1333) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.indexingStrategyForOperation(InternalEngine.java:1310) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:1172) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:1072) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:997) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:915) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:378) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:235) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:305) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:151) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:79) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:216) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.13.0.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1570) ~[?:?]
java.lang.IncompatibleClassChangeError: Class Ljdk.internal.vm.FillerArray; does not implement the requested interface java.util.Collection
	at org.apache.lucene.index.ReadersAndUpdates.writeFieldUpdates(ReadersAndUpdates.java:554) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.ReaderPool.writeAllDocValuesUpdates(ReaderPool.java:251) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.IndexWriter.writeReaderPool(IndexWriter.java:3982) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:598) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:381) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:355) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:345) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170) ~[lucene-core-9.10.0.jar:?]
	at org.elasticsearch.index.engine.ElasticsearchReaderManager.refreshIfNeeded(ElasticsearchReaderManager.java:48) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.engine.ElasticsearchReaderManager.refreshIfNeeded(ElasticsearchReaderManager.java:27) ~[elasticsearch-8.13.0.jar:?]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240) ~[lucene-core-9.10.0.jar:?]
	at org.elasticsearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:461) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:441) ~[elasticsearch-8.13.0.jar:?]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167) ~[lucene-core-9.10.0.jar:?]
	at org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:213) ~[lucene-core-9.10.0.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:2047) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.lambda$maybeRefresh$8(InternalEngine.java:2020) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:270) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.maybeRefresh(InternalEngine.java:2020) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.shard.IndexShard.lambda$scheduledRefresh$47(IndexShard.java:3935) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.action.ActionListener.run(ActionListener.java:356) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:3915) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:998) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1134) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:137) ~[elasticsearch-8.13.0.jar:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) ~[elasticsearch-8.13.0.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1570) ~[?:?]

@ChrisHegarty
Copy link
Contributor Author

ChrisHegarty commented Apr 5, 2024

It seems possible that the elasticsearch benchmarks are running into the same underlying JDK 22 bug. [EDIT: remove inaccessible link]

@tschatzl
Copy link

tschatzl commented Apr 5, 2024

@ChrisHegarty: above link for a elasticsearch-benchmarks repo does not work (404) and I can't find the correct repo myselves. Could you fix it?

@ChrisHegarty
Copy link
Contributor Author

ChrisHegarty commented Apr 5, 2024

above link does not work and I can't find the correct repo myselves. Could you fix it?

Unfortunately, (and my mistake), the aforementioned link is not public, sorry. There is little new information there anyway. What I found there is an interesting hs_err_pidxxxxx.log, which I subsequently attached to the OpenJDK Jira issue.

Additionally, the fact that the crash was observed shortly after the upgrade to JDK 22 helps us to confirm that it is indeed specific to the upgrade to JDK 22.

@aydasraf
Copy link

aydasraf commented Apr 5, 2024

@ChrisHegarty i can confirm that moving to JDK 21.0.3 solves the issue and actually gives a much better and more stable performance.. JDK 22 is just nasty.

@ChrisHegarty
Copy link
Contributor Author

ChrisHegarty commented Apr 5, 2024

i can confirm that moving ro JDK 21.0.3 solves the issue

Thanks for confirming that a downgrade of the JDK (from 22 to 21.x) does not encounter the issue. I want to note that 21.0.3 is currently in Early Access (not yet GA'ed). For Elasticsearch, we're planning on downgrading (back) to JDK 21.0.2.

and actually gives much more better and stable performance.. JDK 22 is nasty

Yes, this is indeed a nasty bug. It's likely impact is much wider than Elastic.

@ChrisHegarty ChrisHegarty changed the title java.lang.ClassCastException: class Ljdk.internal.vm.FillerArray; cannot be cast to class, after upgrading to 8.13.0 Elasticsearch 8.13 encounters a JDK G1 bug and crashes with references [in]to jdk.internal.vm.FillerArray Apr 5, 2024
@ChrisHegarty ChrisHegarty changed the title Elasticsearch 8.13 encounters a JDK G1 bug and crashes with references [in]to jdk.internal.vm.FillerArray JDK G1 bug crashes with references [in]to jdk.internal.vm.FillerArray, when upgrading to 8.13.0 or 8.13.1 Apr 5, 2024
@jesslm
Copy link

jesslm commented Apr 10, 2024

Hi, team! Did the downgrade back to JDK 21.0.2 happen in 8.13.2?

@aydasraf
Copy link

aydasraf commented Apr 10, 2024

Hi, team! Did the downgrade back to JDK 21.0.2 happen in 8.13.2?

@jesslm , yes, the docker images of Elasticsearch v 8.13.2 were downgraded to JDK 21.0.2 ..

@jesslm
Copy link

jesslm commented Apr 10, 2024

What about for Elasticsearch Service?

rjernst added a commit to rjernst/elasticsearch that referenced this issue May 15, 2024
This commit re-bumps the bundled JDK to Java 22 now that we have
a tested workaround for the G1GC bug
(https://bugs.openjdk.org/browse/JDK-8329528).

relates elastic#108571
relates elastic#106987
rjernst added a commit that referenced this issue May 15, 2024
This commit re-bumps the bundled JDK to Java 22 now that we have
a tested workaround for the G1GC bug
(https://bugs.openjdk.org/browse/JDK-8329528).

relates #108571
relates #106987
rjernst added a commit to rjernst/elasticsearch that referenced this issue May 15, 2024
This commit re-bumps the bundled JDK to Java 22 now that we have
a tested workaround for the G1GC bug
(https://bugs.openjdk.org/browse/JDK-8329528).

relates elastic#108571
relates elastic#106987
rjernst added a commit to rjernst/elasticsearch that referenced this issue May 15, 2024
This commit re-bumps the bundled JDK to Java 22 now that we have
a tested workaround for the G1GC bug
(https://bugs.openjdk.org/browse/JDK-8329528).

relates elastic#108571
relates elastic#106987
elasticsearchmachine pushed a commit that referenced this issue May 15, 2024
This commit re-bumps the bundled JDK to Java 22 now that we have
a tested workaround for the G1GC bug
(https://bugs.openjdk.org/browse/JDK-8329528).

relates #108571
relates #106987
elasticsearchmachine pushed a commit that referenced this issue May 16, 2024
* Update bundled JDK to Java 22 (again) (#108654)

This commit re-bumps the bundled JDK to Java 22 now that we have
a tested workaround for the G1GC bug
(https://bugs.openjdk.org/browse/JDK-8329528).

relates #108571
relates #106987

* copy main openjdk toolchain resolver

* use 2 lines for workaround

* fix test

* update adoptium test
vitam-prg pushed a commit to ProgrammeVitam/vitam that referenced this issue May 24, 2024
vitam-prg pushed a commit to ProgrammeVitam/vitam that referenced this issue May 24, 2024
Story #12345: Ultimate COTS upgrade II

* Upgrade MongoDB 7.0.7 -> 7.0.8
* Upgrade ElasticSearch 7.17.19 -> 7.17.20
  * Resolve issue: elastic/elasticsearch#106987
* Upgrade Prometheus & Exporters

See merge request vitam/vitam!10009
@panthony
Copy link

panthony commented Jul 31, 2024

Looks like this issue was reintroduced in later 7.17.x by #108654

I have a cluster on 7.17.22 that randomly crashed with:

java.lang.IncompatibleClassChangeError: Class Ljdk.internal.vm.FillerArray; does not implement the requested interface java.util.Collection
	at java.util.Collections$UnmodifiableCollection.stream(Collections.java:1131) ~[?:?]
	at org.elasticsearch.index.seqno.ReplicationTracker.getRetentionLeases(ReplicationTracker.java:250) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.index.shard.IndexShard.getRetentionLeases(IndexShard.java:2638) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.index.shard.IndexShard.syncRetentionLeases(IndexShard.java:2756) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.index.IndexService.lambda$sync$19(IndexService.java:967) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.index.shard.IndexShard.lambda$runUnderPrimaryPermit$26(IndexShard.java:3496) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$23(IndexShard.java:3450) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:219) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:253) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:199) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:3421) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:3409) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.index.shard.IndexShard.runUnderPrimaryPermit(IndexShard.java:3499) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.index.IndexService.sync(IndexService.java:967) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.index.IndexService.syncRetentionLeases(IndexService.java:951) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.index.IndexService.access$900(IndexService.java:102) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.index.IndexService$AsyncRetentionLeaseSyncTask.runInternal(IndexService.java:1141) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:133) ~[elasticsearch-7.17.22.jar:7.17.22]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) ~[elasticsearch-7.17.22.jar:7.17.22]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1570) [?:?]

@ldematte
Copy link
Contributor

#108654 made its way into 7.17.22, so if you cluster is on 7.17.20, it cannot possibly be it.
Also, 7.17.20 downgraded the JDK to version 21.0.2, the last known version that was not affected by the bug.
Are you running with the bundled JDK, or using your own Java version?
If the former, can you please double check the ES version?
If the latter, can you check your Java version, and ensure is not one of those affected by https://bugs.openjdk.org/browse/JDK-8329528?

@panthony
Copy link

panthony commented Jul 31, 2024

@ldematte My apologies it's a typo, it's indeed 7.17.22

{
  "name" : "xx",
  "cluster_name" : "xx",
  "cluster_uuid" : "P_JgtuvtRFSDKbuk-JdbaQ",
  "version" : {
    "number" : "7.17.22",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "38e9ca2e81304a821c50862dafab089ca863944b",
    "build_date" : "2024-06-06T07:35:17.876121680Z",
    "build_snapshot" : false,
    "lucene_version" : "8.11.3",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

And I do use the bundled java version which is:

/usr/share/elasticsearch/jdk/bin/java --version
openjdk 22.0.1 2024-04-16
OpenJDK Runtime Environment (build 22.0.1+8-16)
OpenJDK 64-Bit Server VM (build 22.0.1+8-16, mixed mode, sharing)

@ldematte
Copy link
Contributor

ldematte commented Aug 1, 2024

This is very strange :/
I see the workaround was backported to 7.17 too (#108631) as well as the re-upgrade to JDK 22 (#108689)
@ChrisHegarty can you think of anything that can explain this?

@ldematte
Copy link
Contributor

ldematte commented Aug 1, 2024

@panthony can you verify in the ES logs that you can see -XX:+UnlockDiagnosticVMOptions -XX:G1NumCollectionsKeepPinned=10000000 in the java options (should be very early in the logs after startup).

@ChrisHegarty
Copy link
Contributor Author

This is very strange :/ I see the workaround was backported to 7.17 too (#108631) as well as the re-upgrade to JDK 22 (#108689) @ChrisHegarty can you think of anything that can explain this?

I cannot. This issue should not be present when either

  1. when on a release < JDK 22.0.2 and the correct JVM flags are set (as above), OR
  2. when on a release >= JDK 22.0.2.

@panthony
Copy link

panthony commented Aug 1, 2024

@ldematte If the log is supposed to be present somewhere in "/var/log/elasticsearch/" it's nowhere to be found

Edit:

I do not see this change on the VM where ElasticSearch is deployed:

https://github.com/elastic/elasticsearch/pull/108631/files#diff-93b9226e55b0c23873222857eac0940b5d8ae09d28d3bbf1a55e6d8a73133ba7

The file /etc/elasticsearch/jvm.options ends with:

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m

I'll try to see why, thanks for your help.

Edit 2:

FYI the original file from ES was replaced by another version that had slight tweaks in it, when ES was upgraded for security fixes there was no diff made around this file to see if there was any important changes. 🤦🏻

Edit 3:

For the sake of completeness, the actual fix is:

6f20cba#diff-93b9226e55b0c23873222857eac0940b5d8ae09d28d3bbf1a55e6d8a73133ba7

When set on a single line it crashes with UnlockDiagnosticVMOptions -XX:G1NumCollectionsKeepPinned=10000000

@ldematte
Copy link
Contributor

ldematte commented Aug 2, 2024

Thanks @panthony for the update!
It seems like you found the root cause indeed; changes to these configuration files are always kind of risky, they might break things like in this case, as we have no reasonable way to "merge" them.

Btw, log location changes based on configuration, distribution, etc.
Here you can find a summary of were to expect them by default: https://www.elastic.co/guide/en/elasticsearch/reference/current/logging.html
Or even better, you can call _nodes/settings?pretty=true and look at path.logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label jvm bug Team:Core/Infra Meta label for core/infra team
Projects
None yet
Development

No branches or pull requests

8 participants