You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
As part of rolling upgrades, we don't want to be locked to any version. Right now, in a mixed cluster state where a primary sends segments on a higher lucene codec version to the replicas, we will see the below error on replicas that causes a shard failure:
[2023-03-08T22:03:30,951][WARN ][o.o.i.c.IndicesClusterStateService] [node3] [my-index-000003][0] marking and sending shard failed due to [shard failure, reason [replication failure]]
org.opensearch.indices.replication.common.ReplicationFailedException: [my-index-000003][0]: Replication failed on (failed to clean after replication)
at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$finalizeReplication$4(SegmentReplicationTarget.java:254) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.indices.replication.SegmentReplicationTarget.finalizeReplication(SegmentReplicationTarget.java:209) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:170) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[opensearch-2.5.0.jar:2.5.0]
at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:77) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:181) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1404) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.5.0.jar:2.5.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.lang.IllegalArgumentException: Could not load codec 'Lucene95'. Did you forget to add lucene-backward-codecs.jar?
at org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:515) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
at org.apache.lucene.index.SegmentInfos.parseSegmentInfos(SegmentInfos.java:404) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:363) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:310) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$finalizeReplication$4(SegmentReplicationTarget.java:218) ~[opensearch-2.5.0.jar:2.5.0]
... 26 more
Suppressed: org.apache.lucene.index.CorruptIndexException: checksum passed (a859c5b). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(SegmentInfos))
at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:500) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:370) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:310) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$finalizeReplication$4(SegmentReplicationTarget.java:218) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.indices.replication.SegmentReplicationTarget.finalizeReplication(SegmentReplicationTarget.java:209) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:170) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[opensearch-2.5.0.jar:2.5.0]
at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:77) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:181) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1404) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) ~[opensearch-2.5.0.jar:2.5.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.5.0.jar:2.5.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.lang.IllegalArgumentException: An SPI class of type org.apache.lucene.codecs.Codec with name 'Lucene95' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath. The current classpath supports the following names: [Lucene94, Lucene80, Lucene84, Lucene86, Lucene87, Lucene70, Lucene90, Lucene91, Lucene92]
at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:113) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
at org.apache.lucene.codecs.Codec.forName(Codec.java:118) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
at org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:511) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
at org.apache.lucene.index.SegmentInfos.parseSegmentInfos(SegmentInfos.java:404) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:363) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:310) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$finalizeReplication$4(SegmentReplicationTarget.java:218) ~[opensearch-2.5.0.jar:2.5.0]
... 26 more
To Reproduce
Steps to reproduce the behavior:
Set up a mixed cluster with nodes using differing lucene codec versions (with the primary being on a higher version).
Index a couple of documents and force segment replication to take place.
Expected behavior
To avoid the above situation of a shard failure, we need to add in a compatibility check that just prints out a warning that they are on differing versions and avoid moving forward with the segment replication.
Risks: This might cause an eventual shard failure if the replica falls too far behind the primary if it's not upgraded in time.
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
OS: [e.g. iOS]
Version [e.g. 22]
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
Describe the bug
As part of rolling upgrades, we don't want to be locked to any version. Right now, in a mixed cluster state where a primary sends segments on a higher lucene codec version to the replicas, we will see the below error on replicas that causes a shard failure:
To Reproduce
Steps to reproduce the behavior:
Expected behavior
To avoid the above situation of a shard failure, we need to add in a compatibility check that just prints out a warning that they are on differing versions and avoid moving forward with the segment replication.
Risks: This might cause an eventual shard failure if the replica falls too far behind the primary if it's not upgraded in time.
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: