Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SegmentReplication] RollingUpgrade - Add a compatibility check to avoid shard failure #6616

Closed
Poojita-Raj opened this issue Mar 10, 2023 · 0 comments · Fixed by #6730
Closed
Assignees
Labels
bug Something isn't working distributed framework

Comments

@Poojita-Raj
Copy link
Contributor

Describe the bug
As part of rolling upgrades, we don't want to be locked to any version. Right now, in a mixed cluster state where a primary sends segments on a higher lucene codec version to the replicas, we will see the below error on replicas that causes a shard failure:

[2023-03-08T22:03:30,951][WARN ][o.o.i.c.IndicesClusterStateService] [node3] [my-index-000003][0] marking and sending shard failed due to [shard failure, reason [replication failure]]
org.opensearch.indices.replication.common.ReplicationFailedException: [my-index-000003][0]: Replication failed on  (failed to clean after replication)
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$finalizeReplication$4(SegmentReplicationTarget.java:254) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.indices.replication.SegmentReplicationTarget.finalizeReplication(SegmentReplicationTarget.java:209) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:170) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[opensearch-2.5.0.jar:2.5.0]
	at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:77) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:181) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1404) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) ~[opensearch-2.5.0.jar:2.5.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.5.0.jar:2.5.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.lang.IllegalArgumentException: Could not load codec 'Lucene95'. Did you forget to add lucene-backward-codecs.jar?
	at org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:515) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
	at org.apache.lucene.index.SegmentInfos.parseSegmentInfos(SegmentInfos.java:404) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:363) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:310) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$finalizeReplication$4(SegmentReplicationTarget.java:218) ~[opensearch-2.5.0.jar:2.5.0]
	... 26 more
	Suppressed: org.apache.lucene.index.CorruptIndexException: checksum passed (a859c5b). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(SegmentInfos))
		at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:500) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
		at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:370) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
		at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:310) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
		at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$finalizeReplication$4(SegmentReplicationTarget.java:218) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.indices.replication.SegmentReplicationTarget.finalizeReplication(SegmentReplicationTarget.java:209) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:170) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[opensearch-2.5.0.jar:2.5.0]
		at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
		at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:77) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:181) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1404) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) ~[opensearch-2.5.0.jar:2.5.0]
		at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.5.0.jar:2.5.0]
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
		at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.lang.IllegalArgumentException: An SPI class of type org.apache.lucene.codecs.Codec with name 'Lucene95' does not exist.  You need to add the corresponding JAR file supporting this SPI to your classpath.  The current classpath supports the following names: [Lucene94, Lucene80, Lucene84, Lucene86, Lucene87, Lucene70, Lucene90, Lucene91, Lucene92]
	at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:113) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
	at org.apache.lucene.codecs.Codec.forName(Codec.java:118) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
	at org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:511) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
	at org.apache.lucene.index.SegmentInfos.parseSegmentInfos(SegmentInfos.java:404) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:363) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:310) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$finalizeReplication$4(SegmentReplicationTarget.java:218) ~[opensearch-2.5.0.jar:2.5.0]
	... 26 more
 

To Reproduce
Steps to reproduce the behavior:

  1. Set up a mixed cluster with nodes using differing lucene codec versions (with the primary being on a higher version).
  2. Index a couple of documents and force segment replication to take place.

Expected behavior
To avoid the above situation of a shard failure, we need to add in a compatibility check that just prints out a warning that they are on differing versions and avoid moving forward with the segment replication.

Risks: This might cause an eventual shard failure if the replica falls too far behind the primary if it's not upgraded in time.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants