You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a node (say Node A) containing primary with on going replication leave the cluster; it results in replica shard failure on target (Node B) due to NodeClosedException. For one replica count, this also leads to a red cluster because both primary (on node A) and replica (on node B) are unassigned. This can be resolved by handling the exceptions gracefully on target when node leaves the cluster.
[2023-03-07T22:11:37,498][ERROR][o.o.i.r.SegmentReplicationTargetService] [ip-10-0-4-54.us-west-2.compute.internal] replication failure
org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:365) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-2.7.0.jar:2.7.0]
at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:218) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:210) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1414) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) [opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.7.0.jar:2.7.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.RemoteTransportException: [ip-10-0-5-14.us-west-2.compute.internal][10.0.5.14:9300][internal:index/shard/replication/get_segment_files]
Caused by: org.opensearch.transport.SendRequestTransportException: [ip-10-0-4-54.us-west-2.compute.internal][10.0.4.54:9300][internal:index/shard/replication/file_chunk]
at org.opensearch.transport.TransportService.sendRequestInternal(TransportService.java:941) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.transport.TransportService.sendRequest(TransportService.java:815) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.transport.TransportService.sendRequest(TransportService.java:758) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.recovery.RetryableTransportClient$1.tryAction(RetryableTransportClient.java:91) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.support.RetryableAction$1.doRun(RetryableAction.java:137) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.support.RetryableAction.run(RetryableAction.java:115) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.recovery.RetryableTransportClient.executeRetryableAction(RetryableTransportClient.java:106) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.replication.RemoteSegmentFileChunkWriter.writeFileChunk(RemoteSegmentFileChunkWriter.java:117) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.replication.SegmentFileTransferHandler$1.executeChunkRequest(SegmentFileTransferHandler.java:148) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.replication.SegmentFileTransferHandler$1.executeChunkRequest(SegmentFileTransferHandler.java:97) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.recovery.MultiChunkTransfer.handleItems(MultiChunkTransfer.java:149) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.recovery.MultiChunkTransfer$1.write(MultiChunkTransfer.java:98) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:129) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:117) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:98) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.recovery.MultiChunkTransfer.addItem(MultiChunkTransfer.java:109) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.recovery.MultiChunkTransfer.lambda$handleItems$3(MultiChunkTransfer.java:151) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:181) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1404) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[opensearch-2.7.0.jar:2.7.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
at java.lang.Thread.run(Thread.java:833) ~[?:?]
Caused by: org.opensearch.node.NodeClosedException: node closed {ip-10-0-5-14.us-west-2.compute.internal}{TfIe0XASSY-qM9pY1gCmow}{mdJAYWGfSJmU13TV20PaMA}{10.0.5.14}{10.0.5.14:9300}{di}{shard_indexing_pressure_enabled=true}
at org.opensearch.transport.TransportService.sendRequestInternal(TransportService.java:922) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.transport.TransportService.sendRequest(TransportService.java:815) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.transport.TransportService.sendRequest(TransportService.java:758) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.recovery.RetryableTransportClient$1.tryAction(RetryableTransportClient.java:91) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.support.RetryableAction$1.doRun(RetryableAction.java:137) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.support.RetryableAction.run(RetryableAction.java:115) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.recovery.RetryableTransportClient.executeRetryableAction(RetryableTransportClient.java:106) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.replication.RemoteSegmentFileChunkWriter.writeFileChunk(RemoteSegmentFileChunkWriter.java:117) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.replication.SegmentFileTransferHandler$1.executeChunkRequest(SegmentFileTransferHandler.java:148) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.replication.SegmentFileTransferHandler$1.executeChunkRequest(SegmentFileTransferHandler.java:97) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.recovery.MultiChunkTransfer.handleItems(MultiChunkTransfer.java:149) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.recovery.MultiChunkTransfer$1.write(MultiChunkTransfer.java:98) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:129) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:117) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:98) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.recovery.MultiChunkTransfer.addItem(MultiChunkTransfer.java:109) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.indices.recovery.MultiChunkTransfer.lambda$handleItems$3(MultiChunkTransfer.java:151) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:181) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1404) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) ~[opensearch-2.7.0.jar:2.7.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[opensearch-2.7.0.jar:2.7.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
at java.lang.Thread.run(Thread.java:833) ~[?:?]
[2023-03-07T22:11:37,878][INFO ][o.o.c.r.a.AllocationService] [seed] Cluster health status changed from [YELLOW] to [RED] (reason: [shards failed [[nyc_taxis][20], [nyc_taxis][20]]]).
Repro steps
Create a multi node cluster with large shard count
Stop opensearch process on one node while there is heavy indexing on-going (works well with nyc_taxis OpenSearch-Benchmark). The more number of shards on stopped node, the more chances of an on-going replication event. This results in cluster going red for one replica setup.
Cluster manager brings the primary up on Node B which contains previously copied files
Expected
Shard should not be marked failed on target when node containing primary goes down.
The text was updated successfully, but these errors were encountered:
When a node (say Node A) containing primary with on going replication leave the cluster; it results in replica shard failure on target (Node B) due to
NodeClosedException
. For one replica count, this also leads to a red cluster because both primary (on node A) and replica (on node B) are unassigned. This can be resolved by handling the exceptions gracefully on target when node leaves the cluster.Repro steps
nyc_taxis
OpenSearch-Benchmark). The more number of shards on stopped node, the more chances of an on-going replication event. This results in cluster going red for one replica setup.Expected
Shard should not be marked failed on target when node containing primary goes down.
The text was updated successfully, but these errors were encountered: