Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClientTimeoutIT.testSegmentReplicationStatsWithTimeout test failure #6259

Closed
Tracked by #5669
dreamer-89 opened this issue Feb 9, 2023 · 2 comments
Closed
Tracked by #5669

Comments

@dreamer-89
Copy link
Member

dreamer-89 commented Feb 9, 2023

First identified in #6250 (comment), testSegmentReplicationStatsWithTimeout causing gradle test failure. The log output shows org.apache.lucene.store.LockObtainFailedException: exception on underlying store during round of segrep causing replication failure.

Failing seed

./gradlew ':server:internalClusterTest' --tests "org.opensearch.action.admin.ClientTimeoutIT.testSegmentReplicationStatsWithTimeout" -Dtests.seed=9A395F8DCBA089E7 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sr-Latn -Dtests.timezone=Pacific/Chuuk -Druntime.java=19

Detailed log output

 [2023-02-09T12:24:39,976][WARN ][o.o.c.r.a.AllocationService] [node_t0] failing shard [failed shard, shard [test-index][2], node[_S7y00_sSWa3B91qZkXEwA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=tCjMdNcOSJWi6pKgkV2klA], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-02-09T02:24:39.869Z], failed_attempts[4], failed_nodes[[_S7y00_sSWa3B91qZkXEwA]], delayed=false, details[failed shard on node [_S7y00_sSWa3B91qZkXEwA]: shard failure, reason [replication failure], failure ReplicationFailedException[Segment Replication failed]; nested: UncategorizedExecutionException[Failed execution]; nested: NotSerializableExceptionWrapper[execution_exception: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.action.admin.ClientTimeoutIT_9A395F8DCBA089E7-004/tempDir-002/node_t1/nodes/0/indices/0Lk-vFiXS-CqZZJJEs1hLA/2/index/write.lock]; nested: LockObtainFailedException[Lock held by this virtual machine: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.action.admin.ClientTimeoutIT_9A395F8DCBA089E7-004/tempDir-002/node_t1/nodes/0/indices/0Lk-vFiXS-CqZZJJEs1hLA/2/index/write.lock]; ], allocation_status[no_attempt]], message [shard failure, reason [replication failure]], failure [ReplicationFailedException[Segment Replication failed]; nested: UncategorizedExecutionException[Failed execution]; nested: NotSerializableExceptionWrapper[execution_exception: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.action.admin.ClientTimeoutIT_9A395F8DCBA089E7-004/tempDir-002/node_t1/nodes/0/indices/0Lk-vFiXS-CqZZJJEs1hLA/2/index/write.lock]; nested: LockObtainFailedException[Lock held by this virtual machine: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.action.admin.ClientTimeoutIT_9A395F8DCBA089E7-004/tempDir-002/node_t1/nodes/0/indices/0Lk-vFiXS-CqZZJJEs1hLA/2/index/write.lock]; ], markAsStale [true]]
  1> org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
  1> 	at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:365) ~[main/:?]
  1> 	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[main/:?]
  1> 	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[main/:?]
  1> 	at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) ~[main/:?]
  1> 	at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) ~[main/:?]
  1> 	at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) ~[main/:?]
  1> 	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:345) ~[main/:?]
  1> 	at org.opensearch.indices.replication.SegmentReplicationTarget.finalizeReplication(SegmentReplicationTarget.java:208) ~[main/:?]
  1> 	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$3(SegmentReplicationTarget.java:169) ~[main/:?]
  1> 	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[main/:?]
  1> 	at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) ~[main/:?]
  1> 	at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:77) ~[main/:?]
  1> 	at org.opensearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55) ~[main/:?]
  1> 	at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) ~[main/:?]
  1> 	at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) ~[main/:?]
  1> 	at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:181) ~[main/:?]
  1> 	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) ~[main/:?]
  1> 	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1381) ~[main/:?]
  1> 	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) ~[main/:?]
  1> 	at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
  1> 	at java.lang.Thread.run(Thread.java:1589) [?:?]
  1> Caused by: org.opensearch.common.util.concurrent.UncategorizedExecutionException: Failed execution
  1> 	at org.opensearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:109) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:101) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:125) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[main/:?]
  1> 	... 35 more
  1> Caused by: org.opensearch.common.io.stream.NotSerializableExceptionWrapper: execution_exception: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.action.admin.ClientTimeoutIT_9A395F8DCBA089E7-004/tempDir-002/node_t1/nodes/0/indices/0Lk-vFiXS-CqZZJJEs1hLA/2/index/write.lock
  1> 	at org.opensearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:286) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:260) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:82) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:94) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:125) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[main/:?]
  1> 	... 35 more
  1> Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.action.admin.ClientTimeoutIT_9A395F8DCBA089E7-004/tempDir-002/node_t1/nodes/0/indices/0Lk-vFiXS-CqZZJJEs1hLA/2/index/write.lock
  1> 	at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(Nativ  1> eFSLockFactory.java:126) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
  1> 	at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:43) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
  1> 	at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:44) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
  1> 	at org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:106) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
  1> 	at org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:106) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
  1> 	at org.opensearch.index.store.Store.renameTempFilesSafe(Store.java:433) ~[main/:?]
  1> 	at org.opensearch.indices.recovery.MultiFileWriter.renameAllTempFiles(MultiFileWriter.java:202) ~[main/:?]
  1> 	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$finalizeReplication$5(SegmentReplicationTarget.java:211) ~[main/:?]
  1> 	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) ~[main/:?]
  1> 	... 25 more
  1> [2023-02-09T12:25:11,207][INFO ][o.o.a.a.ClientTimeoutIT  ] [testSegmentReplicationStatsWithTimeout] ensureGreen timed out, cluster state:
  1> cluster uuid: Qe1G0l-wRc-iUbLvqHvTJw [committed: true]
  1> version: 32
  1> state uuid: B2BEnwl7QPSdJF1nwoEkbQ
  1> from_diff: false
  1> meta data version: 4
  1>    coordination_metadata:
  1>       term: 1
  1>       last_committed_config: VotingConfiguration{_wrd0B39S5eGR1yCNpNLMw}
  1>       last_accepted_config: VotingConfiguration{_wrd0B39S5eGR1yCNpNLMw}
  1>       voting tombstones: []
  1>    [test-index/0Lk-vFiXS-CqZZJJEs1hLA]: v[4], mv[1], sv[1], av[1]
  1>       0: p_term [1], isa_ids [0rdU0HrFT-Cy5kkjqArBlQ]
  1>       1: p_term [1], isa_ids [rkn36s8iQkigiIB1hOSNEg]
  1>       2: p_term [1], isa_ids [FD1D8qrVS4KDbQORQRZqGg]
  1>       3: p_term [1], isa_ids [3ynJhPIWRmOjFTaFyIbhgQ]
  1> metadata customs:
  1>    index-graveyard: IndexGraveyard[[]]
  1> nodes: 
  1>    {node_t2}{so0YOT1SSj2nHiIKDnnbcw}{rHW2NtdYSuyWvYCmXofmtw}{127.0.0.1}{127.0.0.1:40433}{d}{shard_indexing_pressure_enabled=true}
  1>    {node_t1}{_S7y00_sSWa3B91qZkXEwA}{b8e0TsfUTJuteFUPqCHyFg}{127.0.0.1}{127.0.0.1:37059}{d}{shard_indexing_pressure_enabled=true}
  1>    {node_t0}{_wrd0B39S5eGR1yCNpNLMw}{7fNqZbtGQN2WNkqIax0_Sg}{127.0.0.1}{127.0.0.1:36167}{m}{shard_indexing_pressure_enabled=true}, cluster-manager
  1> routing_table (version 29):
  1> -- index [[test-index/0Lk-vFiXS-CqZZJJEs1hLA]]
  1> ----shard_id [test-index][0]
  1> --------[test-index][0], node[so0YOT1SSj2nHiIKDnnbcw], [P], s[STARTED], a[id=0rdU0HrFT-Cy5kkjqArBlQ]
  1> --------[test-index][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-02-09T02:24:39.910Z], failed_attempts[5], failed_nodes[[_S7y00_sSWa3B91qZkXEwA]], delayed=false, details[failed shard on node [_S7y00_sSWa3B91qZkXEwA]: shard failure, reason [replication failure], failure ReplicationFailedException[Segment Replication failed]; nested: UncategorizedExecutionException[Failed execution]; nested: NotSerializableExceptionWrapper[execution_exception: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.action.admin.ClientTimeoutIT_9A395F8DCBA089E7-004/tempDir-002/node_t1/nodes/0/indices/0Lk-vFiXS-CqZZJJEs1hLA/0/index/write.lock]; nested: LockObtainFailedException[Lock held by this virtual machine: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.action.admin.ClientTimeoutIT_9A395F8DCBA089E7-004/tempDir-002/node_t1/nodes/0/indices/0Lk-vFiXS-CqZZJJEs1hLA/0/index/write.lock]; ], allocation_status[no_attempt]]
  1> ----shard_id [test-index][1]
  1> --------[test-index][1], node[_S7y00_sSWa3B91qZkXEwA], [P], s[STARTED], a[id=rkn36s8iQkigiIB1hOSNEg]
  1> --------[test-index][1], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-02-09T02:24:39.955Z], failed_attempts[5], failed_nodes[[so0YOT1SSj2nHiIKDnnbcw]], delayed=false, details[failed shard on node [so0YOT1SSj2nHiIKDnnbcw]: shard failure, reason [replication failure], failure ReplicationFailedException[Segment Replication failed]; nested: UncategorizedExecutionException[Failed execution]; nested: NotSerializableExceptionWrapper[execution_exception: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.action.admin.ClientTimeoutIT_9A395F8DCBA089E7-004/tempDir-002/node_t2/nodes/0/indices/0Lk-vFiXS-CqZZJJEs1hLA/1/index/write.lock]; nested: LockObtainFailedException[Lock held by this virtual machine: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.action.admin.ClientTimeoutIT_9A395F8DCBA089E7-004/tempDir-002/node_t2/nodes/0/indices/0Lk-vFiXS-CqZZJJEs1hLA/1/index/write.lock]; ], allocation_status[no_attempt]]
  1> ----shard_id [test-index][2]
  1> --------[test-index][2], node[so0YOT1SSj2nHiIKDnnbcw], [P], s[STARTED], a[id=FD1D8qrVS4KDbQORQRZqGg]
  1> --------[test-index][2], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-02-09T02:24:39.976Z], failed_attempts[5], failed_nodes[[_S7y00_sSWa3B91qZkXEwA]], delayed=false, details[failed shard on node [_S7y00_sSWa3B91qZkXEwA]: shard failure, reason [replication failure], failure ReplicationFailedException[Segment Replication failed]; nested: UncategorizedExecutionException[Failed execution]; nested: NotSerializableExceptionWrapper[execution_exception: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.action.admin.ClientTimeoutIT_9A395F8DCBA089E7-004/tempDir-002/node_t1/nodes/0/indices/0Lk-vFiXS-CqZZJJEs1hLA/2/index/write.lock]; nested: LockObtainFailedException[Lock held by this virtual machine: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.action.admin.ClientTimeoutIT_9A395F8DCBA089E7-004/tempDir-002/node_t1/nodes/0/indices/0Lk-vFiXS-CqZZJJEs1hLA/2/index/write.lock]; ], allocation_status[no_attempt]]
  1> ----shard_id [test-index][3]
  1> --------[test-index][3], node[_S7y00_sSWa3B91qZkXEwA], [P], s[STARTED], a[id=3ynJhPIWRmOjFTaFyIbhgQ]
  1> --------[test-index][3], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-02-09T02:24:39.936Z], failed_attempts[5], failed_nodes[[so0YOT1SSj2nHiIKDnnbcw]], delayed=false, details[failed shard on node [so0YOT1SSj2nHiIKDnnbcw]: shard failure, reason [replication failure], failure ReplicationFailedException[Segment Replication failed]; nested: UncategorizedExecutionException[Failed execution]; nested: NotSerializableExceptionWrapper[execution_exception: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.action.admin.ClientTimeoutIT_9A395F8DCBA089E7-004/tempDir-002/node_t2/nodes/0/indices/0Lk-vFiXS-CqZZJJEs1hLA/3/index/write.lock]; nested: LockObtainFailedException[Lock held by this virtual machine: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.action.admin.ClientTimeoutIT_9A395F8DCBA089E7-004/tempDir-002/node_t2/nodes/0/indices/0Lk-vFiXS-CqZZJJEs1hLA/3/index/write.lock]; ], allocation_status[no_attempt]]

Gradle check link

https://build.ci.opensearch.org/job/gradle-check/10895/

@ashking94
Copy link
Member

Is this same as #6255?

@Rishikesh1159
Copy link
Member

@ashking94 yes both are same. Closing this issue as this is already tracked in #6255

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants