-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Add cancellation support in RemoteStoreReplicationSource #9234
Conversation
server/src/test/java/org/opensearch/index/shard/SegmentReplicationIndexShardTests.java
Show resolved
Hide resolved
Compatibility status:
|
Gradle Check (Jenkins) Run Completed with:
|
Issue is not repro'able locally.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we run
Line 19 in 4ccbf9d
public class SegmentReplicationSuiteIT extends SegmentReplicationBaseIT { |
server/src/main/java/org/opensearch/indices/replication/RemoteStoreReplicationSource.java
Outdated
Show resolved
Hide resolved
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
@@ Coverage Diff @@
## main #9234 +/- ##
============================================
+ Coverage 71.02% 71.13% +0.11%
- Complexity 57365 57432 +67
============================================
Files 4776 4776
Lines 270809 270806 -3
Branches 39582 39582
============================================
+ Hits 192331 192647 +316
+ Misses 62271 62026 -245
+ Partials 16207 16133 -74
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Gradle Check (Jenkins) Run Completed with:
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Gradle Check (Jenkins) Run Completed with:
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Gradle Check (Jenkins) Run Completed with:
|
|
Gradle Check (Jenkins) Run Completed with:
|
@@ -159,13 +159,15 @@ public void startReplication(ActionListener<Void> listener) { | |||
// Get list of files to copy from this checkpoint. | |||
state.setStage(SegmentReplicationState.Stage.GET_CHECKPOINT_INFO); | |||
cancellableThreads.checkForCancel(); | |||
source.getCheckpointMetadata(getId(), checkpoint, checkpointInfoListener); | |||
cancellableThreads.execute(() -> source.getCheckpointMetadata(getId(), checkpoint, checkpointInfoListener)); | |||
|
|||
checkpointInfoListener.whenComplete(checkpointInfo -> { | |||
final List<StoreFileMetadata> filesToFetch = getFiles(checkpointInfo); | |||
state.setStage(SegmentReplicationState.Stage.GET_FILES); | |||
cancellableThreads.checkForCancel(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - if we are moving to a sync cancel here and wrapping calls to the replication source I don't think this class needs the checkForCancel
checks anymore before invoking the source methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove checkForCancel
call before/after ReplicationSource call. The remaining checkForCancel
would help in fail-fast when cancellation happens after source calls e.g. cancellation just after getSegmentFiles
method call but before finalizeReplication
inside SegmentReplicationTarget.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see remaining checkForCancel()
calls are right after the ReplicationSource
method calls (first statement inside SegmentReplicationTarget.getFiles, SegmentReplicationTarget.finalizeReplication), which means we can remove these calls as well. BUT, it is still possible that cancellation happens outside of replication source method calls, in which case resulting in non-graceful cancellations. I think we should also wrap getFiles
and finalizeReplication
method calls inside CancellationThreads.execute(). This is required as target contains heavy-weight operations such as finalizeReplication (reads commit from disk) and there are chances that cancellation happens during method execution.
server/src/test/java/org/opensearch/index/shard/RemoteIndexShardTests.java
Outdated
Show resolved
Hide resolved
…ionSource Signed-off-by: Suraj Singh <[email protected]> Self review Signed-off-by: Suraj Singh <[email protected]> Address review comments Signed-off-by: Suraj Singh <[email protected]> Address review comments Signed-off-by: Suraj Singh <[email protected]>
Signed-off-by: Suraj Singh <[email protected]>
Signed-off-by: Suraj Singh <[email protected]>
Signed-off-by: Suraj Singh <[email protected]>
Signed-off-by: Suraj Singh <[email protected]>
Signed-off-by: Suraj Singh <[email protected]>
Signed-off-by: Suraj Singh <[email protected]>
|
||
class TestRSReplicationSource extends RemoteStoreReplicationSource { | ||
|
||
private final Thread beforeCheckpoint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Created threads vs Runnable to more closely mimic the actual scenario where cancellations happen in separate threads. Sample trace showing segrep event & cancellation invoke in separate threads (note Thread %tid in logs below).
[2023-08-21T19:42:05,088][INFO ][o.o.i.r.SegmentReplicationTarget] [org.opensearch.index.shard.RemoteIndexShardTests] [test][0] [Thread 42] Starting Replication Target: Id:[3] Checkpoint [ReplicationCheckpoint{shardId=[test][0], primaryTerm=88, segmentsGen=3, version=7, size=0, codec=Lucene95}] Shard:[[test][0]] Source:[TestReplicationSource] with thread opensearch[org.opensearch.index.shard.RemoteIndexShardTests][generic][T#3]
[2023-08-21T19:42:05,088][INFO ][o.o.i.r.SegmentReplicationTarget] [[Thread-5]] [test][0] [Thread 46] Cancelling replication for target Id:[3] Checkpoint [ReplicationCheckpoint{shardId=[test][0], primaryTerm=88, segmentsGen=3, version=7, size=0, codec=Lucene95}] Shard:[[test][0]] Source:[TestReplicationSource]
[2023-08-21T19:42:05,090][ERROR][o.o.i.r.SegmentReplicationTargetService] [org.opensearch.index.shard.RemoteIndexShardTests] [Thread 42] Error during segment replication, Id:[3] Checkpoint [ReplicationCheckpoint{shardId=[test][0], primaryTerm=88, segmentsGen=3, version=7, size=0, codec=Lucene95}] Shard:[[test][0]] Source:[TestReplicationSource]
...
Gradle Check (Jenkins) Run Completed with:
|
@dreamer-89 FYI #9499 - Not sure if your changes here will fix this or not but would like to first determine the source of flakiness here. |
Discussed separately with @mch2, there is no data point around whether this change is needed for 2.10.0 release provided existing tests are stable and we are close to release. We may need to revisit this change, identify use-cases where it helps. I will keep the PR open in case we identify it is needed. |
This PR is stalled because it has been open for 30 days with no activity. |
Cancellation support was added as part of #10725, closing this stalled PR. |
Description
Adds cancellation to RemoteStoreReplicationSource and checks for event cancellation in
getCheckpointMetadata
andgetSegmentFiles
method calls. This is useful as it allows:Related Issues
Resolves #8089
Testing
One test
testDropRandomNodeDuringReplication
failure due to assertion trip which seems unrelated. Ran test againstmain
and I am still seeing this failure. So, this is un-related to this change.Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.