Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Compatibility check for differing lucene codec versions #6730

Merged
merged 2 commits into from
Apr 5, 2023

Conversation

Poojita-Raj
Copy link
Contributor

Description

This change aims to fail segment replications between the primary and replica if they are utilizing differing lucene codec versions. This is to avoid the current behavior of failing the replica shard in such situations.

Issues Resolved

Resolves #6616

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@@ -74,7 +74,8 @@ public void setUp() throws Exception {
ShardId testShardId = primary.shardId();

// This mirrors the creation of the ReplicationCheckpoint inside CopyState
testCheckpoint = new ReplicationCheckpoint(testShardId, primary.getOperationPrimaryTerm(), 0L, 0L);
testCheckpoint = new ReplicationCheckpoint(testShardId, primary.getOperationPrimaryTerm(), 0L, 0L, "Lucene95");
testCheckpoint2 = new ReplicationCheckpoint(testShardId, primary.getOperationPrimaryTerm(), 0L, 0L, "Lucene94");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we generate these constants from actual codecs (or helper classes). Is there any utility which provides an older version of codec

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@dreamer-89
Copy link
Member

Gradle Check (Jenkins) Run Completed with:

@Poojita-Raj : Looks like a legit failure.

> Task :server:test

org.opensearch.indices.replication.SegmentReplicationTargetServiceTests > testOnNewCheckpointFromNewPrimaryCancelOngoingReplication FAILED
    Argument(s) are different! Wanted:
    segmentReplicationTargetService.startReplication(
        ReplicationCheckpoint{shardId=[index][0], primaryTerm=100, segmentsGen=4, version=8, size=0, codec=Asserting},
        org.opensearch.index.shard.IndexShard@2a88af1e,
        <any>
    );
    -> at org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testOnNewCheckpointFromNewPrimaryCancelOngoingReplication(SegmentReplicationTargetServiceTests.java:240)
    Actual invocations have different arguments:
    segmentReplicationTargetService.startReplication(
        org.opensearch.indices.replication.SegmentReplicationTarget$MockitoMock$DfKM6oI5@6ceac65e
    );
    -> at org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testOnNewCheckpointFromNewPrimaryCancelOngoingReplication(SegmentReplicationTargetServiceTests.java:235)
    segmentReplicationTargetService.onNewCheckpoint(
        ReplicationCheckpoint{shardId=[index][0], primaryTerm=100, segmentsGen=4, version=8, size=0, codec=Asserting},
        org.opensearch.index.shard.IndexShard@2a88af1e
    );
    -> at org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.lambda$testOnNewCheckpointFromNewPrimaryCancelOngoingReplication$1(SegmentReplicationTargetServiceTests.java:225)
        at __randomizedtesting.SeedInfo.seed([387957A7584E7A9:DABA5C54F5D74498]:0)
        at app//org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testOnNewCheckpointFromNewPrimaryCancelOngoingReplication(SegmentReplicationTargetServiceTests.java:240)

@@ -61,6 +78,7 @@ public ReplicationCheckpoint(StreamInput in) throws IOException {
segmentsGen = in.readLong();
segmentInfosVersion = in.readLong();
length = in.readLong();
latestSupportedCodec = in.readString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be put behind engine version check else it will fail

  1. On target (on newer version 2.7 here) while deserializing request from node on older version. The target would expect more data and thus would fail with error "expecting more bytes but got EOF" error
  2. On target (on older version < 2.7) while deserializing request from node on higher version. The target wouldn't expect more data and thus would fail with error "more bytes" designating more data to consume for extra fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to hide both length (missed previously) and latestSupportedCodec behind version check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm adding it along with the test fix

Comment on lines 151 to 155
if (copyState.getCheckpoint()
.getLatestSupportedCodec()
.equalsIgnoreCase(request.getCheckpoint().getLatestSupportedCodec()) == false) {
// cancel(request.getTargetAllocationId(), "Requested unsupported codec version");
logger.trace("Requested unsupported codec version {}", request.getCheckpoint().getLatestSupportedCodec());
throw new CancellableThreads.ExecutionCancelledException(
new ParameterizedMessage("Requested unsupported codec version {}", request.getCheckpoint().getLatestSupportedCodec())
.toString()
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Poojita-Raj : I think we need integration tests here verifying behavior

  1. Both nodes running on same OS version
  2. Source node running on higher lucene codec
  3. Target node running on higher lucene codec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am verifying 2 and 3 with the unit tests added in OngoingSegmentReplicationTests. Currently with this release, I'm unable to add in integration tests since it requires having this change in 2 different distributions of OS with differing codecs. But they can definitely be included in further releases.

.equalsIgnoreCase(request.getCheckpoint().getLatestSupportedCodec()) == false) {
// cancel(request.getTargetAllocationId(), "Requested unsupported codec version");
logger.trace("Requested unsupported codec version {}", request.getCheckpoint().getLatestSupportedCodec());
throw new CancellableThreads.ExecutionCancelledException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to have one integration test to verify behavior on target around handling this exception and verifying shard is not failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationAllocationIT.testSingleIndexShardAllocation

@codecov-commenter
Copy link

codecov-commenter commented Mar 23, 2023

Codecov Report

Merging #6730 (250fb8a) into main (95c6ed9) will decrease coverage by 0.09%.
The diff coverage is 64.04%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #6730      +/-   ##
============================================
- Coverage     70.78%   70.69%   -0.09%     
+ Complexity    59305    59260      -45     
============================================
  Files          4813     4823      +10     
  Lines        283781   284031     +250     
  Branches      40924    40954      +30     
============================================
- Hits         200864   200789      -75     
- Misses        66420    66701     +281     
- Partials      16497    16541      +44     
Impacted Files Coverage Δ
...ch/index/codec/customcodecs/CustomCodecPlugin.java 0.00% <0.00%> (ø)
...h/index/codec/customcodecs/CustomCodecService.java 0.00% <0.00%> (ø)
.../codec/customcodecs/CustomCodecServiceFactory.java 0.00% <0.00%> (ø)
...ustomcodecs/PerFieldMappingPostingFormatCodec.java 0.00% <0.00%> (ø)
...rc/main/java/org/opensearch/index/IndexModule.java 81.92% <ø> (ø)
...c/main/java/org/opensearch/index/IndexService.java 73.81% <0.00%> (-0.06%) ⬇️
...in/java/org/opensearch/indices/IndicesService.java 64.74% <ø> (+0.70%) ⬆️
...s/replication/SegmentReplicationTargetService.java 48.10% <0.00%> (ø)
.../java/org/opensearch/plugins/IndexStorePlugin.java 100.00% <ø> (ø)
...customcodecs/Lucene95CustomStoredFieldsFormat.java 25.00% <25.00%> (ø)
... and 11 more

... and 513 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@Poojita-Raj Poojita-Raj requested a review from dreamer-89 March 24, 2023 15:30
Copy link
Collaborator

@nknize nknize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick look I'm not sure how compatible this will be w/ Plugins overriding CodecService


public static ReplicationCheckpoint empty(ShardId shardId) {
return new ReplicationCheckpoint(shardId);
}

private ReplicationCheckpoint(ShardId shardId) {
codecService = new CodecService(null, null);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that plugins can implement a custom CodecService provided through getCustomCodecServiceFactory. When you create a new ReplicationCheckpoint I think you'll need to use the codecService defined by the indexSettings?

Copy link
Contributor Author

@Poojita-Raj Poojita-Raj Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the constructor for ReplicationCheckpoint to pick up the codecService set up for the shard. This would use the custom codec service if it's overridden by plugins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @nknize! Could you take a second look to see if the changes address your concerns?

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.search.SearchWeightedRoutingIT.testSearchAggregationWithNetworkDisruption_FailOpenEnabled

@@ -30,37 +32,59 @@ public class ReplicationCheckpoint implements Writeable, Comparable<ReplicationC
private final long segmentsGen;
private final long segmentInfosVersion;
private final long length;
private final String latestSupportedCodec;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This variabled name seems not appropriate as we are using the codec used in engine.

Copy link
Member

@dreamer-89 dreamer-89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you @Poojita-Raj for this change.

Let's wait for one more approval on this.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@@ -147,6 +148,12 @@ void startSegmentCopy(GetSegmentFilesRequest request, ActionListener<GetSegmentF
*/
CopyState prepareForReplication(CheckpointInfoRequest request, FileChunkWriter fileChunkWriter) throws IOException {
final CopyState copyState = getCachedCopyState(request.getCheckpoint());
if (copyState.getCheckpoint().getCodec().equalsIgnoreCase(request.getCheckpoint().getCodec()) == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the case-insensitive comparison here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that was a remnant of a previous version where we passed in a string of codec name instead of codec service. I'll make the change.

}

public ReplicationCheckpoint(StreamInput in) throws IOException {
shardId = new ShardId(in);
primaryTerm = in.readLong();
segmentsGen = in.readLong();
segmentInfosVersion = in.readLong();
length = in.readLong();
if (in.getVersion().onOrAfter(Version.V_2_7_0)) {
length = in.readLong();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is length inside this version guard where previously it was not?

Also, how does this work? 2.7 doesn't understand this new field (yet). I'm guessing there is no bwc coverage for this functionality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

length was a field introduced by somebody else recently but seems like they forgot to keep it in the version guard which is why it's included here.
No, there's no bwc coverage for this but there will be rolling upgrade coverage for this moving forward.

this.shardId = shardId;
primaryTerm = SequenceNumbers.UNASSIGNED_PRIMARY_TERM;
segmentsGen = SequenceNumbers.NO_OPS_PERFORMED;
segmentInfosVersion = SequenceNumbers.NO_OPS_PERFORMED;
length = 0L;
codec = codecService.codec("default").getName();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all you need here is the name, I would recommend passing in a String instead of a CodecService. Let the caller figure out how to get the name from the CodecService. I think this means you can probably avoid exposing the CodecService from IndexShard and instead only expose the codec service name.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 3, 2023

Gradle Check (Jenkins) Run Completed with:

@@ -494,6 +494,10 @@ public boolean isSystem() {
return indexSettings.getIndexMetadata().isSystem();
}

public CodecService codecService() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this become getCodecName() since it looks like the only thing this is used for is to get the name?

@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationRelocationIT.testRelocateWhileContinuouslyIndexingAndWaitingForRefresh

@@ -494,6 +494,10 @@ public boolean isSystem() {
return indexSettings.getIndexMetadata().isSystem();
}

public String getCodecName() {
return codecService.codec("default").getName();
Copy link
Member

@andrross andrross Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a CodecService.DEFAULT_CODEC constant that can be used here. Also, I'm wondering if this should be getDefaultCodecName() or getCodecName(String codec)? Currently it makes the decision to look up the "default" codec but it is not apparent in the method signature. At a minimum you should add documentation to this public method. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this naming makes sense. Added in the change + documentation for the method.

Signed-off-by: Poojita Raj <[email protected]>
Signed-off-by: Poojita Raj <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationRelocationIT.testPrimaryRelocationWithSegRepFailure
      1 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

@andrross andrross added the backport 2.x Backport to 2.x branch label Apr 4, 2023
@andrross andrross merged commit c334bbd into opensearch-project:main Apr 5, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Apr 5, 2023
…versions (#6730)

* compatCheck

Signed-off-by: Poojita Raj <[email protected]>

* refactor

Signed-off-by: Poojita Raj <[email protected]>

---------

Signed-off-by: Poojita Raj <[email protected]>
(cherry picked from commit c334bbd)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
mitrofmep pushed a commit to mitrofmep/OpenSearch that referenced this pull request Apr 5, 2023
…versions (opensearch-project#6730)

* compatCheck

Signed-off-by: Poojita Raj <[email protected]>

* refactor

Signed-off-by: Poojita Raj <[email protected]>

---------

Signed-off-by: Poojita Raj <[email protected]>
Signed-off-by: Valentin Mitrofanov <[email protected]>
tlfeng pushed a commit that referenced this pull request Apr 6, 2023
…versions (#6730) (#6991)

This change aims to fail segment replications between the primary and replica if they are utilizing differing lucene codec versions. This is to avoid the current behavior of failing the replica shard in such situations.

(cherry picked from commit c334bbd)

Signed-off-by: Poojita Raj <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SegmentReplication] RollingUpgrade - Add a compatibility check to avoid shard failure
5 participants