-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Compatibility check for differing lucene codec versions #6730
Conversation
server/src/test/java/org/opensearch/gateway/PrimaryShardAllocatorTests.java
Outdated
Show resolved
Hide resolved
Gradle Check (Jenkins) Run Completed with:
|
server/src/test/java/org/opensearch/indices/replication/OngoingSegmentReplicationsTests.java
Outdated
Show resolved
Hide resolved
@@ -74,7 +74,8 @@ public void setUp() throws Exception { | |||
ShardId testShardId = primary.shardId(); | |||
|
|||
// This mirrors the creation of the ReplicationCheckpoint inside CopyState | |||
testCheckpoint = new ReplicationCheckpoint(testShardId, primary.getOperationPrimaryTerm(), 0L, 0L); | |||
testCheckpoint = new ReplicationCheckpoint(testShardId, primary.getOperationPrimaryTerm(), 0L, 0L, "Lucene95"); | |||
testCheckpoint2 = new ReplicationCheckpoint(testShardId, primary.getOperationPrimaryTerm(), 0L, 0L, "Lucene94"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we generate these constants from actual codecs (or helper classes). Is there any utility which provides an older version of codec
76f6d82
to
6b4b64b
Compare
Gradle Check (Jenkins) Run Completed with:
|
@Poojita-Raj : Looks like a legit failure.
|
@@ -61,6 +78,7 @@ public ReplicationCheckpoint(StreamInput in) throws IOException { | |||
segmentsGen = in.readLong(); | |||
segmentInfosVersion = in.readLong(); | |||
length = in.readLong(); | |||
latestSupportedCodec = in.readString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be put behind engine version check else it will fail
- On target (on newer version 2.7 here) while deserializing request from node on older version. The target would expect more data and thus would fail with error "expecting more bytes but got EOF" error
- On target (on older version < 2.7) while deserializing request from node on higher version. The target wouldn't expect more data and thus would fail with error "more bytes" designating more data to consume for extra fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to hide both length
(missed previously) and latestSupportedCodec
behind version check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm adding it along with the test fix
if (copyState.getCheckpoint() | ||
.getLatestSupportedCodec() | ||
.equalsIgnoreCase(request.getCheckpoint().getLatestSupportedCodec()) == false) { | ||
// cancel(request.getTargetAllocationId(), "Requested unsupported codec version"); | ||
logger.trace("Requested unsupported codec version {}", request.getCheckpoint().getLatestSupportedCodec()); | ||
throw new CancellableThreads.ExecutionCancelledException( | ||
new ParameterizedMessage("Requested unsupported codec version {}", request.getCheckpoint().getLatestSupportedCodec()) | ||
.toString() | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Poojita-Raj : I think we need integration tests here verifying behavior
- Both nodes running on same OS version
- Source node running on higher lucene codec
- Target node running on higher lucene codec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am verifying 2 and 3 with the unit tests added in OngoingSegmentReplicationTests. Currently with this release, I'm unable to add in integration tests since it requires having this change in 2 different distributions of OS with differing codecs. But they can definitely be included in further releases.
.equalsIgnoreCase(request.getCheckpoint().getLatestSupportedCodec()) == false) { | ||
// cancel(request.getTargetAllocationId(), "Requested unsupported codec version"); | ||
logger.trace("Requested unsupported codec version {}", request.getCheckpoint().getLatestSupportedCodec()); | ||
throw new CancellableThreads.ExecutionCancelledException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be good to have one integration test to verify behavior on target around handling this exception and verifying shard is not failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
519e251
to
ec7cbc0
Compare
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more @@ Coverage Diff @@
## main #6730 +/- ##
============================================
- Coverage 70.78% 70.69% -0.09%
+ Complexity 59305 59260 -45
============================================
Files 4813 4823 +10
Lines 283781 284031 +250
Branches 40924 40954 +30
============================================
- Hits 200864 200789 -75
- Misses 66420 66701 +281
- Partials 16497 16541 +44
... and 513 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick look I'm not sure how compatible this will be w/ Plugins overriding CodecService
|
||
public static ReplicationCheckpoint empty(ShardId shardId) { | ||
return new ReplicationCheckpoint(shardId); | ||
} | ||
|
||
private ReplicationCheckpoint(ShardId shardId) { | ||
codecService = new CodecService(null, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is that plugins can implement a custom CodecService provided through getCustomCodecServiceFactory
. When you create a new ReplicationCheckpoint I think you'll need to use the codecService defined by the indexSettings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified the constructor for ReplicationCheckpoint to pick up the codecService set up for the shard. This would use the custom codec service if it's overridden by plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @nknize! Could you take a second look to see if the changes address your concerns?
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
@@ -30,37 +32,59 @@ public class ReplicationCheckpoint implements Writeable, Comparable<ReplicationC | |||
private final long segmentsGen; | |||
private final long segmentInfosVersion; | |||
private final long length; | |||
private final String latestSupportedCodec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This variabled name seems not appropriate as we are using the codec used in engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you @Poojita-Raj for this change.
Let's wait for one more approval on this.
Gradle Check (Jenkins) Run Completed with:
|
@@ -147,6 +148,12 @@ void startSegmentCopy(GetSegmentFilesRequest request, ActionListener<GetSegmentF | |||
*/ | |||
CopyState prepareForReplication(CheckpointInfoRequest request, FileChunkWriter fileChunkWriter) throws IOException { | |||
final CopyState copyState = getCachedCopyState(request.getCheckpoint()); | |||
if (copyState.getCheckpoint().getCodec().equalsIgnoreCase(request.getCheckpoint().getCodec()) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the case-insensitive comparison here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, that was a remnant of a previous version where we passed in a string of codec name instead of codec service. I'll make the change.
} | ||
|
||
public ReplicationCheckpoint(StreamInput in) throws IOException { | ||
shardId = new ShardId(in); | ||
primaryTerm = in.readLong(); | ||
segmentsGen = in.readLong(); | ||
segmentInfosVersion = in.readLong(); | ||
length = in.readLong(); | ||
if (in.getVersion().onOrAfter(Version.V_2_7_0)) { | ||
length = in.readLong(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is length
inside this version guard where previously it was not?
Also, how does this work? 2.7 doesn't understand this new field (yet). I'm guessing there is no bwc coverage for this functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
length was a field introduced by somebody else recently but seems like they forgot to keep it in the version guard which is why it's included here.
No, there's no bwc coverage for this but there will be rolling upgrade coverage for this moving forward.
this.shardId = shardId; | ||
primaryTerm = SequenceNumbers.UNASSIGNED_PRIMARY_TERM; | ||
segmentsGen = SequenceNumbers.NO_OPS_PERFORMED; | ||
segmentInfosVersion = SequenceNumbers.NO_OPS_PERFORMED; | ||
length = 0L; | ||
codec = codecService.codec("default").getName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all you need here is the name, I would recommend passing in a String
instead of a CodecService
. Let the caller figure out how to get the name from the CodecService. I think this means you can probably avoid exposing the CodecService from IndexShard and instead only expose the codec service name.
Gradle Check (Jenkins) Run Completed with:
|
@@ -494,6 +494,10 @@ public boolean isSystem() { | |||
return indexSettings.getIndexMetadata().isSystem(); | |||
} | |||
|
|||
public CodecService codecService() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this become getCodecName()
since it looks like the only thing this is used for is to get the name?
Gradle Check (Jenkins) Run Completed with:
|
@@ -494,6 +494,10 @@ public boolean isSystem() { | |||
return indexSettings.getIndexMetadata().isSystem(); | |||
} | |||
|
|||
public String getCodecName() { | |||
return codecService.codec("default").getName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a CodecService.DEFAULT_CODEC
constant that can be used here. Also, I'm wondering if this should be getDefaultCodecName()
or getCodecName(String codec)
? Currently it makes the decision to look up the "default" codec but it is not apparent in the method signature. At a minimum you should add documentation to this public method. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this naming makes sense. Added in the change + documentation for the method.
Signed-off-by: Poojita Raj <[email protected]>
Signed-off-by: Poojita Raj <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
…versions (#6730) * compatCheck Signed-off-by: Poojita Raj <[email protected]> * refactor Signed-off-by: Poojita Raj <[email protected]> --------- Signed-off-by: Poojita Raj <[email protected]> (cherry picked from commit c334bbd) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…versions (opensearch-project#6730) * compatCheck Signed-off-by: Poojita Raj <[email protected]> * refactor Signed-off-by: Poojita Raj <[email protected]> --------- Signed-off-by: Poojita Raj <[email protected]> Signed-off-by: Valentin Mitrofanov <[email protected]>
…versions (#6730) (#6991) This change aims to fail segment replications between the primary and replica if they are utilizing differing lucene codec versions. This is to avoid the current behavior of failing the replica shard in such situations. (cherry picked from commit c334bbd) Signed-off-by: Poojita Raj <[email protected]>
Description
This change aims to fail segment replications between the primary and replica if they are utilizing differing lucene codec versions. This is to avoid the current behavior of failing the replica shard in such situations.
Issues Resolved
Resolves #6616
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.