Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert uploading of remote cluster state manifest using min codec version #16403

Merged

Conversation

soosinha
Copy link
Member

@soosinha soosinha commented Oct 21, 2024

Description

When cluster is upgraded from 2.16 or lower to 2.17, the new cluster manager nodes are not able to join the cluster due to deserialization failure while downloading the manifest. The stacktrace is below:

FailedToCommitClusterStateException[publishing failed]; nested: XContentParseException[[-1:616] [cluster_metadata_manifest] failed to parse field [indices]]; nested: XContentParseException[[-1:599] [uploaded_index_metadata] unknown field [component_prefix]];
        at org.opensearch.cluster.coordination.Coordinator.publish(Coordinator.java:1397)
        at org.opensearch.cluster.service.MasterService.publish(MasterService.java:384)
        at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:366)
        at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:228)
        at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:215)
        at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:257)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:891)
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283)
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: org.opensearch.core.xcontent.XContentParseException: [-1:616] [cluster_metadata_manifest] failed to parse field [indices]
        at org.opensearch.core.xcontent.ObjectParser.parseValue(ObjectParser.java:594)
        at org.opensearch.core.xcontent.ObjectParser.parseArray(ObjectParser.java:585)
        at org.opensearch.core.xcontent.ObjectParser.parseSub(ObjectParser.java:620)
        at org.opensearch.core.xcontent.ObjectParser.parse(ObjectParser.java:356)
        at org.opensearch.core.xcontent.ConstructingObjectParser.parse(ConstructingObjectParser.java:188)
        at org.opensearch.gateway.remote.ClusterMetadataManifest.fromXContentV0(ClusterMetadataManifest.java:902)
        at org.opensearch.repositories.blobstore.ChecksumBlobStoreFormat.deserialize(ChecksumBlobStoreFormat.java:163)
        at org.opensearch.gateway.remote.model.RemoteClusterMetadataManifest.deserialize(RemoteClusterMetadataManifest.java:137)
        at org.opensearch.gateway.remote.model.RemoteClusterMetadataManifest.deserialize(RemoteClusterMetadataManifest.java:32)
        at org.opensearch.common.remote.RemoteWriteableEntityBlobStore.read(RemoteWriteableEntityBlobStore.java:77)
        at org.opensearch.gateway.remote.RemoteManifestManager.fetchRemoteClusterMetadataManifest(RemoteManifestManager.java:240)
        at org.opensearch.gateway.remote.RemoteManifestManager.lambda$getLatestClusterMetadataManifest$3(RemoteManifestManager.java:194)
        at java.base/java.util.Optional.map(Optional.java:260)
        at org.opensearch.gateway.remote.RemoteManifestManager.getLatestClusterMetadataManifest(RemoteManifestManager.java:194)
        at org.opensearch.gateway.remote.RemoteClusterStateService.getLatestClusterMetadataManifest(RemoteClusterStateService.java:1045)
        at org.opensearch.gateway.GatewayMetaState$RemotePersistedState.setLastAcceptedState(GatewayMetaState.java:763)
        at org.opensearch.cluster.coordination.CoordinationState.handlePrePublish(CoordinationState.java:586)
        at org.opensearch.cluster.coordination.Coordinator.publish(Coordinator.java:1392)
        ... 11 more
Caused by: org.opensearch.core.xcontent.XContentParseException: [-1:599] [uploaded_index_metadata] unknown field [component_prefix]
        at org.opensearch.core.xcontent.ObjectParser.lambda$errorOnUnknown$2(ObjectParser.java:129)

The above stack trace shows that the field component_prefix is present in the serialized entity but we do not recognize this field while deserializing. The field component_prefix was added under ClusterMetadataManifest.UploadedIndexMetadata in 2.15 version. In 2.17 version we added an enhancement to support version upgrade: #15216. As part of this change, we upload the manifest using the codec version corresponding to the min node version and download using the same codec version. But for UploadedIndexMetadata entity in the manifest we are using the latest codec version for uploading due to which the component_prefix field is always getting serialized which we are not able to deserialize.

Fix
Reverting the logic which uploads the manifest using the minimum codec version.

Related Issues

NA

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for 4ea63a5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 850f6ae: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 2349d48: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 5137b75: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@soosinha soosinha changed the title Serialize using codec version Revert uploading of remote cluster state manifest using min codec version Oct 23, 2024
Copy link
Contributor

❌ Gradle check result for 9835547: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for fdc4e15: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for e9288b4: SUCCESS

@shwetathareja shwetathareja added skip-changelog bug Something isn't working backport 2.x Backport to 2.x branch labels Oct 23, 2024
Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I am just catching up but based on my limited understanding we need to align codec version bumps with engine versions so that we can be sure we don't back to the older codec after a bump.
We also need to maintain all older codec versions for compatibility, we can remove that logic once there is a migration path of data on older codec versions.
While reading the file we read the codec version from the file name so that we don't need to guess what codec version was used for writing with the guarantee that writes were done in a version not greater than what the reader node can understand

@soosinha
Copy link
Member Author

@Bukhtawar We are maintaining the older codec versions for reading reading data written using older codec. Codec version is already there in the filename to determine how to read it.

Copy link
Contributor

❌ Gradle check result for bba8f04: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for d534f17: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 4a6797c: SUCCESS

@shwetathareja
Copy link
Member

@Bukhtawar Merging this PR. This only addresses manifest file and rest of the metadata is also written using latest codec version only. This may need more changes and thinking if we want to provide BWC for write operation as well.

@shwetathareja shwetathareja merged commit 4ad1be3 into opensearch-project:main Oct 25, 2024
38 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 25, 2024
Signed-off-by: Sooraj Sinha <[email protected]>
(cherry picked from commit 4ad1be3)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
shwetathareja pushed a commit that referenced this pull request Oct 25, 2024
(cherry picked from commit 4ad1be3)

Signed-off-by: Sooraj Sinha <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch bug Something isn't working skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants