Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] Fix missing fields to resolve Strict Dynamic Mapping issue when saving task result #16201

Merged
merged 2 commits into from
Oct 21, 2024

Conversation

inpink
Copy link
Contributor

@inpink inpink commented Oct 5, 2024

Related Issues

Resolves #16060

Description

[The result of the execution]

  • Click the image to view it in full size.
Before After
before after
StrictDynamicMappingException occurred not occurred
  • When applying and then deleting the auto follow rule in the CCR Plugin(cross-cluster-replication), the .tasks index is properly updated without StrictDynamicMappingException.

[Background]

  • OpenSearch allows a Follower Cluster to replicate indexes from a Leader Cluster through the Opensearch CCR Plugin.
  • However, when cancelling the auto follow rule in the CCR Plugin, a StrictDynamicMappingException was previously encountered:
[2024-09-28T12:01:16,819][WARN ][o.o.r.t.a.AutoFollowTask ] [5460424aac92][my-connection-alias] Error storing result StrictDynamicMappingException[mapping set to strict, dynamic introduction of [cancellation_time_millis] within [task] is not allowed]
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.parseDynamicValue(DocumentParser.java:876)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:722)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:461)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:419)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:524)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:545)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:447)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:419)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.internalParseDocument(DocumentParser.java:138)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:93)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:256)
2024-09-28 21:01:16     at org.opensearch.index.shard.IndexShard.prepareIndex(IndexShard.java:1178)
2024-09-28 21:01:16     at org.opensearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:1135)
2024-09-28 21:01:16     at org.opensearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:1052)
2024-09-28 21:01:16     at org.opensearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:625)
2024-09-28 21:01:16     at org.opensearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:471)
2024-09-28 21:01:16     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
2024-09-28 21:01:16     at org.opensearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:535)
2024-09-28 21:01:16     at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:416)
2024-09-28 21:01:16     at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:125)
2024-09-28 21:01:16     at org.opensearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:275)
2024-09-28 21:01:16     at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005)
2024-09-28 21:01:16     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
2024-09-28 21:01:16     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
2024-09-28 21:01:16     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
2024-09-28 21:01:16     at java.base/java.lang.Thread.run(Thread.java:1583)
  • When a task is stored, OpenSearch’s TaskResultsService.storeResult(TaskResult taskResult, ActionListener<Void> listener) is called.
  • This method extracts a TaskInfo object from theTaskResult parameter. TaskInfo defines the necessary fields for a task, which are mapped to the .tasks index via task-index-mapping.json.
  • So, when a task is cancelled, the .tasks index should be updated
  • Also, the .tasks index has strict dynamic mapping enabled.
  • However, cancellation_time_millis and resource_stats were present in TaskInfo but missing from the mapping JSON.
  • As a result, this was causing a StrictDynamicMappingException.

[PR contents]

  1. Added the cancellation_time_millis and resource_stats fields to task-index-mapping.json:
...
          "cancellation_time_millis": {
            "type": "long"
          },
          "resource_stats": {
            "type" : "object",
            "enabled" : false
          }
...
  1. Updated the mapping version in both task-index-mapping.json and TaskResultsService from 4 to 5.
    These versions must match for the mapping to apply correctly. When updating a mapping, increment the version so future operations can detect and apply changes.

[Test]

I conducted two types of tests:

  • Writing a test in OpenSearch
  • Running local end-to-end (e2e) tests
Writing a test in OpenSearch

TaskInfo has two constructors: one that includes cancellation_time_millis (a) and one that does not (b).
In the existing TasksIT, the resultsService.storeResult() test did not include the appropriate resource_stats and cancellation_time_millis in the TaskInfo.
So, I added testStoreResultWithAllFields, using constructor (b) to create a test that only passes with the updated task-index-mapping.json.

Local End-to-End (e2e) Testing
  • To perform the e2e test, I needed a setup with two custom OpenSearch 3.0.0v clusters and the CCR Plugin 3.0.0v installed.
  1. I set up two clusters with the CCR Plugin installed. I cloned the OpenSearch GitHub repository and modified the task-index-mapping.json.
  2. After modifying OpenSearch, I assemble it, generating the opensearch-min-3.0.0-SNAPSHOT-linux-arm64.tar.gz file.
  3. I then cloned and assembled the OpenSearch CCR GitHub repository, which generated opensearch-cross-cluster-replication-3.0.0.0-SNAPSHOT.zip.
  4. Using a Dockerfile, I built a Docker image. In the same directory as the Dockerfile, I included the following files:
    opensearch-min-3.0.0-SNAPSHOT-linux-arm64.tar.gz, opensearch-cross-cluster-replication-3.0.0.0-SNAPSHOT.zip, opensearch.yml, opensearch-docker-entrypoint.sh, opensearch-onetime-setup.sh
image
  1. I used Docker Compose to create two clusters.
  2. Finally, I installed the CCR plugin on each cluster:
docker exec -it [container] /bin/bash
/usr/share/opensearch/bin/opensearch-plugin install file:/usr/share/opensearch/opensearch-cross-cluster-replication-3.0.0.0-SNAPSHOT.zip
  1. I registered an auto follow rule and then canceled it to verify the behavior. Set up a cross-cluster connection, get-started-with-auto-follow:
curl -XPUT -k -H 'Content-Type: application/json'  'http://localhost:9200/_cluster/settings?pretty' -d '
{
  "persistent": {
    "cluster": {
      "remote": {
        "my-connection-alias": {
          "seeds": ["localhost:9300"]
        }
      }
    }
  }
}'
curl -XPUT -k -H 'Content-Type: application/json'  'http://localhost:9201/leader-01?pretty'
curl -XPOST -k -H 'Content-Type: application/json' -u 'admin:Yhj99!009' 'https://localhost:9200/_plugins/_replication/_autofollow?pretty' -d '
{
   "leader_alias" : "my-connection-alias",
   "name": "my-replication-rule",
   "pattern": "movies*",
   "use_roles":{
      "leader_cluster_role": "all_access",
      "follower_cluster_role": "all_access"
   }
}'
curl -XDELETE -k -H 'Content-Type: application/json'  'http://localhost:9200/_plugins/_replication/_autofollow?pretty' -d '
{
   "leader_alias" : "my-connection-alias",
   "name": "my-replication-rule"
}'
  1. I checked the logs to confirm if StrictDynamicMappingException occurred. After modifying the task-index-mapping.json correctly, I observed that instead of the previous StrictDynamicMappingException, the task status was successfully updated.
replication-node22-orin: {"type": "server", "timestamp": "2024-10-05T04:37:02,764Z", "level": "INFO", "component": "o.o.r.t.a.AutoFollowTask", "cluster.name": "follower-cluster", "node.name": "6b9ce94a6ee6", "message": "[my-connection-alias] Going to mark AutoFollowTask:97 task as completed", "cluster.uuid": "6afNBEthSgaKVY55dxNk8Q", "node.id": "0Zrg1aZFS_Sov4YAJjCDAg"  }
replication-node22-orin: {"type": "server", "timestamp": "2024-10-05T04:37:02,766Z", "level": "INFO", "component": "o.o.r.t.a.AutoFollowTask", "cluster.name": "follower-cluster", "node.name": "6b9ce94a6ee6", "message": "[my-connection-alias] Completed the task with id:97", "cluster.uuid": "6afNBEthSgaKVY55dxNk8Q", "node.id": "0Zrg1aZFS_Sov4YAJjCDAg"  }
replication-node22-orin: {"type": "server", "timestamp": "2024-10-05T04:37:02,770Z", "level": "INFO", "component": "o.o.r.t.a.AutoFollowTask", "cluster.name": "follower-cluster", "node.name": "6b9ce94a6ee6", "message": "[my-connection-alias] Successfully persisted task status", "cluster.uuid": "6afNBEthSgaKVY55dxNk8Q", "node.id": "0Zrg1aZFS_Sov4YAJjCDAg"  }

Below are the files I used for testing, along with their sources:

�Name File Source
DockerFile link "opensearch-build" al2023.dockerfile
docker-compose link "opensearch" distribution docker-compose
log4j2.properties link "opensearch-build" log4j2.properties
opensearch-docker-entrypoint-default.x.sh link "opensearch-build" entrypoint
opensearch-onetime-setup.sh link "opensearch-build" onetime
opensearch.yml link "opensearch" distribution yml
opensearch-min-3.0.0-SNAPSHOT-linux-arm64.tar.gz link Generated by cloning the OpenSearch repository and assembling the project.
opensearch-cross-cluster-replication-3.0.0.0-SNAPSHOT.zip link Generated by cloning the OpenSearch CCR Plugin repository and assembling the project.

My test environment was Mac OS M2. If you are using a different operating system, replace the .tar.gz file with the version that matches your system.

If needed, I am happy to provide the Docker image used in my test. Please feel free to request it if required.

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Oct 5, 2024

❌ Gradle check result for 839a1bc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 7, 2024

❌ Gradle check result for 7946ad9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@inpink inpink force-pushed the 16060-solve branch 2 times, most recently from 5e60e37 to 5469e96 Compare October 19, 2024 09:04
Copy link
Contributor

❌ Gradle check result for 5469e96: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for f09fe85: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@inpink
Copy link
Contributor Author

inpink commented Oct 19, 2024

image image

The local tests succeed, but they fail on “Gradle Check (Jenkins) / gradle-check (pull_request_target)” even though I haven’t modified them.
What actions should be taken?

@dbwiddis
Copy link
Member

What actions should be taken?

Looks like two test failures:

  1. There's no index test. Investigate why that is.
  2. Your max hits expected and actual mismatch. This is a change in Lucene 9 where search hits gives a minimum bound (to save time) if an explicit value isn't requested. See Fix AbstractStringFieldDataTestCase tests to account for TotalHits lower bound #4270 for example on how to properly test this.

Copy link
Contributor

❕ Gradle check result for 0b367cc: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Oct 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.13%. Comparing base (ad7f9e7) to head (ace68c7).
Report is 4 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16201      +/-   ##
============================================
+ Coverage     72.05%   72.13%   +0.07%     
- Complexity    64861    64885      +24     
============================================
  Files          5309     5309              
  Lines        302734   302785      +51     
  Branches      43733    43750      +17     
============================================
+ Hits         218134   218404     +270     
+ Misses        66731    66516     -215     
+ Partials      17869    17865       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

❌ Gradle check result for 797821f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…tasks index (opensearch-project#16060)

- Fixed issue where `.tasks` index failed to update due to StrictDynamicMappingException when a task was cancelled.
- Added missing `cancellation_time_millis` and `resource_stats` fields to `task-index-mapping.json`.
- Ensured proper task result storage by updating the mappings.
- Changed the version in the meta field from 4 to 5 to reflect the updated mappings.

Signed-off-by: inpink <[email protected]>
Copy link
Contributor

✅ Gradle check result for 6a5e60f: SUCCESS

@inpink
Copy link
Contributor Author

inpink commented Oct 21, 2024

The Gradle check has been completed. 😄🎉
Thank you, @dbwiddis !
When you have time, I’d appreciate it if you could merge it!

Copy link
Contributor

✅ Gradle check result for ace68c7: SUCCESS

@dbwiddis dbwiddis merged commit 322bdc4 into opensearch-project:main Oct 21, 2024
37 of 38 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-16201-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 322bdc42dab1d6d4fa021529057453afd5cb898e
# Push it to GitHub
git push --set-upstream origin backport/backport-16201-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-16201-to-2.x.

dbwiddis added a commit to dbwiddis/OpenSearch that referenced this pull request Oct 21, 2024
…tasks index (opensearch-project#16060) (opensearch-project#16201)

- Fixed issue where `.tasks` index failed to update due to StrictDynamicMappingException when a task was cancelled.
- Added missing `cancellation_time_millis` and `resource_stats` fields to `task-index-mapping.json`.
- Ensured proper task result storage by updating the mappings.
- Changed the version in the meta field from 4 to 5 to reflect the updated mappings.

Signed-off-by: inpink <[email protected]>
Signed-off-by: Daniel Widdis <[email protected]>
Co-authored-by: Daniel Widdis <[email protected]>
dbwiddis added a commit that referenced this pull request Oct 22, 2024
…tasks index (#16060) (#16201) (#16414)

- Fixed issue where `.tasks` index failed to update due to StrictDynamicMappingException when a task was cancelled.
- Added missing `cancellation_time_millis` and `resource_stats` fields to `task-index-mapping.json`.
- Ensured proper task result storage by updating the mappings.
- Changed the version in the meta field from 4 to 5 to reflect the updated mappings.

Signed-off-by: inpink <[email protected]>
Signed-off-by: Daniel Widdis <[email protected]>
Co-authored-by: inpink <[email protected]>
@inpink inpink changed the title [BUGFIX] Fix missing fields to resolve Strict Dynamic Mapping issue when saving task result [BUGFIX] Fix missing fields to resolve Strict Dynamic Mapping issue when saving task result Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed bug Something isn't working Other
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] cancellation_time_millis is not getting added in .tasks index
2 participants