Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a clone of local segements size map used for Remote Segment Stats until sync to remote completes #11896

Conversation

linuxpi
Copy link
Collaborator

@linuxpi linuxpi commented Jan 16, 2024

Description

  • Clones local segment files map stored in segment tracker during upload operations to ensure immutability of the map for that scope
  • map in segment tracker could be updated while segments are uploaded to remote by another refresh.

Related Issues

Resolves #11025 and #9774

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ats until sync to remote completes

Signed-off-by: bansvaru <[email protected]>
Copy link
Contributor

github-actions bot commented Jan 16, 2024

Compatibility status:

Checks if related components are compatible with change db96d34

Incompatible components

Incompatible components: [https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/cross-cluster-replication.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/security.git]

Copy link
Contributor

❕ Gradle check result for 8d6c34b: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadNonexistentBlobThrowsNoSuchFileException
      1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Jan 16, 2024

Codecov Report

Attention: 13 lines in your changes are missing coverage. Please review.

Comparison is base (6012504) 71.28% compared to head (db96d34) 71.36%.
Report is 13 commits behind head on main.

Files Patch % Lines
...ava/org/opensearch/index/mapper/IpFieldMapper.java 65.71% 10 Missing and 2 partials ⚠️
...search/index/shard/RemoteStoreRefreshListener.java 90.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #11896      +/-   ##
============================================
+ Coverage     71.28%   71.36%   +0.08%     
- Complexity    59414    59505      +91     
============================================
  Files          4925     4925              
  Lines        279479   279508      +29     
  Branches      40635    40641       +6     
============================================
+ Hits         199226   199476     +250     
+ Misses        63731    63485     -246     
- Partials      16522    16547      +25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update on how many iterations we have run for asserting the fix in flakiness of this test.

@linuxpi
Copy link
Collaborator Author

linuxpi commented Jan 17, 2024

Please update on how many iterations we have run for asserting the fix in flakiness of this test.

1000 iterations. earlier issue used to popup within 300 iterations

Copy link
Contributor

❌ Gradle check result for c257994: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: bansvaru <[email protected]>
Copy link
Contributor

❌ Gradle check result for 7875161: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@linuxpi
Copy link
Collaborator Author

linuxpi commented Jan 17, 2024

❌ Gradle check result for 7875161: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock #10006
org.opensearch.search.sort.FieldSortIT.testSimpleSorts {p0={"search.concurrent_segment_search.enabled":"true"}} #11875

flaky tests failing

@gbbafna
Copy link
Collaborator

gbbafna commented Jan 17, 2024

Should we give #11908 a try out ? This is getting too complex .

@ashking94
Copy link
Member

Should we give #11908 a try out ? This is getting too complex .

Lets fix this flaky test and I (or whoever can) can attempt the simplification of the listener.

@gbbafna
Copy link
Collaborator

gbbafna commented Jan 18, 2024

Should we give #11908 a try out ? This is getting too complex .

Lets fix this flaky test and I (or whoever can) can attempt the simplification of the listener.

Can we attempt it now (just POCs and finalizing approach) and then come back to this PR depending upon the complexity ? If we can achieve it now, we wouldn't be needing this PR at all . If it is burning problem, we can always mute the test and it can get resolved once we simplify this issue.

Also we would need to get this fixed by 2.12 as it can cause refreshes to get stuck forever due to above NPE .

@linuxpi
Copy link
Collaborator Author

linuxpi commented Jan 18, 2024

Should we give #11908 a try out ? This is getting too complex .

Lets fix this flaky test and I (or whoever can) can attempt the simplification of the listener.

Can we attempt it now (just POCs and finalizing approach) and then come back to this PR depending upon the complexity ? If we can achieve it now, we wouldn't be needing this PR at all . If it is burning problem, we can always mute the test and it can get resolved once we simplify this issue.

Sure @gbbafna . Let me try it out quickly and see what can we do here to simplify the design

Copy link
Contributor

❌ Gradle check result for 4d0de06: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❕ Gradle check result for db96d34: UNSTABLE

  • TEST FAILURES:
      3 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testConcurrentDecommissionAction
      1 org.opensearch.search.SearchWeightedRoutingIT.testMultiGetWithNetworkDisruption_FailOpenEnabled
      1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testSnapshotAndRestore

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@gbbafna gbbafna self-requested a review February 2, 2024 09:47
@gbbafna gbbafna merged commit 57cc0dd into opensearch-project:main Feb 2, 2024
33 checks passed
@gbbafna gbbafna deleted the protect-ongoing-files-segments-tracker branch February 2, 2024 09:48
@gbbafna gbbafna added the backport 2.x Backport to 2.x branch label Feb 2, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Feb 2, 2024
…ats until sync to remote completes (#11896)

Signed-off-by: bansvaru <[email protected]>
(cherry picked from commit 57cc0dd)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
gbbafna pushed a commit that referenced this pull request Feb 5, 2024
…ats until sync to remote completes (#11896) (#12143)

(cherry picked from commit 57cc0dd)

Signed-off-by: bansvaru <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
peteralfonsi pushed a commit to peteralfonsi/OpenSearch that referenced this pull request Mar 1, 2024
rayshrey pushed a commit to rayshrey/OpenSearch that referenced this pull request Mar 18, 2024
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…ats until sync to remote completes (opensearch-project#11896)

Signed-off-by: bansvaru <[email protected]>
Signed-off-by: Shivansh Arora <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch bug Something isn't working skip-changelog Storage:Durability Issues and PRs related to the durability framework Storage:Remote
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Remote Store] Root cause deleted segment files during remote uploads
3 participants