[Meta] [Segment Replication] Run all integration tests with segment replication enabled #6761

dreamer-89 · 2023-03-20T20:30:45Z

Leading up to GA #5147 of segment replication, we need to do a round of sanity testing with existing integration tests. This is to ensure segment replication is compatible with existing features.

This issue is to track effort on running these tests, identifying root cause of failures. The fix for individual failures can be tracked in separate issues. These tests should be RUN against 2.x branch as we targetting this exercise for SegRep GA (going in 2.7). We can start with server module and run all integration tests internalClusterTest. Once these failures are resolved, we can run remaining integration tests.

General steps:

When running these tests, we need to turn Segment Replication as default replication strategy (this is to overcome fixing replication type for each integration test). Below are the changes needed for this:
1 - update INDEX_REPLICATION_TYPE_SETTING in IndexMetadata to return ReplicationType.SEGMENT.
2 - Update ReplicaitonType.java to return SEGMENT if a NPE is caught.
3 - Update FeatureFlags to set REPLICATION_TYPE_SETTING’s default to true.
4 - Some tests use mockEngineFactory, change OpenSearchIntegTestCase#addMockInternalEngine to return false from this method by default.
5 - Some methods use MockEngineFactory as a node level plugin - override this by also updating the class

@Override
    public Engine newReadWriteEngine(EngineConfig config) {
        if (config.isReadOnlyReplica()) {
            return new NRTReplicationEngine(config);
        }
        return new MockInternalEngine(config, wrapper);
    }

The text was updated successfully, but these errors were encountered:

mch2 · 2023-06-09T16:23:16Z

We are still working on enabling SR tests with randomization across our entire suite.

The issue with randomly running SR for all tests is that assertions do not wait for replication to complete. So we consistently see doc count assertions break.

To fix this we would need to update every test and wrap these assertions in assertBusy. Most tests are performing a query and then invoking assertHitCount that takes the response, so we can't wrap this at a higher level.

An alternative here, is we could the SearchType to use of Preference.PRIMARY_FIRST inside of RandomizingClient when segrep is enabled on the index. This would mean assertions are made against a primary unless a replica node is specified. To ensure there was no issue during replication, we can assert replica segment state before/during teardown.

Bukhtawar · 2023-06-09T22:35:50Z

+1 to the overall issue
Primary node however might not be a good test for "replication" and my concern is we might a false sense of victory. However we can atleast start from here and work on truly building "replication" tests

mch2 · 2023-06-16T16:32:13Z

Primary node however might not be a good test for "replication" and my concern is we might a false sense of victory.

Agree we aren't loving the solution of forcing primary first. Also not loving some alternatives we've come up with which are.

Wrapping all tests in assertbusy where required (oof)
adding a new base method for assertHitCount that passes the client and req so it can be repeated. Essentially making it assertEventualCount... This isn't terrible but still requires a lot of tests to be updated. It will also not work for tests concurrently indexing. There are also other assertions made other than hit count and still leave us with flakiness.
Hack the client to refresh & assert replicas are caught up before performing a search / after a refresh. This introduces side effects to the tests that could be dangerous/misleading. Also doesn't work for concurrent cases.

I think the safest approach here is to outline the most critical subset of tests and create separate SR versions of them. We've essentially done that with SegmentReplication based ITs, but we will need to audit that list for coverage.

Rishikesh1159 · 2023-06-20T15:54:43Z

As @mch2 mentioned in this issue #8109 we have outlined the most critical test packages that we need to run with Segment Replication in server module:

cluster
gateway
index
indexing
indices
ingest
recovery
remotestore
update

Rishikesh1159 · 2023-06-20T16:06:50Z

Few options for us to enable segment replication with these critical test packages:

Previously pointed out options:
-> Wrapping all tests in assertbusy where required.
-> adding a new base method for assertHitCount that passes the client and req so it can be repeated. Essentially making it assertEventualCount... This isn't terrible but still requires a lot of tests to be updated. It will also not work for tests concurrently indexing. There are also other assertions made other than hit count and still leave us with flakiness.

Other options:
-> In every test after every index operation, we can call a method to verify if the store content of both primary and replica are same. As we do it here. - Problem here will be, we will need to modify every test and it might not work well for test using background indexing

-> Duplicating all test packages by creating separate Segment Replication versions of those test packages. - Problem here would be duplication lot of tests with very minor changes.

Rishikesh1159 · 2023-07-26T22:27:33Z

First we need to come up with a detailed design of the problem and explore few options/solutions. Once we have detailed plan and a working solution we need to come up with a plan of action to verify all integ tests passes.

Detailed Design of problem and proposed solution:

Background:

→ First let’s do a quick recap of how segment replication works.

When a document is indexed/write into opensearch we first index that document into primary shard (store them as segments in primary shards). Then depending on refresh interval, refresh happens on primary which makes document ready to be searchable by user/client. On a refresh, primary also publishes checkpoint to all replica shards and then the replica copies segments from primary. Until this copy event is completed (which usually takes some time to finish) the document is not searchable on replica shard.

Problem:

The main problem with running all integ tests with segrep enabled is tests failing with assertion on replica shard before it has caught up with primary.

Brute Force Solution:

→ As the problem above states we need to figure out a way to wait until replica shard has caught up with primary and then make assertions on tests.
→ Brute force approach for this would be to go to every integration test and add manual step to wait until replica has caught up with primary before every assertion in an integ test.
→ The problem with brute force approach is there are few thousands of integ tests in our current codebase and all of them need to be manually updated with wait until before assertions in tests. Also every new integ test added into opensearch codebase should have these wait until checks before assertions to be compatible with segment replication.

Proposed Solution:

→ Usually before assertions on replica shard, the client performs a refresh operation or search operation in integ test.
→ Approach is we create a new SegRep client similar to Randomized Client, and then we override the search or refresh methods in client to wait until replica shard has caught up (using _cat/segment_replication API) before returning back to integ test.
→ All the integ tests in opensearch codebase in one way or another inherit from single parent class OpenSearchIntegTestCase. So we can use this new SegRep client in OpenSearchIntegTestCase class so that all integ tests inheriting this class can be ran against segment replication (with waiting until replica catching up behaviour.)
→ I have created a branch: https://github.com/Rishikesh1159/OpenSearch/tree/test-all-integtests-segrep as POC to test this solution. I have modified that client’s search request method to wait until replica has caught up before getting back to integ Test.

Downside with proposed solution:

→ Not every integ test uses the client’s search requests for searching docs. Some integ tests use GET’s and MGET’s to search from translog. So integ tests using these GET’s and MGET’s requests will still fail with above proposed solution when segment replication is enabled. We already have an issue cut for GET’s and MGET’s here: #8536, this is being worked on independently.

→ When there is continuous/concurrent indexing with searching (assertions) in an integ tests, these tests might fail on few occasions as replica can be behind primary because of continuous ingestion and search. Our proposed solution might fail in this case. The best way to handle these kind of integ tests is to handle them individually test by test by adding wait until behaviour manually. As there are only very few tests of this kind we can update them manually if needed.

Plan of Action:

-> We need to verify all integration tests in opensearch repo pass with both segment replication and remote store. So, we can divide our testing plan into two phases:

Phase-1 : Segment Replication Only [Node to Node case].
Phase-2: Remote Store Using Segment Replication.

-> All Tests in opensearch repo do inherit from one of the base classes here. Here we can ignore all tests inheriting from both OpenSearchTestCase (unit tests) and OpenSearchSingleNodeTestCase (no replica case) for segment replication and remote store features. Most important base testing class for us would be OpenSearchIntegTestCase as all integ tests inherit from this base class. Finally we should also make sure that segment replication and remote store haven't broken any modules by verifying base classes OpenSearchRestTestCase and OpenSearchClientYamlSuiteTestCase. Our target for plan of action would be verifying all test classes inheriting following base classes pass:

OpenSearchIntegTestCase
OpenSearchRestTestCase
OpenSearchClientYamlSuiteTestCase

-> Initially we will execute our plan of action first on feature branch segment-replication. Once we have enough confidence we can merge this feature branch to main branch.

### Phase - 1 (Segment Replication - Node to Node):

Setup: Making necessary setup for the proposed solution of creating a new segment replication client.
Support realtime reads with GET and MGET requests.
Verification: Verify all integ tests passing with segment replication (node to node) enabled.
Enabling Randomization, so that we can randomly enable/disable segment replication on integ tests.

### Phase - 2 (Remote Store With Segment Replication):

Setup: Enable remote store setting and register repository on base class.
Verification: Verify all integ tests passing for Remote Store with Segment Replication enabled.
Enabling Randomization, so that we can randomly enable/disable remote store on integ tests.

Rishikesh1159 · 2023-07-26T22:33:27Z

Each point in Plan of Action of above comment can be a separate sub task.

dblock · 2023-07-27T20:43:57Z

Each point in Plan of Action of above comment can be a separate sub task.

You can use - [ ] in the issue summary and GitHub will tell you what's left :)

anasalkouz · 2023-08-08T17:05:51Z

Thanks @Rishikesh1159 for the detailed plan. Is the proposed solution will work for other modules as well (not server module)? are you going to cover this as part of your POC?

Rishikesh1159 · 2024-01-08T18:33:10Z

Plan For running integration tests with segment replication.

Run all the existing Integ Tests in opensearch with segment replication enabled. Initially this was the end goal but soon we realized this is not good idea as there are many Integ Tests which will obviously fail with segrep and there are other set of tests which are completely unrelated to segment replication. Running these sort of tests with segment replication enabled wouldn’t add any value. Instead it might lead to adding more flaky tests. So we are targeting only specific modules to run with segrep initially and then later we can extend this to other modules if needed.

Goal:

We are targeting only specific modules that are related to indexing/replication to run with segrep initially and then later we can extend this to other modules if needed.

Following are 4 steps necessary :

Step 1 (waiting until replica):

→ Coming up with logic of waiting until replica caught up. This logic can be found here

Step 2 (Mechanism to implement waiting until replica):

→ Next we need to come up with an mechanism/approach to use logic in step 1 , so that it can be plugged in at one place and used by multiple tests.
→ Possible Approches:

Approach 1: Modify the existing client to override search requests. Similar to what we discussed before here about segrep client.

Problem with Approach 1: In this approach we block search requests on client side until replica caught up. We are basically modifying existing client behaviour based on segrep settings. Tests/users writing integ tests usually don’t expect the client behaviour to wait. So we will be adding unintended behaviour on client for all segrep tests.

Approach 2: Modify Indexing Helper methods to add step 1 logic. After finishing an indexing request and refresh if segrep is enabled on index then we add waiting until replica caught up (step 1) logic. Usually in all integ tests to perform index operation we use one of the following way to index docs:
-> using helper method indexRandom(), example. Many IntegTests use indexRandom() to index docs, so just modifying the indexRandom() in parent class OpensearchIntegTestClass.java would cover lot of tests.
-> using any of the index() helper method, example.
-> directly using client, client().prepareIndex("test"), example.

Problem with Approach 2: Although many tests use indexRandom() but there are still some tests using client().prepareIndex("test"), index(). So for tests not using indexrandom() we either change the test to use indexRandom or we add waiting until (step 1) logic in test itself after a refresh.

Step 3 (Framework/Mechanism for running a test with both default (docrep) and segrep enabled) :

→ Next we need to come up with an approach to run existing tests with segrep enabled and segrep disabled.
→ Possible Approach:

Approach 1: Similar to existing concurrent search using parameterization of test class. We need to modify the base parameterization class to support final index settings, as segment replication setting is final.

Problem with Approach 1: Tests with default scope/SCOPE.SUITE using same Index across all the tests in suite will fail to run with both segrep enabled and disabled. As segrep setting is final it can be set only once while index being created. We also need to go through every test suite and make some changes to support parameterization for those classes.

Approach 2: Create a new segrep version of every test by extending the original test and modifying settings.

Problem with Approach 2: Lots of extra new classes will be added to the existing test modules as we are basically creating another class for every test to run with segrep.

Step 4:

→ Identify and list out all the tests/modules that definitely need to be run with segment replication.

→ Implementation of all above 4 steps.

andrross · 2024-01-10T19:33:03Z

@reta @sohami I would appreciate your feedback on this as I know you were involved in the parameterization of the concurrent search integration tests. Specifically, I'd like you're feedback on "Step 3" of the previous comment, which is the mechanism for parameterizing the test cases. You can see the implementation in #11773. Basically, because segment replication is a non-dynamic setting for an index, we're going with the inheritance approach in order to accommodate suite scope tests.

reta · 2024-01-10T20:04:30Z

Basically, because segment replication is a non-dynamic setting for an index, we're going with the inheritance approach in order to accommodate suite scope tests.

@andrross I think we could make parameterized tests work with ClusterScope annotation (you are very right that it does not work right now as one would expect) but it as of today, going with inheritance seems to be the most straightforward way to start with (I will try to look into parameterized tests later this week and will update you).

reta · 2024-01-12T19:53:47Z

@andrross have this draft opened #11877, at a glance seems to be feasible but the pull request still needs some work.

dreamer-89 added untriaged distributed framework and removed untriaged labels Mar 20, 2023

dreamer-89 changed the title ~~[Segment Replication] Run all integration tests with segment replication enabled~~ [Meta] [Segment Replication] Run all integration tests with segment replication enabled Mar 20, 2023

dreamer-89 added >test-failure Test failure from CI, local build, etc. v2.7.0 labels Mar 21, 2023

This was referenced Mar 23, 2023

[Segment Replication] Remove deprecated feature flag check preventing SegmentReplicationSnapshotIT tests run #6810

Merged

[Meta] Promote Segment Replication out of experimental. #5147

Closed

dreamer-89 added the v2.8.0 'Issues and PRs related to version v2.8.0' label Apr 18, 2023

sohami mentioned this issue Apr 20, 2023

Moving concurrent-search out of sandbox plugin to core behind feature flag #7203

Merged

6 tasks

mch2 removed the v2.7.0 label May 2, 2023

cwperks mentioned this issue May 9, 2023

Adds ExtensionsManager.lookupExtensionSettingsById #7466

Merged

6 tasks

kotwanikunal assigned Rishikesh1159 May 9, 2023

Rishikesh1159 mentioned this issue May 11, 2023

[segment replication] Test failures with Segment Replication as Replication strategy #7524

Closed

6 tasks

nknize mentioned this issue May 22, 2023

[Refactor] OpenSearchException streamables to a registry #7646

Merged

reta mentioned this issue May 23, 2023

[AUTOCUT] Gradle Check Failure on push to 2.x #7701

Closed

jed326 mentioned this issue May 25, 2023

[Backport 2.x] Change INDEX_SEARCHER threadpool to auto queue to support task resource tracking #7765

Merged

6 tasks

reta mentioned this issue Jun 2, 2023

[AUTOCUT] Gradle Check Failure on push to 2.x #7894

Closed

navneet1v mentioned this issue Jun 8, 2023

Adding the SearchPhaseResultsProcessor interface in Search Pipeline #7283

Merged

6 tasks

opensearch-project deleted a comment from DarshitChanpura Jun 9, 2023

mch2 mentioned this issue Jun 13, 2023

SegRep with Remote: Update components of segrep backpressure to suppo… #8020

Merged

6 tasks

mch2 mentioned this issue Jun 16, 2023

[Segment Replication + Remote Store] GA test planning #8109

Closed

dreamer-89 added the v2.9.0 'Issues and PRs related to version v2.9.0' label Jul 5, 2023

dreamer-89 added v2.10.0 and removed v2.9.0 'Issues and PRs related to version v2.9.0' v2.8.0 'Issues and PRs related to version v2.8.0' labels Jul 19, 2023

Rishikesh1159 mentioned this issue Jul 27, 2023

[Segment Replication] Create new Segment Replication client to support running Integ Tests with Segment Replication enabled #8925

Closed

Bukhtawar added the Indexing:Replication Issues and PRs related to core replication framework eg segrep label Jul 27, 2023

Rishikesh1159 mentioned this issue Jul 27, 2023

[Segment Replication] Verify All Integ Tests are Passing with Segment Replication Enabled #8927

Open

5 tasks

This was referenced Aug 9, 2023

[Remote Store] Verify All Integ Tests are Passing with Remote Store Enabled #9196

Closed

[Remote Store] Enable remote store setting and register repository on base OpenSearchIntegTestCase class #9198

Closed

dreamer-89 mentioned this issue Sep 12, 2023

[Segment Replication] Sequence number based recoveries #10003

Open

anasalkouz removed the distributed framework label Sep 19, 2023

Rishikesh1159 added v2.11.0 Issues and PRs related to version 2.11.0 and removed v2.10.0 labels Sep 25, 2023

Rishikesh1159 mentioned this issue Jan 8, 2024

[Segment Replication] Identify/List out all Integ Tests that must be ran with segment replication enabled #11804

Closed

This was referenced Jan 9, 2024

[Segment Replication] Add Segment Replication Specific Integration Tests #11773

Merged

[Segment Replication] Support Parameterizing the segment replication based tests #11621

Closed

Rishikesh1159 closed this as completed in #11773 Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Meta] [Segment Replication] Run all integration tests with segment replication enabled #6761

[Meta] [Segment Replication] Run all integration tests with segment replication enabled #6761

dreamer-89 commented Mar 20, 2023 •

edited by Rishikesh1159

Loading

mch2 commented Jun 9, 2023

Bukhtawar commented Jun 9, 2023

mch2 commented Jun 16, 2023

Rishikesh1159 commented Jun 20, 2023

Rishikesh1159 commented Jun 20, 2023

Rishikesh1159 commented Jul 26, 2023 •

edited

Loading

Rishikesh1159 commented Jul 26, 2023

dblock commented Jul 27, 2023

anasalkouz commented Aug 8, 2023

Rishikesh1159 commented Jan 8, 2024 •

edited

Loading

andrross commented Jan 10, 2024

reta commented Jan 10, 2024 •

edited

Loading

reta commented Jan 12, 2024

[Meta] [Segment Replication] Run all integration tests with segment replication enabled #6761

[Meta] [Segment Replication] Run all integration tests with segment replication enabled #6761

Comments

dreamer-89 commented Mar 20, 2023 • edited by Rishikesh1159 Loading

mch2 commented Jun 9, 2023

Bukhtawar commented Jun 9, 2023

mch2 commented Jun 16, 2023

Rishikesh1159 commented Jun 20, 2023

Rishikesh1159 commented Jun 20, 2023

Rishikesh1159 commented Jul 26, 2023 • edited Loading

Detailed Design of problem and proposed solution:

Background:

Problem:

Brute Force Solution:

Proposed Solution:

Downside with proposed solution:

Plan of Action:

Rishikesh1159 commented Jul 26, 2023

dblock commented Jul 27, 2023

anasalkouz commented Aug 8, 2023

Rishikesh1159 commented Jan 8, 2024 • edited Loading

Plan For running integration tests with segment replication.

Goal:

Step 1 (waiting until replica):

Step 2 (Mechanism to implement waiting until replica):

Step 3 (Framework/Mechanism for running a test with both default (docrep) and segrep enabled) :

Step 4:

andrross commented Jan 10, 2024

reta commented Jan 10, 2024 • edited Loading

reta commented Jan 12, 2024

dreamer-89 commented Mar 20, 2023 •

edited by Rishikesh1159

Loading

Rishikesh1159 commented Jul 26, 2023 •

edited

Loading

Rishikesh1159 commented Jan 8, 2024 •

edited

Loading

reta commented Jan 10, 2024 •

edited

Loading