Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flaky org.opensearch.indices.replication.SegmentReplicationIT.testScrollCreatedOnReplica test #7556

Closed
ashking94 opened this issue May 15, 2023 · 4 comments · Fixed by #9111
Assignees
Labels
bug Something isn't working distributed framework Indexing:Replication Issues and PRs related to core replication framework eg segrep

Comments

@ashking94
Copy link
Member

Describe the bug
Flaky org.opensearch.indices.replication.SegmentReplicationIT.testScrollCreatedOnReplica

To Reproduce
https://build.ci.opensearch.org/job/gradle-check/15383

Expected behavior
The test shall always pass.

Plugins
NA

@ashking94 ashking94 added bug Something isn't working untriaged labels May 15, 2023
@ashking94
Copy link
Member Author

@dreamer-89 do take a look at this.

@andrross
Copy link
Member

andrross commented Jun 2, 2023

Another failure here: #7892

@tlfeng
Copy link
Collaborator

tlfeng commented Jun 13, 2023

Detailed message in the above test runs:

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationIT.testScrollCreatedOnReplica" -Dtests.seed=5E8BE4AC6E063D80 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-ES -Dtests.timezone=Asia/Yekaterinburg -Druntime.java=19

org.opensearch.indices.replication.SegmentReplicationIT > testScrollCreatedOnReplica FAILED
    java.lang.AssertionError: Files should be preserved
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indices.replication.SegmentReplicationIT.testScrollCreatedOnReplica(SegmentReplicationIT.java:948)

    java.lang.AssertionError: There are still [1] in-flight contexts. The first one's creation site is listed as the cause of this exception.
        at org.opensearch.search.MockSearchService.assertNoInFlightContext(MockSearchService.java:71)
        at org.opensearch.test.OpenSearchTestCase.lambda$ensureAllSearchContextsReleased$10(OpenSearchTestCase.java:656)
...
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationIT.testScrollCreatedOnReplica" -Dtests.seed=72E0A6C2FC7C8D23 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-AR -Dtests.timezone=Portugal -Druntime.java=17

org.opensearch.indices.replication.SegmentReplicationIT > testScrollCreatedOnReplica FAILED
    java.lang.AssertionError: Files should be preserved
        at __randomizedtesting.SeedInfo.seed([72E0A6C2FC7C8D23:6F337545EAD3C256]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indices.replication.SegmentReplicationIT.testScrollCreatedOnReplica(SegmentReplicationIT.java:951)
> Task :server:internalClusterTest

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationIT.testScrollCreatedOnReplica" -Dtests.seed=E856C16957444948 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=mk-MK -Dtests.timezone=Africa/Asmara -Druntime.java=20

org.opensearch.indices.replication.SegmentReplicationIT > testScrollCreatedOnReplica FAILED
    java.lang.AssertionError: Files should be preserved
        at __randomizedtesting.SeedInfo.seed([E856C16957444948:F58512EE41EB063D]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indices.replication.SegmentReplicationIT.testScrollCreatedOnReplica(SegmentReplicationIT.java:950)

The test is disabled when testing using remote store as segment replication source in the commit 2a3683e, so I don't think the flakiness in this issue is related to the remote store integration.

After looking into the procedure of the test, I didn't have a clear idea what may cause the assertion failed.
The assertion aims to validate the replica shard contains all the segments after search request using scrolling (https://www.elastic.co/guide/en/elasticsearch/reference/7.10/paginate-search-results.html#scroll-search-results) and before clearing the search context used in the scroll, when some documents are deleted from the primary shard after the scroll search request proceeded.
The assertion failure generally means there are some segments got removed from the replica shard during the scroll search. This is unbelievable, because by design, the search context should be preserved during the scroll search request, and the time to keep the search context is set to 1 day.

Probably we could add more logging around the flaky assertion, to list the segment files, then figure out which file is not "preserved". Here is a sample of the logs may be added: 614c57c
@dreamer-89 reminded that the code to delete segment files located in the inner class StoreDirectory in the class Store. There is already a log recorded every file deletion at trace level: https://github.com/opensearch-project/OpenSearch/blob/2.8.0/server/src/main/java/org/opensearch/index/store/Store.java#L979, and that log will help identify cause of files clean up.

I suggest to add the above mentioned logs at info level to help with troubleshooting. Specifically, the logs need to show the content for the list snapshotted segments and currentFiles, with file deletion record on the node with replica shard node_t1. Because I found the test failure can be reproducible almost only when running the whole integration test class, instead of running the single test alone, I think a possible action is to run the tests in whole class SegmentReplicationIT together after enabling more logs in the code.
Sample code and command:
b0a49a6

for i in $(seq 0 5000) ; do echo "Iteration: $i" && ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationIT" >> ~/test-output_7556.txt 2>&1 ; done

Sample of the logs:

[2023-06-15T16:20:42,308][INFO ][o.o.i.r.SegmentReplicationIT] [testScrollCreatedOnReplica] snapshotted segments: 
[_2j.si, _30.si, _2y.si, _2j.fdx, _2t.fdt, _2v.si, _2z.cfe, _2t.fdx, _2t.fdm, _2y.cfe, _2z.cfs, _2y.cfs, _2x.si, _2u.si, _2j.fdt, _31.cfe, _2j.fdm, _31.cfs, _2v.cfe, _2j_Lucene90_0.doc, _30.cfe, _2t_Lucene90_0.doc, _2w.cfs, _30.cfs, _2t.si, _2x.cfs, _2t.kdi, _2t.kdm, _2u.cfe, _2j.fnm, _2j_Lucene90_0.tim, _2j_Lucene90_0.tip, _2j_Lucene90_0.dvm, _2t_Lucene90_0.tim, _2t.fnm, _2t_Lucene90_0.tip, _2t_Lucene90_0.dvm, _2t.kdd, _31.si, _2z.si, _2x.cfe, _2j_Lucene90_0.dvd, _2u.cfs, _2j_Lucene90_0.tmd, _2v.cfs, _2j.kdm, _2w.si, _2t_Lucene90_0.tmd, _2j.kdi, _2j.kdd, _2t_Lucene90_0.dvd, _2w.cfe]

[2023-06-15T16:20:53,745][INFO ][o.o.i.r.SegmentReplicationIT] [testScrollCreatedOnReplica] currentFiles: 
[_2j.fdm, _2j.fdt, _2j.fdx, _2j.fnm, _2j.kdd, _2j.kdi, _2j.kdm, _2j.si, _2j_Lucene90_0.doc, _2j_Lucene90_0.dvd, _2j_Lucene90_0.dvm, _2j_Lucene90_0.tim, _2j_Lucene90_0.tip, _2j_Lucene90_0.tmd, _2t.fdm, _2t.fdt, _2t.fdx, _2t.fnm, _2t.kdd, _2t.kdi, _2t.kdm, _2t.si, _2t_Lucene90_0.doc, _2t_Lucene90_0.dvd, _2t_Lucene90_0.dvm, _2t_Lucene90_0.tim, _2t_Lucene90_0.tip, _2t_Lucene90_0.tmd, _2u.cfe, _2u.cfs, _2u.si, _2v.cfe, _2v.cfs, _2v.si, _2w.cfe, _2w.cfs, _2w.si, _2x.cfe, _2x.cfs, _2x.si, _2y.cfe, _2y.cfs, _2y.si, _2z.cfe, _2z.cfs, _2z.si, _30.cfe, _30.cfs, _30.si, _31.cfe, _31.cfs, _31.si, _51.fdm, _51.fdt, _51.fdx, _51.fnm, _51.kdd, _51.kdi, _51.kdm, _51.si, _51_Lucene90_0.doc, _51_Lucene90_0.dvd, _51_Lucene90_0.dvm, _51_Lucene90_0.tim, _51_Lucene90_0.tip, _51_Lucene90_0.tmd, segments_t, write.lock]
[2023-06-15T16:20:53,453][INFO ][i.s.deletes              ] [node_t0][test-idx-1][0] StoreDirectory.deleteFile: delete file _57.fdx
[2023-06-15T16:20:53,454][INFO ][i.s.deletes              ] [node_t0][test-idx-1][0] StoreDirectory.deleteFile: delete file _57.fdt
[2023-06-15T16:20:53,456][INFO ][i.s.deletes              ] [node_t0][test-idx-1][0] StoreDirectory.deleteFile: delete file _57_Lucene90_0.dvm
...
[2023-06-15T16:20:53,679][INFO ][i.s.deletes              ] [node_t1][test-idx-1][0] On reader closed: delete file _57.cfs
[2023-06-15T16:20:53,680][INFO ][i.s.deletes              ] [node_t1][test-idx-1][0] On reader closed: delete file _57.si
[2023-06-15T16:20:53,692][INFO ][i.s.deletes              ] [node_t1][test-idx-1][0] After commit: delete file _56.fdm
[2023-06-15T16:20:53,693][INFO ][i.s.deletes              ] [node_t1][test-idx-1][0] After commit: delete file _56.fdt

@tlfeng
Copy link
Collaborator

tlfeng commented Aug 7, 2023

I added the above logs in the code and ran the test locally.
I ran the whole integration test suite SegmentReplicationIT in a loop of 5k times, there were 3 failures with the error message java.lang.AssertionError: Files should be preserved. I will post the detail for further investigation.

@dreamer-89 dreamer-89 assigned mch2 and unassigned tlfeng Aug 8, 2023
@github-project-automation github-project-automation bot moved this from Todo to Done in Segment Replication Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
Status: Done
6 participants