Don't rely on test code execution time span for RemoteSegmentTransferTrackerTests #15187

lukas-vlcek · 2024-08-09T18:41:00Z

Description

Current implementation of RemoteSegmentTransferTrackerTests.testComputeTimeLagOnUpdate() test rely on some assumptions about how fast the testing code will finish in JVM. Moreover it does not precisely control boundaries of the time span, specifically the start of the span because it is determined by internal implementation of RemoteSegmentTransferTracker.getTimeMsLag() which indirectly makes call to System.nanoTime().

This commit loosens the assumption that the test code execution will finish within +/-20ms. Instead it only assumes that the execution time span won't be shorter than predefined (and controlled) thread sleep interval and any larger interval value is considered a success.

The whole point of this test is not to verify execution speed with defined precision. Instead the point is that the getTimeMsLag() method returns either 0 (for specific conditions) or possitive number (assuming that remoteRefreshStartTimeMs is not greater than System.nanoTime()).

Related Issues

Closes: #14325

Check List

Functionality includes testing.
~~[ ] API changes companion pull request created, if applicable.~~
~~[ ] Public documentation issue/PR created, if applicable.~~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

lukas-vlcek · 2024-08-09T18:41:24Z

Please add skip-changelog label.

lukas-vlcek · 2024-08-09T18:49:49Z

Idea for possible future improvement:

We could remove all direct calls to System.nanoTime() and similar System time methods from RemoteSegmentTransferTracker.java class and delegate it to some "TimeProvider" object. Then we could implement tests that have some more specific assumptions about code execution time span, because we would be able to control the TimeProvider in the test precisely.

If there is an agreement that this would be beneficial then we can open a new ticket.

github-actions · 2024-08-09T19:34:01Z

✅ Gradle check result for 025c303: SUCCESS

codecov · 2024-08-09T19:36:04Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.92%. Comparing base (b6c80b1) to head (6e191a3).
Report is 6 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #15187      +/-   ##
============================================
+ Coverage     71.90%   71.92%   +0.01%     
- Complexity    63033    63114      +81     
============================================
  Files          5197     5197              
  Lines        295313   295313              
  Branches      42677    42677              
============================================
+ Hits         212354   212390      +36     
- Misses        65552    65607      +55     
+ Partials      17407    17316      -91

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

server/src/test/java/org/opensearch/index/remote/RemoteSegmentTransferTrackerTests.java

linuxpi · 2024-08-13T08:50:48Z

Thanks for raising a fix for this flaky test @lukas-vlcek, seems like a hot one.

Idea for possible future improvement:

We could remove all direct calls to System.nanoTime() and similar System time methods from RemoteSegmentTransferTracker.java class and delegate it to some "TimeProvider" object. Then we could implement tests that have some more specific assumptions about code execution time span, because we would be able to control the TimeProvider in the test precisely.

If there is an agreement that this would be beneficial then we can open a new ticket.

As of now we have always relied of directly using the System.nanoTime() but i think it would be good to have an abstraction like TimeProvider . To get an agreement we can maybe open a small RFC and get some thoughts from others

github-actions · 2024-08-13T11:26:38Z

❌ Gradle check result for 510c03a: UNSTABLE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-08-13T11:45:42Z

❌ Gradle check result for 18dbf31: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-08-13T15:25:41Z

❕ Gradle check result for d1cc5c7: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

server/src/test/java/org/opensearch/index/remote/RemoteSegmentTransferTrackerTests.java

linuxpi · 2024-08-13T17:27:25Z

Mostly changes look good. One thing is we should run multiple iterations locally to make sure we have not regressed

Current implementation of [`RemoteSegmentTransferTrackerTests.testComputeTimeLagOnUpdate()`](https://github.com/opensearch-project/OpenSearch/blob/2b17902643738f0d2a75ade7c85cbca94d18ce49/server/src/test/java/org/opensearch/index/remote/RemoteSegmentTransferTrackerTests.java#L139) test rely on some assumptions about how fast the testing code will finish in JVM. Moreover it does not precisely control boundaries of the time span, specifically the start of the span because it is determined by internal implementation of [`RemoteSegmentTransferTracker.getTimeMsLag()`](https://github.com/opensearch-project/OpenSearch/blob/2b17902643738f0d2a75ade7c85cbca94d18ce49/server/src/main/java/org/opensearch/index/remote/RemoteSegmentTransferTracker.java#L262) which indirectly makes call to `System.nanoTime()`. This commit loosens the assumption that the test code execution will finish within +/-20ms. Instead it only assumes that the execution time span won't be shorter than predefined (and controlled) thread sleep interval and any larger interval value is considered a success. The whole point of this test is not to verify execution speed with defined precision. Instead the point is that the [`getTimeMsLag()`](https://github.com/opensearch-project/OpenSearch/blob/2b17902643738f0d2a75ade7c85cbca94d18ce49/server/src/main/java/org/opensearch/index/remote/RemoteSegmentTransferTracker.java#L262) method returns either 0 (for specific conditions) or possitive number (assuming that `remoteRefreshStartTimeMs` is not greater than `System.nanoTime()`). Closes: opensearch-project#14325 Signed-off-by: Lukáš Vlček <[email protected]>

lukas-vlcek · 2024-08-14T06:45:02Z

@linuxpi As for TimeProvider RFC, I will check other uses of System.nanoTime() in the code base (once I'm back from 🌴) and I will let you know what I think.

github-actions · 2024-08-14T07:26:15Z

❕ Gradle check result for 6e191a3: UNSTABLE

TEST FAILURES:

      1 org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

…TrackerTests (#15187) Signed-off-by: Lukáš Vlček <[email protected]> (cherry picked from commit ef1a79f) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…TrackerTests (#15187) (#15244) (cherry picked from commit ef1a79f) Signed-off-by: Lukáš Vlček <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…TrackerTests (opensearch-project#15187) Signed-off-by: Lukáš Vlček <[email protected]>

lukas-vlcek requested review from anasalkouz, andrross, ashking94, Bukhtawar, CEHENKLE, dblock, dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, nknize, owaiskazi19, reta, Rishikesh1159, sachinpkale, saratvemulapalli, shwetathareja, sohami and VachaShah as code owners August 9, 2024 18:41

github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run Storage:Remote labels Aug 9, 2024

skumawat2025 added the skip-changelog label Aug 12, 2024

linuxpi assigned lukas-vlcek and unassigned linuxpi Aug 13, 2024

linuxpi reviewed Aug 13, 2024

View reviewed changes

server/src/test/java/org/opensearch/index/remote/RemoteSegmentTransferTrackerTests.java Outdated Show resolved Hide resolved

lukas-vlcek force-pushed the 14325 branch 2 times, most recently from 510c03a to 18dbf31 Compare August 13, 2024 11:09

lukas-vlcek force-pushed the 14325 branch from 18dbf31 to d1cc5c7 Compare August 13, 2024 14:33

jed326 approved these changes Aug 13, 2024

View reviewed changes

linuxpi reviewed Aug 13, 2024

View reviewed changes

server/src/test/java/org/opensearch/index/remote/RemoteSegmentTransferTrackerTests.java Outdated Show resolved Hide resolved

lukas-vlcek force-pushed the 14325 branch from d1cc5c7 to 6e191a3 Compare August 14, 2024 06:38

linuxpi approved these changes Aug 14, 2024

View reviewed changes

linuxpi changed the title ~~Don't rely on test code execution time span~~ Don't rely on test code execution time span for RemoteSegmentTransferTrackerTests Aug 14, 2024

linuxpi merged commit ef1a79f into opensearch-project:main Aug 14, 2024
36 checks passed

linuxpi added the backport 2.x Backport to 2.x branch label Aug 14, 2024

opensearch-trigger-bot bot mentioned this pull request Aug 14, 2024

[Backport 2.x] Don't rely on test code execution time span for RemoteSegmentTransferTrackerTests #15244

Merged

lukas-vlcek deleted the 14325 branch August 15, 2024 06:49

wdongyu pushed a commit to wdongyu/OpenSearch that referenced this pull request Aug 22, 2024

Don't rely on test code execution time span for RemoteSegmentTransfer…

1128e77

…TrackerTests (opensearch-project#15187) Signed-off-by: Lukáš Vlček <[email protected]>

This was referenced Sep 6, 2024

[AUTOCUT] Gradle Check Flaky Test Report for MasterServiceTests #15809

Open

[AUTOCUT] Gradle Check Flaky Test Report for ShardIndexingPressureIT #15830

Open

akolarkunnu pushed a commit to akolarkunnu/OpenSearch that referenced this pull request Sep 10, 2024

Don't rely on test code execution time span for RemoteSegmentTransfer…

8086bb8

…TrackerTests (opensearch-project#15187) Signed-off-by: Lukáš Vlček <[email protected]>

opensearch-ci-bot mentioned this pull request Sep 11, 2024

[AUTOCUT] Gradle Check Flaky Test Report for SearchRestCancellationIT #14311

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't rely on test code execution time span for RemoteSegmentTransferTrackerTests #15187

Don't rely on test code execution time span for RemoteSegmentTransferTrackerTests #15187

lukas-vlcek commented Aug 9, 2024

lukas-vlcek commented Aug 9, 2024

lukas-vlcek commented Aug 9, 2024

github-actions bot commented Aug 9, 2024

codecov bot commented Aug 9, 2024 •

edited

Loading

linuxpi commented Aug 13, 2024

github-actions bot commented Aug 13, 2024

github-actions bot commented Aug 13, 2024

github-actions bot commented Aug 13, 2024

linuxpi commented Aug 13, 2024

lukas-vlcek commented Aug 14, 2024

github-actions bot commented Aug 14, 2024

Don't rely on test code execution time span for RemoteSegmentTransferTrackerTests #15187

Don't rely on test code execution time span for RemoteSegmentTransferTrackerTests #15187

Conversation

lukas-vlcek commented Aug 9, 2024

Description

Related Issues

Check List

lukas-vlcek commented Aug 9, 2024

lukas-vlcek commented Aug 9, 2024

github-actions bot commented Aug 9, 2024

codecov bot commented Aug 9, 2024 • edited Loading

Codecov Report

linuxpi commented Aug 13, 2024

github-actions bot commented Aug 13, 2024

github-actions bot commented Aug 13, 2024

github-actions bot commented Aug 13, 2024

linuxpi commented Aug 13, 2024

lukas-vlcek commented Aug 14, 2024

github-actions bot commented Aug 14, 2024

codecov bot commented Aug 9, 2024 •

edited

Loading