Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky test and log level/msgs and enable auto-expand replication #202

Merged
merged 5 commits into from
Sep 16, 2021

Conversation

kaituo
Copy link
Collaborator

@kaituo kaituo commented Sep 2, 2021

Description

This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail.

This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does.

This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small.

Testing done:

  1. Checked that the AD job index setting change is effective and won't negatively impact normal e2e workflow.

Issues Resolved

Flaky test run: https://github.com/opensearch-project/anomaly-detection/pull/196/checks?check_run_id=3487905317

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@kaituo kaituo requested review from ylwu-amzn and ohltyler September 2, 2021 22:42
ohltyler
ohltyler previously approved these changes Sep 2, 2021
@codecov-commenter
Copy link

codecov-commenter commented Sep 2, 2021

Codecov Report

Merging #202 (e603205) into main (26e1efe) will increase coverage by 0.42%.
The diff coverage is 63.14%.

Impacted file tree graph

@@             Coverage Diff              @@
##               main     #202      +/-   ##
============================================
+ Coverage     73.28%   73.70%   +0.42%     
- Complexity     3395     3463      +68     
============================================
  Files           276      276              
  Lines         15327    15509     +182     
  Branches       1560     1585      +25     
============================================
+ Hits          11232    11431     +199     
+ Misses         3372     3356      -16     
+ Partials        723      722       -1     
Flag Coverage Δ
plugin 73.70% <63.14%> (+0.42%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../main/java/org/opensearch/ad/cluster/HashRing.java 81.30% <ø> (-0.09%) ⬇️
.../org/opensearch/ad/feature/CompositeRetriever.java 84.11% <0.00%> (ø)
src/main/java/org/opensearch/ad/model/ADTask.java 90.40% <0.00%> (-0.23%) ⬇️
...search/ad/rest/RestDeleteAnomalyResultsAction.java 19.04% <ø> (+0.86%) ⬆️
...est/handler/IndexAnomalyDetectorActionHandler.java 43.85% <0.00%> (ø)
...java/org/opensearch/ad/task/ADBatchTaskRunner.java 64.81% <ø> (-1.31%) ⬇️
...ava/org/opensearch/ad/task/ADHCBatchTaskCache.java 71.91% <ø> (+14.60%) ⬆️
...d/transport/AnomalyDetectorJobTransportAction.java 89.74% <ø> (ø)
...rch/ad/transport/AnomalyResultTransportAction.java 79.57% <0.00%> (-0.16%) ⬇️
.../handler/IndexAnomalyDetectorJobActionHandler.java 51.65% <8.33%> (-3.35%) ⬇️
... and 19 more

@@ -874,7 +874,7 @@ public void stopDetector(

consumer.accept(detector);
} catch (Exception e) {
String message = "Failed to start anomaly detector " + detectorId;
String message = "Failed to get anomaly detector " + detectorId;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about change the error message as "Failed to parse anomaly detector "

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Settings
.builder()
// AD job index is small. 1 primary shard is enough
.put(IndexMetadata.SETTING_NUMBER_OF_SHARDS, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we or is it possible to change the old AD job index setting to this new setting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You meant changing the number of primary shards? if yes, OpenSearch does not support it.

Copy link
Collaborator

@ylwu-amzn ylwu-amzn Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I mean the auto expand replica, so we can leverage more data nodes to run realtime job. I'm ok to address this in next PR if that needs more time to research, change code and test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

…of AD job index

This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail.

This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does.

This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small.

Testing done:
1. Checked that the AD job index setting change is effective and won't negatively impact normal e2e workflow.
@kaituo
Copy link
Collaborator Author

kaituo commented Sep 9, 2021

@ohltyler The build failed due to flaky test org.opensearch.ad.rest.HistoricalAnalysisRestApiIT. Yaliang fixed it in another PR. Will leave it as it is.

@kaituo kaituo requested a review from ohltyler September 9, 2021 22:43
ohltyler
ohltyler previously approved these changes Sep 9, 2021
ylwu-amzn added a commit to ylwu-amzn/anomaly-detection-2 that referenced this pull request Sep 10, 2021
@@ -786,8 +897,96 @@ public int getSchemaVersion(ADIndex index) {
* @param index Index metadata
* @return Whether the given index's mapping is up-to-date
*/
public Boolean isUpdated(ADIndex index) {
public Boolean isMappingUpdated(ADIndex index) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this method is not being used now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, removed

@kaituo kaituo merged commit d9a122d into opensearch-project:main Sep 16, 2021
ylwu-amzn added a commit that referenced this pull request Sep 23, 2021
* clean up realtime AD cache on old coordinating node

Signed-off-by: Yaliang Wu <[email protected]>

* clean up cache in entity cold starter and priority cache

* clean up model when node removed

* remove unused method

* address comments

* add more ut

* address comments

* fix failed ut; add more ut

* comment out flaky test

* merge fix of remove wrong link from PR #202

* cleanup historical task cache when stop historical analysis

* remove top entity check when start historical analysis

Test with 1.8 billion docs in data index, it took 23 seconds to return
and it exceeds timeout 10s. After removing the top entity check, it took
260ms to return.

Signed-off-by: Yaliang Wu <[email protected]>

* fix stop historical analysis issue when no entity task running

Signed-off-by: Yaliang Wu <[email protected]>

* OpenSearch core Ref from 1.x to 1.1 in CI

* fix entity filter query; tune task cache; support stop HC before entity task starts

Signed-off-by: Yaliang Wu <[email protected]>

* fix running entity bug

* add more UT

* catch exception explicitly in AD batch task runner; add more unit test for ForwardADTaskRequest

* add more comments

* address comments

* fix flaky test
ohltyler pushed a commit to ohltyler/anomaly-detection-2 that referenced this pull request Sep 23, 2021
…pensearch-project#202)

* Fix flaky test and log level/messages and enable auto-expand replica of AD job index

This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail.

This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does.

This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small.

Testing done:
1. Checked that the AD job index setting change is effective and won't negatively impact normal e2e workflow.
ohltyler pushed a commit to ohltyler/anomaly-detection-2 that referenced this pull request Sep 23, 2021
…ct#209)

* clean up realtime AD cache on old coordinating node

Signed-off-by: Yaliang Wu <[email protected]>

* clean up cache in entity cold starter and priority cache

* clean up model when node removed

* remove unused method

* address comments

* add more ut

* address comments

* fix failed ut; add more ut

* comment out flaky test

* merge fix of remove wrong link from PR opensearch-project#202

* cleanup historical task cache when stop historical analysis

* remove top entity check when start historical analysis

Test with 1.8 billion docs in data index, it took 23 seconds to return
and it exceeds timeout 10s. After removing the top entity check, it took
260ms to return.

Signed-off-by: Yaliang Wu <[email protected]>

* fix stop historical analysis issue when no entity task running

Signed-off-by: Yaliang Wu <[email protected]>

* OpenSearch core Ref from 1.x to 1.1 in CI

* fix entity filter query; tune task cache; support stop HC before entity task starts

Signed-off-by: Yaliang Wu <[email protected]>

* fix running entity bug

* add more UT

* catch exception explicitly in AD batch task runner; add more unit test for ForwardADTaskRequest

* add more comments

* address comments

* fix flaky test
ohltyler pushed a commit to ohltyler/anomaly-detection-2 that referenced this pull request Sep 23, 2021
…pensearch-project#202)

* Fix flaky test and log level/messages and enable auto-expand replica of AD job index

This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail.

This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does.

This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small.

Testing done:
1. Checked that the AD job index setting change is effective and won't negatively impact normal e2e workflow.
ohltyler pushed a commit to ohltyler/anomaly-detection-2 that referenced this pull request Sep 23, 2021
…ct#209)

* clean up realtime AD cache on old coordinating node

Signed-off-by: Yaliang Wu <[email protected]>

* clean up cache in entity cold starter and priority cache

* clean up model when node removed

* remove unused method

* address comments

* add more ut

* address comments

* fix failed ut; add more ut

* comment out flaky test

* merge fix of remove wrong link from PR opensearch-project#202

* cleanup historical task cache when stop historical analysis

* remove top entity check when start historical analysis

Test with 1.8 billion docs in data index, it took 23 seconds to return
and it exceeds timeout 10s. After removing the top entity check, it took
260ms to return.

Signed-off-by: Yaliang Wu <[email protected]>

* fix stop historical analysis issue when no entity task running

Signed-off-by: Yaliang Wu <[email protected]>

* OpenSearch core Ref from 1.x to 1.1 in CI

* fix entity filter query; tune task cache; support stop HC before entity task starts

Signed-off-by: Yaliang Wu <[email protected]>

* fix running entity bug

* add more UT

* catch exception explicitly in AD batch task runner; add more unit test for ForwardADTaskRequest

* add more comments

* address comments

* fix flaky test
ohltyler pushed a commit that referenced this pull request Sep 23, 2021
…202)

* Fix flaky test and log level/messages and enable auto-expand replica of AD job index

This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail.

This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does.

This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small.

Testing done:
1. Checked that the AD job index setting change is effective and won't negatively impact normal e2e workflow.
ohltyler pushed a commit that referenced this pull request Sep 23, 2021
* clean up realtime AD cache on old coordinating node

Signed-off-by: Yaliang Wu <[email protected]>

* clean up cache in entity cold starter and priority cache

* clean up model when node removed

* remove unused method

* address comments

* add more ut

* address comments

* fix failed ut; add more ut

* comment out flaky test

* merge fix of remove wrong link from PR #202

* cleanup historical task cache when stop historical analysis

* remove top entity check when start historical analysis

Test with 1.8 billion docs in data index, it took 23 seconds to return
and it exceeds timeout 10s. After removing the top entity check, it took
260ms to return.

Signed-off-by: Yaliang Wu <[email protected]>

* fix stop historical analysis issue when no entity task running

Signed-off-by: Yaliang Wu <[email protected]>

* OpenSearch core Ref from 1.x to 1.1 in CI

* fix entity filter query; tune task cache; support stop HC before entity task starts

Signed-off-by: Yaliang Wu <[email protected]>

* fix running entity bug

* add more UT

* catch exception explicitly in AD batch task runner; add more unit test for ForwardADTaskRequest

* add more comments

* address comments

* fix flaky test
ohltyler pushed a commit that referenced this pull request Sep 23, 2021
…202)

* Fix flaky test and log level/messages and enable auto-expand replica of AD job index

This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail.

This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does.

This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small.

Testing done:
1. Checked that the AD job index setting change is effective and won't negatively impact normal e2e workflow.
ohltyler pushed a commit that referenced this pull request Sep 23, 2021
* clean up realtime AD cache on old coordinating node

Signed-off-by: Yaliang Wu <[email protected]>

* clean up cache in entity cold starter and priority cache

* clean up model when node removed

* remove unused method

* address comments

* add more ut

* address comments

* fix failed ut; add more ut

* comment out flaky test

* merge fix of remove wrong link from PR #202

* cleanup historical task cache when stop historical analysis

* remove top entity check when start historical analysis

Test with 1.8 billion docs in data index, it took 23 seconds to return
and it exceeds timeout 10s. After removing the top entity check, it took
260ms to return.

Signed-off-by: Yaliang Wu <[email protected]>

* fix stop historical analysis issue when no entity task running

Signed-off-by: Yaliang Wu <[email protected]>

* OpenSearch core Ref from 1.x to 1.1 in CI

* fix entity filter query; tune task cache; support stop HC before entity task starts

Signed-off-by: Yaliang Wu <[email protected]>

* fix running entity bug

* add more UT

* catch exception explicitly in AD batch task runner; add more unit test for ForwardADTaskRequest

* add more comments

* address comments

* fix flaky test
kaituo added a commit to kaituo/anomaly-detection-1 that referenced this pull request Sep 25, 2021
kaituo added a commit to kaituo/anomaly-detection-1 that referenced this pull request Sep 25, 2021
kaituo added a commit that referenced this pull request Sep 27, 2021
* [1.1] Bump OpenSearch core to 1.1 in CI (#212)

Signed-off-by: Tyler Ohlsen <[email protected]>

* add thresholded rcf (#215)

Signed-off-by: lai <[email protected]>

* Integrating with ThresholdedRandomCutForest

This PR contains:

PRs (all approved)  related to integrating ThresholdedRandomCutForest:
#221
#222
#223
#224
#226
#227
#228

rebased PR:
#201
#202
#216

* pass the correct shingleSize to ThresholdedRandomCutForest

Previously, I used shingleSize 1 for externally shingled ThresholdedRandomCutForest because of the double multiplication with shingle size in RCF.  Now RCF has fixed the issue. This commits adds new RCF libraries from aws/random-cut-forest-by-aws#278 and passes the correct shingleSize to ThresholdedRandomCutForest.

This commits adds new RCF libraries from aws/random-cut-forest-by-aws#278 and passes the correct shingleSize to ThresholdedRandomCutForest.

Co-authored-by: Tyler Ohlsen <[email protected]>
Co-authored-by: Lai <[email protected]>
hamersu9t added a commit to hamersu9t/anomaly-detection that referenced this pull request Aug 10, 2024
* [1.1] Bump OpenSearch core to 1.1 in CI (#212)

Signed-off-by: Tyler Ohlsen <[email protected]>

* add thresholded rcf (#215)

Signed-off-by: lai <[email protected]>

* Integrating with ThresholdedRandomCutForest

This PR contains:

PRs (all approved)  related to integrating ThresholdedRandomCutForest:
opensearch-project/anomaly-detection#221
opensearch-project/anomaly-detection#222
opensearch-project/anomaly-detection#223
opensearch-project/anomaly-detection#224
opensearch-project/anomaly-detection#226
opensearch-project/anomaly-detection#227
opensearch-project/anomaly-detection#228

rebased PR:
opensearch-project/anomaly-detection#201
opensearch-project/anomaly-detection#202
opensearch-project/anomaly-detection#216

* pass the correct shingleSize to ThresholdedRandomCutForest

Previously, I used shingleSize 1 for externally shingled ThresholdedRandomCutForest because of the double multiplication with shingle size in RCF.  Now RCF has fixed the issue. This commits adds new RCF libraries from aws/random-cut-forest-by-aws#278 and passes the correct shingleSize to ThresholdedRandomCutForest.

This commits adds new RCF libraries from aws/random-cut-forest-by-aws#278 and passes the correct shingleSize to ThresholdedRandomCutForest.

Co-authored-by: Tyler Ohlsen <[email protected]>
Co-authored-by: Lai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants