Fix flaky test and log level/msgs and enable auto-expand replication #202

kaituo · 2021-09-02T22:41:57Z

Description

This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail.

This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does.

This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small.

Testing done:

Checked that the AD job index setting change is effective and won't negatively impact normal e2e workflow.

Issues Resolved

Flaky test run: https://github.com/opensearch-project/anomaly-detection/pull/196/checks?check_run_id=3487905317

Check List

Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov-commenter · 2021-09-02T23:08:16Z

Codecov Report

Merging #202 (e603205) into main (26e1efe) will increase coverage by 0.42%.
The diff coverage is 63.14%.

@@             Coverage Diff              @@
##               main     #202      +/-   ##
============================================
+ Coverage     73.28%   73.70%   +0.42%     
- Complexity     3395     3463      +68     
============================================
  Files           276      276              
  Lines         15327    15509     +182     
  Branches       1560     1585      +25     
============================================
+ Hits          11232    11431     +199     
+ Misses         3372     3356      -16     
+ Partials        723      722       -1

Flag	Coverage Δ
plugin	`73.70% <63.14%> (+0.42%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
.../main/java/org/opensearch/ad/cluster/HashRing.java	`81.30% <ø> (-0.09%)`	⬇️
.../org/opensearch/ad/feature/CompositeRetriever.java	`84.11% <0.00%> (ø)`
src/main/java/org/opensearch/ad/model/ADTask.java	`90.40% <0.00%> (-0.23%)`	⬇️
...search/ad/rest/RestDeleteAnomalyResultsAction.java	`19.04% <ø> (+0.86%)`	⬆️
...est/handler/IndexAnomalyDetectorActionHandler.java	`43.85% <0.00%> (ø)`
...java/org/opensearch/ad/task/ADBatchTaskRunner.java	`64.81% <ø> (-1.31%)`	⬇️
...ava/org/opensearch/ad/task/ADHCBatchTaskCache.java	`71.91% <ø> (+14.60%)`	⬆️
...d/transport/AnomalyDetectorJobTransportAction.java	`89.74% <ø> (ø)`
...rch/ad/transport/AnomalyResultTransportAction.java	`79.57% <0.00%> (-0.16%)`	⬇️
.../handler/IndexAnomalyDetectorJobActionHandler.java	`51.65% <8.33%> (-3.35%)`	⬇️
... and 19 more

ylwu-amzn · 2021-09-03T00:08:56Z

src/main/java/org/opensearch/ad/task/ADTaskManager.java

@@ -874,7 +874,7 @@ public void stopDetector(

                consumer.accept(detector);
            } catch (Exception e) {
-                String message = "Failed to start anomaly detector " + detectorId;
+                String message = "Failed to get anomaly detector " + detectorId;


How about change the error message as "Failed to parse anomaly detector "

ylwu-amzn · 2021-09-03T00:11:47Z

src/main/java/org/opensearch/ad/indices/AnomalyDetectionIndices.java

+                    Settings
+                        .builder()
+                        // AD job index is small. 1 primary shard is enough
+                        .put(IndexMetadata.SETTING_NUMBER_OF_SHARDS, 1)


Should we or is it possible to change the old AD job index setting to this new setting?

You meant changing the number of primary shards? if yes, OpenSearch does not support it.

Oh, I mean the auto expand replica, so we can leverage more data nodes to run realtime job. I'm ok to address this in next PR if that needs more time to research, change code and test.

…of AD job index This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail. This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does. This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small. Testing done: 1. Checked that the AD job index setting change is effective and won't negatively impact normal e2e workflow.

kaituo · 2021-09-09T22:42:55Z

@ohltyler The build failed due to flaky test org.opensearch.ad.rest.HistoricalAnalysisRestApiIT. Yaliang fixed it in another PR. Will leave it as it is.

src/main/java/org/opensearch/ad/indices/AnomalyDetectionIndices.java

src/main/java/org/opensearch/ad/ratelimit/CheckpointReadWorker.java

ylwu-amzn · 2021-09-10T23:48:32Z

src/main/java/org/opensearch/ad/indices/AnomalyDetectionIndices.java

@@ -786,8 +897,96 @@ public int getSchemaVersion(ADIndex index) {
     * @param index Index metadata
     * @return Whether the given index's mapping is up-to-date
     */
-    public Boolean isUpdated(ADIndex index) {
+    public Boolean isMappingUpdated(ADIndex index) {


Seems this method is not being used now?

yeah, removed

* clean up realtime AD cache on old coordinating node Signed-off-by: Yaliang Wu <[email protected]> * clean up cache in entity cold starter and priority cache * clean up model when node removed * remove unused method * address comments * add more ut * address comments * fix failed ut; add more ut * comment out flaky test * merge fix of remove wrong link from PR #202 * cleanup historical task cache when stop historical analysis * remove top entity check when start historical analysis Test with 1.8 billion docs in data index, it took 23 seconds to return and it exceeds timeout 10s. After removing the top entity check, it took 260ms to return. Signed-off-by: Yaliang Wu <[email protected]> * fix stop historical analysis issue when no entity task running Signed-off-by: Yaliang Wu <[email protected]> * OpenSearch core Ref from 1.x to 1.1 in CI * fix entity filter query; tune task cache; support stop HC before entity task starts Signed-off-by: Yaliang Wu <[email protected]> * fix running entity bug * add more UT * catch exception explicitly in AD batch task runner; add more unit test for ForwardADTaskRequest * add more comments * address comments * fix flaky test

…pensearch-project#202) * Fix flaky test and log level/messages and enable auto-expand replica of AD job index This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail. This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does. This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small. Testing done: 1. Checked that the AD job index setting change is effective and won't negatively impact normal e2e workflow.

…ct#209) * clean up realtime AD cache on old coordinating node Signed-off-by: Yaliang Wu <[email protected]> * clean up cache in entity cold starter and priority cache * clean up model when node removed * remove unused method * address comments * add more ut * address comments * fix failed ut; add more ut * comment out flaky test * merge fix of remove wrong link from PR opensearch-project#202 * cleanup historical task cache when stop historical analysis * remove top entity check when start historical analysis Test with 1.8 billion docs in data index, it took 23 seconds to return and it exceeds timeout 10s. After removing the top entity check, it took 260ms to return. Signed-off-by: Yaliang Wu <[email protected]> * fix stop historical analysis issue when no entity task running Signed-off-by: Yaliang Wu <[email protected]> * OpenSearch core Ref from 1.x to 1.1 in CI * fix entity filter query; tune task cache; support stop HC before entity task starts Signed-off-by: Yaliang Wu <[email protected]> * fix running entity bug * add more UT * catch exception explicitly in AD batch task runner; add more unit test for ForwardADTaskRequest * add more comments * address comments * fix flaky test

…pensearch-project#202) * Fix flaky test and log level/messages and enable auto-expand replica of AD job index This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail. This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does. This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small. Testing done: 1. Checked that the AD job index setting change is effective and won't negatively impact normal e2e workflow.

…ct#209) * clean up realtime AD cache on old coordinating node Signed-off-by: Yaliang Wu <[email protected]> * clean up cache in entity cold starter and priority cache * clean up model when node removed * remove unused method * address comments * add more ut * address comments * fix failed ut; add more ut * comment out flaky test * merge fix of remove wrong link from PR opensearch-project#202 * cleanup historical task cache when stop historical analysis * remove top entity check when start historical analysis Test with 1.8 billion docs in data index, it took 23 seconds to return and it exceeds timeout 10s. After removing the top entity check, it took 260ms to return. Signed-off-by: Yaliang Wu <[email protected]> * fix stop historical analysis issue when no entity task running Signed-off-by: Yaliang Wu <[email protected]> * OpenSearch core Ref from 1.x to 1.1 in CI * fix entity filter query; tune task cache; support stop HC before entity task starts Signed-off-by: Yaliang Wu <[email protected]> * fix running entity bug * add more UT * catch exception explicitly in AD batch task runner; add more unit test for ForwardADTaskRequest * add more comments * address comments * fix flaky test

…202) * Fix flaky test and log level/messages and enable auto-expand replica of AD job index This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail. This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does. This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small. Testing done: 1. Checked that the AD job index setting change is effective and won't negatively impact normal e2e workflow.

* clean up realtime AD cache on old coordinating node Signed-off-by: Yaliang Wu <[email protected]> * clean up cache in entity cold starter and priority cache * clean up model when node removed * remove unused method * address comments * add more ut * address comments * fix failed ut; add more ut * comment out flaky test * merge fix of remove wrong link from PR #202 * cleanup historical task cache when stop historical analysis * remove top entity check when start historical analysis Test with 1.8 billion docs in data index, it took 23 seconds to return and it exceeds timeout 10s. After removing the top entity check, it took 260ms to return. Signed-off-by: Yaliang Wu <[email protected]> * fix stop historical analysis issue when no entity task running Signed-off-by: Yaliang Wu <[email protected]> * OpenSearch core Ref from 1.x to 1.1 in CI * fix entity filter query; tune task cache; support stop HC before entity task starts Signed-off-by: Yaliang Wu <[email protected]> * fix running entity bug * add more UT * catch exception explicitly in AD batch task runner; add more unit test for ForwardADTaskRequest * add more comments * address comments * fix flaky test

…202) * Fix flaky test and log level/messages and enable auto-expand replica of AD job index This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail. This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does. This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small. Testing done: 1. Checked that the AD job index setting change is effective and won't negatively impact normal e2e workflow.

* clean up realtime AD cache on old coordinating node Signed-off-by: Yaliang Wu <[email protected]> * clean up cache in entity cold starter and priority cache * clean up model when node removed * remove unused method * address comments * add more ut * address comments * fix failed ut; add more ut * comment out flaky test * merge fix of remove wrong link from PR #202 * cleanup historical task cache when stop historical analysis * remove top entity check when start historical analysis Test with 1.8 billion docs in data index, it took 23 seconds to return and it exceeds timeout 10s. After removing the top entity check, it took 260ms to return. Signed-off-by: Yaliang Wu <[email protected]> * fix stop historical analysis issue when no entity task running Signed-off-by: Yaliang Wu <[email protected]> * OpenSearch core Ref from 1.x to 1.1 in CI * fix entity filter query; tune task cache; support stop HC before entity task starts Signed-off-by: Yaliang Wu <[email protected]> * fix running entity bug * add more UT * catch exception explicitly in AD batch task runner; add more unit test for ForwardADTaskRequest * add more comments * address comments * fix flaky test

This PR contains: PRs (all approved) related to integrating ThresholdedRandomCutForest: opensearch-project#221 opensearch-project#222 opensearch-project#223 opensearch-project#224 opensearch-project#226 opensearch-project#227 opensearch-project#228 rebased PR: opensearch-project#201 opensearch-project#202 opensearch-project#216

* [1.1] Bump OpenSearch core to 1.1 in CI (#212) Signed-off-by: Tyler Ohlsen <[email protected]> * add thresholded rcf (#215) Signed-off-by: lai <[email protected]> * Integrating with ThresholdedRandomCutForest This PR contains: PRs (all approved) related to integrating ThresholdedRandomCutForest: #221 #222 #223 #224 #226 #227 #228 rebased PR: #201 #202 #216 * pass the correct shingleSize to ThresholdedRandomCutForest Previously, I used shingleSize 1 for externally shingled ThresholdedRandomCutForest because of the double multiplication with shingle size in RCF. Now RCF has fixed the issue. This commits adds new RCF libraries from aws/random-cut-forest-by-aws#278 and passes the correct shingleSize to ThresholdedRandomCutForest. This commits adds new RCF libraries from aws/random-cut-forest-by-aws#278 and passes the correct shingleSize to ThresholdedRandomCutForest. Co-authored-by: Tyler Ohlsen <[email protected]> Co-authored-by: Lai <[email protected]>

* [1.1] Bump OpenSearch core to 1.1 in CI (#212) Signed-off-by: Tyler Ohlsen <[email protected]> * add thresholded rcf (#215) Signed-off-by: lai <[email protected]> * Integrating with ThresholdedRandomCutForest This PR contains: PRs (all approved) related to integrating ThresholdedRandomCutForest: opensearch-project/anomaly-detection#221 opensearch-project/anomaly-detection#222 opensearch-project/anomaly-detection#223 opensearch-project/anomaly-detection#224 opensearch-project/anomaly-detection#226 opensearch-project/anomaly-detection#227 opensearch-project/anomaly-detection#228 rebased PR: opensearch-project/anomaly-detection#201 opensearch-project/anomaly-detection#202 opensearch-project/anomaly-detection#216 * pass the correct shingleSize to ThresholdedRandomCutForest Previously, I used shingleSize 1 for externally shingled ThresholdedRandomCutForest because of the double multiplication with shingle size in RCF. Now RCF has fixed the issue. This commits adds new RCF libraries from aws/random-cut-forest-by-aws#278 and passes the correct shingleSize to ThresholdedRandomCutForest. This commits adds new RCF libraries from aws/random-cut-forest-by-aws#278 and passes the correct shingleSize to ThresholdedRandomCutForest. Co-authored-by: Tyler Ohlsen <[email protected]> Co-authored-by: Lai <[email protected]>

kaituo requested review from ylwu-amzn and ohltyler September 2, 2021 22:42

ohltyler previously approved these changes Sep 2, 2021

View reviewed changes

ylwu-amzn reviewed Sep 3, 2021

View reviewed changes

kaituo requested a review from ylwu-amzn September 3, 2021 18:55

ohltyler mentioned this pull request Sep 3, 2021

Use AD 1.1 branch for OpenSearch 1.1 opensearch-project/opensearch-build#387

Merged

1 task

peternied mentioned this pull request Sep 3, 2021

[BACKPORT] Pull Request #202 -> Branch 1.1 #206

Closed

kaituo added 2 commits September 6, 2021 13:51

address Yaliang's comments; honour max in CompositeRetriever

778e71e

kaituo dismissed ohltyler’s stale review via 778e71e September 9, 2021 21:25

kaituo force-pushed the flakyTest branch from 67bc1d8 to 778e71e Compare September 9, 2021 21:25

remove wrong link

385c41f

kaituo requested a review from ohltyler September 9, 2021 22:43

ohltyler previously approved these changes Sep 9, 2021

View reviewed changes

ylwu-amzn reviewed Sep 10, 2021

View reviewed changes

src/main/java/org/opensearch/ad/indices/AnomalyDetectionIndices.java Show resolved Hide resolved

ylwu-amzn reviewed Sep 10, 2021

View reviewed changes

src/main/java/org/opensearch/ad/ratelimit/CheckpointReadWorker.java Show resolved Hide resolved

ylwu-amzn mentioned this pull request Sep 10, 2021

clean up realtime AD cache on old coordinating node #209

Merged

1 task

ylwu-amzn added a commit to ylwu-amzn/anomaly-detection-2 that referenced this pull request Sep 10, 2021

merge fix of remove wrong link from PR opensearch-project#202

8ac1eb8

ylwu-amzn reviewed Sep 10, 2021

View reviewed changes

remove unused method

e7d5535

kaituo dismissed ohltyler’s stale review via e7d5535 September 14, 2021 21:43

use opensearch 1.1 instead of 1.x

e603205

kaituo requested review from ylwu-amzn and ohltyler September 14, 2021 23:49

ohltyler approved these changes Sep 15, 2021

View reviewed changes

ylwu-amzn approved these changes Sep 15, 2021

View reviewed changes

kaituo merged commit d9a122d into opensearch-project:main Sep 16, 2021

This was referenced Sep 23, 2021

[BACKPORT] [1.1] Backport remaining bug fixes #213

Merged

[BACKPORT] [1.x] Backport remaining bug fixes #207

Merged

kaituo mentioned this pull request Sep 25, 2021

Integrating with ThresholdedRandomCutForest #237

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky test and log level/msgs and enable auto-expand replication #202

Fix flaky test and log level/msgs and enable auto-expand replication #202

kaituo commented Sep 2, 2021

codecov-commenter commented Sep 2, 2021 •

edited

Loading

ylwu-amzn Sep 3, 2021

kaituo Sep 3, 2021

ylwu-amzn Sep 3, 2021

kaituo Sep 3, 2021

ylwu-amzn Sep 3, 2021 •

edited

Loading

kaituo Sep 9, 2021

kaituo commented Sep 9, 2021

ylwu-amzn Sep 10, 2021

kaituo Sep 14, 2021

Fix flaky test and log level/msgs and enable auto-expand replication #202

Fix flaky test and log level/msgs and enable auto-expand replication #202

Conversation

kaituo commented Sep 2, 2021

Description

Issues Resolved

Check List

codecov-commenter commented Sep 2, 2021 • edited Loading

Codecov Report

ylwu-amzn Sep 3, 2021

Choose a reason for hiding this comment

kaituo Sep 3, 2021

Choose a reason for hiding this comment

ylwu-amzn Sep 3, 2021

Choose a reason for hiding this comment

kaituo Sep 3, 2021

Choose a reason for hiding this comment

ylwu-amzn Sep 3, 2021 • edited Loading

Choose a reason for hiding this comment

kaituo Sep 9, 2021

Choose a reason for hiding this comment

kaituo commented Sep 9, 2021

ylwu-amzn Sep 10, 2021

Choose a reason for hiding this comment

kaituo Sep 14, 2021

Choose a reason for hiding this comment

codecov-commenter commented Sep 2, 2021 •

edited

Loading

ylwu-amzn Sep 3, 2021 •

edited

Loading