-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky test and log level/msgs and enable auto-expand replication #202
Conversation
Codecov Report
@@ Coverage Diff @@
## main #202 +/- ##
============================================
+ Coverage 73.28% 73.70% +0.42%
- Complexity 3395 3463 +68
============================================
Files 276 276
Lines 15327 15509 +182
Branches 1560 1585 +25
============================================
+ Hits 11232 11431 +199
+ Misses 3372 3356 -16
+ Partials 723 722 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@@ -874,7 +874,7 @@ public void stopDetector( | |||
|
|||
consumer.accept(detector); | |||
} catch (Exception e) { | |||
String message = "Failed to start anomaly detector " + detectorId; | |||
String message = "Failed to get anomaly detector " + detectorId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about change the error message as "Failed to parse anomaly detector "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
Settings | ||
.builder() | ||
// AD job index is small. 1 primary shard is enough | ||
.put(IndexMetadata.SETTING_NUMBER_OF_SHARDS, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we or is it possible to change the old AD job index setting to this new setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You meant changing the number of primary shards? if yes, OpenSearch does not support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I mean the auto expand replica, so we can leverage more data nodes to run realtime job. I'm ok to address this in next PR if that needs more time to research, change code and test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
…of AD job index This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail. This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does. This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small. Testing done: 1. Checked that the AD job index setting change is effective and won't negatively impact normal e2e workflow.
@ohltyler The build failed due to flaky test org.opensearch.ad.rest.HistoricalAnalysisRestApiIT. Yaliang fixed it in another PR. Will leave it as it is. |
@@ -786,8 +897,96 @@ public int getSchemaVersion(ADIndex index) { | |||
* @param index Index metadata | |||
* @return Whether the given index's mapping is up-to-date | |||
*/ | |||
public Boolean isUpdated(ADIndex index) { | |||
public Boolean isMappingUpdated(ADIndex index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this method is not being used now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, removed
* clean up realtime AD cache on old coordinating node Signed-off-by: Yaliang Wu <[email protected]> * clean up cache in entity cold starter and priority cache * clean up model when node removed * remove unused method * address comments * add more ut * address comments * fix failed ut; add more ut * comment out flaky test * merge fix of remove wrong link from PR #202 * cleanup historical task cache when stop historical analysis * remove top entity check when start historical analysis Test with 1.8 billion docs in data index, it took 23 seconds to return and it exceeds timeout 10s. After removing the top entity check, it took 260ms to return. Signed-off-by: Yaliang Wu <[email protected]> * fix stop historical analysis issue when no entity task running Signed-off-by: Yaliang Wu <[email protected]> * OpenSearch core Ref from 1.x to 1.1 in CI * fix entity filter query; tune task cache; support stop HC before entity task starts Signed-off-by: Yaliang Wu <[email protected]> * fix running entity bug * add more UT * catch exception explicitly in AD batch task runner; add more unit test for ForwardADTaskRequest * add more comments * address comments * fix flaky test
…pensearch-project#202) * Fix flaky test and log level/messages and enable auto-expand replica of AD job index This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail. This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does. This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small. Testing done: 1. Checked that the AD job index setting change is effective and won't negatively impact normal e2e workflow.
…ct#209) * clean up realtime AD cache on old coordinating node Signed-off-by: Yaliang Wu <[email protected]> * clean up cache in entity cold starter and priority cache * clean up model when node removed * remove unused method * address comments * add more ut * address comments * fix failed ut; add more ut * comment out flaky test * merge fix of remove wrong link from PR opensearch-project#202 * cleanup historical task cache when stop historical analysis * remove top entity check when start historical analysis Test with 1.8 billion docs in data index, it took 23 seconds to return and it exceeds timeout 10s. After removing the top entity check, it took 260ms to return. Signed-off-by: Yaliang Wu <[email protected]> * fix stop historical analysis issue when no entity task running Signed-off-by: Yaliang Wu <[email protected]> * OpenSearch core Ref from 1.x to 1.1 in CI * fix entity filter query; tune task cache; support stop HC before entity task starts Signed-off-by: Yaliang Wu <[email protected]> * fix running entity bug * add more UT * catch exception explicitly in AD batch task runner; add more unit test for ForwardADTaskRequest * add more comments * address comments * fix flaky test
…pensearch-project#202) * Fix flaky test and log level/messages and enable auto-expand replica of AD job index This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail. This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does. This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small. Testing done: 1. Checked that the AD job index setting change is effective and won't negatively impact normal e2e workflow.
…ct#209) * clean up realtime AD cache on old coordinating node Signed-off-by: Yaliang Wu <[email protected]> * clean up cache in entity cold starter and priority cache * clean up model when node removed * remove unused method * address comments * add more ut * address comments * fix failed ut; add more ut * comment out flaky test * merge fix of remove wrong link from PR opensearch-project#202 * cleanup historical task cache when stop historical analysis * remove top entity check when start historical analysis Test with 1.8 billion docs in data index, it took 23 seconds to return and it exceeds timeout 10s. After removing the top entity check, it took 260ms to return. Signed-off-by: Yaliang Wu <[email protected]> * fix stop historical analysis issue when no entity task running Signed-off-by: Yaliang Wu <[email protected]> * OpenSearch core Ref from 1.x to 1.1 in CI * fix entity filter query; tune task cache; support stop HC before entity task starts Signed-off-by: Yaliang Wu <[email protected]> * fix running entity bug * add more UT * catch exception explicitly in AD batch task runner; add more unit test for ForwardADTaskRequest * add more comments * address comments * fix flaky test
…202) * Fix flaky test and log level/messages and enable auto-expand replica of AD job index This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail. This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does. This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small. Testing done: 1. Checked that the AD job index setting change is effective and won't negatively impact normal e2e workflow.
* clean up realtime AD cache on old coordinating node Signed-off-by: Yaliang Wu <[email protected]> * clean up cache in entity cold starter and priority cache * clean up model when node removed * remove unused method * address comments * add more ut * address comments * fix failed ut; add more ut * comment out flaky test * merge fix of remove wrong link from PR #202 * cleanup historical task cache when stop historical analysis * remove top entity check when start historical analysis Test with 1.8 billion docs in data index, it took 23 seconds to return and it exceeds timeout 10s. After removing the top entity check, it took 260ms to return. Signed-off-by: Yaliang Wu <[email protected]> * fix stop historical analysis issue when no entity task running Signed-off-by: Yaliang Wu <[email protected]> * OpenSearch core Ref from 1.x to 1.1 in CI * fix entity filter query; tune task cache; support stop HC before entity task starts Signed-off-by: Yaliang Wu <[email protected]> * fix running entity bug * add more UT * catch exception explicitly in AD batch task runner; add more unit test for ForwardADTaskRequest * add more comments * address comments * fix flaky test
…202) * Fix flaky test and log level/messages and enable auto-expand replica of AD job index This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail. This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does. This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small. Testing done: 1. Checked that the AD job index setting change is effective and won't negatively impact normal e2e workflow.
* clean up realtime AD cache on old coordinating node Signed-off-by: Yaliang Wu <[email protected]> * clean up cache in entity cold starter and priority cache * clean up model when node removed * remove unused method * address comments * add more ut * address comments * fix failed ut; add more ut * comment out flaky test * merge fix of remove wrong link from PR #202 * cleanup historical task cache when stop historical analysis * remove top entity check when start historical analysis Test with 1.8 billion docs in data index, it took 23 seconds to return and it exceeds timeout 10s. After removing the top entity check, it took 260ms to return. Signed-off-by: Yaliang Wu <[email protected]> * fix stop historical analysis issue when no entity task running Signed-off-by: Yaliang Wu <[email protected]> * OpenSearch core Ref from 1.x to 1.1 in CI * fix entity filter query; tune task cache; support stop HC before entity task starts Signed-off-by: Yaliang Wu <[email protected]> * fix running entity bug * add more UT * catch exception explicitly in AD batch task runner; add more unit test for ForwardADTaskRequest * add more comments * address comments * fix flaky test
This PR contains: PRs (all approved) related to integrating ThresholdedRandomCutForest: opensearch-project#221 opensearch-project#222 opensearch-project#223 opensearch-project#224 opensearch-project#226 opensearch-project#227 opensearch-project#228 rebased PR: opensearch-project#201 opensearch-project#202 opensearch-project#216
This PR contains: PRs (all approved) related to integrating ThresholdedRandomCutForest: opensearch-project#221 opensearch-project#222 opensearch-project#223 opensearch-project#224 opensearch-project#226 opensearch-project#227 opensearch-project#228 rebased PR: opensearch-project#201 opensearch-project#202 opensearch-project#216
* [1.1] Bump OpenSearch core to 1.1 in CI (#212) Signed-off-by: Tyler Ohlsen <[email protected]> * add thresholded rcf (#215) Signed-off-by: lai <[email protected]> * Integrating with ThresholdedRandomCutForest This PR contains: PRs (all approved) related to integrating ThresholdedRandomCutForest: #221 #222 #223 #224 #226 #227 #228 rebased PR: #201 #202 #216 * pass the correct shingleSize to ThresholdedRandomCutForest Previously, I used shingleSize 1 for externally shingled ThresholdedRandomCutForest because of the double multiplication with shingle size in RCF. Now RCF has fixed the issue. This commits adds new RCF libraries from aws/random-cut-forest-by-aws#278 and passes the correct shingleSize to ThresholdedRandomCutForest. This commits adds new RCF libraries from aws/random-cut-forest-by-aws#278 and passes the correct shingleSize to ThresholdedRandomCutForest. Co-authored-by: Tyler Ohlsen <[email protected]> Co-authored-by: Lai <[email protected]>
* [1.1] Bump OpenSearch core to 1.1 in CI (#212) Signed-off-by: Tyler Ohlsen <[email protected]> * add thresholded rcf (#215) Signed-off-by: lai <[email protected]> * Integrating with ThresholdedRandomCutForest This PR contains: PRs (all approved) related to integrating ThresholdedRandomCutForest: opensearch-project/anomaly-detection#221 opensearch-project/anomaly-detection#222 opensearch-project/anomaly-detection#223 opensearch-project/anomaly-detection#224 opensearch-project/anomaly-detection#226 opensearch-project/anomaly-detection#227 opensearch-project/anomaly-detection#228 rebased PR: opensearch-project/anomaly-detection#201 opensearch-project/anomaly-detection#202 opensearch-project/anomaly-detection#216 * pass the correct shingleSize to ThresholdedRandomCutForest Previously, I used shingleSize 1 for externally shingled ThresholdedRandomCutForest because of the double multiplication with shingle size in RCF. Now RCF has fixed the issue. This commits adds new RCF libraries from aws/random-cut-forest-by-aws#278 and passes the correct shingleSize to ThresholdedRandomCutForest. This commits adds new RCF libraries from aws/random-cut-forest-by-aws#278 and passes the correct shingleSize to ThresholdedRandomCutForest. Co-authored-by: Tyler Ohlsen <[email protected]> Co-authored-by: Lai <[email protected]>
Description
This PR (hopefully, as I cannot reproduce the failure locally) fixed flaky tests in MultiEntityResultTests. The tests are flaky, maybe because we expect two pages in our pagination, but we may create more than two pages due to a race condition. Please read comments in MultiEntityResultTests for detail.
This PR also changes the log level of the updating real-time task log from info to debug. We don't need info as Opensearch prints the log repeatedly in each interval. Also, I changed the log message in ADTaskManager to match what the relevant code does.
This PR also enables auto-expand replication for AD job indexes. The job scheduler puts both primary and replica shards in the hash ring. Enabling auto-expand the number of replicas based on the number of data nodes (up to 20) in the cluster so that each node can become a coordinating node. Enabling auto-expanding is useful when customers scale out their cluster so that we can do adaptive scaling accordingly. Also, this PR changed the primary number of shards of the AD job index to 1 as the AD job index is small.
Testing done:
Issues Resolved
Flaky test run: https://github.com/opensearch-project/anomaly-detection/pull/196/checks?check_run_id=3487905317
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.