Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

1.9 backport #220

Conversation

yizheliu-amazon
Copy link
Contributor

Issue #, if available:

Description of changes:
Backport changes from PR #178 (c062263, commit after ODFE 1.9 upgrade) to PR #214 (942a10a, commit before ODFE 1.10 upgrade)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

yizheliu-amazon and others added 30 commits June 25, 2020 14:45
* Add result indices retention period

Currently we never delete the result index, even though customers have deleted the detector.   An increasing amount of result indices can use significant disk space, as well as memory pressure due to the creation of rolled over indices. This PR adds retention period to anomaly results.  We delete result indices when they are older than a retention period, which is 90 days by default. We use 90 days because that's the maximum days we allow users to view results on Kibana.  Users can configure the retention period via the setting opendistro.anomaly_detection.ad_result_history_retention_period dynamically.

Also, previously we roll over empty result indices.  This PR fixes that by removing the max age condition of result indices.  So we only roll over the result index when the maximum number of documents in the index is reached.

Testing done:
* manually tested result indices would be deleted after passing retention period.
…h#175)

* Add dyanmic setting for memory threshold

We have the following two memory threshold:
* a detector's model size per node cannot exceed 10% of heap memory
* the total model size per node cannot exceed 10% of heap memory.
If users use the ES cluster mainly for anomaly detection, they should be able to increase the memory for AD.This PR makes the threshold dynamically adjustable via the setting opendistro.anomaly_detection.model_max_size_percent.

Testing done:
* Manually verified I can dynamically adjust the memory threshold setting.  
…search#182)

* Support security plugin enabled testing in CI

This change is to support integration testing against remote security-enabled clustering in our CI. Currently, we only support the integration testing without a security plugin and have to remove the security plugin in CI.

The critical change is to create an https client with basic auth in ODFERestTestCase and let our rest test case inherit from ODFERestTestCase. With default security configuration, the password for the admin user is also “admin.” For ssl certification, we choose to disable it since it’s the default demo cert used here.

Testing done:
* ran the scripts on CI/CD workflow on Mac and tested it works.
* gradle build still passes.
* Create RFC for AD CLI

This is the RFC for the Anomaly Detection CLI feature to be built on AD.
…arch#164)

* Adds initialization progress to profile API

This PR adds init_progress to profile API. init_progress helps users track initialization percentage, needed shingles, and estimated time to go if the future data stream is continuous (no missing data).

Initialization percentage measures how far away we can observe RCF models emitting scores.  The implementation fetches the RCF model's total updates while the AD job is running and materializes the value to the newly added index .opendistro-anomaly-info. Total updates record the number of times this RCF model has been updated. 

Initialization percent is computed as x/128: 
* if total updates > 128, x = 128. Otherwise, x is the total updates 
* 128 is our output after the number in RCF. After observing 128 samples, RCF starts emitting scores.

Needed shingles are computed as 128 -x.  Estimated minutes to go is computed as needed shingles * detector interval

This PR also materializes the error message in the most recent run to speed up profile API's error fetching.

During each AD execution, we also check if a checkpoint is there (the result is saved and maintained as other AD states), if yes, we cold start immediately.

Testing done:
1. adds unit tests
2. run e2e tests to verify init_progress number makes sense.
…a full shingle (opendistro-for-elasticsearch#176)

Adds feature to query data from the index when insufficient data in buffer to form a full shingle. Query results are cached.
ESAD Version 0.1 to interact with AD Plugin through cli.
* Workflow changes for AD release

Update release workflow to be triggered only,
when a new tag is pushed ( starts with v )

* Add badge for testing and code coverage
Added flag: cli for cli project
Added flag: plugin for java
* Include threshold
* Include flag in code covereage
…or-elasticsearch/ad-multinode

Adding multinode integration test support
Change table from right aligned to left aligned.
Remove esad.
Instead of displaying headers with empty list, show message with hints.
For readability, move constants based on alphabetically.
…rch#203)

Allows the model shingle size to be set per detector.
…pendistro-for-elasticsearch#210)

* Change profile API to not return estimated minutes remaining until cold start is finished

Currently, the progress bar on AD Kibana will show the estimated time remaining to initialize a detector. This can be confusing if this message is displayed before cold start is finished, where the actual initialization time may be much shorter if sufficient historical data exists. This PR changes the profile api to not return any estimated time left until the cold start is finished to prevent this.

Testing done:
1. Manually verified the problem is fixed.
2. added unit test for the issue.
* Change to use callbacks in cold start

This PR changes the code path of cold start in the transport layer to use callbacks.

Previously, I created AD’s ExecutorService that has one thread for cold starts in ColdStartRunner. When we need to trigger a cold start, we can submit a task in the ExecutorService, and consult a hash map (keyed by detector Id) that cached the results of recent cold start results. Since I have to invoke the cold start thread in various callbacks, I created a cold start thread pool and put the cold start result in the transport state.

This PR also handles new exceptions like invalid queries introduced by recent changes on ModelManager and FeatureManager.

This PR lowers the severity of a couple of log messages in HashRing and RCFPollingTransportAction to avoid overwhelming readers of log files. These log messages are common.

This PR corrects typos and updates known causes of EndRunException in comments.

Testing done:
1. Simulated cold start failures: Exceptions of cold starts can be seen by the transport layer.  EndRunException can cause AD jobs to be terminated.
2. Happy case of a cold start still works.
Every exported go method should have comments for readability and usage.
Add appropriate commands and sample data is copied from
https://opendistro.github.io/for-elasticsearch-docs/docs/ad/api
Added function and test to check get response
and regenerated mocks as well
Used by GetDetectorController to deserialize
response from gateway to type.
Mapper to convert gateway response to structure.
GetDetector will call gateway to get configuration
GetDetectorByName will first call SearchDetector to get ID from Name pattern.
Later, it will call GetDetector to get details.
Handler for cat command, this will get list of detector
based on id and name pattern
VijayanB and others added 2 commits August 20, 2020 12:04
Created command to accept list of detectors to concatenate and
print detectors. This is similar to cat command in unix.
goimports fixes imports order and also formats like gofmt.
@codecov
Copy link

codecov bot commented Aug 27, 2020

Codecov Report

❗ No coverage uploaded for pull request base (opendistro-1.9@4712ede). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@                Coverage Diff                @@
##             opendistro-1.9     #220   +/-   ##
=================================================
  Coverage                  ?   72.36%           
  Complexity                ?     1288           
=================================================
  Files                     ?      139           
  Lines                     ?     6073           
  Branches                  ?      469           
=================================================
  Hits                      ?     4395           
  Misses                    ?     1465           
  Partials                  ?      213           
Flag Coverage Δ Complexity Δ
#cli 80.30% <0.00%> (?) 0.00% <0.00%> (?%)
#plugin 71.30% <0.00%> (?) 1288.00% <0.00%> (?%)

Flags with carried forward coverage won't be shown. Click here to find out more.

@kaituo
Copy link
Member

kaituo commented Aug 27, 2020

CLI workflow fails.

@VijayanB
Copy link
Member

@yizheliu-amazon can you include this commit(9df7b59) as well to fix workflow failure?

@yizheliu-amazon
Copy link
Contributor Author

@yizheliu-amazon can you include this commit(9df7b59) as well to fix workflow failure?

Thanks for pointing it out. Cherry-picked it.

@yizheliu-amazon
Copy link
Contributor Author

CLI workflow fails.

After cherry-picking new changes, all workflows pass.

Copy link
Contributor

@ylwu-amzn ylwu-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the change.

@yizheliu-amazon yizheliu-amazon merged commit 0b3807c into opendistro-for-elasticsearch:opendistro-1.9 Aug 28, 2020
@ohltyler ohltyler added the infra label Oct 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants