Add state and error to profile API #84

kaituo · 2020-04-10T23:08:07Z

Issue #, if available:

Description of changes:

We want to make it easy for customers and oncalls to identify a detector’s state and error if any. This PR adds such information to our new profile API.

We expect three kinds of states:
-Disabled: if get ad job api says the job is disabled;
-Init: if anomaly score after the last update time of the detector is larger than 0
-Running: if neither of the above applies and no exceptions.

Error is populated if error of the latest anomaly result is not empty.

Example:

curl -X GET "localhost:9200/_opendistro/_anomaly_detection/detectors/t6gGRXEBNjeafFFiEvhk/_profile"
{"state":"INIT","error":"No full shingle in current detection window"}

curl -X GET "localhost:9200/_opendistro/_anomaly_detection/detectors/t6gGRXEBNjeafFFiEvhk/_profile/state"
{"state":"INIT"}

Testing done:
-manual testing during a detector’s life cycle: not created, created but not started, started, during initialization, after initialization, stopped, restarted
-added unit tests to cover above scenario

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

We want to make it easy for customers and oncalls to identify a detector’s state and error if any. This PR adds such information to our new profile API. We expect three kinds of states: -Disabled: if get ad job api says the job is disabled; -Init: if anomaly score after the last update time of the detector is larger than 0 -Running: if neither of the above applies and no exceptions. Error is populated if error of the latest anomaly result is not empty. Testing done: -manual testing during a detector’s life cycle: not created, created but not started, started, during initialization, after initialization, stopped, restarted -added unit tests to cover above scenario

ylwu-amzn · 2020-04-14T06:04:16Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

+    private SearchRequest createInittedEverRequest(String detectorId, long lastUpdateTimeEpochMs) {
+        BoolQueryBuilder filterQuery = new BoolQueryBuilder();
+        filterQuery.filter(QueryBuilders.termQuery(AnomalyResult.DETECTOR_ID_FIELD, detectorId));
+        filterQuery.filter(QueryBuilders.rangeQuery(AnomalyResult.EXECUTION_END_TIME_FIELD).gte(lastUpdateTimeEpochMs));


Use job enabled_time here, think of the case : detector last update time not changed, but we disabled and restarted job multiple times. So we may get some AD result with non-zero anomaly score which generated before latest job enabled time, but actually the latest AD job is still initializing.

good point. Done.

ylwu-amzn · 2020-04-14T06:07:07Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

+    private SearchRequest createLatestAnomalyResultRequest(String detectorId, long lastUpdateTimeEpochMs) {
+        BoolQueryBuilder filterQuery = new BoolQueryBuilder();
+        filterQuery.filter(QueryBuilders.termQuery(AnomalyResult.DETECTOR_ID_FIELD, detectorId));
+        filterQuery.filter(QueryBuilders.rangeQuery(AnomalyResult.EXECUTION_END_TIME_FIELD).gte(lastUpdateTimeEpochMs));


Similar as line 260, we should use AD job "enabled_time"

ylwu-amzn · 2020-04-14T06:11:01Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

+                    }
+                } catch (IOException | XContentParseException e) {
+                    String error = "Fail to parse detector with id: " + detectorId;
+                    logger.error(error);


log exception stack trace to make operation easier? Similar to other places

The catch block is changed after addressing other comments. New code would log stack trace.

ylwu-amzn · 2020-04-14T06:45:26Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/DelegateActionListener.java

+    private final ActionListener<T> delegate;
+    private final AtomicInteger collectedResponseCount;
+    private final int expectedResponseCount;
+    private final List<T> savedResponses;


saved means the responses are from some saved result from ES indices ? Or means we cache these response ?

the latter. Added a comment.

ylwu-amzn · 2020-04-14T06:55:36Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/rest/RestGetAnomalyDetectorAction.java

+        controller
+            .registerHandler(
+                RestRequest.Method.GET,
+                String.format(Locale.ROOT, "%s/{%s}/%s/{%s}", AnomalyDetectorPlugin.AD_BASE_DETECTORS_URI, DETECTOR_ID, PROFILE, TYPE),


How about we add some comments about what TYPE means and the supported value?

ylwu-amzn · 2020-04-14T06:56:38Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/rest/RestGetAnomalyDetectorAction.java

+        return new BytesRestResponse(RestStatus.INTERNAL_SERVER_ERROR, errorMsg);
+    }
+
+    private Set<String> getProfilesToCollect(String typesStr) {


How about we validate type here and return Set<ProfileName>?

ylwu-amzn · 2020-04-14T07:11:31Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/DelegateActionListener.java

+
+import com.amazon.opendistroforelasticsearch.ad.model.Mergeable;
+
+public class DelegateActionListener<T extends Mergeable> implements ActionListener<T> {


Seems you design this general delegate listener not only for profile API. Can you add more comments? Suggest to use a more specific name like MultiResponsesDelegateActionListener ?

ylwu-amzn · 2020-04-14T07:16:25Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/DelegateActionListener.java

+        try {
+            this.exceptions.add(e.getMessage());
+        } finally {
+            if (collectedResponseCount.incrementAndGet() == expectedResponseCount) {


If expectedResponseCount==0 , collectedResponseCount.incrementAndGet() will always greater than expectedResponseCount , please add some validation for expectedResponseCount, or change to collectedResponseCount.incrementAndGet() >= expectedResponseCount

good point. Used the latter.

ylwu-amzn · 2020-04-14T07:20:11Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/DelegateActionListener.java

+                this.delegate.onFailure(new RuntimeException(String.format("Unexpected exceptions")));
+            } else {
+                T response0 = savedResponses.get(0);
+                LOG.info(response0);


Why log response0 here? Similar for line 84.

It is used for debugging. removed.

ylwu-amzn · 2020-04-14T07:28:42Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

+        DelegateActionListener<DetectorProfile> delegateListener = new DelegateActionListener<DetectorProfile>(
+            listener,
+            profiles.size(),
+            "Fail to fetch profile for " + detectorId


Here, the finalErrorMsg is "Fail to fetch profile for " + detectorId.
From line89 of class DelegateActionListener: this.delegate.onFailure(new RuntimeException(String.format(Locale.ROOT, finalErrorMsg, exceptions)));, String.format(...) will not include exceptions, is this by design?

String.format(...) would include exception message. Could you explain your questions?

fixed as we discussed offline.

ylwu-amzn · 2020-04-14T07:35:21Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

+        );
+
+        if (profiles.isEmpty()) {
+            listener.onFailure(new RuntimeException("Unsupported profile types."));


How about we change to "Must set at least one profile type" to avoid confusion between empty profile types and wrong profile types which we don't support?

RestGetAnomalyDetectorAction.getProfilesToCollect would return an interaction between valid types and the provided types. If the result is empty, it means all of the types from the users are unsupported. So the error is not that customers have not set at least one profile type. It is all of the profile types are invalid.

ylwu-amzn · 2020-04-14T07:43:24Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

+
+                } catch (IOException | XContentParseException | NullPointerException e) {
+                    logger.error(e);
+                    listener.failImmediately(new RuntimeException(FAIL_TO_FIND_DETECTOR_MSG + detectorId, e));


minor: can use this method: listener.failImmediately(String errMsg, Exception e)

Good catch. Fixed.

ylwu-amzn · 2020-04-14T07:52:33Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

+                        profile.setState(DetectorState.DISABLED);
+                        listener.onResponse(profile);
+                    }
+                } catch (IOException | XContentParseException e) {


If some uncatched exception, will not execute listener.onFailure method, collectedResponseCount will not increase, so will never execute finish.
Suggest to catch Exception here to avoid some uncatched exceptions. Similar to line 236

If some uncatched exception, control flow would be redirected to the exception branch and listener.onFailure would be called. Please see the implementation of ActionListener.

ylwu-amzn · 2020-04-14T08:01:11Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

+                listener.onResponse(profile);
+            }
+        }, exception -> {
+            logger.warn(exception);


Can we add custom error message here?

This line is removed after addressing other comments.

ylwu-amzn · 2020-04-14T08:09:27Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

+                profile.setState(DetectorState.INIT);
+                listener.onResponse(profile);
+            } else {
+                logger.error("Fail to find latest anomaly result of id: {}", detectorId);


minor: make the error message more accurate, like Fail to find latest anomaly result with anomalyScore>0 from XXX for detector XXX

ylwu-amzn · 2020-04-14T08:15:44Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

+            SearchHits hits = searchResponse.getHits();
+            if (hits.getTotalHits().value == 0L) {
+                logger.error("We should not get empty result: {}", detectorId);
+                listener.onFailure(new RuntimeException("Unexpected error while looking for detector state:  " + detectorId));


Why throw exception if we can't find AD result ? If not AD result, that means AD job is initializing and no error. But from DelegateActionListener line 89, if any exception occurs, will execute this.delegate.onFailure(...) rather than return Init state and null error.

good catch. Fixed.

…ndistro-for-elasticsearch#82) * Added URL for jb_scheduler-plugin_zip instead of local file path * Fixed windows path by adding additional /

* Use callbacks and bug fix This PR includes the following changes: 1. remove classes that are not needed in jacocoExclusions since we have enough coverage for those classes. 2. Use ClientUtil instead of Elasticsearch’s client in AD job runner 3. Use one function to get the number of partitioned forests. Previously, we have redundant code in both ModelManager and ADStateManager. 4. Change ADStateManager.getAnomalyDetector to use callback. 5. Change AnomalyResultTransportAction to use callback to get features. 6. Add in AnomalyResultTransportAction to handle the case where all features have been disabled, and users' index does not exist. 7. Change get RCF and threshold result methods to use callback and add exception handling of IndexNotFoundException due to the change. Previously, getting RCF and threshold result methods won’t throw IndexNotFoundException. 8. Remove unused fields in StopDetectorTransportAction and AnomalyResultTransportAction 9. Unwrap EsRejectedExecutionException as it can be nested inside RemoteTransportException. Previously, we would not recognize EsRejectedExecutionException and thus miss anomaly results write retrying. 10. Add error in anomaly result schema.11. Fix broken tests due to my changes. Testing done: 1. unit/integration tests pass 2. do end-to-end testing and make sure my fix achieves the purpose * timeout issue is gone * when all features have been disabled or index does not exist, we will retry a few more times and disable AD jobs.

We want to make it easy for customers and oncalls to identify a detector’s state and error if any. This PR adds such information to our new profile API. We expect three kinds of states: -Disabled: if get ad job api says the job is disabled; -Init: if anomaly score after the last update time of the detector is larger than 0 -Running: if neither of the above applies and no exceptions. Error is populated if error of the latest anomaly result is not empty. Testing done: -manual testing during a detector’s life cycle: not created, created but not started, started, during initialization, after initialization, stopped, restarted -added unit tests to cover above scenario

yizheliu-amazon · 2020-04-15T17:30:22Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

+            "Fail to fetch profile for " + detectorId
+        );
+
+        if (profiles.isEmpty()) {


not blocker: you can move this isEmpty() check to the entry of this method, aka line 67. And then you can skip the check on line 78

good catch. Fixed.

yizheliu-amazon · 2020-04-15T17:43:47Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/model/ProfileName.java

+            case "error":
+                return ERROR;
+            default:
+                throw new IllegalArgumentException("Unsupported prof");


Unsupported profile

thanks for the catch. Fixed.

yizheliu-amazon · 2020-04-15T17:50:20Z

...java/com/amazon/opendistroforelasticsearch/ad/util/MultiResponsesDelegateActionListener.java

+
+    @Override
+    public void onFailure(Exception e) {
+        LOG.info(e);


LOG.error()

yizheliu-amazon · 2020-04-15T17:54:46Z

...java/com/amazon/opendistroforelasticsearch/ad/util/MultiResponsesDelegateActionListener.java

+        try {
+            this.exceptions.add(e.getMessage());
+        } finally {
+            if (collectedResponseCount.incrementAndGet() >= expectedResponseCount) {
+                finish();
+            }
+        }


It looks like to be duplicate with above line 60. Can we remove the finally here? And also, the only potential scenario where exception can get thrown is that Exception e is null, but I don't think it is possible.

The purpose of this class is to collect async requests: no matter it is a failure or success, and then increment the count. If the count equals to or larger than expected, then send a final success or failure responses. We need finally here to increment the count when there is a failure. THis is not a failure when e is null. It means an async request fails.

After another look, I guess exception can be thrown if thread is interrupted.

yes, any exception can be thrown by an asynchronous request.

yizheliu-amazon · 2020-04-15T18:23:35Z

...java/com/amazon/opendistroforelasticsearch/ad/util/MultiResponsesDelegateActionListener.java

+        } catch (Exception e) {
+            onFailure(e);
+        } finally {
+            if (collectedResponseCount.incrementAndGet() >= expectedResponseCount) {


expected implies that the total collected count must be more than expectedResponseCount, otherwise it is a failure. Based on my understanding of use of this class, I guess maxResponseCount might be a better name.

yizheliu-amazon · 2020-04-15T18:32:04Z

...java/com/amazon/opendistroforelasticsearch/ad/util/MultiResponsesDelegateActionListener.java

+    private void finish() {
+        if (this.exceptions.size() == 0) {
+            if (savedResponses.size() == 0) {
+                this.delegate.onFailure(new RuntimeException(String.format("Unexpected exceptions")));


String.format may not be needed if only static string is there. Also, I think in case of empty exceptions and empty savedResponses, it may be better to throw exception with message like No response collected, which makes more sense to me.

good catch. Removed String.format and changed to "No response collected".

yizheliu-amazon · 2020-04-15T18:48:32Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/AnomalyDetectorProfileRunner.java

+                client.get(getDetectorRequest, onGetDetectorResponse(listener, detectorId, profiles));
+            }
+        }, exception -> {
+            if (exception instanceof IndexNotFoundException) {


can you log exception here as well?

yizheliu-amazon

a few minor comments. please feel free to check in after addressing them.

Author: Kaituo Li <[email protected]> Date: Wed Apr 15 15:45:13 2020 -0700 Add state and error to profile API (opendistro-for-elasticsearch#84) * Add state and error to profile API We want to make it easy for customers and oncalls to identify a detector’s state and error if any. This PR adds such information to our new profile API. We expect three kinds of states: -Disabled: if get ad job api says the job is disabled; -Init: if anomaly score after the last update time of the detector is larger than 0 -Running: if neither of the above applies and no exceptions. Error is populated if error of the latest anomaly result is not empty. Testing done: -manual testing during a detector’s life cycle: not created, created but not started, started, during initialization, after initialization, stopped, restarted -added unit tests to cover above scenario commit 0c33050 Author: Kaituo Li <[email protected]> Date: Tue Apr 14 11:52:20 2020 -0700 Use callbacks and bug fix (opendistro-for-elasticsearch#83) * Use callbacks and bug fix This PR includes the following changes: 1. remove classes that are not needed in jacocoExclusions since we have enough coverage for those classes. 2. Use ClientUtil instead of Elasticsearch’s client in AD job runner 3. Use one function to get the number of partitioned forests. Previously, we have redundant code in both ModelManager and ADStateManager. 4. Change ADStateManager.getAnomalyDetector to use callback. 5. Change AnomalyResultTransportAction to use callback to get features. 6. Add in AnomalyResultTransportAction to handle the case where all features have been disabled, and users' index does not exist. 7. Change get RCF and threshold result methods to use callback and add exception handling of IndexNotFoundException due to the change. Previously, getting RCF and threshold result methods won’t throw IndexNotFoundException. 8. Remove unused fields in StopDetectorTransportAction and AnomalyResultTransportAction 9. Unwrap EsRejectedExecutionException as it can be nested inside RemoteTransportException. Previously, we would not recognize EsRejectedExecutionException and thus miss anomaly results write retrying. 10. Add error in anomaly result schema.11. Fix broken tests due to my changes. Testing done: 1. unit/integration tests pass 2. do end-to-end testing and make sure my fix achieves the purpose * timeout issue is gone * when all features have been disabled or index does not exist, we will retry a few more times and disable AD jobs.

kaituo requested review from ylwu-amzn and yizheliu-amazon April 10, 2020 23:08

ylwu-amzn reviewed Apr 14, 2020

View reviewed changes

amirmuminovic and others added 5 commits April 14, 2020 18:37

Added URL for jb_scheduler-plugin_zip instead of local file path (ope…

6a1a304

…ndistro-for-elasticsearch#82) * Added URL for jb_scheduler-plugin_zip instead of local file path * Fixed windows path by adding additional /

Addresss various comments from Yaliang

b445fb8

Merge branch 'development' into profile3

a476ed4

ylwu-amzn approved these changes Apr 15, 2020

View reviewed changes

yizheliu-amazon reviewed Apr 15, 2020

View reviewed changes

yizheliu-amazon approved these changes Apr 15, 2020

View reviewed changes

Address comments from Yizhe

3d6e9bb

kaituo merged commit e5b6ce5 into opendistro-for-elasticsearch:development Apr 15, 2020


		import com.amazon.opendistroforelasticsearch.ad.model.Mergeable;

		public class DelegateActionListener<T extends Mergeable> implements ActionListener<T> {

Add state and error to profile API #84

Add state and error to profile API #84

Conversation

kaituo commented Apr 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylwu-amzn Apr 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylwu-amzn Apr 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylwu-amzn Apr 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yizheliu-amazon left a comment

Choose a reason for hiding this comment

kaituo commented Apr 10, 2020 •

edited

Loading

ylwu-amzn Apr 14, 2020 •

edited

Loading

ylwu-amzn Apr 14, 2020 •

edited

Loading

ylwu-amzn Apr 14, 2020 •

edited

Loading