This repository has been archived by the owner on Aug 2, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 36
Fix that AD job cannot be terminated due to missing training data #102
Merged
kaituo
merged 1 commit into
opendistro-for-elasticsearch:opendistro-1.4
from
kaituo:profileTest
May 4, 2020
Merged
Fix that AD job cannot be terminated due to missing training data #102
kaituo
merged 1 commit into
opendistro-for-elasticsearch:opendistro-1.4
from
kaituo:profileTest
May 4, 2020
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ohltyler
approved these changes
May 4, 2020
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -80,8 +83,10 @@ public void shutDown() { | |||
checkResult(); | |||
|
|||
if (currentExceptions.containsKey(adID)) { | |||
LOG.error("Found matching exception for {}", adID); | |||
LOG.info("Found a matching exception for {}", adID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you log the matching exception as well?
nit: my personal preference is to log as ERROR, since it is in the scope of 1 particular exception log, same as below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logged
changes this to error. For the else case, it is not an error since we don't expect an exception to happen during cold start.
yizheliu-amazon
approved these changes
May 4, 2020
We expect an EndRunException to be thrown due to missing training data. But the exception is not appropriately propagated back to AD job and results in InternalFailure instead. This PR fixes the bug. Testing done: 1. Manually reproduced all possible EndRunExceptions, check AD job is terminated, and check profile API status is correct. 2. Added unit tests to expose the bug.
kaituo
added a commit
to kaituo/anomaly-detection
that referenced
this pull request
May 19, 2020
…endistro-for-elasticsearch#102) We expect an EndRunException to be thrown due to missing training data. But the exception is not appropriately propagated back to AD job and results in InternalFailure instead. This PR fixes the bug. Testing done: 1. Manually reproduced all possible EndRunExceptions, check AD job is terminated, and check profile API status is correct. 2. Added unit tests to expose the bug.
kaituo
added a commit
to kaituo/anomaly-detection
that referenced
this pull request
May 19, 2020
…endistro-for-elasticsearch#102) We expect an EndRunException to be thrown due to missing training data. But the exception is not appropriately propagated back to AD job and results in InternalFailure instead. This PR fixes the bug. Testing done: 1. Manually reproduced all possible EndRunExceptions, check AD job is terminated, and check profile API status is correct. 2. Added unit tests to expose the bug.
kaituo
added a commit
that referenced
this pull request
May 19, 2020
…training data" (#126) * Fix that AD job cannot be terminated due to missing training data (#102) We expect an EndRunException to be thrown due to missing training data. But the exception is not appropriately propagated back to AD job and results in InternalFailure instead. This PR fixes the bug. This PR also fixes test to use clusterService that integ test uses Testing done: 1. Manually reproduced all possible EndRunExceptions, check AD job is terminated, and check profile API status is correct. 2. Added unit tests to expose the bug.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue #, if available:
Description of changes:
We expect an EndRunException to be thrown due to missing training data. But the exception is not appropriately propagated back to AD job and results in InternalFailure instead. This PR fixes the bug.
Testing done:
Profile output for init failures:
[2020-05-02 14:17:31 PDT] [2020-05-02 21:17:31 UTC]
[email protected]: ~/code/github/anomaly-detection
% curl -X GET "localhost:9200/_opendistro/_anomaly_detection/detectors/fu8z13EBn03yvIf6cLBm/_profile"
{"state":"DISABLED","error":"Stopped detector as job failed consecutively for more than 3 times: Cannot get training data"}
[2020-05-02 15:13:30 PDT] [2020-05-02 22:13:30 UTC]
[email protected]: ~/code/github/anomaly-detection
% curl -X GET "localhost:9200/_opendistro/_anomaly_detection/detectors/cgdm13EBVQAb62q7oAbR/_profile"
{"state":"DISABLED","error":"Stopped detector as job failed consecutively for more than 3 times: Error while cold start"}
[2020-05-02 15:32:04 PDT] [2020-05-02 22:32:04 UTC]
[email protected]: ~/code/github/anomaly-detection
% curl -X GET "localhost:9200/_opendistro/_anomaly_detection/detectors/4j1313EBhPlEUyl3nsX-/_profile"
{"state":"DISABLED","error":"Stopped detector: AD models memory usage exceeds our limit."}
[2020-05-02 15:45:50 PDT] [2020-05-02 22:45:50 UTC]
[email protected]: ~/code/github/anomaly-detection
% curl -X GET "localhost:9200/_opendistro/_anomaly_detection/detectors/DoOI13EBxc51Fte-eAry/_profile"
{"state":"DISABLED","error":"Stopped detector: limit_exceeded_exception: Exceeded memory limit. New size is 4456448 and max limit is 51897958.400000"
[2020-05-02 15:52:19 PDT] [2020-05-02 22:52:19 UTC]
[email protected]: ~/code/github/anomaly-detection
% curl -X GET "localhost:9200/_opendistro/_anomaly_detection/detectors/5z-V13EBskgfiUUv6dRA/_profile"
{"state":"DISABLED","error":"Stopped detector: Having trouble querying data: no such index [server-metrics]"}
[2020-05-02 16:01:09 PDT] [2020-05-02 23:01:09 UTC]
[email protected]: ~/code/github/anomaly-detection
% curl -X GET "localhost:9200/_opendistro/_anomaly_detection/detectors/xnuc13EB6HcqWQxXA0ot/_profile"
{"state":"DISABLED","error":"Stopped detector: Having trouble querying data because all of your features have been disabled."}
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.