Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Fix that AD job cannot be terminated due to missing training data #102

Merged

Conversation

kaituo
Copy link
Member

@kaituo kaituo commented May 4, 2020

Issue #, if available:

Description of changes:
We expect an EndRunException to be thrown due to missing training data. But the exception is not appropriately propagated back to AD job and results in InternalFailure instead. This PR fixes the bug.

Testing done:

  1. Manually reproduced all possible EndRunExceptions, check AD job is terminated, and check profile API status is correct.
  2. Added unit tests to expose the bug.

Profile output for init failures:
[2020-05-02 14:17:31 PDT] [2020-05-02 21:17:31 UTC]
[email protected]: ~/code/github/anomaly-detection
% curl -X GET "localhost:9200/_opendistro/_anomaly_detection/detectors/fu8z13EBn03yvIf6cLBm/_profile"
{"state":"DISABLED","error":"Stopped detector as job failed consecutively for more than 3 times: Cannot get training data"}

[2020-05-02 15:13:30 PDT] [2020-05-02 22:13:30 UTC]
[email protected]: ~/code/github/anomaly-detection
% curl -X GET "localhost:9200/_opendistro/_anomaly_detection/detectors/cgdm13EBVQAb62q7oAbR/_profile"
{"state":"DISABLED","error":"Stopped detector as job failed consecutively for more than 3 times: Error while cold start"}

[2020-05-02 15:32:04 PDT] [2020-05-02 22:32:04 UTC]
[email protected]: ~/code/github/anomaly-detection
% curl -X GET "localhost:9200/_opendistro/_anomaly_detection/detectors/4j1313EBhPlEUyl3nsX-/_profile"
{"state":"DISABLED","error":"Stopped detector: AD models memory usage exceeds our limit."}

[2020-05-02 15:45:50 PDT] [2020-05-02 22:45:50 UTC]
[email protected]: ~/code/github/anomaly-detection
% curl -X GET "localhost:9200/_opendistro/_anomaly_detection/detectors/DoOI13EBxc51Fte-eAry/_profile"
{"state":"DISABLED","error":"Stopped detector: limit_exceeded_exception: Exceeded memory limit. New size is 4456448 and max limit is 51897958.400000"

[2020-05-02 15:52:19 PDT] [2020-05-02 22:52:19 UTC]
[email protected]: ~/code/github/anomaly-detection
% curl -X GET "localhost:9200/_opendistro/_anomaly_detection/detectors/5z-V13EBskgfiUUv6dRA/_profile"
{"state":"DISABLED","error":"Stopped detector: Having trouble querying data: no such index [server-metrics]"}

[2020-05-02 16:01:09 PDT] [2020-05-02 23:01:09 UTC]
[email protected]: ~/code/github/anomaly-detection
% curl -X GET "localhost:9200/_opendistro/_anomaly_detection/detectors/xnuc13EB6HcqWQxXA0ot/_profile"
{"state":"DISABLED","error":"Stopped detector: Having trouble querying data because all of your features have been disabled."}

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@kaituo kaituo requested review from yizheliu-amazon and ohltyler May 4, 2020 15:53
Copy link
Contributor

@ohltyler ohltyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -80,8 +83,10 @@ public void shutDown() {
checkResult();

if (currentExceptions.containsKey(adID)) {
LOG.error("Found matching exception for {}", adID);
LOG.info("Found a matching exception for {}", adID);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you log the matching exception as well?

nit: my personal preference is to log as ERROR, since it is in the scope of 1 particular exception log, same as below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logged

changes this to error. For the else case, it is not an error since we don't expect an exception to happen during cold start.

We expect an EndRunException to be thrown due to missing training data.  But the exception is not appropriately propagated back to AD job and results in InternalFailure instead.  This PR fixes the bug.

Testing done:
1. Manually reproduced all  possible EndRunExceptions, check AD job is terminated, and check profile API status is correct.
2. Added unit tests to expose the bug.
@kaituo kaituo merged commit b351d40 into opendistro-for-elasticsearch:opendistro-1.4 May 4, 2020
kaituo added a commit to kaituo/anomaly-detection that referenced this pull request May 19, 2020
…endistro-for-elasticsearch#102)

We expect an EndRunException to be thrown due to missing training data.  But the exception is not appropriately propagated back to AD job and results in InternalFailure instead.  This PR fixes the bug.

Testing done:
1. Manually reproduced all  possible EndRunExceptions, check AD job is terminated, and check profile API status is correct.
2. Added unit tests to expose the bug.
kaituo added a commit to kaituo/anomaly-detection that referenced this pull request May 19, 2020
…endistro-for-elasticsearch#102)

We expect an EndRunException to be thrown due to missing training data.  But the exception is not appropriately propagated back to AD job and results in InternalFailure instead.  This PR fixes the bug.

Testing done:
1. Manually reproduced all  possible EndRunExceptions, check AD job is terminated, and check profile API status is correct.
2. Added unit tests to expose the bug.
kaituo added a commit that referenced this pull request May 19, 2020
…training data" (#126)

* Fix that AD job cannot be terminated due to missing training data (#102)

We expect an EndRunException to be thrown due to missing training data.  But the exception is not appropriately propagated back to AD job and results in InternalFailure instead.  This PR fixes the bug.

This PR also fixes test to use clusterService that integ test uses

Testing done:
1. Manually reproduced all  possible EndRunExceptions, check AD job is terminated, and check profile API status is correct.
2. Added unit tests to expose the bug.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants