Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

not return estimated minutes remaining until cold start is finished #210

Merged
merged 3 commits into from
Aug 18, 2020

Conversation

kaituo
Copy link
Member

@kaituo kaituo commented Aug 12, 2020

Issue #, if available:
#198

Description of changes:
Currently, the progress bar on AD Kibana will show the estimated time remaining to initialize a detector. This can be confusing if this message is displayed before cold start is finished, where the actual initialization time may be much shorter if sufficient historical data exists. This PR changes the profile api to not return any estimated time left until the cold start is finished to prevent this.

Testing done:

  1. Manually verified the problem is fixed.
  2. added unit test for the issue.

Before the PR:
[email protected]: ~
% curl -XGET "http://localhost:9200/_opendistro/_anomaly_detection/detectors/F5nQ4HMBALosA6jhE7Ni/_profile?_all=true&pretty"
{
"state" : "INIT",
"shingle_size" : 0,
"coordinating_node" : "",
"total_size_in_bytes" : 0,
"init_progress" : {
"percentage" : "0%",
"estimated_minutes_left" : 128,
"needed_shingles" : 128
}
}

After the PR:
[email protected]: ~
% curl -XGET "http://localhost:9200/_opendistro/_anomaly_detection/detectors/FO7a4HMBTu_4lKyQCdgs/_profile?_all=true&pretty"
{
"state" : "INIT",
"shingle_size" : 0,
"coordinating_node" : "",
"total_size_in_bytes" : 0,
"init_progress" : {
"percentage" : "0%",
"needed_shingles" : 128
}
}

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…ld start is finished

Currently, the progress bar on AD Kibana will show the estimated time remaining to initialize a detector. This can be confusing if this message is displayed before cold start is finished, where the actual initialization time may be much shorter if sufficient historical data exists. This PR changes the profile api to not return any estimated time left until the cold start is finished to prevent this.

Testing done:
1. Manually verified the problem is fixed.
2. added unit test for the issue.
@codecov
Copy link

codecov bot commented Aug 12, 2020

Codecov Report

Merging #210 into master will decrease coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #210      +/-   ##
============================================
- Coverage     72.18%   72.16%   -0.03%     
- Complexity     1275     1282       +7     
============================================
  Files           140      140              
  Lines          5914     5938      +24     
  Branches        463      468       +5     
============================================
+ Hits           4269     4285      +16     
- Misses         1441     1444       +3     
- Partials        204      209       +5     
Flag Coverage Δ Complexity Δ
#cli 79.74% <ø> (ø) 0.00 <ø> (ø)
#plugin 71.27% <100.00%> (-0.03%) 1282.00 <1.00> (+7.00) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...elasticsearch/ad/AnomalyDetectorProfileRunner.java 69.94% <100.00%> (+1.26%) 38.00 <1.00> (+3.00)
...istroforelasticsearch/ad/util/ColdStartRunner.java 80.00% <0.00%> (-16.67%) 7.00% <0.00%> (-2.00%)
...asticsearch/ad/cluster/ADClusterEventListener.java 88.00% <0.00%> (-4.00%) 13.00% <0.00%> (-1.00%)
...search/ad/rest/RestIndexAnomalyDetectorAction.java 54.05% <0.00%> (-1.51%) 3.00% <0.00%> (ø%)
...sticsearch/ad/indices/AnomalyDetectionIndices.java 61.59% <0.00%> (-0.73%) 23.00% <0.00%> (-1.00%)
...stroforelasticsearch/ad/AnomalyDetectorPlugin.java 93.61% <0.00%> (ø) 10.00% <0.00%> (ø%)
...ticsearch/ad/settings/AnomalyDetectorSettings.java 100.00% <0.00%> (ø) 1.00% <0.00%> (ø%)
...est/handler/IndexAnomalyDetectorActionHandler.java 0.00% <0.00%> (ø) 0.00% <0.00%> (ø%)
...opendistroforelasticsearch/ad/ml/ModelManager.java 92.05% <0.00%> (+0.01%) 112.00% <0.00%> (ø%)
...stroforelasticsearch/ad/model/AnomalyDetector.java 88.88% <0.00%> (+0.75%) 46.00% <0.00%> (+5.00%)
... and 1 more

@kaituo kaituo added the enhancement New feature or request label Aug 12, 2020
Copy link
Contributor

@wnbts wnbts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted comment

@@ -391,19 +399,26 @@ private void processInitResponse(
String detectorId,
Set<ProfileName> profilesToCollect,
MultiResponsesDelegateActionListener<DetectorProfile> listener,
long totalUpdates
long totalUpdates,
boolean hideMinutesLeft
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be hide estimate, which affects minutes left and data points needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you meant the variable name should be hideEstimate? It only affect minutes left, not data points needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the data needed and time needed together. Before model training is done, they should be hidden together by a message such as "Model training is starting soon". Once that is done, the detailed information on needed data and time is now shown, such as "need x data and y time".

It'd be confusing to see "need x data and no time". What does the ux look like in this case?

Copy link
Contributor

@ohltyler ohltyler Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as ux goes, we will probably add an intermediate callout saying something like "attempting to initialize with existing/historical data", and show that callout until we can retrieve minutesLeft from the backend profile call. Then we show the regular progress bar callout, which includes the "need x data and y time" info. See Kibana issue here.

I guess it's kind of arbitrary on if we hide minutes left and/or data points left, but need some way to indicate on the frontend that the cold start process isn't finished.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial temporal call out is informative enough. Before a model is trained, both data and time that the (non-existent) model will need to wait for should be equally unknown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a separate callout, both of these would be ignored as far as frontend is concerned. Wouldn't we want a different callout in this case? Seems that we wouldn't want to show 0% progress either, when that may immediately go to 100% and disappear if there is sufficient historical data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call out for the case before model training may just be "attempting to initialize with existing/historical data"/"Model training is starting soon", no data/time/progress needed to cause confusion or panic (yet).

Copy link
Member Author

@kaituo kaituo Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, UX does not show needed shingles. They only look at progress and estimated time. Second, as an API user, I don't feel it causes confusion when we show needed shingles. For example, one minute ago, we need 128 shingles. After 1 minute, we need 0. It means we find enough training data. At that particular moment, the needed shingles are accurate. If cold start is stuck or retrying, the number also gives clarification on the state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not apply a more consistent logic to make code more maintainable and understandable? that is before a model is created, the time and data it needs to wait is either both unknown or both guessed. Or, what's the benefit of having mixed results from different logic for each field?

this is a question not a blocker

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data it needs to wait is not guessed. The time is. Without a checkpoint, it is accurate to say we need 128 shingles.

The benefit is provide transparency instead of guessing.

Thanks for not blocking.

Comment on lines +79 to +80
if (requiredSamples <= 0) {
throw new IllegalArgumentException("required samples should be a positive number, but was " + requiredSamples);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing. the testing of this condition is missing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Contributor

@yizheliu-amazon yizheliu-amazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@kaituo kaituo merged commit fed3b78 into opendistro-for-elasticsearch:master Aug 18, 2020
yizheliu-amazon pushed a commit that referenced this pull request Aug 28, 2020
…210)

* Change profile API to not return estimated minutes remaining until cold start is finished

Currently, the progress bar on AD Kibana will show the estimated time remaining to initialize a detector. This can be confusing if this message is displayed before cold start is finished, where the actual initialization time may be much shorter if sufficient historical data exists. This PR changes the profile api to not return any estimated time left until the cold start is finished to prevent this.

Testing done:
1. Manually verified the problem is fixed.
2. added unit test for the issue.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants