Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Add entity model training #247

Closed

Conversation

kaituo
Copy link
Member

@kaituo kaituo commented Oct 13, 2020

Note: since there are a lot of dependencies, I only list the main class and test code to save reviewers' time. The build will fail due to missing dependencies. I will use that PR just for review. will not merge it. Will have a big one in the end and merge once after all review PRs get approved.

Issue #, if available:

Description of changes:
This PR adds entity model training funcitons. When the models are missing in the cache and checkpoint, we need to run queries to get training data and then train models. To collect training data, we sample historical data and linearly interpolate data points between the samples.

Specifically, we first note the maximum and minimum timestamp, and sample at most 24 points (with 60 points apart between two neighboring samples) between those minimum and maximum timestamps. Samples can be missing. We only interpolate points between present neighboring samples. We then transform samples and interpolate points to shingles. Finally, full shingles will be used for cold start.

Testing done:

  1. add unit tests
  2. end-to-end testing

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

This PR adds entity model training funcitons. When the models are missing in the cache and checkpoint, we need to run queries to get training data and then train models.  To collect training data, we sample historical data and linearly interpolate data points between the samples.

Specifically, we first note the maximum and minimum timestamp, and sample at most 24 points (with 60 points apart between two neighboring samples) between those minimum and maximum timestamps.  Samples can be missing.  We only interpolate points between present neighboring samples. We then transform samples and interpolate points to shingles. Finally, full shingles will be used for cold start.

Testing done:
1. add unit tests
2. end-to-end testing
@codecov
Copy link

codecov bot commented Oct 13, 2020

Codecov Report

Merging #247 into master will increase coverage by 1.10%.
The diff coverage is 68.04%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #247      +/-   ##
============================================
+ Coverage     71.70%   72.81%   +1.10%     
- Complexity     1367     1464      +97     
============================================
  Files           157      164       +7     
  Lines          6513     6867     +354     
  Branches        493      533      +40     
============================================
+ Hits           4670     5000     +330     
- Misses         1610     1615       +5     
- Partials        233      252      +19     
Flag Coverage Δ Complexity Δ
#cli 79.27% <ø> (ø) 0.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...distroforelasticsearch/ad/constant/CommonName.java 66.66% <ø> (ø) 1.00 <0.00> (ø)
...troforelasticsearch/ad/feature/FeatureManager.java 96.68% <ø> (ø) 96.00 <0.00> (ø)
.../handler/IndexAnomalyDetectorJobActionHandler.java 11.44% <8.92%> (+11.44%) 4.00 <1.00> (+4.00)
...icsearch/ad/rest/RestAnomalyDetectorJobAction.java 50.00% <14.28%> (+10.00%) 3.00 <1.00> (ø)
...oforelasticsearch/ad/model/AnomalyDetectorJob.java 58.97% <42.85%> (-2.20%) 24.00 <1.00> (+2.00) ⬇️
...stroforelasticsearch/ad/model/AnomalyDetector.java 62.06% <50.00%> (-4.21%) 52.00 <6.00> (+6.00) ⬇️
...on/opendistroforelasticsearch/ad/model/Entity.java 50.00% <50.00%> (ø) 4.00 <4.00> (?)
...arch/ad/transport/IndexAnomalyDetectorRequest.java 46.80% <50.00%> (+1.64%) 11.00 <4.00> (+4.00)
...est/handler/IndexAnomalyDetectorActionHandler.java 51.17% <72.22%> (+35.68%) 26.00 <14.00> (+24.00)
...elasticsearch/ad/transport/ADResultBulkAction.java 75.00% <75.00%> (ø) 2.00 <2.00> (?)
... and 31 more

* @param interpolator Used to generate data points between samples.
* @param searchFeatureDao Used to issue ES queries.
* @param shingleSize The size of a data point window that appear consecutively.
* @param thresholdMinPvalue min P-value for thresholding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor. indentation is off.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

double[] scores = new double[dataPoints.length];
Arrays.fill(scores, 0.);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor. this line is not needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

}

EntityModel model = entityState.getModel();
assert (model != null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not set a new model in entityState?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model might have already contained samples passed from upstream callers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean if model is null, why not set a new model?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. Changed.

Comment on lines +272 to +273
double[] scores = trainRCFModel(continuousDataPoints, modelId, rcf);
allScores.add(scores);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion. using list for train method and addAll to allScores saves much work below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried. I end up with a List and have to convert to Double[], then double[]. Not sure the amount of code is smaller and boxing/unboxing back and forth is not efficient.

entityState.setLastUsedTime(clock.instant());

// save to checkpoint
checkpointDao.write(entityState, modelId, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question. rather than saving each model individually, is it more efficient to do batch indexing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did. The write just write to a buffer.

}, listener::onFailure);

searchFeatureDao
.getColdStartSamplesForPeriods(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are missing values from count/sum/etc aggregations filtered out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am reusing SearchFeatureDao.parseAggregations (https://github.com/opendistro-for-elasticsearch/anomaly-detection/blob/master/src/main/java/com/amazon/opendistroforelasticsearch/ad/feature/SearchFeatureDao.java#L561-L571) to parse each bucket. Missing values results in empty bucket , which should give Optional.empty(), right? If yes, then those will be filtered out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for count/sum/etc on missing data, the value of the bucket should be 0. A model trained with 0s from missing data will treat new non-zero data points as anomalies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we should have some aggregation to differentiate the two cases: count with default 0 and count with missing value remove.d

ActionListener<List<Optional<double[]>>> getFeaturelistener = ActionListener.wrap(featureSamples -> {
ArrayList<double[]> continuousSampledFeatures = new ArrayList<>(maxTrainSamples);

// featuresSamples are in ascending order of time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the time ranges in feature query are in descending order. when does the reversing happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

/**
* TODO: make it work for shingle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't given data already shingle data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation only gives unshingled data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't line 390/440 coldStartData.add(featureManager.batchShingle(points, entityShingleSize)) already producing shingled data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the entity state's samples are not shignled data.

*/
private void combineTrainSamples(List<double[][]> coldstartDatapoints, String modelId, ModelState<EntityModel> entityState) {
EntityModel model = entityState.getModel();
if (model != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is null, why not set the model and keep the data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. Changed.

}

/**
* TODO: make it work for shingle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which part doesn't work for shingle?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to consider timestamp of each data point for shingling.


/**
* TODO: make it work for shingle
* Precondition: we don't have enough training data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question. it there an indicator that the historical data for an entity has been retrieved or attempted to retrieve? if an entitys get only 10 samples, will it get the same 10 samples at a later time? Or, there is no data for an entity, will the system try repeatedly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't retry within one hour for an entity. Do you think this is long enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there will be some issues. In the case where samples are scarce, say just 1 data point, due to repeated process, the training data only contains the same data points. If there is no data at all, this process will not end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If the value is too large, we have long initialization issue. If it is too small, we can waste resources. Given our cold start is rate limited, the latter's impact is not that bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main issue is incorrect training data. When there is 2 data points for an entity in the index, they will be added the training data on the first run. On second run later, the same data points will be added again. And again and again until a number of samples has reached but all samples are just repeats.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a bloom filter to filter such cases. If a point only appears once or twice, it won't trigger cold start.

Copy link
Contributor

@wnbts wnbts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are three main issues.

  1. feature query in cold start should not return 0 for count/sum/additive aggregation on missing date. The changes will be in a separate pr.
  2. cold start process might result into overlapping/duplicate samples. The issue is noted.
  3. shingling support is not implemented. The issue is noted.

@kaituo
Copy link
Member Author

kaituo commented Oct 16, 2020

there are three main issues.

1. feature query in cold start should not return 0 for count/sum/additive aggregation on missing date. The changes will be in a separate pr.

2. cold start process might result into overlapping/duplicate samples. The issue is noted.

3. shingling support is not implemented. The issue is noted.

Thanks Lai. Will put those things in to-do list.

kaituo added a commit that referenced this pull request Oct 16, 2020
* Add support filtering the data by one categorical variable

This PR is a conglomerate of the following PRs.

#247
#249
#250
#252
#253
#256
#257
#258
#259
#260
#261
#262
#263
#264
#265
#266
#267
#268
#269

This spreadsheet contains the mappings from files to PR number: https://quip-amazon.com/DiHkAmz9oSLu/HC-PR

Testing done:
1. Add unit tests except four classes (excluded in build.gradle). Will add them in the later PR.
2. Manual testing passes.
@kaituo kaituo closed this Oct 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants