Add entity model training #247

kaituo · 2020-10-13T20:50:32Z

Note: since there are a lot of dependencies, I only list the main class and test code to save reviewers' time. The build will fail due to missing dependencies. I will use that PR just for review. will not merge it. Will have a big one in the end and merge once after all review PRs get approved.

Issue #, if available:

Description of changes:
This PR adds entity model training funcitons. When the models are missing in the cache and checkpoint, we need to run queries to get training data and then train models. To collect training data, we sample historical data and linearly interpolate data points between the samples.

Specifically, we first note the maximum and minimum timestamp, and sample at most 24 points (with 60 points apart between two neighboring samples) between those minimum and maximum timestamps. Samples can be missing. We only interpolate points between present neighboring samples. We then transform samples and interpolate points to shingles. Finally, full shingles will be used for cold start.

Testing done:

add unit tests
end-to-end testing

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

This PR adds entity model training funcitons. When the models are missing in the cache and checkpoint, we need to run queries to get training data and then train models. To collect training data, we sample historical data and linearly interpolate data points between the samples. Specifically, we first note the maximum and minimum timestamp, and sample at most 24 points (with 60 points apart between two neighboring samples) between those minimum and maximum timestamps. Samples can be missing. We only interpolate points between present neighboring samples. We then transform samples and interpolate points to shingles. Finally, full shingles will be used for cold start. Testing done: 1. add unit tests 2. end-to-end testing

codecov · 2020-10-13T20:51:43Z

Codecov Report

Merging #247 into master will increase coverage by 1.10%.
The diff coverage is 68.04%.

@@             Coverage Diff              @@
##             master     #247      +/-   ##
============================================
+ Coverage     71.70%   72.81%   +1.10%     
- Complexity     1367     1464      +97     
============================================
  Files           157      164       +7     
  Lines          6513     6867     +354     
  Branches        493      533      +40     
============================================
+ Hits           4670     5000     +330     
- Misses         1610     1615       +5     
- Partials        233      252      +19

Flag	Coverage Δ	Complexity Δ
#cli	`79.27% <ø> (ø)`	`0.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...distroforelasticsearch/ad/constant/CommonName.java	`66.66% <ø> (ø)`	`1.00 <0.00> (ø)`
...troforelasticsearch/ad/feature/FeatureManager.java	`96.68% <ø> (ø)`	`96.00 <0.00> (ø)`
.../handler/IndexAnomalyDetectorJobActionHandler.java	`11.44% <8.92%> (+11.44%)`	`4.00 <1.00> (+4.00)`
...icsearch/ad/rest/RestAnomalyDetectorJobAction.java	`50.00% <14.28%> (+10.00%)`	`3.00 <1.00> (ø)`
...oforelasticsearch/ad/model/AnomalyDetectorJob.java	`58.97% <42.85%> (-2.20%)`	`24.00 <1.00> (+2.00)`	⬇️
...stroforelasticsearch/ad/model/AnomalyDetector.java	`62.06% <50.00%> (-4.21%)`	`52.00 <6.00> (+6.00)`	⬇️
...on/opendistroforelasticsearch/ad/model/Entity.java	`50.00% <50.00%> (ø)`	`4.00 <4.00> (?)`
...arch/ad/transport/IndexAnomalyDetectorRequest.java	`46.80% <50.00%> (+1.64%)`	`11.00 <4.00> (+4.00)`
...est/handler/IndexAnomalyDetectorActionHandler.java	`51.17% <72.22%> (+35.68%)`	`26.00 <14.00> (+24.00)`
...elasticsearch/ad/transport/ADResultBulkAction.java	`75.00% <75.00%> (ø)`	`2.00 <2.00> (?)`
... and 31 more

wnbts · 2020-10-13T23:03:32Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/ml/EntityColdStarter.java

+     * @param interpolator Used to generate data points between samples.
+     * @param searchFeatureDao Used to issue ES queries.
+     * @param shingleSize The size of a data point window that appear consecutively.
+          * @param thresholdMinPvalue min P-value for thresholding


minor. indentation is off.

wnbts · 2020-10-13T23:22:18Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/ml/EntityColdStarter.java

+        }
+
+        double[] scores = new double[dataPoints.length];
+        Arrays.fill(scores, 0.);


minor. this line is not needed.

wnbts · 2020-10-13T23:25:52Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/ml/EntityColdStarter.java

+        }
+
+        EntityModel model = entityState.getModel();
+        assert (model != null);


why not set a new model in entityState?

The model might have already contained samples passed from upstream callers.

i mean if model is null, why not set a new model?

good idea. Changed.

wnbts · 2020-10-13T23:29:47Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/ml/EntityColdStarter.java

+            double[] scores = trainRCFModel(continuousDataPoints, modelId, rcf);
+            allScores.add(scores);


suggestion. using list for train method and addAll to allScores saves much work below.

I tried. I end up with a List and have to convert to Double[], then double[]. Not sure the amount of code is smaller and boxing/unboxing back and forth is not efficient.

wnbts · 2020-10-13T23:32:19Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/ml/EntityColdStarter.java

+        entityState.setLastUsedTime(clock.instant());
+
+        // save to checkpoint
+        checkpointDao.write(entityState, modelId, true);


question. rather than saving each model individually, is it more efficient to do batch indexing?

I did. The write just write to a buffer.

wnbts · 2020-10-14T00:31:25Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/ml/EntityColdStarter.java

+                    }, listener::onFailure);
+
+                    searchFeatureDao
+                        .getColdStartSamplesForPeriods(


are missing values from count/sum/etc aggregations filtered out?

I am reusing SearchFeatureDao.parseAggregations (https://github.com/opendistro-for-elasticsearch/anomaly-detection/blob/master/src/main/java/com/amazon/opendistroforelasticsearch/ad/feature/SearchFeatureDao.java#L561-L571) to parse each bucket. Missing values results in empty bucket , which should give Optional.empty(), right? If yes, then those will be filtered out.

for count/sum/etc on missing data, the value of the bucket should be 0. A model trained with 0s from missing data will treat new non-zero data points as anomalies.

yeah, we should have some aggregation to differentiate the two cases: count with default 0 and count with missing value remove.d

wnbts · 2020-10-14T00:36:06Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/ml/EntityColdStarter.java

+                    ActionListener<List<Optional<double[]>>> getFeaturelistener = ActionListener.wrap(featureSamples -> {
+                        ArrayList<double[]> continuousSampledFeatures = new ArrayList<>(maxTrainSamples);
+
+                        // featuresSamples are in ascending order of time.


the time ranges in feature query are in descending order. when does the reversing happen?

Please read https://github.com/kaituo/anomaly-detection/blob/test_hc3/src/main/java/com/amazon/opendistroforelasticsearch/ad/feature/SearchFeatureDao.java#L670-L689

wnbts · 2020-10-14T00:47:21Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/ml/EntityColdStarter.java

+    }
+
+    /**
+     * TODO: make it work for shingle


isn't given data already shingle data?

Current implementation only gives unshingled data.

isn't line 390/440 coldStartData.add(featureManager.batchShingle(points, entityShingleSize)) already producing shingled data?

yes, the entity state's samples are not shignled data.

wnbts · 2020-10-14T00:52:00Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/ml/EntityColdStarter.java

+     */
+    private void combineTrainSamples(List<double[][]> coldstartDatapoints, String modelId, ModelState<EntityModel> entityState) {
+        EntityModel model = entityState.getModel();
+        if (model != null) {


if this is null, why not set the model and keep the data?

good point. Changed.

wnbts · 2020-10-14T00:57:58Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/ml/EntityColdStarter.java

+    }
+
+    /**
+     * TODO: make it work for shingle.


which part doesn't work for shingle?

Have to consider timestamp of each data point for shingling.

wnbts · 2020-10-14T22:51:05Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/ml/EntityColdStarter.java

+
+    /**
+     * TODO: make it work for shingle
+     * Precondition: we don't have enough training data.


question. it there an indicator that the historical data for an entity has been retrieved or attempted to retrieve? if an entitys get only 10 samples, will it get the same 10 samples at a later time? Or, there is no data for an entity, will the system try repeatedly?

We won't retry within one hour for an entity. Do you think this is long enough?

there will be some issues. In the case where samples are scarce, say just 1 data point, due to repeated process, the training data only contains the same data points. If there is no data at all, this process will not end.

Yes. If the value is too large, we have long initialization issue. If it is too small, we can waste resources. Given our cold start is rate limited, the latter's impact is not that bad.

the main issue is incorrect training data. When there is 2 data points for an entity in the index, they will be added the training data on the first run. On second run later, the same data points will be added again. And again and again until a number of samples has reached but all samples are just repeats.

I have a bloom filter to filter such cases. If a point only appears once or twice, it won't trigger cold start.

wnbts

there are three main issues.

feature query in cold start should not return 0 for count/sum/additive aggregation on missing date. The changes will be in a separate pr.
cold start process might result into overlapping/duplicate samples. The issue is noted.
shingling support is not implemented. The issue is noted.

kaituo · 2020-10-16T01:14:23Z

there are three main issues.

1. feature query in cold start should not return 0 for count/sum/additive aggregation on missing date. The changes will be in a separate pr.

2. cold start process might result into overlapping/duplicate samples. The issue is noted.

3. shingling support is not implemented. The issue is noted.

Thanks Lai. Will put those things in to-do list.

* Add support filtering the data by one categorical variable This PR is a conglomerate of the following PRs. #247 #249 #250 #252 #253 #256 #257 #258 #259 #260 #261 #262 #263 #264 #265 #266 #267 #268 #269 This spreadsheet contains the mappings from files to PR number: https://quip-amazon.com/DiHkAmz9oSLu/HC-PR Testing done: 1. Add unit tests except four classes (excluded in build.gradle). Will add them in the later PR. 2. Manual testing passes.

kaituo requested review from wnbts and yizheliu-amazon October 13, 2020 20:51

wnbts reviewed Oct 14, 2020

View reviewed changes

Address comments

154e28c

wnbts reviewed Oct 14, 2020

View reviewed changes

Address comments

f6e4369

wnbts approved these changes Oct 16, 2020

View reviewed changes

yizheliu-amazon approved these changes Oct 16, 2020

View reviewed changes

kaituo mentioned this pull request Oct 16, 2020

Add support filtering the data by one categorical variable #270

Merged

kaituo closed this Oct 16, 2020

		double[] scores = trainRCFModel(continuousDataPoints, modelId, rcf);
		allScores.add(scores);

Add entity model training #247

Add entity model training #247

Conversation

kaituo commented Oct 13, 2020 • edited Loading

codecov bot commented Oct 13, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wnbts left a comment

Choose a reason for hiding this comment

kaituo commented Oct 16, 2020

kaituo commented Oct 13, 2020 •

edited

Loading

codecov bot commented Oct 13, 2020 •

edited

Loading