Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Get feature data #250

Closed

Conversation

kaituo
Copy link
Member

@kaituo kaituo commented Oct 14, 2020

Note: since there are a lot of dependencies, I only list the main class and test code to save reviewers' time. The build will fail due to missing dependencies. I will use that PR just for review. will not merge it. Will have a big one in the end and merge once after all review PRs get approved.

Issue #, if available:

Description of changes:
To get feature data, the coordinating node regularly aggregates log entries into multiple entity keys and their corresponding value vectors. An entity key can be an IP address, while the value vector can contain the total bytes sent to the IP address over an interval (say, 1 minute).

In this PR, We use terms aggregation (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html) to calculate features. There can be a significant number (e.g., millions) of entities in one query. By defaut, we return 1000 entities per query at most. Customers have the option to increase the limit.

This PR also adds functions to get cold start samples within a range.

Testing done:

  1. added unit tests.
  2. done end-to-end testing

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

To get feature data, the coordinating node regularly aggregates log entries into multiple entity keys and their corresponding value vectors.  An entity key can be an IP address, while the value vector can contain the total bytes sent to the IP address over an interval (say, 1 minute).

In this PR, We use terms aggregation (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html) to calculate features.

There can be a significant number (e.g., millions) of entities in one query. By defaut, we return 1000 entities per query at most. Customers have the option to increase the limit.

Testing done:
1. added unit tests.
2. done end-to-end testing
@codecov
Copy link

codecov bot commented Oct 14, 2020

Codecov Report

Merging #250 into master will not change coverage.
The diff coverage is 82.60%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #250   +/-   ##
=========================================
  Coverage     73.01%   73.01%           
  Complexity     1461     1461           
=========================================
  Files           164      164           
  Lines          6834     6834           
  Branches        527      527           
=========================================
  Hits           4990     4990           
  Misses         1594     1594           
  Partials        250      250           
Flag Coverage Δ Complexity Δ
#cli 79.27% <ø> (ø) 0.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...opendistroforelasticsearch/ad/util/ParseUtils.java 52.29% <ø> (ø) 16.00 <0.00> (ø)
...oforelasticsearch/ad/feature/SearchFeatureDao.java 84.12% <82.60%> (ø) 63.00 <12.00> (ø)

Comment on lines 660 to 663
logger.debug(() -> "getColdStartSamplesForPeriods: " + request.toString());

client.search(request, ActionListener.wrap(response -> {
logger.debug(() -> "getColdStartSamplesForPeriods: " + response.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small: remove these debugging logs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

kaituo added a commit that referenced this pull request Oct 16, 2020
* Add support filtering the data by one categorical variable

This PR is a conglomerate of the following PRs.

#247
#249
#250
#252
#253
#256
#257
#258
#259
#260
#261
#262
#263
#264
#265
#266
#267
#268
#269

This spreadsheet contains the mappings from files to PR number: https://quip-amazon.com/DiHkAmz9oSLu/HC-PR

Testing done:
1. Add unit tests except four classes (excluded in build.gradle). Will add them in the later PR.
2. Manual testing passes.
@kaituo kaituo closed this Oct 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants