Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add queue for cold entities #64

Closed
wants to merge 3 commits into from

Conversation

kaituo
Copy link
Collaborator

@kaituo kaituo commented May 24, 2021

Note: since there are a lot of dependencies, I only list the main class and test code to save reviewers' time. The build will fail due to missing dependencies. I will use that PR just for review. will not merge it. Will have a big one in the end and merge once after all review PRs get approved. Now the code is missing unit tests. Posting PRs now to meet the cutoff date (June 1). Will add unit tests, run performance tests, and fix bugs before the official release.

Description

In HCAD v1, only a subset of hot entities can use their models to predict if we are short of memory. In v2, we allocate a queue to absorb cold entities. Like hot entities, we load a cold entity's model checkpoint from disk, train models if the checkpoint is not found, query for missed features to complete a shingle, use the models to check whether the incoming feature is normal, update models, and save the detection results to disks.

This PR adds the implementation of cold entity queue. Implementation-wise, we reuse the queues we have developed for hot entities. The differences are we release cold entity requests in a controlled pace. Also, cold entity requests’ priority is low. So only when there are no hot entities requests to process we are gonna process cold entity requests.

Testing done:

  1. Manual tests using 10 HCAD detectors and 12,000 entities in a 3 node cluster.

Issues Resolved

[List any issues this PR will resolve]

Check List

  • [ Y ] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

In HCAD v1, only a subset of hot entities can use their models to predict if we are short of memory. In v2, we allocate a queue to absorb cold entities. Like hot entities, we load a cold entity's model checkpoint from disk, train models if the checkpoint is not found, query for missed features to complete a shingle, use the models to check whether the incoming feature is normal, update models, and save the detection results to disks.

This PR adds the implementation of cold entity queue. Implementation-wise, we reuse the queues we have developed for hot entities.  The differences are we release cold entity requests in a controlled pace.  Also, cold entity requests’ priority is low.  So only when there are no hot entities requests to process we are gonna process cold entity requests.

Testing done:
1. Manual tests using 10 HCAD detectors and 12,000 entities in a 3 node cluster.
public class ColdEntityQueue extends RateLimitedQueue<EntityFeatureRequest> {
private static final Logger LOG = LogManager.getLogger(ColdEntityQueue.class);

private int batchSize;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add volatile?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

public ColdEntityQueue(
long heapSizeInBytes,
int singleRequestSizeInBytes,
Setting<Float> maxHeapPercentForQueueSetting,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not get this setting like line 109 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it in the super class to reduce redundant code.


private int batchSize;
private final CheckpointReadQueue checkpointReadQueue;
private boolean scheduled;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does scheduled mean? Add some comments?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment "// indicate whether a future pull over cold entity queues is scheduled"

try {
threadPool.schedule(this::pullRequests, delay, AnomalyDetectorPlugin.AD_THREAD_POOL_NAME);
} catch (Exception e) {
LOG.error("Fail to schedule cold entity pulling", e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if fail to schedule? Seems we will just poll part of requests and put into checkpointReadQueue and ignore other requests if fail to schedule.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am logging failures. threadpool.schedule is the basic API provided by Opensearch. If it fails, I don't have fallback as sth is fundamentally wrong.

@kaituo kaituo requested a review from jngz-es June 1, 2021 18:36
@kaituo kaituo closed this Jun 2, 2021
kaituo added a commit that referenced this pull request Jul 12, 2021
This PR is a conglomerate of the following PRs.

#60
#64
#65
#67
#68
#69
#70
#71
#74
#75
#76
#77
#78
#79
#82
#83
#84
#92
#94
#93
#95
kaituo#1
kaituo#2
kaituo#3
kaituo#4
kaituo#5
kaituo#6
kaituo#7
kaituo#8
kaituo#9
kaituo#10

This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included):
https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
ohltyler pushed a commit that referenced this pull request Sep 1, 2021
This PR is a conglomerate of the following PRs.

#60
#64
#65
#67
#68
#69
#70
#71
#74
#75
#76
#77
#78
#79
#82
#83
#84
#92
#94
#93
#95
kaituo#1
kaituo#2
kaituo#3
kaituo#4
kaituo#5
kaituo#6
kaituo#7
kaituo#8
kaituo#9
kaituo#10

This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included):
https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants