-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add queue for cold entities #64
Conversation
In HCAD v1, only a subset of hot entities can use their models to predict if we are short of memory. In v2, we allocate a queue to absorb cold entities. Like hot entities, we load a cold entity's model checkpoint from disk, train models if the checkpoint is not found, query for missed features to complete a shingle, use the models to check whether the incoming feature is normal, update models, and save the detection results to disks. This PR adds the implementation of cold entity queue. Implementation-wise, we reuse the queues we have developed for hot entities. The differences are we release cold entity requests in a controlled pace. Also, cold entity requests’ priority is low. So only when there are no hot entities requests to process we are gonna process cold entity requests. Testing done: 1. Manual tests using 10 HCAD detectors and 12,000 entities in a 3 node cluster.
public class ColdEntityQueue extends RateLimitedQueue<EntityFeatureRequest> { | ||
private static final Logger LOG = LogManager.getLogger(ColdEntityQueue.class); | ||
|
||
private int batchSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add volatile
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
public ColdEntityQueue( | ||
long heapSizeInBytes, | ||
int singleRequestSizeInBytes, | ||
Setting<Float> maxHeapPercentForQueueSetting, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not get this setting like line 109 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it in the super class to reduce redundant code.
|
||
private int batchSize; | ||
private final CheckpointReadQueue checkpointReadQueue; | ||
private boolean scheduled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does scheduled
mean? Add some comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comment "// indicate whether a future pull over cold entity queues is scheduled"
try { | ||
threadPool.schedule(this::pullRequests, delay, AnomalyDetectorPlugin.AD_THREAD_POOL_NAME); | ||
} catch (Exception e) { | ||
LOG.error("Fail to schedule cold entity pulling", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if fail to schedule? Seems we will just poll part of requests and put into checkpointReadQueue
and ignore other requests if fail to schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am logging failures. threadpool.schedule is the basic API provided by Opensearch. If it fails, I don't have fallback as sth is fundamentally wrong.
This PR is a conglomerate of the following PRs. #60 #64 #65 #67 #68 #69 #70 #71 #74 #75 #76 #77 #78 #79 #82 #83 #84 #92 #94 #93 #95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
…ject#121) This PR is a conglomerate of the following PRs. opensearch-project#60 opensearch-project#64 opensearch-project#65 opensearch-project#67 opensearch-project#68 opensearch-project#69 opensearch-project#70 opensearch-project#71 opensearch-project#74 opensearch-project#75 opensearch-project#76 opensearch-project#77 opensearch-project#78 opensearch-project#79 opensearch-project#82 opensearch-project#83 opensearch-project#84 opensearch-project#92 opensearch-project#94 opensearch-project#93 opensearch-project#95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
…ject#121) This PR is a conglomerate of the following PRs. opensearch-project#60 opensearch-project#64 opensearch-project#65 opensearch-project#67 opensearch-project#68 opensearch-project#69 opensearch-project#70 opensearch-project#71 opensearch-project#74 opensearch-project#75 opensearch-project#76 opensearch-project#77 opensearch-project#78 opensearch-project#79 opensearch-project#82 opensearch-project#83 opensearch-project#84 opensearch-project#92 opensearch-project#94 opensearch-project#93 opensearch-project#95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
This PR is a conglomerate of the following PRs. #60 #64 #65 #67 #68 #69 #70 #71 #74 #75 #76 #77 #78 #79 #82 #83 #84 #92 #94 #93 #95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
Note: since there are a lot of dependencies, I only list the main class and test code to save reviewers' time. The build will fail due to missing dependencies. I will use that PR just for review. will not merge it. Will have a big one in the end and merge once after all review PRs get approved. Now the code is missing unit tests. Posting PRs now to meet the cutoff date (June 1). Will add unit tests, run performance tests, and fix bugs before the official release.
Description
In HCAD v1, only a subset of hot entities can use their models to predict if we are short of memory. In v2, we allocate a queue to absorb cold entities. Like hot entities, we load a cold entity's model checkpoint from disk, train models if the checkpoint is not found, query for missed features to complete a shingle, use the models to check whether the incoming feature is normal, update models, and save the detection results to disks.
This PR adds the implementation of cold entity queue. Implementation-wise, we reuse the queues we have developed for hot entities. The differences are we release cold entity requests in a controlled pace. Also, cold entity requests’ priority is low. So only when there are no hot entities requests to process we are gonna process cold entity requests.
Testing done:
Issues Resolved
[List any issues this PR will resolve]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.