-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] disallow autoscaling downscaling in two trained model assignment scenarios #88623
[ML] disallow autoscaling downscaling in two trained model assignment scenarios #88623
Conversation
Pinging @elastic/ml-core (Team:ML) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the cloud-deploy
label. We should re-run @wwang500's test that found the autoscaling loop using the image created by that label when the minor nits I mentioned are added.
@@ -409,6 +410,17 @@ public AutoscalingDeciderResult scale(Settings configuration, AutoscalingDecider | |||
.filter(e -> e.getValue().getAssignmentState().equals(AssignmentState.STARTING) && e.getValue().getNodeRoutingTable().isEmpty()) | |||
.map(Map.Entry::getKey) | |||
.toList(); | |||
final List<String> notFullyAllocatedModels = modelAssignments.entrySet() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point 2 is a place holder. Fix 1 will be a requirement even in the future with vCPU autoscaling.
It's true we'll need something that achieves the same as fix 1, but since our current autoscaling decider is already incredibly complex it would be nice to turn it into an ML memory autoscaling decider and have a separate ML CPU autoscaling decider. If we do that then this logic will live in the new CPU autoscaling decider.
So please add a TODO that this is a condition based on CPU, and should move to the CPU autoscaling decider when it's written.
@@ -654,6 +669,9 @@ public AutoscalingDeciderResult scale(Settings configuration, AutoscalingDecider | |||
if (capacity == null) { | |||
return null; | |||
} | |||
if (modelAssignmentsRequireMoreThanHalfCpu(modelAssignments.values(), mlNodes)) { | |||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a debug message here so that if this ever blocks a cluster from scaling down and we need to confirm this is what's really happening we can ask to switch on debug logging for this class.
Also, please add another TODO here saying this condition should move to the CPU autoscaling decider when it's written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart from a couple of nits, but please deploy the Cloud image before merging and leave it with autoscaling enabled overnight with some models similar to the ones @wwang500 ran to see if it works.
// TODO we should remove this when we can auto-scale (down and up) via a new CPU auto-scaling decider | ||
if (modelAssignmentsRequireMoreThanHalfCpu(modelAssignments.values(), mlNodes)) { | ||
logger.debug( | ||
() -> format("not down-scaling; model assignments require more than half of the ML tier's allocated processors") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to call format
here as there are no parameters.
@@ -553,11 +568,13 @@ public AutoscalingDeciderResult scale(Settings configuration, AutoscalingDecider | |||
Locale.ROOT, | |||
"Passing currently perceived capacity as there are [%d] model snapshot upgrades, " | |||
+ "[%d] analytics and [%d] anomaly detection jobs in the queue, " | |||
+ " [%d] trained models not fully-allocated, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+ " [%d] trained models not fully-allocated, " | |
+ "[%d] trained models not fully-allocated, " |
Hi @benwtrent and @droberts195 I have tried the autoscaling by using this PR's docker build. there is one strange thing, when I tried starting models with autoscaling ON, the models failed in starting, which was expected in |
* upstream/master: (40 commits) Fix CI job naming [ML] disallow autoscaling downscaling in two trained model assignment scenarios (elastic#88623) Add "Vector Search" area to changelog schema [DOCS] Update API key API (elastic#88499) Enable the pipeline on the feature branch (elastic#88672) Adding the ability to register a PeerFinderListener to Coordinator (elastic#88626) [DOCS] Fix transform painless example syntax (elastic#88364) [ML] Muting InternalCategorizationAggregationTests testReduceRandom (elastic#88685) Fix double rounding errors for disk usage (elastic#88683) Replace health request with a state observer. (elastic#88641) [ML] Fail model deployment if all allocations cannot be provided (elastic#88656) Upgrade to OpenJDK 18.0.2+9 (elastic#88675) [ML] make bucket_correlation aggregation generally available (elastic#88655) Adding cardinality support for random_sampler agg (elastic#86838) Use custom task instead of generic AckedClusterStateUpdateTask (elastic#88643) Reinstate test cluster throttling behavior (elastic#88664) Mute testReadBlobWithPrematureConnectionClose Simplify plugin descriptor tests (elastic#88659) Add CI job for testing more job parallelism [ML] make deployment infer requests fully cancellable (elastic#88649) ...
With 8.4, trained model assignments now take CPU into consideration. This changes our calculation for scaling down until we fully support autoscaling according to CPU requirements.
Point 2 is a place holder. Fix 1 will be a requirement even in the future with vCPU autoscaling.