Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] disallow autoscaling downscaling in two trained model assignment scenarios #88623

Merged

Conversation

benwtrent
Copy link
Member

With 8.4, trained model assignments now take CPU into consideration. This changes our calculation for scaling down until we fully support autoscaling according to CPU requirements.

  1. We shouldn't allow scaling down if there is ANY model assignment that isn't fully allocated (meaning, not enough CPUs)
  2. We don't allow scaling down unless model assignments require less than half of the current scale's CPU count.

Point 2 is a place holder. Fix 1 will be a requirement even in the future with vCPU autoscaling.

@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Jul 19, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent benwtrent changed the title [ML] disallow autoscaling in two trained model assignment scenarios [ML] disallow autoscaling downscaling in two trained model assignment scenarios Jul 19, 2022
@droberts195 droberts195 added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Jul 20, 2022
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the cloud-deploy label. We should re-run @wwang500's test that found the autoscaling loop using the image created by that label when the minor nits I mentioned are added.

@@ -409,6 +410,17 @@ public AutoscalingDeciderResult scale(Settings configuration, AutoscalingDecider
.filter(e -> e.getValue().getAssignmentState().equals(AssignmentState.STARTING) && e.getValue().getNodeRoutingTable().isEmpty())
.map(Map.Entry::getKey)
.toList();
final List<String> notFullyAllocatedModels = modelAssignments.entrySet()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point 2 is a place holder. Fix 1 will be a requirement even in the future with vCPU autoscaling.

It's true we'll need something that achieves the same as fix 1, but since our current autoscaling decider is already incredibly complex it would be nice to turn it into an ML memory autoscaling decider and have a separate ML CPU autoscaling decider. If we do that then this logic will live in the new CPU autoscaling decider.

So please add a TODO that this is a condition based on CPU, and should move to the CPU autoscaling decider when it's written.

@@ -654,6 +669,9 @@ public AutoscalingDeciderResult scale(Settings configuration, AutoscalingDecider
if (capacity == null) {
return null;
}
if (modelAssignmentsRequireMoreThanHalfCpu(modelAssignments.values(), mlNodes)) {
return null;
Copy link
Contributor

@droberts195 droberts195 Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a debug message here so that if this ever blocks a cluster from scaling down and we need to confirm this is what's really happening we can ask to switch on debug logging for this class.

Also, please add another TODO here saying this condition should move to the CPU autoscaling decider when it's written.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart from a couple of nits, but please deploy the Cloud image before merging and leave it with autoscaling enabled overnight with some models similar to the ones @wwang500 ran to see if it works.

// TODO we should remove this when we can auto-scale (down and up) via a new CPU auto-scaling decider
if (modelAssignmentsRequireMoreThanHalfCpu(modelAssignments.values(), mlNodes)) {
logger.debug(
() -> format("not down-scaling; model assignments require more than half of the ML tier's allocated processors")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to call format here as there are no parameters.

@@ -553,11 +568,13 @@ public AutoscalingDeciderResult scale(Settings configuration, AutoscalingDecider
Locale.ROOT,
"Passing currently perceived capacity as there are [%d] model snapshot upgrades, "
+ "[%d] analytics and [%d] anomaly detection jobs in the queue, "
+ " [%d] trained models not fully-allocated, "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ " [%d] trained models not fully-allocated, "
+ "[%d] trained models not fully-allocated, "

@wwang500
Copy link

I have added the cloud-deploy label. We should re-run @wwang500's test that found the autoscaling loop using the image created by that label when the minor nits I mentioned are added.

Hi @benwtrent and @droberts195 I have tried the autoscaling by using this PR's docker build. there is one strange thing,

when I tried starting models with autoscaling ON, the models failed in starting, which was expected in starting state. Later on, the scale up event was still triggered, however the models are still in fail state.

Screen Shot 2022-07-21 at 12 41 23 AM

Screen Shot 2022-07-21 at 9 48 33 AM

Screen Shot 2022-07-21 at 9 47 40 AM

@benwtrent benwtrent merged commit 303f9f1 into elastic:master Jul 21, 2022
@benwtrent benwtrent deleted the feature/ml-trained-model-scale-down branch July 21, 2022 19:27
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Jul 22, 2022
* upstream/master: (40 commits)
  Fix CI job naming
  [ML] disallow autoscaling downscaling in two trained model assignment scenarios (elastic#88623)
  Add "Vector Search" area to changelog schema
  [DOCS] Update API key API (elastic#88499)
  Enable the pipeline on the feature branch (elastic#88672)
  Adding the ability to register a PeerFinderListener to Coordinator (elastic#88626)
  [DOCS] Fix transform painless example syntax (elastic#88364)
  [ML] Muting InternalCategorizationAggregationTests testReduceRandom (elastic#88685)
  Fix double rounding errors for disk usage (elastic#88683)
  Replace health request with a state observer. (elastic#88641)
  [ML] Fail model deployment if all allocations cannot be provided (elastic#88656)
  Upgrade to OpenJDK 18.0.2+9 (elastic#88675)
  [ML] make bucket_correlation aggregation generally available (elastic#88655)
  Adding cardinality support for random_sampler agg (elastic#86838)
  Use custom task instead of generic AckedClusterStateUpdateTask (elastic#88643)
  Reinstate test cluster throttling behavior (elastic#88664)
  Mute testReadBlobWithPrematureConnectionClose
  Simplify plugin descriptor tests (elastic#88659)
  Add CI job for testing more job parallelism
  [ML] make deployment infer requests fully cancellable (elastic#88649)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-deploy Publish cloud docker image for Cloud-First-Testing :ml Machine learning >non-issue Team:ML Meta label for the ML team v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants