[ML] disallow autoscaling downscaling in two trained model assignment scenarios #88623

benwtrent · 2022-07-19T17:44:01Z

With 8.4, trained model assignments now take CPU into consideration. This changes our calculation for scaling down until we fully support autoscaling according to CPU requirements.

We shouldn't allow scaling down if there is ANY model assignment that isn't fully allocated (meaning, not enough CPUs)
We don't allow scaling down unless model assignments require less than half of the current scale's CPU count.

Point 2 is a place holder. Fix 1 will be a requirement even in the future with vCPU autoscaling.

elasticsearchmachine · 2022-07-19T17:44:25Z

Pinging @elastic/ml-core (Team:ML)

droberts195

I have added the cloud-deploy label. We should re-run @wwang500's test that found the autoscaling loop using the image created by that label when the minor nits I mentioned are added.

droberts195 · 2022-07-20T09:17:22Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java

@@ -409,6 +410,17 @@ public AutoscalingDeciderResult scale(Settings configuration, AutoscalingDecider
            .filter(e -> e.getValue().getAssignmentState().equals(AssignmentState.STARTING) && e.getValue().getNodeRoutingTable().isEmpty())
            .map(Map.Entry::getKey)
            .toList();
+        final List<String> notFullyAllocatedModels = modelAssignments.entrySet()


Point 2 is a place holder. Fix 1 will be a requirement even in the future with vCPU autoscaling.

It's true we'll need something that achieves the same as fix 1, but since our current autoscaling decider is already incredibly complex it would be nice to turn it into an ML memory autoscaling decider and have a separate ML CPU autoscaling decider. If we do that then this logic will live in the new CPU autoscaling decider.

So please add a TODO that this is a condition based on CPU, and should move to the CPU autoscaling decider when it's written.

droberts195 · 2022-07-20T09:18:57Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java

@@ -654,6 +669,9 @@ public AutoscalingDeciderResult scale(Settings configuration, AutoscalingDecider
                if (capacity == null) {
                    return null;
                }
+                if (modelAssignmentsRequireMoreThanHalfCpu(modelAssignments.values(), mlNodes)) {
+                    return null;


Please add a debug message here so that if this ever blocks a cluster from scaling down and we need to confirm this is what's really happening we can ask to switch on debug logging for this class.

Also, please add another TODO here saying this condition should move to the CPU autoscaling decider when it's written.

…d-model-scale-down

droberts195

LGTM apart from a couple of nits, but please deploy the Cloud image before merging and leave it with autoscaling enabled overnight with some models similar to the ones @wwang500 ran to see if it works.

droberts195 · 2022-07-20T14:46:15Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java

+                // TODO we should remove this when we can auto-scale (down and up) via a new CPU auto-scaling decider
+                if (modelAssignmentsRequireMoreThanHalfCpu(modelAssignments.values(), mlNodes)) {
+                    logger.debug(
+                        () -> format("not down-scaling; model assignments require more than half of the ML tier's allocated processors")


There's no need to call format here as there are no parameters.

droberts195 · 2022-07-20T14:46:36Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java

@@ -553,11 +568,13 @@ public AutoscalingDeciderResult scale(Settings configuration, AutoscalingDecider
                        Locale.ROOT,
                        "Passing currently perceived capacity as there are [%d] model snapshot upgrades, "
                            + "[%d] analytics and [%d] anomaly detection jobs in the queue, "
+                            + " [%d] trained models not fully-allocated, "


Suggested change

+ " [%d] trained models not fully-allocated, "

+ "[%d] trained models not fully-allocated, "

wwang500 · 2022-07-21T13:50:07Z

I have added the cloud-deploy label. We should re-run @wwang500's test that found the autoscaling loop using the image created by that label when the minor nits I mentioned are added.

Hi @benwtrent and @droberts195 I have tried the autoscaling by using this PR's docker build. there is one strange thing,

when I tried starting models with autoscaling ON, the models failed in starting, which was expected in starting state. Later on, the scale up event was still triggered, however the models are still in fail state.

…d-model-scale-down

* upstream/master: (40 commits) Fix CI job naming [ML] disallow autoscaling downscaling in two trained model assignment scenarios (elastic#88623) Add "Vector Search" area to changelog schema [DOCS] Update API key API (elastic#88499) Enable the pipeline on the feature branch (elastic#88672) Adding the ability to register a PeerFinderListener to Coordinator (elastic#88626) [DOCS] Fix transform painless example syntax (elastic#88364) [ML] Muting InternalCategorizationAggregationTests testReduceRandom (elastic#88685) Fix double rounding errors for disk usage (elastic#88683) Replace health request with a state observer. (elastic#88641) [ML] Fail model deployment if all allocations cannot be provided (elastic#88656) Upgrade to OpenJDK 18.0.2+9 (elastic#88675) [ML] make bucket_correlation aggregation generally available (elastic#88655) Adding cardinality support for random_sampler agg (elastic#86838) Use custom task instead of generic AckedClusterStateUpdateTask (elastic#88643) Reinstate test cluster throttling behavior (elastic#88664) Mute testReadBlobWithPrematureConnectionClose Simplify plugin descriptor tests (elastic#88659) Add CI job for testing more job parallelism [ML] make deployment infer requests fully cancellable (elastic#88649) ...

[ML] disallow autoscaling in two trained model assignment scenarios

d6f10a9

benwtrent added >non-issue :ml Machine learning v8.4.0 labels Jul 19, 2022

elasticsearchmachine added the Team:ML Meta label for the ML team label Jul 19, 2022

benwtrent changed the title ~~[ML] disallow autoscaling in two trained model assignment scenarios~~ [ML] disallow autoscaling downscaling in two trained model assignment scenarios Jul 19, 2022

droberts195 added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Jul 20, 2022

droberts195 reviewed Jul 20, 2022

View reviewed changes

benwtrent added 2 commits July 20, 2022 10:31

Merge remote-tracking branch 'upstream/master' into feature/ml-traine…

7776e9d

…d-model-scale-down

adding logging and todos

e7d08bd

benwtrent requested a review from droberts195 July 20, 2022 14:37

droberts195 approved these changes Jul 20, 2022

View reviewed changes

benwtrent added 2 commits July 20, 2022 11:03

fixing format

37b4fbe

fixing format

6d45adc

benwtrent added 4 commits July 21, 2022 10:08

Merge remote-tracking branch 'upstream/master' into feature/ml-traine…

a7cd899

…d-model-scale-down

fixing failed state condition

1779208

fixing format

278302b

Merge remote-tracking branch 'upstream/master' into feature/ml-traine…

733b6b0

…d-model-scale-down

benwtrent merged commit 303f9f1 into elastic:master Jul 21, 2022

benwtrent deleted the feature/ml-trained-model-scale-down branch July 21, 2022 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] disallow autoscaling downscaling in two trained model assignment scenarios #88623

[ML] disallow autoscaling downscaling in two trained model assignment scenarios #88623

benwtrent commented Jul 19, 2022

elasticsearchmachine commented Jul 19, 2022

droberts195 left a comment

droberts195 Jul 20, 2022

droberts195 Jul 20, 2022 •

edited

Loading

droberts195 left a comment

droberts195 Jul 20, 2022

droberts195 Jul 20, 2022

wwang500 commented Jul 21, 2022

	+ " [%d] trained models not fully-allocated, "
	+ "[%d] trained models not fully-allocated, "

[ML] disallow autoscaling downscaling in two trained model assignment scenarios #88623

[ML] disallow autoscaling downscaling in two trained model assignment scenarios #88623

Conversation

benwtrent commented Jul 19, 2022

elasticsearchmachine commented Jul 19, 2022

droberts195 left a comment

Choose a reason for hiding this comment

droberts195 Jul 20, 2022

Choose a reason for hiding this comment

droberts195 Jul 20, 2022 • edited Loading

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

droberts195 Jul 20, 2022

Choose a reason for hiding this comment

droberts195 Jul 20, 2022

Choose a reason for hiding this comment

wwang500 commented Jul 21, 2022

droberts195 Jul 20, 2022 •

edited

Loading