Autoscaling during shrink #88292

henningandersen · 2022-07-05T20:03:46Z

Fix autoscaling during shrink to disregard pinning to nodes for the total tier size.
Instead signal that we need a minimum node size to hold the entire shrink
operation. This avoids scaling far higher than necessary when cluster
balancing does not allow a shrink to proceed. It is considered a
(separate) balancing issue when a shrink cannot complete with enough
space in the tier.

This changes autoscaling in general for node pinning filters (based on
_id, _name or name filters).

Clone and split also pins to the shards they clone or split, similarly
this is changed to ignore that pinning when calculating the total tier size.

Closes #85480

Fix autoscaling during shrink to disregard pinning to nodes. Instead signal that we need a minimum node size to hold the entire shrink operation. This avoids scaling far higher than necessary when cluster balancing does not allow a shrink to proceed. It is considered a (separate) balancing issue when a shrink cannot complete with enough space in the tier. This changes autoscaling in general for node pinning filters (based on `_id`, `_name` or `name` filters). Clone and split also pins to the shards they clone or split, similarly this is changed to ignore that pinning during autoscaling. Closes elastic#85480

elasticmachine · 2022-07-05T20:03:50Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine · 2022-07-05T20:04:11Z

Hi @henningandersen, I've created a changelog YAML for you.

fcofdez

Looks mostly good, I left a few comments 👍

...src/main/java/org/elasticsearch/xpack/autoscaling/storage/ReactiveStorageDeciderService.java

fcofdez · 2022-07-07T17:26:41Z

...src/main/java/org/elasticsearch/xpack/autoscaling/storage/ReactiveStorageDeciderService.java

+                    // For resize shards only allow autoscaling if there is no other node where the shard could fit had it not been
+                    // a resize shard. Notice that we already removed any initial_recovery filters.
+                    diskOnly = nodesInTier(allocation.routingNodes()).map(node -> allocationDeciders.canAllocate(shard, node, allocation))
+                        .anyMatch(ReactiveStorageDeciderService::isResizeOnlyNoDecision) == false;


Can we add a test where there's enough space to hold the resize shard so we verify that we don't request more capacity in that case?

I find double negations a bit trappy 😅

I think this is covered by the assertion here:

https://github.com/elastic/elasticsearch/pull/88292/files#diff-ea5512c9b6e033506a5075f6f187e685738dfb47f7a77c5be938bd773971a9f5R334

Sorry, I missed that.

fcofdez · 2022-07-07T17:29:43Z

...src/main/java/org/elasticsearch/xpack/autoscaling/storage/ReactiveStorageDeciderService.java

@@ -222,7 +298,8 @@ public static class AllocationState {
            Set<DiscoveryNode> nodes,
            Set<DiscoveryNodeRole> roles
        ) {
-            this.state = state;
+            this.state = removeNodeLockFilters(state);


I wonder if removing the node lock filter in all cases couldn't lead to a different allocation outcome? Maybe we should only remove the initial_recovery setting here? I'm not 100% sure though.

The idea I followed was that we should ensure that we have a node that is large enough to hold the node-locked data, hence the addition to ensure we deliver a proper node-level size.

With that done, we can assume that it is an allocation problem if it cannot fit. Hence it seems fair to remove the node locking here. I am aware that our allocation system is not yet sophisticated enough, but I'd rather not autoscale to a too large setup (since that may be multiple steps too large) in that case. For ILM controlled shrink, it will eventually fail and retry. Manual intervention may be necessary until our allocation system can handle this.

That makes sense, thanks for clarifying.

henningandersen · 2022-07-07T19:19:38Z

Thanks @fcofdez for the review and sorry about the left-over code from previous generations of this PR. This is ready for another round (commented on the two other outstanding comments).

henningandersen · 2022-07-07T20:38:52Z

@elasticmachine update branch

fcofdez

LGTM 👍

henningandersen · 2022-07-08T12:16:35Z

Thanks Francisco!

henningandersen added >bug :Distributed Coordination/Autoscaling v8.4.0 labels Jul 5, 2022

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jul 5, 2022

henningandersen and others added 3 commits July 5, 2022 22:04

Update docs/changelog/88292.yaml

89e3c44

spotless

a33632b

Adapt test to new node level requirement.

df810c8

fcofdez self-requested a review July 6, 2022 11:14

fcofdez reviewed Jul 7, 2022

View reviewed changes

Remove unused code.

109f820

henningandersen requested a review from fcofdez July 7, 2022 19:19

Merge branch 'master' into fix_autoscaling_during_shrink_pr

7c5a361

fcofdez approved these changes Jul 8, 2022

View reviewed changes

henningandersen merged commit 66e750c into elastic:master Jul 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaling during shrink #88292

Autoscaling during shrink #88292

henningandersen commented Jul 5, 2022

elasticmachine commented Jul 5, 2022

elasticsearchmachine commented Jul 5, 2022

fcofdez left a comment

fcofdez Jul 7, 2022

fcofdez Jul 7, 2022

henningandersen Jul 7, 2022

fcofdez Jul 8, 2022

fcofdez Jul 7, 2022

henningandersen Jul 7, 2022

fcofdez Jul 8, 2022

henningandersen commented Jul 7, 2022

henningandersen commented Jul 7, 2022

fcofdez left a comment

henningandersen commented Jul 8, 2022

Autoscaling during shrink #88292

Autoscaling during shrink #88292

Conversation

henningandersen commented Jul 5, 2022

elasticmachine commented Jul 5, 2022

elasticsearchmachine commented Jul 5, 2022

fcofdez left a comment

Choose a reason for hiding this comment

fcofdez Jul 7, 2022

Choose a reason for hiding this comment

fcofdez Jul 7, 2022

Choose a reason for hiding this comment

henningandersen Jul 7, 2022

Choose a reason for hiding this comment

fcofdez Jul 8, 2022

Choose a reason for hiding this comment

fcofdez Jul 7, 2022

Choose a reason for hiding this comment

henningandersen Jul 7, 2022

Choose a reason for hiding this comment

fcofdez Jul 8, 2022

Choose a reason for hiding this comment

henningandersen commented Jul 7, 2022

henningandersen commented Jul 7, 2022

fcofdez left a comment

Choose a reason for hiding this comment

henningandersen commented Jul 8, 2022