Make the TransportRolloverAction execute in one cluster state update #50388

andreidan · 2019-12-19T17:28:21Z

This commit makes the TransportRolloverAction more resilient, by having it
execute only one cluster state update that creates the new (rollover index), rolls
over the alias from the source to the target index and set the RolloverInfo on the
source index. Before, these 3 steps were represented as 3 chained cluster state
updates, which would've seen the user manually intervene if, say, the alias
rollover cluster state update (second in the chain) failed but the creation of
the rollover index (first in the chain) update succeeded

This commit makes the rollover more resilient, by having it execute only one cluster state update that creates the new (rollover index), rolls over the alias from the source to the target index and set the RolloverInfo on the source index. Before these 3 steps were represented as 3 chained cluster state updates, which would've seen the user manually intervene if, say, the alias rollover cluster state update (second in the chain) failed but the creation of the rollover index (first in the chain) update succeeded

elasticmachine · 2019-12-19T17:29:22Z

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

dakrone

Thanks @andreidan, I left two minor comments and one about moving part of this to a separate PR.

dakrone · 2019-12-19T18:22:25Z

...r/src/main/java/org/elasticsearch/action/admin/indices/rollover/TransportRolloverAction.java

-                                    rolloverIndexName, rolloverRequest);
+                        CreateIndexClusterStateUpdateRequest createIndexRequest = prepareCreateIndexRequest(unresolvedName,
+                            rolloverIndexName, rolloverRequest);
+                        clusterService.submitStateUpdateTask("rollover_index", new ClusterStateUpdateTask() {


Can you add the index name into the task string?

dakrone · 2019-12-19T18:24:13Z

...r/src/main/java/org/elasticsearch/action/admin/indices/rollover/TransportRolloverAction.java

-                        }, listener::onFailure));
+                            @Override
+                            public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
+                                activeShardsObserver.waitForActiveShards(new String[]{rolloverIndexName},


I don't expect it to happen (it would be a bad situation if it did), but it might be a good idea to wrap this in:

if (newState.equals(oldState) == false) { ... }

dakrone · 2019-12-19T18:28:08Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/RolloverStep.java

@@ -30,6 +30,11 @@ public RolloverStep(StepKey key, StepKey nextStepKey, Client client) {
        super(key, nextStepKey, client);
    }

+    @Override
+    public boolean isRetryable() {


I think we should do this in a separate PR. We still need to solve the complexity around retrying AsyncAction steps, which will require slightly different execution than the other step types, due to their exactly-once re-invocation.

I think we satisfy the exactly-once re-invocation already. A failed AsyncActionStep will be moved into the ErrorStep on failure and on the next ILM periodic loop we'll move the index policy back into the failed step (the AsyncActionStep) which will be executed on the next async steps execution cycle (on a master change event or the subsequent periodic ILM loop). Am I missing something here? I'm happy to separate the PRs either way

AsyncActionSteps don't get executed by the periodic loop invocation, meaning they will never be executed except on a master change event (which should be exceedingly rare). I was saying that we need to figure out the best way to make sure we can execute AsyncAction steps when they do an automatic retry.

To figure out this logic I was thinking a different PR would be better since it would likely mean extra testing for however we decide we want invoke AsyncAction steps (it may be as simple as adding the invocation in the onProcessed method of the ilm-retry-failed-step cluster state update, but we'll have to test).

Ah, you're right Lee, I mixed the AsyncWaitStep with the AsyncActionStep (async and step was all my brain read)

andreidan · 2019-12-20T14:25:17Z

@elasticmachine update branch

dakrone

LGTM, thanks Andrei

…lastic#50388) This commit makes the TransportRolloverAction more resilient, by having it execute only one cluster state update that creates the new (rollover index), rolls over the alias from the source to the target index and set the RolloverInfo on the source index. Before these 3 steps were represented as 3 chained cluster state updates, which would've seen the user manually intervene if, say, the alias rollover cluster state update (second in the chain) failed but the creation of the rollover index (first in the chain) update succeeded * Rename innerExecute to applyAliasActions (cherry picked from commit 1ba4339) Signed-off-by: Andrei Dan <[email protected]>

…50388) (#50442) This commit makes the TransportRolloverAction more resilient, by having it execute only one cluster state update that creates the new (rollover index), rolls over the alias from the source to the target index and set the RolloverInfo on the source index. Before these 3 steps were represented as 3 chained cluster state updates, which would've seen the user manually intervene if, say, the alias rollover cluster state update (second in the chain) failed but the creation of the rollover index (first in the chain) update succeeded * Rename innerExecute to applyAliasActions (cherry picked from commit 1ba4339) Signed-off-by: Andrei Dan <[email protected]>

shwetathareja · 2019-12-24T06:30:38Z

Hi @andreidan,
Do you have plans to backport this to 6.8? We are facing similar problem in our cluster as pointed out here issue and failing to rollover is causing single index to grow huge.
If it is ok, I can raise the PR for backporting to 6.8.

andreidan · 2019-12-30T09:45:09Z

Hi @shwetathareja, thank you for your interest and I'm sorry to hear you're running into issues with the rollover action.
Unfortunately, we will not backport this to 6.8 as that release line will only receive bug fixes going forward. We are working extensively on making ILM more reliable (with this particular enhancement backported already to 7.6 and more to come with regards to making ILM and the rollover action more resilient). Would an upgrade to 7.6 be possible once it's out?

…lastic#50388) This commit makes the TransportRolloverAction more resilient, by having it execute only one cluster state update that creates the new (rollover index), rolls over the alias from the source to the target index and set the RolloverInfo on the source index. Before these 3 steps were represented as 3 chained cluster state updates, which would've seen the user manually intervene if, say, the alias rollover cluster state update (second in the chain) failed but the creation of the rollover index (first in the chain) update succeeded * Rename innerExecute to applyAliasActions Co-authored-by: Elastic Machine <[email protected]>

andreidan added 2 commits December 19, 2019 17:24

Rename innerExecute to applyAliasActions

156b556

andreidan added :Data Management/ILM+SLM Index and Snapshot lifecycle management v7.6.0 v8.0.0 labels Dec 19, 2019

dakrone requested changes Dec 19, 2019

View reviewed changes

andreidan added 3 commits December 20, 2019 14:07

Add the rollover index names in cluster update task name

aba5acd

Guard active shards observer by newState != oldState

d899a10

RolloverStep is not retryable

78aca58

Merge branch 'master' into ilm-retryable-rollover-step

71f8c43

andreidan changed the title ~~ILM: Make the rollover step retryable~~ Make the TransportRolloverAction execute in one cluster state update Dec 20, 2019

dakrone approved these changes Dec 20, 2019

View reviewed changes

andreidan merged commit 1ba4339 into elastic:master Dec 20, 2019

andreidan added the backport pending label Dec 20, 2019

andreidan mentioned this pull request Dec 20, 2019

Make the TransportRolloverAction execute in one cluster state update (#50388) #50442

Merged

andreidan removed the backport pending label Dec 20, 2019

shwetathareja mentioned this pull request Jan 9, 2020

index rollover running with "NORMAL" priority #50778

Closed

dakrone mentioned this pull request Jan 14, 2020

Rollover sometimes does not attach Rollover Info to index metadata #49413

Closed

$@polyfractal$ polyfractal added the >enhancement label Jan 15, 2020

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

shwetathareja mentioned this pull request Mar 11, 2020

Rollover API failing with 400 in case settings name don't have "index." prefix or "index" section in json #53388

Closed

jakelandis removed the v8.0.0 label Jul 26, 2021

jakelandis added the v8.0.0-alpha1 label Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the TransportRolloverAction execute in one cluster state update #50388

Make the TransportRolloverAction execute in one cluster state update #50388

andreidan commented Dec 19, 2019 •

edited

Loading

elasticmachine commented Dec 19, 2019

dakrone left a comment

dakrone Dec 19, 2019

dakrone Dec 19, 2019

dakrone Dec 19, 2019

andreidan Dec 20, 2019

dakrone Dec 20, 2019

andreidan Dec 20, 2019

andreidan commented Dec 20, 2019

dakrone left a comment

shwetathareja commented Dec 24, 2019

andreidan commented Dec 30, 2019

Make the TransportRolloverAction execute in one cluster state update #50388

Make the TransportRolloverAction execute in one cluster state update #50388

Conversation

andreidan commented Dec 19, 2019 • edited Loading

elasticmachine commented Dec 19, 2019

dakrone left a comment

Choose a reason for hiding this comment

dakrone Dec 19, 2019

Choose a reason for hiding this comment

dakrone Dec 19, 2019

Choose a reason for hiding this comment

dakrone Dec 19, 2019

Choose a reason for hiding this comment

andreidan Dec 20, 2019

Choose a reason for hiding this comment

dakrone Dec 20, 2019

Choose a reason for hiding this comment

andreidan Dec 20, 2019

Choose a reason for hiding this comment

andreidan commented Dec 20, 2019

dakrone left a comment

Choose a reason for hiding this comment

shwetathareja commented Dec 24, 2019

andreidan commented Dec 30, 2019

andreidan commented Dec 19, 2019 •

edited

Loading