Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the TransportRolloverAction execute in one cluster state update #50388

Merged
merged 6 commits into from
Dec 20, 2019

Conversation

andreidan
Copy link
Contributor

@andreidan andreidan commented Dec 19, 2019

This commit makes the TransportRolloverAction more resilient, by having it
execute only one cluster state update that creates the new (rollover index), rolls
over the alias from the source to the target index and set the RolloverInfo on the
source index. Before, these 3 steps were represented as 3 chained cluster state
updates, which would've seen the user manually intervene if, say, the alias
rollover cluster state update (second in the chain) failed but the creation of
the rollover index (first in the chain) update succeeded

This commit makes the rollover more resilient, by having it execute only
one cluster state update that creates the new (rollover index), rolls over
the alias from the source to the target index and set the RolloverInfo on the
source index. Before these 3 steps were represented as 3 chained cluster state
updates, which would've seen the user manually intervene if, say, the alias
rollover cluster state update (second in the chain) failed but the creation of
the rollover index (first in the chain) update succeeded
@andreidan andreidan added :Data Management/ILM+SLM Index and Snapshot lifecycle management v7.6.0 v8.0.0 labels Dec 19, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andreidan, I left two minor comments and one about moving part of this to a separate PR.

rolloverIndexName, rolloverRequest);
CreateIndexClusterStateUpdateRequest createIndexRequest = prepareCreateIndexRequest(unresolvedName,
rolloverIndexName, rolloverRequest);
clusterService.submitStateUpdateTask("rollover_index", new ClusterStateUpdateTask() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the index name into the task string?

}, listener::onFailure));
@Override
public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
activeShardsObserver.waitForActiveShards(new String[]{rolloverIndexName},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect it to happen (it would be a bad situation if it did), but it might be a good idea to wrap this in:

if (newState.equals(oldState) == false) {
    ...
}

@@ -30,6 +30,11 @@ public RolloverStep(StepKey key, StepKey nextStepKey, Client client) {
super(key, nextStepKey, client);
}

@Override
public boolean isRetryable() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do this in a separate PR. We still need to solve the complexity around retrying AsyncAction steps, which will require slightly different execution than the other step types, due to their exactly-once re-invocation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we satisfy the exactly-once re-invocation already. A failed AsyncActionStep will be moved into the ErrorStep on failure and on the next ILM periodic loop we'll move the index policy back into the failed step (the AsyncActionStep) which will be executed on the next async steps execution cycle (on a master change event or the subsequent periodic ILM loop). Am I missing something here? I'm happy to separate the PRs either way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AsyncActionSteps don't get executed by the periodic loop invocation, meaning they will never be executed except on a master change event (which should be exceedingly rare). I was saying that we need to figure out the best way to make sure we can execute AsyncAction steps when they do an automatic retry.

To figure out this logic I was thinking a different PR would be better since it would likely mean extra testing for however we decide we want invoke AsyncAction steps (it may be as simple as adding the invocation in the onProcessed method of the ilm-retry-failed-step cluster state update, but we'll have to test).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right Lee, I mixed the AsyncWaitStep with the AsyncActionStep (async and step was all my brain read)

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

@andreidan andreidan changed the title ILM: Make the rollover step retryable Make the TransportRolloverAction execute in one cluster state update Dec 20, 2019
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Andrei

@andreidan andreidan merged commit 1ba4339 into elastic:master Dec 20, 2019
andreidan added a commit to andreidan/elasticsearch that referenced this pull request Dec 20, 2019
…lastic#50388)

This commit makes the TransportRolloverAction more resilient, by having it execute
only one cluster state update that creates the new (rollover index), rolls over
the alias from the source to the target index and set the RolloverInfo on the
source index. Before these 3 steps were represented as 3 chained cluster state
updates, which would've seen the user manually intervene if, say, the alias
rollover cluster state update (second in the chain) failed but the creation of
the rollover index (first in the chain) update succeeded

* Rename innerExecute to applyAliasActions

(cherry picked from commit 1ba4339)
Signed-off-by: Andrei Dan <[email protected]>
andreidan added a commit that referenced this pull request Dec 20, 2019
…50388) (#50442)

This commit makes the TransportRolloverAction more resilient, by having it execute
only one cluster state update that creates the new (rollover index), rolls over
the alias from the source to the target index and set the RolloverInfo on the
source index. Before these 3 steps were represented as 3 chained cluster state
updates, which would've seen the user manually intervene if, say, the alias
rollover cluster state update (second in the chain) failed but the creation of
the rollover index (first in the chain) update succeeded

* Rename innerExecute to applyAliasActions

(cherry picked from commit 1ba4339)
Signed-off-by: Andrei Dan <[email protected]>
@shwetathareja
Copy link

Hi @andreidan,
Do you have plans to backport this to 6.8? We are facing similar problem in our cluster as pointed out here issue and failing to rollover is causing single index to grow huge.
If it is ok, I can raise the PR for backporting to 6.8.

@andreidan
Copy link
Contributor Author

Hi @shwetathareja, thank you for your interest and I'm sorry to hear you're running into issues with the rollover action.
Unfortunately, we will not backport this to 6.8 as that release line will only receive bug fixes going forward. We are working extensively on making ILM more reliable (with this particular enhancement backported already to 7.6 and more to come with regards to making ILM and the rollover action more resilient). Would an upgrade to 7.6 be possible once it's out?

SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
…lastic#50388)

This commit makes the TransportRolloverAction more resilient, by having it execute
only one cluster state update that creates the new (rollover index), rolls over
the alias from the source to the target index and set the RolloverInfo on the
source index. Before these 3 steps were represented as 3 chained cluster state
updates, which would've seen the user manually intervene if, say, the alias
rollover cluster state update (second in the chain) failed but the creation of
the rollover index (first in the chain) update succeeded

* Rename innerExecute to applyAliasActions

Co-authored-by: Elastic Machine <[email protected]>
@jakelandis jakelandis removed the v8.0.0 label Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants