Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CheckTargetShardsCountStep #(48460) #89176

Merged
merged 12 commits into from
Aug 25, 2022

Conversation

gmarouli
Copy link
Contributor

@gmarouli gmarouli commented Aug 8, 2022

The issue
The flaky test simulates the following:

  • Create a shrink policy with an invalid target shard count
  • Then change the policy to have a valid target shard count
  • Expectation: the check-target-shards-count will return true and the shrink operation will be successful.

What was happening in some cases in the background:

  • Create the shrink policy with an invalid target shard count
  • The check-target-shards-count gets created and queued to be executed with the invalid target shards count.
  • The task doesn't get enough priority to be executed
  • We change the policy to have a valid target shards count
  • We execute the queued task which still has the outdated target shard count.

Proof
We enriched the code with some extra logging to verify that the above scenario is actually correct:

## Adding the check target shards to the executingTasks

[2022-08-08T18:02:52,824][INFO ][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] #78460: Adding task to queue check if I can shrink to numberOfShards = 5
[2022-08-08T18:02:52,825][TRACE][o.e.x.i.h.ILMHistoryStore] [javaRestTest-0] queueing ILM history item for indexing [ilm-history-5]: [{"index":"index-zmmrkzfhht","policy":"policy-bEmKF","@timestamp":1659970972825,"index_age":12608,"success":true,"state":{"phase":"warm","phase_definition":"{\"policy\":\"policy-bEmKF\",\"phase_definition\":{\"min_age\":\"0ms\",\"actions\":{\"shrink\":{\"number_of_shards\":5}}},\"version\":1,\"modified_date_in_millis\":1659970962968}","action_time":"1659970968847","phase_time":"1659970966014","action":"shrink","step":"check-target-shards-count","creation_date":"1659970960217","step_time":"1659970972076"}}]

## We change the policy before even the condition is never even evaluated.

[2022-08-08T18:02:52,825][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [javaRestTest-0] updating index lifecycle policy [policy-bEmKF]
[2022-08-08T18:02:52,826][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] [index-zmmrkzfhht] updated policy [policy-bEmKF] contains the same phase step keys and can be refreshed
[2022-08-08T18:02:52,826][TRACE][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] [index-zmmrkzfhht] updating cached phase definition for policy [policy-bEmKF]
[2022-08-08T18:02:52,826][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] refreshed policy [policy-bEmKF] phase definition for [1] indices

## We check the condition for the first time but the target shard count is already outdated

[2022-08-08T18:02:53,406][ERROR][o.e.x.c.i.CheckTargetShardsCountStep] [javaRestTest-0] #78460: Policy has different target number of shards in cluster state 2 vs what will be executed 5.
[2022-08-08T18:02:53,441][DEBUG][o.e.x.c.i.CheckTargetShardsCountStep] [javaRestTest-0] lifecycle action of policy [policy-bEmKF] for index [index-zmmrkzfhht] cannot make progress because the target shards count [5] must be a factor of the source index's shards count [4]

Impact
We do not think that the impact is that big for production clusters because there are many more cluster state updates. However, it might cause some inconvenience to the users who fixed a policy and do not see the effect as soon as they could have.

The fix
Our proposed fix is to not provide the target shard count upon task creation but to retrieve from the cluster state. This way we ensure it will have the newest value.

Future work
Currently for every cluster state we go through all the indices and we check if any step needs to be executed. This doesn't scale well. We would like to try to switch to a more efficient model potentially with a cluster state observer. Issue will be created soon.

Resolves #78460

@elasticsearchmachine elasticsearchmachine added v8.5.0 needs:triage Requires assignment of a team area label labels Aug 8, 2022
@gmarouli gmarouli added >test Issues or PRs that are addressing/adding tests :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Aug 8, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels Aug 8, 2022
@gmarouli
Copy link
Contributor Author

gmarouli commented Aug 9, 2022

@elasticmachine update branch

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks just fine to me in principle, just one point on the seemingly redundant interface that was added here.


private Integer getTargetNumberOfShards(String policyName, ClusterState clusterState) {
IndexLifecycleMetadata indexLifecycleMetadata = clusterState.metadata().custom(IndexLifecycleMetadata.TYPE);
LifecycleAction lifecycleAction = indexLifecycleMetadata.getPolicyMetadatas()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a NIT: Should we be a little more careful here to avoid NPEs if the policy got concurrently modified/deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, good point. We should, but then we need to see how to handle it. I do not have a clear picture yet how we handle the deletion or if the step got removed. Do I remember correctly that keep a cached version of the step being executed in the index metadata? Should we fallback there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think throwing an appropriate exception that explains what happened is just fine here. No need to add additional logic, just figured it'd be nice to avoid a possible NPE and replace it with an easy to interpret exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I am still doubting if that is sufficient though. What concerns me in this approach is that it will throw errors and get stuck there permanently. That will also change the code behavior compared to what it does now.
If this would have happened in the current set-up, the check would yield either true or false and it would continue. That's what I am trying to achieve by looking for a graceful fallback. Don't you think that's worth it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point if we get stuck in error permanently. I thought other functionality would take care of clearing things up the policy disappears? (maybe that's something to fix separately if not?)
That said, that's a different issue from what we're working on here IMO. Here I was just aiming at getting an easy to understand exception.

the check would yield either true or false and it would continue.

Right but then we'd throw on the next step anyway (at least we should) wouldn't we so likely not really a change from the user's perspective?

Copy link
Contributor Author

@gmarouli gmarouli Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreidan if I understand correctly then the safest option that fixes also the outdated step count issue it to read it from the cached phase, right?
Assuming that if you update only the target shard count then the cached phase will be updated because the number of steps remained the same.

Copy link
Contributor

@andreidan andreidan Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ that's a safe option however that might incur a performance penalty (parsing the cached json on every step execution)
@original-brownbear how do you feel about that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we combine the two approaches to be a bit more efficient. What if we compare if the number of shards in the cluster state and in the step are the same (that is probably the most common case). If they are different then we try to parse the cached phase in an effort to improve user experience and we accept the performance hit since this will probably will not be a very common situation?

Copy link
Contributor

@andreidan andreidan Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmarouli That's an interesting idea.

Do you think this is worth doing though? In a production environment the ClusterStateWaitStep will eventually get the correct value and this affects concurrent updates only.

IMO we should fix the test and trigger a cluster state update if the test fails just to make sure it's not run into this race condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are probably right..... It's probably not worth the trouble... I wanted to explore all the options before I change the test instead for the race condition.

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but maybe wait for Andrei's review as well :)
Also, I still wouldn't mind a nicer exception instead of an NPE for a missing policy below but that's optional too from my end


private Integer getTargetNumberOfShards(String policyName, ClusterState clusterState) {
IndexLifecycleMetadata indexLifecycleMetadata = clusterState.metadata().custom(IndexLifecycleMetadata.TYPE);
LifecycleAction lifecycleAction = indexLifecycleMetadata.getPolicyMetadatas()
Copy link
Contributor

@andreidan andreidan Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this failure Mary

I think this will potentially break our contract towards the phase we cache for execution.

For e.g. let's say an index is in the warm phase that currently has actions shrink to 3 shards and forcemerge. Our index is in the shrink action.
A user updates the warm phase to shrink to 1 shard (also removing forcemerge). Since we removed an action, all the indices currently in the warm phase MUST execute the previously defined (and cached) phase - to honour the started flow https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/PhaseCacheManagement.java#L243
So currently (before this PR) an index in the warm phase will continue to execute shrink to 3 and forcemerge.

With this change, the user will pick up a shrink action to a different target (1 shard) and the forcemerge will also be executed (as we didn't update the cached phase). This leads to executing a mix of phases.

I don't think we should make this change, but rather rework the test.

@gmarouli
Copy link
Contributor Author

@elasticmachine update branch

@gmarouli gmarouli requested a review from andreidan August 22, 2022 12:57
@gmarouli
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for fixing this

@@ -396,6 +398,18 @@ public static String getSnapshotState(RestClient client, String snapshot) throws
return (String) snapResponse.get("state");
}

@Nullable
public static String waitAndGetShrinkIndexNameWithExtraClusterStateChange(RestClient client, String originalIndex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please document the why behind this? (the cluster state batching and such)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah very good point! On it.

@gmarouli
Copy link
Contributor Author

@elasticmachine update branch

@gmarouli gmarouli added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Aug 25, 2022
@elasticsearchmachine elasticsearchmachine merged commit 862c885 into elastic:main Aug 25, 2022
@gmarouli gmarouli deleted the test-fix-78460 branch August 25, 2022 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team >test Issues or PRs that are addressing/adding tests v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] ShrinkActionIT testAutomaticRetryFailedShrinkAction failing
5 participants