Fix CheckTargetShardsCountStep #(48460) #89176

gmarouli · 2022-08-08T16:35:56Z

The issue
The flaky test simulates the following:

Create a shrink policy with an invalid target shard count
Then change the policy to have a valid target shard count
Expectation: the check-target-shards-count will return true and the shrink operation will be successful.

What was happening in some cases in the background:

Create the shrink policy with an invalid target shard count
The check-target-shards-count gets created and queued to be executed with the invalid target shards count.
The task doesn't get enough priority to be executed
We change the policy to have a valid target shards count
We execute the queued task which still has the outdated target shard count.

Proof
We enriched the code with some extra logging to verify that the above scenario is actually correct:

## Adding the check target shards to the executingTasks

[2022-08-08T18:02:52,824][INFO ][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] #78460: Adding task to queue check if I can shrink to numberOfShards = 5
[2022-08-08T18:02:52,825][TRACE][o.e.x.i.h.ILMHistoryStore] [javaRestTest-0] queueing ILM history item for indexing [ilm-history-5]: [{"index":"index-zmmrkzfhht","policy":"policy-bEmKF","@timestamp":1659970972825,"index_age":12608,"success":true,"state":{"phase":"warm","phase_definition":"{\"policy\":\"policy-bEmKF\",\"phase_definition\":{\"min_age\":\"0ms\",\"actions\":{\"shrink\":{\"number_of_shards\":5}}},\"version\":1,\"modified_date_in_millis\":1659970962968}","action_time":"1659970968847","phase_time":"1659970966014","action":"shrink","step":"check-target-shards-count","creation_date":"1659970960217","step_time":"1659970972076"}}]

## We change the policy before even the condition is never even evaluated.

[2022-08-08T18:02:52,825][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [javaRestTest-0] updating index lifecycle policy [policy-bEmKF]
[2022-08-08T18:02:52,826][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] [index-zmmrkzfhht] updated policy [policy-bEmKF] contains the same phase step keys and can be refreshed
[2022-08-08T18:02:52,826][TRACE][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] [index-zmmrkzfhht] updating cached phase definition for policy [policy-bEmKF]
[2022-08-08T18:02:52,826][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] refreshed policy [policy-bEmKF] phase definition for [1] indices

## We check the condition for the first time but the target shard count is already outdated

[2022-08-08T18:02:53,406][ERROR][o.e.x.c.i.CheckTargetShardsCountStep] [javaRestTest-0] #78460: Policy has different target number of shards in cluster state 2 vs what will be executed 5.
[2022-08-08T18:02:53,441][DEBUG][o.e.x.c.i.CheckTargetShardsCountStep] [javaRestTest-0] lifecycle action of policy [policy-bEmKF] for index [index-zmmrkzfhht] cannot make progress because the target shards count [5] must be a factor of the source index's shards count [4]

Impact
We do not think that the impact is that big for production clusters because there are many more cluster state updates. However, it might cause some inconvenience to the users who fixed a policy and do not see the effect as soon as they could have.

The fix
Our proposed fix is to not provide the target shard count upon task creation but to retrieve from the cluster state. This way we ensure it will have the newest value.

Future work
Currently for every cluster state we go through all the indices and we check if any step needs to be executed. This doesn't scale well. We would like to try to switch to a more efficient model potentially with a cluster state observer. Issue will be created soon.

Resolves #78460

elasticsearchmachine · 2022-08-08T16:38:40Z

Pinging @elastic/es-data-management (Team:Data Management)

gmarouli · 2022-08-09T11:32:53Z

@elasticmachine update branch

original-brownbear

Looks just fine to me in principle, just one point on the seemingly redundant interface that was added here.

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WithTargetNumberOfShards.java

original-brownbear · 2022-08-09T11:56:45Z

...k/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/CheckTargetShardsCountStep.java

+
+    private Integer getTargetNumberOfShards(String policyName, ClusterState clusterState) {
+        IndexLifecycleMetadata indexLifecycleMetadata = clusterState.metadata().custom(IndexLifecycleMetadata.TYPE);
+        LifecycleAction lifecycleAction = indexLifecycleMetadata.getPolicyMetadatas()


More of a NIT: Should we be a little more careful here to avoid NPEs if the policy got concurrently modified/deleted?

Hm, good point. We should, but then we need to see how to handle it. I do not have a clear picture yet how we handle the deletion or if the step got removed. Do I remember correctly that keep a cached version of the step being executed in the index metadata? Should we fallback there?

I think throwing an appropriate exception that explains what happened is just fine here. No need to add additional logic, just figured it'd be nice to avoid a possible NPE and replace it with an easy to interpret exception.

I see, I am still doubting if that is sufficient though. What concerns me in this approach is that it will throw errors and get stuck there permanently. That will also change the code behavior compared to what it does now.
If this would have happened in the current set-up, the check would yield either true or false and it would continue. That's what I am trying to achieve by looking for a graceful fallback. Don't you think that's worth it?

That's a fair point if we get stuck in error permanently. I thought other functionality would take care of clearing things up the policy disappears? (maybe that's something to fix separately if not?)
That said, that's a different issue from what we're working on here IMO. Here I was just aiming at getting an easy to understand exception.

the check would yield either true or false and it would continue.

Right but then we'd throw on the next step anyway (at least we should) wouldn't we so likely not really a change from the user's perspective?

@andreidan if I understand correctly then the safest option that fixes also the outdated step count issue it to read it from the cached phase, right?
Assuming that if you update only the target shard count then the cached phase will be updated because the number of steps remained the same.

++ that's a safe option however that might incur a performance penalty (parsing the cached json on every step execution)
@original-brownbear how do you feel about that?

What if we combine the two approaches to be a bit more efficient. What if we compare if the number of shards in the cluster state and in the step are the same (that is probably the most common case). If they are different then we try to parse the cached phase in an effort to improve user experience and we accept the performance hit since this will probably will not be a very common situation?

@gmarouli That's an interesting idea.

Do you think this is worth doing though? In a production environment the ClusterStateWaitStep will eventually get the correct value and this affects concurrent updates only.

IMO we should fix the test and trigger a cluster state update if the test fails just to make sure it's not run into this race condition.

You are probably right..... It's probably not worth the trouble... I wanted to explore all the options before I change the test instead for the race condition.

original-brownbear

LGTM, but maybe wait for Andrei's review as well :)
Also, I still wouldn't mind a nicer exception instead of an NPE for a missing policy below but that's optional too from my end

...gin/core/src/test/java/org/elasticsearch/xpack/core/ilm/CheckTargetShardsCountStepTests.java

andreidan · 2022-08-10T11:49:40Z

...k/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/CheckTargetShardsCountStep.java

+
+    private Integer getTargetNumberOfShards(String policyName, ClusterState clusterState) {
+        IndexLifecycleMetadata indexLifecycleMetadata = clusterState.metadata().custom(IndexLifecycleMetadata.TYPE);
+        LifecycleAction lifecycleAction = indexLifecycleMetadata.getPolicyMetadatas()


Thanks for looking into this failure Mary

I think this will potentially break our contract towards the phase we cache for execution.

For e.g. let's say an index is in the warm phase that currently has actions shrink to 3 shards and forcemerge. Our index is in the shrink action.
A user updates the warm phase to shrink to 1 shard (also removing forcemerge). Since we removed an action, all the indices currently in the warm phase MUST execute the previously defined (and cached) phase - to honour the started flow https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/PhaseCacheManagement.java#L243
So currently (before this PR) an index in the warm phase will continue to execute shrink to 3 and forcemerge.

With this change, the user will pick up a shrink action to a different target (1 shard) and the forcemerge will also be executed (as we didn't update the cached phase). This leads to executing a mix of phases.

I don't think we should make this change, but rather rework the test.

gmarouli · 2022-08-20T11:49:12Z

@elasticmachine update branch

gmarouli · 2022-08-22T12:58:23Z

@elasticmachine update branch

andreidan

LGTM thanks for fixing this

andreidan · 2022-08-22T14:52:20Z

...in/ilm/qa/multi-node/src/javaRestTest/java/org/elasticsearch/xpack/TimeSeriesRestDriver.java

@@ -396,6 +398,18 @@ public static String getSnapshotState(RestClient client, String snapshot) throws
        return (String) snapResponse.get("state");
    }

+    @Nullable
+    public static String waitAndGetShrinkIndexNameWithExtraClusterStateChange(RestClient client, String originalIndex)


Can you please document the why behind this? (the cluster state batching and such)

Ah very good point! On it.

gmarouli · 2022-08-25T08:05:54Z

@elasticmachine update branch

Fix CheckTargetShardsCountStep #(48460)

0414bbb

elasticsearchmachine added v8.5.0 needs:triage Requires assignment of a team area label labels Aug 8, 2022

Fix error message

12cdc8a

gmarouli added >test Issues or PRs that are addressing/adding tests :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Aug 8, 2022

elasticsearchmachine added Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels Aug 8, 2022

gmarouli requested review from andreidan and original-brownbear August 9, 2022 11:24

gmarouli mentioned this pull request Aug 9, 2022

[CI] ShrinkActionIT testAutomaticRetryFailedShrinkAction failing #78460

Closed

Enable the test

0684740

Merge branch 'main' into test-fix-78460

6f67012

original-brownbear reviewed Aug 9, 2022

View reviewed changes

Remove extra interface

7d1c550

original-brownbear approved these changes Aug 9, 2022

View reviewed changes

...gin/core/src/test/java/org/elasticsearch/xpack/core/ilm/CheckTargetShardsCountStepTests.java Outdated Show resolved Hide resolved

Consistent use of Map.of()

ee0e086

andreidan requested changes Aug 10, 2022

View reviewed changes

elasticmachine and others added 3 commits August 20, 2022 21:19

Merge branch 'main' into test-fix-78460

5239423

Revert changes

0e3e568

Add a cluster state update and retry

f732346

gmarouli requested a review from andreidan August 22, 2022 12:57

Merge branch 'main' into test-fix-78460

067c767

andreidan approved these changes Aug 22, 2022

View reviewed changes

Merge branch 'main' into test-fix-78460

38884dd

Add comment to explain the motivation

50da973

gmarouli added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Aug 25, 2022

elasticsearchmachine merged commit 862c885 into elastic:main Aug 25, 2022

gmarouli deleted the test-fix-78460 branch August 25, 2022 09:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CheckTargetShardsCountStep #(48460) #89176

Fix CheckTargetShardsCountStep #(48460) #89176

gmarouli commented Aug 8, 2022

elasticsearchmachine commented Aug 8, 2022

gmarouli commented Aug 9, 2022

original-brownbear left a comment

original-brownbear Aug 9, 2022

gmarouli Aug 9, 2022

original-brownbear Aug 9, 2022

gmarouli Aug 10, 2022

original-brownbear Aug 10, 2022

gmarouli Aug 10, 2022 •

edited

Loading

andreidan Aug 10, 2022 •

edited

Loading

gmarouli Aug 10, 2022

andreidan Aug 11, 2022 •

edited

Loading

gmarouli Aug 11, 2022

original-brownbear left a comment

andreidan Aug 10, 2022 •

edited

Loading

gmarouli commented Aug 20, 2022

gmarouli commented Aug 22, 2022

andreidan left a comment

andreidan Aug 22, 2022

gmarouli Aug 25, 2022

gmarouli commented Aug 25, 2022

Fix CheckTargetShardsCountStep #(48460) #89176

Fix CheckTargetShardsCountStep #(48460) #89176

Conversation

gmarouli commented Aug 8, 2022

elasticsearchmachine commented Aug 8, 2022

gmarouli commented Aug 9, 2022

original-brownbear left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmarouli Aug 10, 2022 • edited Loading

Choose a reason for hiding this comment

andreidan Aug 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan Aug 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear left a comment

Choose a reason for hiding this comment

andreidan Aug 10, 2022 • edited Loading

Choose a reason for hiding this comment

gmarouli commented Aug 20, 2022

gmarouli commented Aug 22, 2022

andreidan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmarouli commented Aug 25, 2022

gmarouli Aug 10, 2022 •

edited

Loading

andreidan Aug 10, 2022 •

edited

Loading

andreidan Aug 11, 2022 •

edited

Loading

andreidan Aug 10, 2022 •

edited

Loading