-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CheckTargetShardsCountStep #(48460) #89176
Conversation
Pinging @elastic/es-data-management (Team:Data Management) |
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks just fine to me in principle, just one point on the seemingly redundant interface that was added here.
x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WithTargetNumberOfShards.java
Outdated
Show resolved
Hide resolved
|
||
private Integer getTargetNumberOfShards(String policyName, ClusterState clusterState) { | ||
IndexLifecycleMetadata indexLifecycleMetadata = clusterState.metadata().custom(IndexLifecycleMetadata.TYPE); | ||
LifecycleAction lifecycleAction = indexLifecycleMetadata.getPolicyMetadatas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a NIT: Should we be a little more careful here to avoid NPEs if the policy got concurrently modified/deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, good point. We should, but then we need to see how to handle it. I do not have a clear picture yet how we handle the deletion or if the step got removed. Do I remember correctly that keep a cached version of the step being executed in the index metadata? Should we fallback there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think throwing an appropriate exception that explains what happened is just fine here. No need to add additional logic, just figured it'd be nice to avoid a possible NPE and replace it with an easy to interpret exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I am still doubting if that is sufficient though. What concerns me in this approach is that it will throw errors and get stuck there permanently. That will also change the code behavior compared to what it does now.
If this would have happened in the current set-up, the check would yield either true or false and it would continue. That's what I am trying to achieve by looking for a graceful fallback. Don't you think that's worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point if we get stuck in error permanently. I thought other functionality would take care of clearing things up the policy disappears? (maybe that's something to fix separately if not?)
That said, that's a different issue from what we're working on here IMO. Here I was just aiming at getting an easy to understand exception.
the check would yield either true or false and it would continue.
Right but then we'd throw on the next step anyway (at least we should) wouldn't we so likely not really a change from the user's perspective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreidan if I understand correctly then the safest option that fixes also the outdated step count issue it to read it from the cached phase, right?
Assuming that if you update only the target shard count then the cached phase will be updated because the number of steps remained the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ that's a safe option however that might incur a performance penalty (parsing the cached json on every step execution)
@original-brownbear how do you feel about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we combine the two approaches to be a bit more efficient. What if we compare if the number of shards in the cluster state and in the step are the same (that is probably the most common case). If they are different then we try to parse the cached phase in an effort to improve user experience and we accept the performance hit since this will probably will not be a very common situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmarouli That's an interesting idea.
Do you think this is worth doing though? In a production environment the ClusterStateWaitStep
will eventually get the correct value and this affects concurrent updates only.
IMO we should fix the test and trigger a cluster state update if the test fails just to make sure it's not run into this race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are probably right..... It's probably not worth the trouble... I wanted to explore all the options before I change the test instead for the race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but maybe wait for Andrei's review as well :)
Also, I still wouldn't mind a nicer exception instead of an NPE for a missing policy below but that's optional too from my end
...gin/core/src/test/java/org/elasticsearch/xpack/core/ilm/CheckTargetShardsCountStepTests.java
Outdated
Show resolved
Hide resolved
|
||
private Integer getTargetNumberOfShards(String policyName, ClusterState clusterState) { | ||
IndexLifecycleMetadata indexLifecycleMetadata = clusterState.metadata().custom(IndexLifecycleMetadata.TYPE); | ||
LifecycleAction lifecycleAction = indexLifecycleMetadata.getPolicyMetadatas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this failure Mary
I think this will potentially break our contract towards the phase we cache for execution.
For e.g. let's say an index is in the warm
phase that currently has actions shrink to 3 shards and forcemerge
. Our index is in the shrink
action.
A user updates the warm
phase to shrink to 1 shard
(also removing forcemerge
). Since we removed an action, all the indices currently in the warm
phase MUST execute the previously defined (and cached) phase - to honour the started flow https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/PhaseCacheManagement.java#L243
So currently (before this PR) an index in the warm
phase will continue to execute shrink to 3 and forcemerge
.
With this change, the user will pick up a shrink
action to a different target (1 shard) and the forcemerge
will also be executed (as we didn't update the cached phase). This leads to executing a mix of phases.
I don't think we should make this change, but rather rework the test.
@elasticmachine update branch |
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for fixing this
@@ -396,6 +398,18 @@ public static String getSnapshotState(RestClient client, String snapshot) throws | |||
return (String) snapResponse.get("state"); | |||
} | |||
|
|||
@Nullable | |||
public static String waitAndGetShrinkIndexNameWithExtraClusterStateChange(RestClient client, String originalIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please document the why
behind this? (the cluster state batching and such)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah very good point! On it.
@elasticmachine update branch |
The issue
The flaky test simulates the following:
check-target-shards-count
will return true and the shrink operation will be successful.What was happening in some cases in the background:
check-target-shards-count
gets created and queued to be executed with the invalid target shards count.Proof
We enriched the code with some extra logging to verify that the above scenario is actually correct:
Impact
We do not think that the impact is that big for production clusters because there are many more cluster state updates. However, it might cause some inconvenience to the users who fixed a policy and do not see the effect as soon as they could have.
The fix
Our proposed fix is to not provide the target shard count upon task creation but to retrieve from the cluster state. This way we ensure it will have the newest value.
Future work
Currently for every cluster state we go through all the indices and we check if any step needs to be executed. This doesn't scale well. We would like to try to switch to a more efficient model potentially with a cluster state observer. Issue will be created soon.
Resolves #78460