-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shrink sometimes fails with no obvious cause #56062
Comments
Pinging @elastic/es-core-features (:Core/Features/Indices APIs) |
I did a little testing on a 2 node hot/warm 7.1.1 ESS cluster. One thing I found immediately is shrink will not happen if all primaries cannot be allocated to a single node. Granted, this is a two node cluster and the index is configured with 2 primaries and 1 replica. The replicas have to be dropped to shrink:
Shrink can happen now.
So while the condition may be expected and valid for the Shrink API-- we'll only shrink if all primaries are on the same node, ILM should be able to reconcile it and follow the policy. |
It shouldn't be required that all primary shards are on the same node, just that at least one copy of each shard is on a single node. This is a hard requirement for how shrinking indices works - it involves manipulating the shard files on the filesystem in a way that can only be done if one node has a copy of each shard. There's no way we could work around this for ILM, although we do try to do it intelligently - at least in later versions. I haven't run a test yet, but I believe the issue you hit was resolved in #43300 (6.8.2+ or 7.2.1+). That said, I think there's still a separate issue as described in the original ticket - note that the issue originally hit broke on step |
@gwbrown , thanks for sharing this. Let me share here a potential workaround to try again the
I have seen the exact problem you have shared in one environment but I haven't been able to determine why the shrink never occurred... there was no track at all of the shrunk index.... I moved the original index back to the |
Shrink can sometimes fail with no obvious cause, leading to trouble with ILM (and particularly stopping ILM).
I've only seen this occur a few times, and in each case the relevant logs had aged out by the time I got to see the cluster with the problem. This issue is intended to track failures like this to see if we can spot any patterns.
One example is an ILM explain output from a v7.1.1 that has a
step_info
like this:The index in question did not proceed from that step for roughly 10 days, with no obvious cause. The situation was fixed by removing ILM from the index. In this case, no shrunken index had been created, but I've seen cases where the shrunken index was created.
The text was updated successfully, but these errors were encountered: