-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combine ILM shrink and force merge #73499
Comments
Pinging @elastic/es-core-features (Team:Core/Features) |
@dakrone , can I work on this issue? I'm a deep user of ILM and want to make more contributions to the feature. |
@gaobinlong I appreciate the interest! For this one though, I think we should hold off. I'm not sure yet the best way to implement this, whether we want to put something solely in the shrink action, or whether we want to introduce the concept of a "logical plan" into ILM that can re-order or combine steps to be optimized. |
@dakrone thanks for you reply, I will keep track of this issue and follow up the development of ILM. |
Maybe one argument for the latter is that we would likely want to also optimize the forcemerge + shrink + searchable_snapshot workflow to replace the step that increases the number of replicas of the shrunken index with taking a snapshot and doing a snapshot recovery? |
@jpountz yes with a logical plan we could re-order, elide, or enhance actions to make more combinations of actions efficient. |
In addition to the DTS costs, there is another aspect of this proposal that I like a lot, which is the fact that we would reduce the CPU cost of the forcemerge operation by 2x since it would run on a single shard copy. This would be a win on its own, plus we could then have more discussions about shifting some of the CPU cost from natural merges to forced merges, e.g.
|
There is related discussion in Can we avoid force-merging all shard copies? |
It's a common use case for an ILM policy to have a shrink action as well as a forcemerge action in the warm phase. However, in order to reduce DTS costs, we should investigate combining these actions.
Currently when performing a shrink, the following actions are taken by ILM (this is a subset):
The forcemerge performs a simple forcemerge of the index, but it does mean that the forcemerge is duplicated, and because merging is non deterministic the segments will likely differ between the nodes, leading to replication of segments.
There are at least two things we can do to help reduce DTS costs related to this:
Shrink into an index with zero replicas
When we shrink, currently ILM creates the shrunken index with the same replica count, but since this is going on transparently in the background, there is no need to create a shrunken index with a single replica. Instead, we can create the index with zero replicas, and increase the number of replicas to the original index's count prior to deletion of the original index.
Since shrink now has ILM resiliency, it means that in the event that something goes wrong, no data loss occurs, and ILM can retry.
By itself, this doesn't reduce DTS, because regardless the data will still have to be replicated across the zone boundary. However, if it was combined with the next enhancement:
Perform forcemerge prior to increasing the replica count
Forcemerge also ends up leading to replication across zone boundaries, however, if we perform the forcemerge at a point where the index has no replicas, then it only need be performed once, and the data will be replicated to a different zone only a single time.
If we combine both of these behaviors, the new behavior looks like:
Here is a before picture:
And here is an after picture:
In both examples I treated the single node allocation rule (where ILM has to get a copy of each shard on the same node) as "smart" and not sending any data across zones. Still, this step is tedious, and it would be nice if we could skip it.
The text was updated successfully, but these errors were encountered: