-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit ILM retry strategy for additional conditions #42824
Comments
Pinging @elastic/es-core-features |
This seems to be the same issue we are facing in our prod environment. |
It's possible for force merges kicked off by ILM to silently stop (due to a node relocating for example). In which case, the segment count may not reach what the user configured. In the subsequent `SegmentCountStep` waiting for the expected segment count may wait indefinitely. Because of this, this commit makes force merges "best effort" and then changes the `SegmentCountStep` to simply report (at INFO level) if the merge was not successful. Relates to elastic#42824 Resolves elastic#43245
It's possible for force merges kicked off by ILM to silently stop (due to a node relocating for example). In which case, the segment count may not reach what the user configured. In the subsequent `SegmentCountStep` waiting for the expected segment count may wait indefinitely. Because of this, this commit makes force merges "best effort" and then changes the `SegmentCountStep` to simply report (at INFO level) if the merge was not successful. Relates to #42824 Resolves #43245
It's possible for force merges kicked off by ILM to silently stop (due to a node relocating for example). In which case, the segment count may not reach what the user configured. In the subsequent `SegmentCountStep` waiting for the expected segment count may wait indefinitely. Because of this, this commit makes force merges "best effort" and then changes the `SegmentCountStep` to simply report (at INFO level) if the merge was not successful. Relates to #42824 Resolves #43245
It's possible for force merges kicked off by ILM to silently stop (due to a node relocating for example). In which case, the segment count may not reach what the user configured. In the subsequent `SegmentCountStep` waiting for the expected segment count may wait indefinitely. Because of this, this commit makes force merges "best effort" and then changes the `SegmentCountStep` to simply report (at INFO level) if the merge was not successful. Relates to #42824 Resolves #43245
It's possible for force merges kicked off by ILM to silently stop (due to a node relocating for example). In which case, the segment count may not reach what the user configured. In the subsequent `SegmentCountStep` waiting for the expected segment count may wait indefinitely. Because of this, this commit makes force merges "best effort" and then changes the `SegmentCountStep` to simply report (at INFO level) if the merge was not successful. Relates to #42824 Resolves #43245
It's possible for force merges kicked off by ILM to silently stop (due to a node relocating for example). In which case, the segment count may not reach what the user configured. In the subsequent `SegmentCountStep` waiting for the expected segment count may wait indefinitely. Because of this, this commit makes force merges "best effort" and then changes the `SegmentCountStep` to simply report (at INFO level) if the merge was not successful. Relates to #42824 Resolves #43245
I ran into this when running out of disk space on our ECE instance. While it was easy to expand the nodes it was very user-hostile to make me manually trigger a retry on my 28 failed indexes that have the same ILM policy configured. |
We are sorry about the poor experience that you had here. We have recognized this and problems like this are serious usability issues. We have been making a concerted effort in our system to make Elasticsearch more resilient in the face of errors in a way that requires less intervention from a human: we think when the system can recover on its own, it should. ILM is one area in particular where we are investing heavily and making the system more resilient to errors so the system automatically recovers. |
Thank you! I will make sure to forward this information to my team. |
Here’s an issue that you can use to track our progress on this work specific to ILM: #48183 |
Closing this in favor of #48183, where we will track the work for this. |
About 'Make ILM force merging best effort (#43246)' There are cluster with 3 shards on 3 data node, forcemerge with max_num_segments=1 against Elasticsearch7.0.1 cluster will spend twice as much time as Elasticsearch6.8.13 cluster. |
@gaocx2000cn force merging should take roughly the same amount of time, there is no functional difference in force merging in those cases. The only difference would be Lucene version. |
Currently, ILM does not retry on most step errors other than SnapshotInProgressException.
The following are a few scenarios users have run into in the field where having a retry strategy for other errors or conditions will be helpful:
ILM will leave an index at the forcemerge action's segment-count step, waiting for the shards to merge.
However, the segment-count step does not have any knowledge of whether there is still an outstanding force merge operation running against the index. It does not currently retry forcemerge so it will just keep waiting in segment-count until either 1) the user runs force merge outside of ILM to complete the force merge, 2) the user instructs ILM to re-run force merge by manually moving the step back to forcemerge.
If the node has previously hit the flood stage watermark, after the admin has addressed the disk usage and removed the read-only/allow delete block against the affected indices, it may not occur to them that they will also have to manually issue a ILM retry against the index that couldn't rollover before due to the block. If the admin has removed the block against the index but not manually reissued a retry in ILM against the index, indexing will keep writing to the latest rollover index beyond max_size. As a result, the cluster can end up getting an index that is hundreds of Gbs with shards that are way over 100Gb each, causing other issues.
It can be helpful to add a note to https://www.elastic.co/guide/en/elasticsearch/reference/current/disk-allocator.html#disk-allocator as part of the example to remove the block to remind admins to check ILM to see if they need to issue a manual retry. Though it will be better if ILM can periodically retry so that it will reset itself after the block is cleared against the index.
The text was updated successfully, but these errors were encountered: