ILM shrink causes cluster to turn red #67957
Labels
>bug
:Data Management/ILM+SLM
Index and Snapshot lifecycle management
Team:Data Management
Meta label for data/management team
Elasticsearch version (
bin/elasticsearch --version
):Version: 7.7.1, Build: default/tar/ad56dce891c901a492bb1ee393f12dfff473a423/2020-05-28T16:30:01.040088Z, JVM: 14.0.1
Plugins installed: []
repository-s3
JVM version (
java -version
):openjdk version "14.0.1" 2020-04-14
OpenJDK Runtime Environment AdoptOpenJDK (build 14.0.1+7)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 14.0.1+7, mixed mode, sharing)
OS version (
uname -a
if on a Unix-like system):RHEL 7.9 / 3.10.0-1160.11.1.el7.x86_64
Description of the problem including expected versus actual behavior:
According to https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-shrink-index.html
However, when ILM chooses eligible nodes for the shrink process, it only considers nodes that have enough free space for one copy of all shards, not two:
https://github.com/elastic/elasticsearch/blob/v7.10.2/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SetSingleNodeAllocateStep.java#L94
I consider this a major bug because the allocator routinely picks a node close to the low watermark and moves all shards to that node. It then isn't able to allocate the new shrink-* index because it would put the node above the watermark. That results in the cluster turning red and requires manual intervention to remediate.
I know this issue would likely be solved by #63519, but that is not intended to be worked on "in the foreseeable future." I think my issue warrants a bug fix in the mean time. Having ILM routinely turn the cluster red is a major problem. This also seems like a much quicker fix than #63519, even if that work gets re-prioritized.
It looks like curator handles this scenario correctly. It adds up the size of all the primaries, multiplies by two, and adds a small amount of padding. See https://github.com/elastic/curator/blob/v5.8.2/curator/actions.py#L2252-L2253.
I really don't want to switch back to curator when ILM seems poised to replace it.
Steps to reproduce:
This is difficult to reproduce since the allocator picks a random node after building the list of eligible nodes.
The text was updated successfully, but these errors were encountered: