Shrink sometimes fails with no obvious cause #56062

gwbrown · 2020-04-30T23:07:39Z

Shrink can sometimes fail with no obvious cause, leading to trouble with ILM (and particularly stopping ILM).

I've only seen this occur a few times, and in each case the relevant logs had aged out by the time I got to see the cluster with the problem. This issue is intended to track failures like this to see if we can spot any patterns.

One example is an ILM explain output from a v7.1.1 that has a step_info like this:

        "phase": "warm",
        "action": "shrink",
		"step": "shrunk-shards-allocated",
        "step_info": {
            "message": "Waiting for shrunk index to be created",
            "shrunk_index_exists": false,
            "actual_shards": -1,
            "all_shards_active": false
        },

The index in question did not proceed from that step for roughly 10 days, with no obvious cause. The situation was fixed by removing ILM from the index. In this case, no shrunken index had been created, but I've seen cases where the shrunken index was created.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-04-30T23:07:40Z

Pinging @elastic/es-core-features (:Core/Features/Indices APIs)

inqueue · 2020-05-01T15:51:53Z

I did a little testing on a 2 node hot/warm 7.1.1 ESS cluster. One thing I found immediately is shrink will not happen if all primaries cannot be allocated to a single node. Granted, this is a two node cluster and the index is configured with 2 primaries and 1 replica. The replicas have to be dropped to shrink:

/metricbeat-7.6.2-2020.05.01-000008/_settings?filter_path=*.settings.index.number_of_*

{
  "metricbeat-7.6.2-2020.05.01-000008" : {
    "settings" : {
      "index" : {
        "number_of_replicas" : "1",
        "number_of_shards" : "2"
      }
    }
  }
}

/metricbeat-7.6.2-2020.05.01-000008/_ilm/explain

{
    "metricbeat-7.6.2-2020.05.01-000008" : {
      "index" : "metricbeat-7.6.2-2020.05.01-000008",
      "managed" : true,
      "policy" : "metricbeat-i",
      "lifecycle_date_millis" : 1588347262492,
      "phase" : "warm",
      "phase_time_millis" : 1588347264424,
      "action" : "allocate",
      "action_time_millis" : 1588347269281,
      "step" : "check-allocation",
      "step_time_millis" : 1588347270843,
      "step_info" : {
        "message" : "Waiting for [2] shards to be allocated to nodes matching the given filters",
        "shards_left_to_allocate" : 2,
        "all_shards_active" : true,
        "actual_replicas" : 1
      },
      "phase_execution" : {
        "policy" : "metricbeat-i",
        "phase_definition" : {
          "min_age" : "0ms",
          "actions" : {
            "allocate" : {
              "include" : { },
              "exclude" : { },
              "require" : {
                "data" : "warm"
              }
            },
            "set_priority" : {
              "priority" : 50
            },
            "shrink" : {
              "number_of_shards" : 1
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1588343480680
      }
    }

/_cat/shards/metricbeat-7.6.2-2020.05.01-000008

metricbeat-7.6.2-2020.05.01-000008 1 r STARTED 143  89.9kb 172.25.21.164 instance-0000000001
metricbeat-7.6.2-2020.05.01-000008 1 p STARTED 143 357.1kb 172.25.14.234 instance-0000000000
metricbeat-7.6.2-2020.05.01-000008 0 r STARTED 157 105.2kb 172.25.21.164 instance-0000000001
metricbeat-7.6.2-2020.05.01-000008 0 p STARTED 157   371kb 172.25.14.234 instance-0000000000

PUT /metricbeat-7.6.2-2020.05.01-000008/_settings
{
  "index.number_of_replicas": 0
}

Shrink can happen now.

/metricbeat-7.6.2-2020.05.01-000008/_ilm/explain

{
  "indices" : {
    "metricbeat-7.6.2-2020.05.01-000008" : {
      "index" : "metricbeat-7.6.2-2020.05.01-000008",
      "managed" : true,
      "policy" : "metricbeat-i",
      "lifecycle_date_millis" : 1588347262492,
      "phase" : "warm",
      "phase_time_millis" : 1588347264424,
      "action" : "shrink",
      "action_time_millis" : 1588347958002,
      "step" : "shrink",
      "step_time_millis" : 1588347965358,
      "phase_execution" : {
        "policy" : "metricbeat-i",
        "phase_definition" : {
          "min_age" : "0ms",
          "actions" : {
            "allocate" : {
              "include" : { },
              "exclude" : { },
              "require" : {
                "data" : "warm"
              }
            },
            "set_priority" : {
              "priority" : 50
            },
            "shrink" : {
              "number_of_shards" : 1
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1588343480680
      }
    }
  }
}

So while the condition may be expected and valid for the Shrink API-- we'll only shrink if all primaries are on the same node, ILM should be able to reconcile it and follow the policy.

gwbrown · 2020-05-04T23:21:59Z

So while the condition may be expected and valid for the Shrink API-- we'll only shrink if all primaries are on the same node, ILM should be able to reconcile it and follow the policy.

It shouldn't be required that all primary shards are on the same node, just that at least one copy of each shard is on a single node. This is a hard requirement for how shrinking indices works - it involves manipulating the shard files on the filesystem in a way that can only be done if one node has a copy of each shard. There's no way we could work around this for ILM, although we do try to do it intelligently - at least in later versions. I haven't run a test yet, but I believe the issue you hit was resolved in #43300 (6.8.2+ or 7.2.1+).

That said, I think there's still a separate issue as described in the original ticket - note that the issue originally hit broke on step shrunk-shards-allocated, as opposed to the issue you hit on step check-allocation.

eedugon · 2020-07-02T08:35:15Z

@gwbrown , thanks for sharing this. Let me share here a potential workaround to try again the shrink operation by ILM. This should work if the final index shrink-index_name doesn't exist:

POST _ilm/move/index_name
{
  "current_step": { 
    "phase": "warm",
    "action": "shrink",
    "name": "shrunk-shards-allocated"
  },
  "next_step": { 
    "phase": "warm",
    "action": "shrink",
    "name": "set-single-node-allocation"
  }
}

I have seen the exact problem you have shared in one environment but I haven't been able to determine why the shrink never occurred... there was no track at all of the shrunk index....

I moved the original index back to the set-single-node-allocation step and this time all worked fine....

gwbrown added >bug :Data Management/Indices APIs APIs to create and manage indices and templates :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Apr 30, 2020

elasticmachine added the Team:Data Management Meta label for data/management team label Apr 30, 2020

This was referenced May 4, 2021

Add validation of the number_of_shards to the beginning of the ILM shrink step #72724

Closed

Add validation of the total_shards_per_node to the beginning of the ILM shrink step #72725

Closed

ILM shrink causes cluster to turn red #67957

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shrink sometimes fails with no obvious cause #56062

Shrink sometimes fails with no obvious cause #56062

gwbrown commented Apr 30, 2020 •

edited

Loading

elasticmachine commented Apr 30, 2020

inqueue commented May 1, 2020

gwbrown commented May 4, 2020 •

edited

Loading

eedugon commented Jul 2, 2020

Shrink sometimes fails with no obvious cause #56062

Shrink sometimes fails with no obvious cause #56062

Comments

gwbrown commented Apr 30, 2020 • edited Loading

elasticmachine commented Apr 30, 2020

inqueue commented May 1, 2020

gwbrown commented May 4, 2020 • edited Loading

eedugon commented Jul 2, 2020

gwbrown commented Apr 30, 2020 •

edited

Loading

gwbrown commented May 4, 2020 •

edited

Loading