Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shrink sometimes fails with no obvious cause #56062

Open
gwbrown opened this issue Apr 30, 2020 · 4 comments
Open

Shrink sometimes fails with no obvious cause #56062

gwbrown opened this issue Apr 30, 2020 · 4 comments
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management :Data Management/Indices APIs APIs to create and manage indices and templates Team:Data Management Meta label for data/management team

Comments

@gwbrown
Copy link
Contributor

gwbrown commented Apr 30, 2020

Shrink can sometimes fail with no obvious cause, leading to trouble with ILM (and particularly stopping ILM).

I've only seen this occur a few times, and in each case the relevant logs had aged out by the time I got to see the cluster with the problem. This issue is intended to track failures like this to see if we can spot any patterns.


One example is an ILM explain output from a v7.1.1 that has a step_info like this:

        "phase": "warm",
        "action": "shrink",
		"step": "shrunk-shards-allocated",
        "step_info": {
            "message": "Waiting for shrunk index to be created",
            "shrunk_index_exists": false,
            "actual_shards": -1,
            "all_shards_active": false
        },

The index in question did not proceed from that step for roughly 10 days, with no obvious cause. The situation was fixed by removing ILM from the index. In this case, no shrunken index had been created, but I've seen cases where the shrunken index was created.

@gwbrown gwbrown added >bug :Data Management/Indices APIs APIs to create and manage indices and templates :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Apr 30, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Indices APIs)

@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Apr 30, 2020
@inqueue
Copy link
Member

inqueue commented May 1, 2020

I did a little testing on a 2 node hot/warm 7.1.1 ESS cluster. One thing I found immediately is shrink will not happen if all primaries cannot be allocated to a single node. Granted, this is a two node cluster and the index is configured with 2 primaries and 1 replica. The replicas have to be dropped to shrink:

/metricbeat-7.6.2-2020.05.01-000008/_settings?filter_path=*.settings.index.number_of_*

{
  "metricbeat-7.6.2-2020.05.01-000008" : {
    "settings" : {
      "index" : {
        "number_of_replicas" : "1",
        "number_of_shards" : "2"
      }
    }
  }
}

/metricbeat-7.6.2-2020.05.01-000008/_ilm/explain

{
    "metricbeat-7.6.2-2020.05.01-000008" : {
      "index" : "metricbeat-7.6.2-2020.05.01-000008",
      "managed" : true,
      "policy" : "metricbeat-i",
      "lifecycle_date_millis" : 1588347262492,
      "phase" : "warm",
      "phase_time_millis" : 1588347264424,
      "action" : "allocate",
      "action_time_millis" : 1588347269281,
      "step" : "check-allocation",
      "step_time_millis" : 1588347270843,
      "step_info" : {
        "message" : "Waiting for [2] shards to be allocated to nodes matching the given filters",
        "shards_left_to_allocate" : 2,
        "all_shards_active" : true,
        "actual_replicas" : 1
      },
      "phase_execution" : {
        "policy" : "metricbeat-i",
        "phase_definition" : {
          "min_age" : "0ms",
          "actions" : {
            "allocate" : {
              "include" : { },
              "exclude" : { },
              "require" : {
                "data" : "warm"
              }
            },
            "set_priority" : {
              "priority" : 50
            },
            "shrink" : {
              "number_of_shards" : 1
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1588343480680
      }
    }

/_cat/shards/metricbeat-7.6.2-2020.05.01-000008

metricbeat-7.6.2-2020.05.01-000008 1 r STARTED 143  89.9kb 172.25.21.164 instance-0000000001
metricbeat-7.6.2-2020.05.01-000008 1 p STARTED 143 357.1kb 172.25.14.234 instance-0000000000
metricbeat-7.6.2-2020.05.01-000008 0 r STARTED 157 105.2kb 172.25.21.164 instance-0000000001
metricbeat-7.6.2-2020.05.01-000008 0 p STARTED 157   371kb 172.25.14.234 instance-0000000000
PUT /metricbeat-7.6.2-2020.05.01-000008/_settings
{
  "index.number_of_replicas": 0
}

Shrink can happen now.

/metricbeat-7.6.2-2020.05.01-000008/_ilm/explain

{
  "indices" : {
    "metricbeat-7.6.2-2020.05.01-000008" : {
      "index" : "metricbeat-7.6.2-2020.05.01-000008",
      "managed" : true,
      "policy" : "metricbeat-i",
      "lifecycle_date_millis" : 1588347262492,
      "phase" : "warm",
      "phase_time_millis" : 1588347264424,
      "action" : "shrink",
      "action_time_millis" : 1588347958002,
      "step" : "shrink",
      "step_time_millis" : 1588347965358,
      "phase_execution" : {
        "policy" : "metricbeat-i",
        "phase_definition" : {
          "min_age" : "0ms",
          "actions" : {
            "allocate" : {
              "include" : { },
              "exclude" : { },
              "require" : {
                "data" : "warm"
              }
            },
            "set_priority" : {
              "priority" : 50
            },
            "shrink" : {
              "number_of_shards" : 1
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1588343480680
      }
    }
  }
}

So while the condition may be expected and valid for the Shrink API-- we'll only shrink if all primaries are on the same node, ILM should be able to reconcile it and follow the policy.

@gwbrown
Copy link
Contributor Author

gwbrown commented May 4, 2020

So while the condition may be expected and valid for the Shrink API-- we'll only shrink if all primaries are on the same node, ILM should be able to reconcile it and follow the policy.

It shouldn't be required that all primary shards are on the same node, just that at least one copy of each shard is on a single node. This is a hard requirement for how shrinking indices works - it involves manipulating the shard files on the filesystem in a way that can only be done if one node has a copy of each shard. There's no way we could work around this for ILM, although we do try to do it intelligently - at least in later versions. I haven't run a test yet, but I believe the issue you hit was resolved in #43300 (6.8.2+ or 7.2.1+).

That said, I think there's still a separate issue as described in the original ticket - note that the issue originally hit broke on step shrunk-shards-allocated, as opposed to the issue you hit on step check-allocation.

@eedugon
Copy link
Contributor

eedugon commented Jul 2, 2020

@gwbrown , thanks for sharing this. Let me share here a potential workaround to try again the shrink operation by ILM. This should work if the final index shrink-index_name doesn't exist:

POST _ilm/move/index_name
{
  "current_step": { 
    "phase": "warm",
    "action": "shrink",
    "name": "shrunk-shards-allocated"
  },
  "next_step": { 
    "phase": "warm",
    "action": "shrink",
    "name": "set-single-node-allocation"
  }
}

I have seen the exact problem you have shared in one environment but I haven't been able to determine why the shrink never occurred... there was no track at all of the shrunk index....

I moved the original index back to the set-single-node-allocation step and this time all worked fine....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management :Data Management/Indices APIs APIs to create and manage indices and templates Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

4 participants