ILM migrate data between tiers #61377

andreidan · 2020-08-20T15:49:29Z

This adds ILM support for automatically migrating the managed
indices between data tiers.

This proposal makes use of a MigrateAction that is injected
(similar to how the Unfollow action is injected) in phases that
don't define index allocation rules using the AllocateAction or
don't explicitly define the MigrateAction itself (regardless if it's
enabled or disabled).

Relates to #60848

andreidan · 2020-08-20T15:50:52Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/AllocationRoutedStep.java

+    private static final AllocationDeciders ALLOCATION_DECIDERS = new AllocationDeciders(
+        List.of(
+            new FilterAllocationDecider(Settings.EMPTY, new ClusterSettings(Settings.EMPTY,
+                ClusterSettings.BUILT_IN_CLUSTER_SETTINGS)),
+            new DataTierAllocationDecider(new ClusterSettings(Settings.EMPTY, ALL_CLUSTER_SETTINGS))
+        )


should we make this configurable and serializable for the step? (ie. use just the one allocation decider depending on where it is used)
I'd say we want a fully allocated index so verifying both always is alright. What do you think?

Hmm... I actually think that we may want to split checking the allocation for the migrate action into a separate step. For example, the allocation routed step currently has a pretty generic message (Waiting for [n] shards to be allocated to nodes matching the given filters). I think if we split this into a new MigrationRouted step we could give it a much better explanation, for example, something like:

waiting [23m] for [3] shards to be allocated on nodes with the [data_warm] tier

additionally, I think we could also even throw an error to signal to the user when things were in a bad state, something like:

exception waiting for index to be moved to the [data_cold] tier, there are currently no [data_cold] nodes in the cluster

Then the step could be retryable (so we check every 10 minutes) and it at least gives us a way of signaling to a user (alerting on the ilm-history index for example) when they are in an irreconcilable position and need to adjust their cluster.

What do you think?

You make a great point on validating if the cluster has any node with a particular role available. I'll create another step for the migrate action (the nicer messages will be a great UX improvement as well)

andreidan · 2020-08-20T15:52:00Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/MigrateAction.java

+    public List<Step> toSteps(Client client, String phase, StepKey nextStepKey) {
+        if (enabled) {
+            Map<String, String> include = Map.of("_tier", "data_" + phase);
+            AllocateAction migrateDataAction = new AllocateAction(null, include, null, null);


should we remove the possible_require and _exclude settings (which might've been set manually before) to make sure they don't invalidate the "migrate to the next data tier" goal?

That's a good question, I'm not sure we'd want to piecemeal remove all of them, because they could be something for an additional attribute that we want to preserve between phases.

I will have to think on it a bit, it also makes me wonder whether we should make it configurable whether all other index-level allocation filtering settings are removed/preserved when doing the migration.

It's an interesting one - we do not allow to have the migrate action enabled in a phase that has the allocate action (configuring allocation), so maybe it makes sense for the migrate action to invalidate the possible allocation filterings the index might have (for eg. from a previous phase where the migrate action was disabled and the allocation action was configured)

dakrone

Thanks for working on this Andrei! I left some thoughts on the implementation.

I also was thinking/wondering, is this something we think should have an index level opt-out setting? I'm thinking of our internal indices where we might be using an ILM policy but don't know the topology of the cluster (and honestly don't care about it living on a specific data tier).

I am also wondering if maybe should have the setting be a more generic "opt out of automatic tier management" that would opt out of allocating on hot nodes by default. We could then potentially use this setting for internal indices that we don't want to have to worry about, like .security or .kibana.

What do you think?

dakrone · 2020-08-20T23:13:17Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/AllocationRoutedStep.java

+    private static final AllocationDeciders ALLOCATION_DECIDERS = new AllocationDeciders(
+        List.of(
+            new FilterAllocationDecider(Settings.EMPTY, new ClusterSettings(Settings.EMPTY,
+                ClusterSettings.BUILT_IN_CLUSTER_SETTINGS)),
+            new DataTierAllocationDecider(new ClusterSettings(Settings.EMPTY, ALL_CLUSTER_SETTINGS))
+        )


Hmm... I actually think that we may want to split checking the allocation for the migrate action into a separate step. For example, the allocation routed step currently has a pretty generic message (Waiting for [n] shards to be allocated to nodes matching the given filters). I think if we split this into a new MigrationRouted step we could give it a much better explanation, for example, something like:

waiting [23m] for [3] shards to be allocated on nodes with the [data_warm] tier

additionally, I think we could also even throw an error to signal to the user when things were in a bad state, something like:

exception waiting for index to be moved to the [data_cold] tier, there are currently no [data_cold] nodes in the cluster

Then the step could be retryable (so we check every 10 minutes) and it at least gives us a way of signaling to a user (alerting on the ilm-history index for example) when they are in an irreconcilable position and need to adjust their cluster.

What do you think?

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/MigrateAction.java

dakrone · 2020-08-20T23:16:34Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/MigrateAction.java

+    public List<Step> toSteps(Client client, String phase, StepKey nextStepKey) {
+        if (enabled) {
+            Map<String, String> include = Map.of("_tier", "data_" + phase);
+            AllocateAction migrateDataAction = new AllocateAction(null, include, null, null);


That's a good question, I'm not sure we'd want to piecemeal remove all of them, because they could be something for an additional attribute that we want to preserve between phases.

I will have to think on it a bit, it also makes me wonder whether we should make it configurable whether all other index-level allocation filtering settings are removed/preserved when doing the migration.

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/MigrateAction.java

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/TimeseriesLifecycleType.java

andreidan · 2020-08-21T09:22:54Z

I am also wondering if maybe should have the setting be a more generic "opt out of automatic tier management" that would opt out of allocating on hot nodes by default

@dakrone I think this makes sense

We already validate the data tier setting values as part of the index settings validation.

We'll have dedicated data tier integration tests that'll have data migration enabled.

…eseries ITS" This reverts commit 6c05591.

andreidan · 2020-09-06T13:53:16Z

@elasticmachine update branch

…o start

andreidan · 2020-09-07T15:23:53Z

@elasticmachine update branch

elasticmachine · 2020-09-07T15:23:55Z

expected head sha didn’t match current head ref.

andreidan · 2020-09-07T15:24:08Z

@elasticmachine update branch

andreidan · 2020-09-10T11:40:42Z

@elasticmachine update branch

andreidan · 2020-09-14T12:40:44Z

@elasticmachine update branch

dakrone

Awesome, this is looking really close, I left one comment about the execution order (I think 'allocate' should come before 'migrate') and a few other really really minor quibbilings, but nothing major

dakrone · 2020-09-14T22:05:14Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/TimeseriesLifecycleType.java

-        FreezeAction.NAME, SearchableSnapshotAction.NAME);
-    static final List<String> ORDERED_VALID_FROZEN_ACTIONS = Arrays.asList(SetPriorityAction.NAME, UnfollowAction.NAME, AllocateAction.NAME,
-        FreezeAction.NAME, SearchableSnapshotAction.NAME);
+        MigrateAction.NAME, AllocateAction.NAME, ShrinkAction.NAME, ForceMergeAction.NAME);


I realized something about this, I think we should do AllocateAction prior to the migrate action, or else we are liable to get "stuck".

For example, if a user had multiple hot nodes and only a single warm node, we would try to migrate an index with one replica from hot to warm, but the replica could never allocate, since there is only a single node. Instead, we should do the allocate action first (where a user could set number_of_replicas to 0) before migrating to the next tier.

What do you think?

I also realize we can get stuck the opposite way too, so either way it's possible to get stuck, but I think that doing allocate first is still the way we should go. Curious about what you think

You make a great point Lee. I think we should go forward this way as it also keeps the "number of replicas is modified before the allocations change" behaviour consistent with how the allocate action works.

Maybe it's a bit more of a future talk, but I do wonder if it'd be a good time to extract the "change number of replicas" into its own action. I think being conflated with the allocation rules in the allocate action is generally confusing, but even more so now where we'd expect the users to use both the allocate and the migrate (admittedly, not necessarily explicitly declaring it) actions to achieve the "reduce number of replicas and then relocate index" scenario.
I think a change-replicas-number action would be welcome at this stage.

dakrone · 2020-09-14T22:07:49Z

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/DataTierMigrationRoutedStep.java

+            logger.debug("[{}] lifecycle action for index [{}] cannot make progress because not all shards are active",
+                    getKey().getAction(), index.getName());


I think in this one we can be more specific in the debug message, so perhaps:

[check-migration] migration for index [foo] to the [data_cold] tier cannot progress, as not all shards are active

(we can pull the tier name from the index setting)

dakrone · 2020-09-14T22:10:36Z

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/DataTierMigrationRoutedStep.java

+            logger.debug(statusMessage);
+            return new Result(false, new AllocationInfo(idxMeta.getNumberOfReplicas(), allocationPendingAllShards, true, statusMessage));
+        } else {
+            logger.debug("{} lifecycle action for [{}] complete", index, getKey().getAction());


same here about log messages, perhaps:

[check-migration] migration of index [foo] to tier [data_cold] complete

It's definitely a lot nicer to grep for in logs :)

dakrone · 2020-09-14T22:13:30Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/step/info/AllocationInfo.java

+     * Builds the AllocationInfo representing a cluster state with a routing table that has all the shards active for a particular index
+     * but there are still {@link #numberShardsLeftToAllocate} left to be allocated.
+     */
+    public static AllocationInfo allShardsActiveAllocationInfo(long actualReplicas, long numberShardsLeftToAllocate) {


Super minor, but it's confusing to call the parameter here (and in the function above) "actualReplicas", it makes it seem like there should be a "allocatedReplicas" parameter somewhere as well. Maybe we can just call in numReplicas?

Totally, my grep skills failed me.

dakrone · 2020-09-14T22:18:03Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/TimeseriesLifecycleType.java

+            .collect(Collectors.joining(","));
+        if (Strings.hasText(phasesWithConflictingMigrationActions)) {
+            throw new IllegalArgumentException("phases [" + phasesWithConflictingMigrationActions + "] specify an enabled " +
+                MigrateAction.NAME + " action and the " + AllocateAction.NAME + " action. specify only one data migration action in these" +


I think rather than:

phases [warm,cold] specify an enabled migrate action and the allocate action. specify only one data migration action in these phases

We should clarify that it's the actual allocation rules that cause problems (they're free to adjust replica count as much as they'd like):

phase [warm,cold] specifies an enabled migrate action and an allocate action with allocation rules, specify only a single data migration in each phase

dakrone · 2020-09-14T22:20:20Z

...n/ilm/src/internalClusterTest/java/org/elasticsearch/xpack/ilm/DataTiersMigrationsTests.java

+        });
+
+        logger.info("starting cold data node");
+        internalCluster().startNode(coldNode(Settings.EMPTY));


Should we have the check for "cold" node allocation before starting up the frozen node? We have it for the other phases, so probably best to check for it just so it doesn't end up disappearing for only the cold phase some time in the future

Yes, changed to wait for the complete step in the cold phase (removed the frozen node and checks)

andreidan · 2020-09-16T09:28:36Z

@elasticmachine update branch

dakrone

LGTM, thanks for iterating on this Andrei! I left one more minor comment, but it's pretty small

dakrone · 2020-09-16T18:13:39Z

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/DataTierMigrationRoutedStep.java

+
+    @Override
+    public int hashCode() {
+        return 711;


I think this needs to incorporate super.hashCode() or else the hash code will be the same regardless of stepkeys

This adds ILM support for automatically migrating the managed indices between data tiers. This proposal makes use of a MigrateAction that is injected (similar to how the Unfollow action is injected) in phases that don't define index allocation rules using the AllocateAction or don't explicitly define the MigrateAction itself (regardless if it's enabled or disabled). (cherry picked from commit c1746af) Signed-off-by: Andrei Dan <[email protected]>

In #61377 we introduced a feature to let ILM migrate data between tiers. We achived this by inject a MigrateAction in `warm` and `cold` phase. However we also injected this action to phases where it is not supported such as `hot`, `frozen` and `delete` phase. It's better to no inject MigrateAction even though MigrateAction in these phases will be filtered out in `TimeseriesLifecycleType#getOrderedActions(Phase)` . This pr updates the `TimeseriesLifecycleType#shouldInjectMigrateStepForPhase(Phase)` to not inject MigrateAction for phase in which MigrateAction is not supported.

andreidan added 2 commits August 20, 2020 16:46

Enable the _tier attribute as a node filter

f4a37a9

ILM: inject a migrate step to migrate data between data tiers

024a0de

andreidan commented Aug 20, 2020

View reviewed changes

dakrone self-requested a review August 20, 2020 19:23

dakrone reviewed Aug 20, 2020

View reviewed changes

dakrone mentioned this pull request Aug 24, 2020

Allocate newly created indices on data_hot tier nodes #61342

Merged

andreidan added 10 commits September 1, 2020 17:31

Merge branch 'master' into ilm-migrate-data-between-tiers

ca9800f

Add a DataTierMigrationRoutedStep and cleanup

07dc8de

Fix tests

42900b2

Fix test

5d6729b

Remove invalid tier name setting validation from ILM step

28ccee9

We already validate the data tier setting values as part of the index settings validation.

Add test for DataTierMigrationRoutedStep

45d2eb3

Disable data tier migration in the full policy we use for timeseries ITS

6c05591

We'll have dedicated data tier integration tests that'll have data migration enabled.

Remove unused import

783061d

Revert "Disable data tier migration in the full policy we use for tim…

ab3ac33

…eseries ITS" This reverts commit 6c05591.

Add MigrateActionTest

6b74d7c

elasticmachine and others added 5 commits September 6, 2020 07:53

Merge branch 'master' into ilm-migrate-data-between-tiers

61ed9eb

Don't randomly pick a disabled migrate action as it has no steps

6f83f91

Add ILM tier migratino IT

4bef9b2

Use 0 replicas in docs test as check-migration waits for replicas t…

5024595

…o start

Fix ILMHistoryTests

41dc480

elasticmachine and others added 3 commits September 7, 2020 09:24

Merge branch 'master' into ilm-migrate-data-between-tiers

d9b6390

Disable migrate action in test that waits for shard allocations

8e9a4f3

Add ILM client MigrateAction

996c747

Fix HLRC tests

f3a64d3

elasticmachine and others added 2 commits September 10, 2020 05:40

Merge branch 'master' into ilm-migrate-data-between-tiers

d119797

Fix ILMHistoryTests

8f855c5

andreidan requested a review from dakrone September 10, 2020 14:38

dakrone mentioned this pull request Sep 10, 2020

Formalize the concept of data tiers in Elasticsearch #60848

Closed

18 tasks

Merge branch 'master' into ilm-migrate-data-between-tiers

0636d0e

dakrone requested changes Sep 14, 2020

View reviewed changes

jloleysens mentioned this pull request Sep 15, 2020

[ILM] Data tiers for 7.10 elastic/kibana#76126

Merged

11 tasks

elasticmachine and others added 6 commits September 16, 2020 03:28

Merge branch 'master' into ilm-migrate-data-between-tiers

500a275

Remove frozen phase from HLRC

9b90555

Execute the allocate action before migrate

eddd223

Adjust test to wait until ILM completes in the cold phase

d733d72

Log and exception messages

4da7379

Merge branch 'master' into ilm-migrate-data-between-tiers

be17159

dakrone approved these changes Sep 16, 2020

View reviewed changes

Drop equals and hashcode (in favour of the ones in Step)

b837718

andreidan merged commit c1746af into elastic:master Sep 17, 2020

andreidan mentioned this pull request Sep 17, 2020

[7.x] ILM migrate data between tiers (#61377) #62536

Merged

andreidan added the >feature label Oct 8, 2020

Mpdreamz mentioned this pull request Nov 16, 2020

7.10.1 Meta Ticket elastic/elasticsearch-net#5096

Closed

61 tasks

stevejgordon mentioned this pull request Dec 17, 2020

7.11.0 Meta Ticket elastic/elasticsearch-net#5198

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

mushao999 mentioned this pull request Aug 18, 2022

Do not inject MigrateAction if it is not supported #89457

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ILM migrate data between tiers #61377

ILM migrate data between tiers #61377

andreidan commented Aug 20, 2020 •

edited

Loading

andreidan Aug 20, 2020

dakrone Aug 20, 2020

andreidan Aug 21, 2020 •

edited

Loading

andreidan Aug 20, 2020

dakrone Aug 20, 2020

andreidan Aug 21, 2020

dakrone left a comment

dakrone Aug 20, 2020

dakrone Aug 20, 2020

andreidan commented Aug 21, 2020

andreidan commented Sep 6, 2020

andreidan commented Sep 7, 2020

elasticmachine commented Sep 7, 2020

andreidan commented Sep 7, 2020

andreidan commented Sep 10, 2020

andreidan commented Sep 14, 2020

dakrone left a comment

dakrone Sep 14, 2020

dakrone Sep 15, 2020

andreidan Sep 16, 2020

dakrone Sep 14, 2020

dakrone Sep 14, 2020

dakrone Sep 14, 2020

andreidan Sep 16, 2020

dakrone Sep 14, 2020

dakrone Sep 14, 2020

andreidan Sep 16, 2020

andreidan commented Sep 16, 2020

dakrone left a comment

dakrone Sep 16, 2020

		logger.debug("[{}] lifecycle action for index [{}] cannot make progress because not all shards are active",
		getKey().getAction(), index.getName());

ILM migrate data between tiers #61377

ILM migrate data between tiers #61377

Conversation

andreidan commented Aug 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan Aug 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dakrone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan commented Aug 21, 2020

andreidan commented Sep 6, 2020

andreidan commented Sep 7, 2020

elasticmachine commented Sep 7, 2020

andreidan commented Sep 7, 2020

andreidan commented Sep 10, 2020

andreidan commented Sep 14, 2020

dakrone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan commented Sep 16, 2020

dakrone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan commented Aug 20, 2020 •

edited

Loading

andreidan Aug 21, 2020 •

edited

Loading