[Transform] Improve robustness of checkpointing #80984

hendrikmuhs · 2021-11-24T09:13:31Z

rewrites checkpointing as internal actions, reducing several sub-calls to
only 1 per data node that has at least 1 primary shard of the indexes of
interest.

Notes

Robustness: The current checkpointing sends a request to every shard - primary and replica - and collects the results. If 1 request fails, even for a replica, checkpointing fails. See #75780 for details.

Performance: The current checkpointing is wasteful, it uses get index and get index stats which results in a lot more calls and executes a lot more code which produces results we are not interested in.

Number of node<->node messages:

old: 1 + shards * indices * (replicas + 1)
new: data_nodes
(super precise: all data nodes with at least 1 primary shard of the requested indices)

e.g.

shards	replicas	indices	data_nodes	old	new
1	1	1	1	4	1
5	1	1	1	11	1
5	1	5	3	51	3
5	1	20	8	201	8
5	1	100	10	1001	10

Fixes #75780

elasticmachine · 2021-12-15T13:01:53Z

Pinging @elastic/ml-core (Team:ML)

...in/core/src/main/java/org/elasticsearch/xpack/core/transform/action/GetCheckpointAction.java

...test/java/org/elasticsearch/xpack/core/transform/action/GetCheckpointActionRequestTests.java

...orm/src/main/java/org/elasticsearch/xpack/transform/action/TransportGetCheckpointAction.java

hendrikmuhs · 2022-01-10T13:45:34Z

@elasticmachine update branch

elasticmachine · 2022-01-10T13:45:36Z

merge conflict between base and head

hendrikmuhs · 2022-01-10T13:49:32Z

@elasticmachine update branch

przemekwitek

LGTM

benwtrent

Just to be paranoid, does this actually work with remote clusters? I guess the transport client is used between clusters?

...ore/src/main/java/org/elasticsearch/xpack/core/transform/action/GetCheckpointNodeAction.java

...vileges-tests/src/javaRestTest/java/org/elasticsearch/xpack/security/operator/Constants.java

...orm/src/main/java/org/elasticsearch/xpack/transform/action/TransportGetCheckpointAction.java

davidkyle

LGTM

...orm/src/main/java/org/elasticsearch/xpack/transform/action/TransportGetCheckpointAction.java

...src/main/java/org/elasticsearch/xpack/transform/action/TransportGetCheckpointNodeAction.java

hendrikmuhs · 2022-02-07T15:45:27Z

Just to be paranoid, does this actually work with remote clusters? I guess the transport client is used between clusters?

yes, it should work the same way as the previous approach. One challenge are older remotes, this is handled like "pit", which means it falls back to the old style if the exception caught is "action not found".

To be sure, I will manually test this.

hendrikmuhs · 2022-02-07T16:52:04Z

changing the node action to internal breaks CCS, I am getting an authorization error.

I will have another look tomorrow, I don't want to give up, this might just be a bug.

benwtrent · 2022-02-07T17:07:33Z

changing the node action to internal breaks CCS, I am getting an authorization error.

It may be that since its a separate cluster, it has to travel through a path that is no longer "internal". In that case, I think switching back to admin is ok. But, it seems to me that it should work as it is all caused by an internal source.

…or resolving indices.

hendrikmuhs · 2022-02-15T17:05:58Z

@ywangd

I have changed the implementation, so both actions are index actions now, the node action now carries indices and gets re-authorized. As you suggested I removed both actions from manage_transform. That means they are both fall under view_index_metadata now.

~~FWIW: What confused me: the node action is not part of view_index_metadata, but nevertheless this works as the system integration tests indicate. Is it because I use the transport to call it?~~
It works because of the wildcard: GetCheckpointAction.NAME + "*".

ywangd

LGTM

@hendrikmuhs I appreciate your effort on the iterations and reaching out to us for awareness. Thanks a lot!

It works because of the wildcard: GetCheckpointAction.NAME + "*"

Yes we intentionally use the suffix wildcard pattern so that privileges of child actions are also granted along with the parent action (assuming child actions are named using the suffix pattern).

rewrites checkpointing as internal actions, reducing several sub-calls to only 1 per data node that has at least 1 primary shard of the indexes of interest. Robustness: The current checkpointing sends a request to every shard - primary and replica - and collects the results. If 1 request fails, even for a replica, checkpointing fails. See elastic#75780 for details. Performance: The current checkpointing is wasteful, it uses get index and get index stats which results in a lot more calls and executes a lot more code which produces results we are not interested in. Number of requests before and after: before: 1 + #shards * #indices * (#replicas + 1) after: #data_nodes_holding_gt1_shard Fixes elastic#75780

In elastic#80984, a new action is added to the "view_index_privilege" index privilege. This PR adds it under "manage" as well and also adds test to ensure "view_index_metadata" is always a subset of "manage".

In #80984, a new GetCheckpointAction is added under the namespace of indices:internal/. This is a new namespace and there it is not automatically covered by the manage index privilege. Since it is explicilty added to view_index_metadata, it means view_index_metadata is no longer a subset of manage. This is unexpected. After discussion, instead of adding this new namespace to manage, we agreed to move the new action under monitor and drop the new namespace. IIUC, the internal is used to indicate that this action is internal to a bigger process and it cannot be called on its own and should be kept as an implementation detail. However, what privilege an action should have is an orthogonal concern to how it should be used. The function of this action is more of monitor (similar to the existing GetGlobalCheckpoint API) This PR also adds a few tests to ensure certain subset relationships between privileges, e.g. "view_index_privilege" is a subset of "manage".

properly prefix remote indices in checkpoints, fixes a failure when more than 1 cluster is used and index names clash relates #80984 fixes #91550

properly prefix remote indices in checkpoints, fixes a failure when more than 1 cluster is used and index names clash relates elastic#80984 fixes elastic#91550

) properly prefix remote indices in checkpoints, fixes a failure when more than 1 cluster is used and index names clash relates #80984 fixes #91550

hendrikmuhs added >refactoring :ml/Transform Transform v8.1.0 labels Nov 24, 2021

hendrikmuhs force-pushed the transform-checkpointing-refactor branch 3 times, most recently from 05f9d7b to b9c47d3 Compare December 1, 2021 07:59

hendrikmuhs force-pushed the transform-checkpointing-refactor branch from c5e4f8e to d9848eb Compare December 14, 2021 15:52

hendrikmuhs marked this pull request as ready for review December 15, 2021 13:01

elasticmachine added the Team:ML Meta label for the ML team label Dec 15, 2021

hendrikmuhs force-pushed the transform-checkpointing-refactor branch from fa00471 to f39c911 Compare December 21, 2021 08:12

przemekwitek self-requested a review December 21, 2021 09:01

przemekwitek reviewed Dec 21, 2021

View reviewed changes

hendrikmuhs force-pushed the transform-checkpointing-refactor branch from d075344 to e48e6a7 Compare December 21, 2021 16:44

przemekwitek approved these changes Jan 12, 2022

View reviewed changes

benwtrent reviewed Jan 12, 2022

View reviewed changes

davidkyle approved these changes Jan 12, 2022

View reviewed changes

...orm/src/main/java/org/elasticsearch/xpack/transform/action/TransportGetCheckpointAction.java Outdated Show resolved Hide resolved

...src/main/java/org/elasticsearch/xpack/transform/action/TransportGetCheckpointNodeAction.java Show resolved Hide resolved

mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022

hendrikmuhs force-pushed the transform-checkpointing-refactor branch 3 times, most recently from 702913b to 9df29bf Compare February 7, 2022 15:09

hendrikmuhs added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Feb 7, 2022

benwtrent approved these changes Feb 7, 2022

View reviewed changes

Hendrik Muhs added 8 commits February 15, 2022 10:30

apply review comment

fdb02ad

make checkpoint node operation internal

9b5f577

update doc comments

d23d9a1

overhaul privilige checking

2a35374

make GetCheckpointNodeAction an IndicesRequest

ff8ad32

add logging

4b3d8c9

add code comments

836d3f1

fix mixed version issue

bc6da83

hendrikmuhs force-pushed the transform-checkpointing-refactor branch from 8b4cee4 to bc6da83 Compare February 15, 2022 10:57

disable remote indizes explicitly and explain why, use shorter call f…

0426dbe

…or resolving indices.

document the access control check

1fa096f

ywangd approved these changes Feb 17, 2022

View reviewed changes

hendrikmuhs merged commit 48e562a into elastic:master Feb 17, 2022

hendrikmuhs deleted the transform-checkpointing-refactor branch February 17, 2022 08:38

ywangd mentioned this pull request Mar 1, 2022

Move GetCheckpointAction under the monitor namespace #84473

Merged

This was referenced Nov 14, 2022

[Transform] Transform can fail resolving indices from CCS clusters #91550

Closed

[Transform] Fix failure when resolving indices from CCS #91622

Merged

hendrikmuhs pushed a commit that referenced this pull request Nov 18, 2022

[Transform] Fix failure when resolving indices from CCS (#91622)

4c521d0

properly prefix remote indices in checkpoints, fixes a failure when more than 1 cluster is used and index names clash relates #80984 fixes #91550

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Transform] Improve robustness of checkpointing #80984

[Transform] Improve robustness of checkpointing #80984

hendrikmuhs commented Nov 24, 2021 •

edited

Loading

elasticmachine commented Dec 15, 2021

hendrikmuhs commented Jan 10, 2022

elasticmachine commented Jan 10, 2022

hendrikmuhs commented Jan 10, 2022

przemekwitek left a comment

benwtrent left a comment

davidkyle left a comment

hendrikmuhs commented Feb 7, 2022

hendrikmuhs commented Feb 7, 2022

benwtrent commented Feb 7, 2022

hendrikmuhs commented Feb 15, 2022 •

edited

Loading

ywangd left a comment

[Transform] Improve robustness of checkpointing #80984

[Transform] Improve robustness of checkpointing #80984

Conversation

hendrikmuhs commented Nov 24, 2021 • edited Loading

Number of node<->node messages:

elasticmachine commented Dec 15, 2021

hendrikmuhs commented Jan 10, 2022

elasticmachine commented Jan 10, 2022

hendrikmuhs commented Jan 10, 2022

przemekwitek left a comment

Choose a reason for hiding this comment

benwtrent left a comment

Choose a reason for hiding this comment

davidkyle left a comment

Choose a reason for hiding this comment

hendrikmuhs commented Feb 7, 2022

hendrikmuhs commented Feb 7, 2022

benwtrent commented Feb 7, 2022

hendrikmuhs commented Feb 15, 2022 • edited Loading

ywangd left a comment

Choose a reason for hiding this comment

hendrikmuhs commented Nov 24, 2021 •

edited

Loading

hendrikmuhs commented Feb 15, 2022 •

edited

Loading