Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transform] Improve robustness of checkpointing #80984

Merged

Conversation

hendrikmuhs
Copy link

@hendrikmuhs hendrikmuhs commented Nov 24, 2021

rewrites checkpointing as internal actions, reducing several sub-calls to
only 1 per data node that has at least 1 primary shard of the indexes of
interest.

Notes

Robustness: The current checkpointing sends a request to every shard - primary and replica - and collects the results. If 1 request fails, even for a replica, checkpointing fails. See #75780 for details.

Performance: The current checkpointing is wasteful, it uses get index and get index stats which results in a lot more calls and executes a lot more code which produces results we are not interested in.

Number of node<->node messages:

old: 1 + shards * indices * (replicas + 1)
new: data_nodes
(super precise: all data nodes with at least 1 primary shard of the requested indices)

e.g.

shards replicas indices data_nodes old new
1 1 1 1 4 1
5 1 1 1 11 1
5 1 5 3 51 3
5 1 20 8 201 8
5 1 100 10 1001 10

Fixes #75780

@hendrikmuhs hendrikmuhs force-pushed the transform-checkpointing-refactor branch 3 times, most recently from 05f9d7b to b9c47d3 Compare December 1, 2021 07:59
@hendrikmuhs hendrikmuhs force-pushed the transform-checkpointing-refactor branch from c5e4f8e to d9848eb Compare December 14, 2021 15:52
@hendrikmuhs hendrikmuhs marked this pull request as ready for review December 15, 2021 13:01
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Dec 15, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@hendrikmuhs hendrikmuhs force-pushed the transform-checkpointing-refactor branch from fa00471 to f39c911 Compare December 21, 2021 08:12
@przemekwitek przemekwitek self-requested a review December 21, 2021 09:01
@hendrikmuhs hendrikmuhs force-pushed the transform-checkpointing-refactor branch from d075344 to e48e6a7 Compare December 21, 2021 16:44
@hendrikmuhs
Copy link
Author

@elasticmachine update branch

@elasticmachine
Copy link
Collaborator

merge conflict between base and head

@hendrikmuhs
Copy link
Author

@elasticmachine update branch

Copy link
Contributor

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be paranoid, does this actually work with remote clusters? I guess the transport client is used between clusters?

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mark-vieira mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022
@hendrikmuhs hendrikmuhs force-pushed the transform-checkpointing-refactor branch 3 times, most recently from 702913b to 9df29bf Compare February 7, 2022 15:09
@hendrikmuhs hendrikmuhs added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Feb 7, 2022
@hendrikmuhs
Copy link
Author

Just to be paranoid, does this actually work with remote clusters? I guess the transport client is used between clusters?

yes, it should work the same way as the previous approach. One challenge are older remotes, this is handled like "pit", which means it falls back to the old style if the exception caught is "action not found".

To be sure, I will manually test this.

@hendrikmuhs
Copy link
Author

changing the node action to internal breaks CCS, I am getting an authorization error.

I will have another look tomorrow, I don't want to give up, this might just be a bug.

@benwtrent
Copy link
Member

changing the node action to internal breaks CCS, I am getting an authorization error.

It may be that since its a separate cluster, it has to travel through a path that is no longer "internal". In that case, I think switching back to admin is ok. But, it seems to me that it should work as it is all caused by an internal source.

@hendrikmuhs hendrikmuhs force-pushed the transform-checkpointing-refactor branch from 8b4cee4 to bc6da83 Compare February 15, 2022 10:57
@hendrikmuhs
Copy link
Author

hendrikmuhs commented Feb 15, 2022

@ywangd

I have changed the implementation, so both actions are index actions now, the node action now carries indices and gets re-authorized. As you suggested I removed both actions from manage_transform. That means they are both fall under view_index_metadata now.

FWIW: What confused me: the node action is not part of view_index_metadata, but nevertheless this works as the system integration tests indicate. Is it because I use the transport to call it?
It works because of the wildcard: GetCheckpointAction.NAME + "*".

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hendrikmuhs I appreciate your effort on the iterations and reaching out to us for awareness. Thanks a lot!

It works because of the wildcard: GetCheckpointAction.NAME + "*"

Yes we intentionally use the suffix wildcard pattern so that privileges of child actions are also granted along with the parent action (assuming child actions are named using the suffix pattern).

@hendrikmuhs hendrikmuhs merged commit 48e562a into elastic:master Feb 17, 2022
@hendrikmuhs hendrikmuhs deleted the transform-checkpointing-refactor branch February 17, 2022 08:38
probakowski pushed a commit to probakowski/elasticsearch that referenced this pull request Feb 23, 2022
rewrites checkpointing as internal actions, reducing several sub-calls to
only 1 per data node that has at least 1 primary shard of the indexes of
interest.

Robustness: The current checkpointing sends a request to every shard
 - primary and replica - and collects the results. If 1 request fails, even 
for a replica, checkpointing fails. See elastic#75780 for details.

Performance: The current checkpointing is wasteful, it uses get index 
and get index stats which results in a lot more calls and executes a 
lot more code which produces results we are not interested in.

Number of requests before and after:
before: 1 + #shards * #indices * (#replicas + 1)
after: #data_nodes_holding_gt1_shard

Fixes elastic#75780
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Mar 1, 2022
In elastic#80984, a new action is added to the "view_index_privilege" index
privilege. This PR adds it under "manage" as well and also adds test to
ensure "view_index_metadata" is always a subset of "manage".
ywangd added a commit that referenced this pull request Mar 9, 2022
In #80984, a new GetCheckpointAction is added under the namespace of
indices:internal/. This is a new namespace and there it is not automatically
covered by the manage index privilege. Since it is explicilty added to
view_index_metadata, it means view_index_metadata is no longer a
subset of manage. This is unexpected.

After discussion, instead of adding this new namespace to manage, we agreed
to move the new action under monitor and drop the new namespace.
IIUC, the internal is used to indicate that this action is internal to a bigger process
and it cannot be called on its own and should be kept as an implementation detail.
However, what privilege an action should have is an orthogonal concern to how it
should be used. The function of this action is more of monitor (similar to the existing
GetGlobalCheckpoint API)

This PR also adds a few tests to ensure certain subset relationships between
privileges, e.g. "view_index_privilege" is a subset of "manage".
hendrikmuhs pushed a commit that referenced this pull request Nov 18, 2022
properly prefix remote indices in checkpoints, fixes a failure when more than 1 cluster is used and index names clash

relates #80984
fixes #91550
hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this pull request Nov 18, 2022
properly prefix remote indices in checkpoints, fixes a failure when more than 1 cluster is used and index names clash

relates elastic#80984
fixes elastic#91550
hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this pull request Nov 18, 2022
properly prefix remote indices in checkpoints, fixes a failure when more than 1 cluster is used and index names clash

relates elastic#80984
fixes elastic#91550
elasticsearchmachine pushed a commit that referenced this pull request Nov 18, 2022
)

properly prefix remote indices in checkpoints, fixes a failure when more than 1 cluster is used and index names clash

relates #80984
fixes #91550
elasticsearchmachine pushed a commit that referenced this pull request Nov 18, 2022
)

properly prefix remote indices in checkpoints, fixes a failure when more than 1 cluster is used and index names clash

relates #80984
fixes #91550
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-deploy Publish cloud docker image for Cloud-First-Testing :ml/Transform Transform >refactoring Team:ML Meta label for the ML team v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Transform] Checkpointing should not fail due to replica unavailability
7 participants