-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Transform] Improve robustness of checkpointing #80984
[Transform] Improve robustness of checkpointing #80984
Conversation
05f9d7b
to
b9c47d3
Compare
c5e4f8e
to
d9848eb
Compare
Pinging @elastic/ml-core (Team:ML) |
fa00471
to
f39c911
Compare
...in/core/src/main/java/org/elasticsearch/xpack/core/transform/action/GetCheckpointAction.java
Show resolved
Hide resolved
...in/core/src/main/java/org/elasticsearch/xpack/core/transform/action/GetCheckpointAction.java
Show resolved
Hide resolved
...test/java/org/elasticsearch/xpack/core/transform/action/GetCheckpointActionRequestTests.java
Outdated
Show resolved
Hide resolved
...orm/src/main/java/org/elasticsearch/xpack/transform/action/TransportGetCheckpointAction.java
Outdated
Show resolved
Hide resolved
...orm/src/main/java/org/elasticsearch/xpack/transform/action/TransportGetCheckpointAction.java
Outdated
Show resolved
Hide resolved
...orm/src/main/java/org/elasticsearch/xpack/transform/action/TransportGetCheckpointAction.java
Outdated
Show resolved
Hide resolved
d075344
to
e48e6a7
Compare
@elasticmachine update branch |
merge conflict between base and head |
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be paranoid, does this actually work with remote clusters? I guess the transport client is used between clusters?
...ore/src/main/java/org/elasticsearch/xpack/core/transform/action/GetCheckpointNodeAction.java
Outdated
Show resolved
Hide resolved
...vileges-tests/src/javaRestTest/java/org/elasticsearch/xpack/security/operator/Constants.java
Outdated
Show resolved
Hide resolved
...orm/src/main/java/org/elasticsearch/xpack/transform/action/TransportGetCheckpointAction.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
...orm/src/main/java/org/elasticsearch/xpack/transform/action/TransportGetCheckpointAction.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/elasticsearch/xpack/transform/action/TransportGetCheckpointNodeAction.java
Show resolved
Hide resolved
702913b
to
9df29bf
Compare
yes, it should work the same way as the previous approach. One challenge are older remotes, this is handled like "pit", which means it falls back to the old style if the exception caught is "action not found". To be sure, I will manually test this. |
changing the node action to I will have another look tomorrow, I don't want to give up, this might just be a bug. |
It may be that since its a separate cluster, it has to travel through a path that is no longer "internal". In that case, I think switching back to |
8b4cee4
to
bc6da83
Compare
…or resolving indices.
I have changed the implementation, so both actions are index actions now, the node action now carries indices and gets re-authorized. As you suggested I removed both actions from
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@hendrikmuhs I appreciate your effort on the iterations and reaching out to us for awareness. Thanks a lot!
It works because of the wildcard: GetCheckpointAction.NAME + "*"
Yes we intentionally use the suffix wildcard pattern so that privileges of child actions are also granted along with the parent action (assuming child actions are named using the suffix pattern).
rewrites checkpointing as internal actions, reducing several sub-calls to only 1 per data node that has at least 1 primary shard of the indexes of interest. Robustness: The current checkpointing sends a request to every shard - primary and replica - and collects the results. If 1 request fails, even for a replica, checkpointing fails. See elastic#75780 for details. Performance: The current checkpointing is wasteful, it uses get index and get index stats which results in a lot more calls and executes a lot more code which produces results we are not interested in. Number of requests before and after: before: 1 + #shards * #indices * (#replicas + 1) after: #data_nodes_holding_gt1_shard Fixes elastic#75780
In elastic#80984, a new action is added to the "view_index_privilege" index privilege. This PR adds it under "manage" as well and also adds test to ensure "view_index_metadata" is always a subset of "manage".
In #80984, a new GetCheckpointAction is added under the namespace of indices:internal/. This is a new namespace and there it is not automatically covered by the manage index privilege. Since it is explicilty added to view_index_metadata, it means view_index_metadata is no longer a subset of manage. This is unexpected. After discussion, instead of adding this new namespace to manage, we agreed to move the new action under monitor and drop the new namespace. IIUC, the internal is used to indicate that this action is internal to a bigger process and it cannot be called on its own and should be kept as an implementation detail. However, what privilege an action should have is an orthogonal concern to how it should be used. The function of this action is more of monitor (similar to the existing GetGlobalCheckpoint API) This PR also adds a few tests to ensure certain subset relationships between privileges, e.g. "view_index_privilege" is a subset of "manage".
properly prefix remote indices in checkpoints, fixes a failure when more than 1 cluster is used and index names clash relates elastic#80984 fixes elastic#91550
properly prefix remote indices in checkpoints, fixes a failure when more than 1 cluster is used and index names clash relates elastic#80984 fixes elastic#91550
rewrites checkpointing as internal actions, reducing several sub-calls to
only 1 per data node that has at least 1 primary shard of the indexes of
interest.
Notes
Robustness: The current checkpointing sends a request to every shard - primary and replica - and collects the results. If 1 request fails, even for a replica, checkpointing fails. See #75780 for details.
Performance: The current checkpointing is wasteful, it uses get index and get index stats which results in a lot more calls and executes a lot more code which produces results we are not interested in.
Number of node<->node messages:
old:
1 + shards * indices * (replicas + 1)
new:
data_nodes
(super precise: all data nodes with at least 1 primary shard of the requested indices)
e.g.
Fixes #75780