[Transform] Checkpointing should not fail due to replica unavailability #75780

hendrikmuhs · 2021-07-28T11:34:46Z

When transform creates a checkpoint it uses the index stats API to get the global checkpoints for every shard.

By design (of index stats) this call queries every shard, including replica shards. Internally this information is de-duplicated and max(gcp) is taken.

Fleet implemented very similar functionality with the global checkpoints API. The implementation for it does not query replica shards and only collects the information that is required.

Transform can benefit from a similar implementation, if possible by re-using code/functionality.

This will make transform less likely to fail and reduce the amount of network calls for checkpointing by 50 or more percent.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-07-28T11:34:48Z

Pinging @elastic/ml-core (Team:ML)

benwtrent · 2021-08-13T12:47:23Z

Looking more into https://www.elastic.co/guide/en/elasticsearch/reference/7.14/get-global-checkpoints.html

It does seem to fit the bill. We could fairly easily expand the index patterns and send off individual requests for every index and then collapse the global checkpoints.

This would be slightly more complicated than getting stats as that supports multiple indices, but not too much.

I have two open questions:

Mixed cluster support; how does this work against nodes that are <7.13?
Remote cluster support; does this work against remote clusters at all? I am assuming it definitely won't work against clusters with data nodes <7.13.

But, there doesn't seem to be a way to disable this plugin. So, we could move the action classes to XPack core and simply make transport client calls and handle the responses.

hendrikmuhs · 2021-10-29T07:29:17Z

After some more investigation:

I don't think the fleet API fits our use case, but it is interesting to learn from. E.g. the limitation to 1 index and 1 shard is a no go for us.

Checkpointing today does 2 calls a get index and a stats call. Both aren't cheap and we throw a lot of the ouput away. The get index call seems superfluous, we could resolve ourselves, however this does not work for CCS, we need to resolve on the remote.

Proposal

Create a transport action that replaces the 2 calls and provides:

resolving to concrete indexes
spawn node-level requests to collect checkpoints from primary shards
collect checkpoints and send the checkpoint result back

Expected result

combining the 2 calls into 1 removes 1 network-hop
asking only primary shards reduces network calls at least by 50% (for 1 replica)
switching to node-level communication further reduces network calls depending on the number of indexes and shards
the custom transport speeds up execution as it only gets what is required
checkpoint creation does not fail if a replica shard fails and it will be less likely to run into a timeout

CCS and BWC

the transport is compatible to the old way, we can switch between the 2 if necessary
the transport will be available for CCS as well, however we don't know the version of the remote

Options:

check if in a mixed clusters
get the remote version as in SourceDestValidator, which uses RemoteClusterLicenseChecker to obtain the version
"trial and error" like the pit change: if pit creation fails, it falls back to the old implementation. However, the optimization stays disabled until the transform gets restarted or re-located, consider re-checking periodically?

rewrites checkpointing as internal actions, reducing several sub-calls to only 1 per data node that has at least 1 primary shard of the indexes of interest. Robustness: The current checkpointing sends a request to every shard - primary and replica - and collects the results. If 1 request fails, even for a replica, checkpointing fails. See #75780 for details. Performance: The current checkpointing is wasteful, it uses get index and get index stats which results in a lot more calls and executes a lot more code which produces results we are not interested in. Number of requests before and after: before: 1 + #shards * #indices * (#replicas + 1) after: #data_nodes_holding_gt1_shard Fixes #75780

rewrites checkpointing as internal actions, reducing several sub-calls to only 1 per data node that has at least 1 primary shard of the indexes of interest. Robustness: The current checkpointing sends a request to every shard - primary and replica - and collects the results. If 1 request fails, even for a replica, checkpointing fails. See elastic#75780 for details. Performance: The current checkpointing is wasteful, it uses get index and get index stats which results in a lot more calls and executes a lot more code which produces results we are not interested in. Number of requests before and after: before: 1 + #shards * #indices * (#replicas + 1) after: #data_nodes_holding_gt1_shard Fixes elastic#75780

hendrikmuhs added >enhancement :ml/Transform Transform labels Jul 28, 2021

elasticmachine added the Team:ML Meta label for the ML team label Jul 28, 2021

hendrikmuhs mentioned this issue Aug 26, 2021

[Transform] Reduce indexes to query based on checkpoints #75839

Merged

hendrikmuhs self-assigned this Oct 28, 2021

hendrikmuhs mentioned this issue Nov 24, 2021

[Transform] Improve robustness of checkpointing #80984

Merged

hendrikmuhs closed this as completed in #80984 Feb 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Transform] Checkpointing should not fail due to replica unavailability #75780

[Transform] Checkpointing should not fail due to replica unavailability #75780

hendrikmuhs commented Jul 28, 2021 •

edited

Loading

elasticmachine commented Jul 28, 2021

benwtrent commented Aug 13, 2021 •

edited

Loading

hendrikmuhs commented Oct 29, 2021

[Transform] Checkpointing should not fail due to replica unavailability #75780

[Transform] Checkpointing should not fail due to replica unavailability #75780

Comments

hendrikmuhs commented Jul 28, 2021 • edited Loading

elasticmachine commented Jul 28, 2021

benwtrent commented Aug 13, 2021 • edited Loading

hendrikmuhs commented Oct 29, 2021

Proposal

Expected result

CCS and BWC

hendrikmuhs commented Jul 28, 2021 •

edited

Loading

benwtrent commented Aug 13, 2021 •

edited

Loading