Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transform] Checkpointing should not fail due to replica unavailability #75780

Closed
hendrikmuhs opened this issue Jul 28, 2021 · 3 comments · Fixed by #80984
Closed

[Transform] Checkpointing should not fail due to replica unavailability #75780

hendrikmuhs opened this issue Jul 28, 2021 · 3 comments · Fixed by #80984
Assignees
Labels
>enhancement :ml/Transform Transform Team:ML Meta label for the ML team

Comments

@hendrikmuhs
Copy link

hendrikmuhs commented Jul 28, 2021

When transform creates a checkpoint it uses the index stats API to get the global checkpoints for every shard.

By design (of index stats) this call queries every shard, including replica shards. Internally this information is de-duplicated and max(gcp) is taken.

Fleet implemented very similar functionality with the global checkpoints API. The implementation for it does not query replica shards and only collects the information that is required.

Transform can benefit from a similar implementation, if possible by re-using code/functionality.

This will make transform less likely to fail and reduce the amount of network calls for checkpointing by 50 or more percent.

@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jul 28, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent
Copy link
Member

benwtrent commented Aug 13, 2021

Looking more into https://www.elastic.co/guide/en/elasticsearch/reference/7.14/get-global-checkpoints.html

It does seem to fit the bill. We could fairly easily expand the index patterns and send off individual requests for every index and then collapse the global checkpoints.

This would be slightly more complicated than getting stats as that supports multiple indices, but not too much.

I have two open questions:

  • Mixed cluster support; how does this work against nodes that are <7.13?
  • Remote cluster support; does this work against remote clusters at all? I am assuming it definitely won't work against clusters with data nodes <7.13.

But, there doesn't seem to be a way to disable this plugin. So, we could move the action classes to XPack core and simply make transport client calls and handle the responses.

@hendrikmuhs
Copy link
Author

After some more investigation:

I don't think the fleet API fits our use case, but it is interesting to learn from. E.g. the limitation to 1 index and 1 shard is a no go for us.

Checkpointing today does 2 calls a get index and a stats call. Both aren't cheap and we throw a lot of the ouput away. The get index call seems superfluous, we could resolve ourselves, however this does not work for CCS, we need to resolve on the remote.

Proposal

Create a transport action that replaces the 2 calls and provides:

  • resolving to concrete indexes
  • spawn node-level requests to collect checkpoints from primary shards
  • collect checkpoints and send the checkpoint result back

Expected result

  • combining the 2 calls into 1 removes 1 network-hop
  • asking only primary shards reduces network calls at least by 50% (for 1 replica)
  • switching to node-level communication further reduces network calls depending on the number of indexes and shards
  • the custom transport speeds up execution as it only gets what is required
  • checkpoint creation does not fail if a replica shard fails and it will be less likely to run into a timeout

CCS and BWC

  • the transport is compatible to the old way, we can switch between the 2 if necessary
  • the transport will be available for CCS as well, however we don't know the version of the remote

Options:

  • check if in a mixed clusters
  • get the remote version as in SourceDestValidator, which uses RemoteClusterLicenseChecker to obtain the version
  • "trial and error" like the pit change: if pit creation fails, it falls back to the old implementation. However, the optimization stays disabled until the transform gets restarted or re-located, consider re-checking periodically?

hendrikmuhs pushed a commit that referenced this issue Feb 17, 2022
rewrites checkpointing as internal actions, reducing several sub-calls to
only 1 per data node that has at least 1 primary shard of the indexes of
interest.

Robustness: The current checkpointing sends a request to every shard
 - primary and replica - and collects the results. If 1 request fails, even 
for a replica, checkpointing fails. See #75780 for details.

Performance: The current checkpointing is wasteful, it uses get index 
and get index stats which results in a lot more calls and executes a 
lot more code which produces results we are not interested in.

Number of requests before and after:
before: 1 + #shards * #indices * (#replicas + 1)
after: #data_nodes_holding_gt1_shard

Fixes #75780
probakowski pushed a commit to probakowski/elasticsearch that referenced this issue Feb 23, 2022
rewrites checkpointing as internal actions, reducing several sub-calls to
only 1 per data node that has at least 1 primary shard of the indexes of
interest.

Robustness: The current checkpointing sends a request to every shard
 - primary and replica - and collects the results. If 1 request fails, even 
for a replica, checkpointing fails. See elastic#75780 for details.

Performance: The current checkpointing is wasteful, it uses get index 
and get index stats which results in a lot more calls and executes a 
lot more code which produces results we are not interested in.

Number of requests before and after:
before: 1 + #shards * #indices * (#replicas + 1)
after: #data_nodes_holding_gt1_shard

Fixes elastic#75780
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml/Transform Transform Team:ML Meta label for the ML team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants