-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Transform] Checkpointing should not fail due to replica unavailability #75780
Comments
Pinging @elastic/ml-core (Team:ML) |
Looking more into https://www.elastic.co/guide/en/elasticsearch/reference/7.14/get-global-checkpoints.html It does seem to fit the bill. We could fairly easily expand the index patterns and send off individual requests for every index and then collapse the global checkpoints. This would be slightly more complicated than getting stats as that supports multiple indices, but not too much. I have two open questions:
But, there doesn't seem to be a way to disable this plugin. So, we could move the action classes to XPack core and simply make transport client calls and handle the responses. |
After some more investigation: I don't think the fleet API fits our use case, but it is interesting to learn from. E.g. the limitation to 1 index and 1 shard is a no go for us. Checkpointing today does 2 calls a get index and a stats call. Both aren't cheap and we throw a lot of the ouput away. The get index call seems superfluous, we could resolve ourselves, however this does not work for CCS, we need to resolve on the remote. ProposalCreate a transport action that replaces the 2 calls and provides:
Expected result
CCS and BWC
Options:
|
rewrites checkpointing as internal actions, reducing several sub-calls to only 1 per data node that has at least 1 primary shard of the indexes of interest. Robustness: The current checkpointing sends a request to every shard - primary and replica - and collects the results. If 1 request fails, even for a replica, checkpointing fails. See #75780 for details. Performance: The current checkpointing is wasteful, it uses get index and get index stats which results in a lot more calls and executes a lot more code which produces results we are not interested in. Number of requests before and after: before: 1 + #shards * #indices * (#replicas + 1) after: #data_nodes_holding_gt1_shard Fixes #75780
rewrites checkpointing as internal actions, reducing several sub-calls to only 1 per data node that has at least 1 primary shard of the indexes of interest. Robustness: The current checkpointing sends a request to every shard - primary and replica - and collects the results. If 1 request fails, even for a replica, checkpointing fails. See elastic#75780 for details. Performance: The current checkpointing is wasteful, it uses get index and get index stats which results in a lot more calls and executes a lot more code which produces results we are not interested in. Number of requests before and after: before: 1 + #shards * #indices * (#replicas + 1) after: #data_nodes_holding_gt1_shard Fixes elastic#75780
When transform creates a checkpoint it uses the index stats API to get the global checkpoints for every shard.
By design (of index stats) this call queries every shard, including replica shards. Internally this information is de-duplicated and
max(gcp)
is taken.Fleet implemented very similar functionality with the global checkpoints API. The implementation for it does not query replica shards and only collects the information that is required.
Transform can benefit from a similar implementation, if possible by re-using code/functionality.
This will make transform less likely to fail and reduce the amount of network calls for checkpointing by 50 or more percent.
The text was updated successfully, but these errors were encountered: