-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCS]explain how checkpoints works for time buckets #71472
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,50 +5,70 @@ | |
<titleabbrev>How checkpoints work</titleabbrev> | ||
++++ | ||
|
||
Each time a {transform} examines the source indices and creates or updates the | ||
Each time a {transform} examines the source indices and creates or updates the | ||
destination index, it generates a _checkpoint_. | ||
|
||
If your {transform} runs only once, there is logically only one checkpoint. If | ||
your {transform} runs continuously, however, it creates checkpoints as it | ||
ingests and transforms new source data. | ||
If your {transform} runs only once, there is logically only one checkpoint. If | ||
your {transform} runs continuously, however, it creates checkpoints as it | ||
ingests and transforms new source data. The `sync` configuration object in the | ||
{transform} configures checkpointing, e.g. by specifying a time field. | ||
|
||
To create a checkpoint, the {ctransform}: | ||
|
||
. Checks for changes to source indices. | ||
+ | ||
Using a simple periodic timer, the {transform} checks for changes to the source | ||
indices. This check is done based on the interval defined in the transform's | ||
Using a simple periodic timer, the {transform} checks for changes to the source | ||
indices. This check is done based on the interval defined in the transform's | ||
`frequency` property. | ||
+ | ||
If the source indices remain unchanged or if a checkpoint is already in progress | ||
then it waits for the next timer. | ||
|
||
. Identifies which entities have changed. | ||
If changes are found a checkpoint gets created. | ||
|
||
. Identifies which entities or time buckets have changed. | ||
+ | ||
The {transform} searches to see which entities have changed since the last time | ||
it checked. The `sync` configuration object in the {transform} identifies a time | ||
field in the source indices. The {transform} uses the values in that field to | ||
synchronize the source and destination indices. | ||
. Updates the destination index (the {dataframe}) with the changed entities. | ||
The {transform} searches to see which entities or time buckets have changed | ||
between the last and the new checkpoint. The {transform} uses the values to | ||
synchronize the source and destination indices with fewer operations than a | ||
full re-run. | ||
|
||
. Updates the destination index (the {dataframe}) with the changes | ||
+ | ||
-- | ||
The {transform} applies changes related to either new or changed entities to the | ||
destination index. The set of changed entities is paginated. For each page, the | ||
{transform} performs a composite aggregation using a `terms` query. After all | ||
the pages of changes have been applied, the checkpoint is complete. | ||
The {transform} applies changes related to either new or changed entities or | ||
time buckets to the destination index. The set of changes can be paginated. The | ||
{transform} performs a composite aggregation like for the run once case, however | ||
injects query filters based on step 2, to reduce the amount work. After all | ||
changes have been applied, the checkpoint is complete. | ||
-- | ||
|
||
This checkpoint process involves both search and indexing activity on the | ||
cluster. We have attempted to favor control over performance while developing | ||
{transforms}. We decided it was preferable for the {transform} to take longer to | ||
complete, rather than to finish quickly and take precedence in resource | ||
consumption. That being said, the cluster still requires enough resources to | ||
support both the composite aggregation search and the indexing of its results. | ||
{transforms}. We decided it was preferable for the {transform} to take longer to | ||
complete, rather than to finish quickly and take precedence in resource | ||
consumption. That being said, the cluster still requires enough resources to | ||
support both the composite aggregation search and the indexing of its results. | ||
|
||
TIP: If the cluster experiences unsuitable performance degradation due to the | ||
{transform}, stop the {transform} and refer to <<transform-performance>>. | ||
|
||
[discrete] | ||
[[ml-transform-checkpoint-heuristics]] | ||
== Change Detection Heuristics | ||
|
||
When transform runs in continuous mode it updates the documents in the | ||
destination index as new data comes in. Transform uses a set of heuristics | ||
called change detection to update the destination index with fewer operations. | ||
|
||
As an example, assume you are grouping on hostnames. Change detection will detect | ||
which hostnames have changed, e.g. host `A`, `C` and `G` and only update documents | ||
with those hosts but not documents that store information about host `B`, `D`, ... | ||
|
||
Another Heuristic can be applied for time buckets if you use a `date_histogram` to | ||
group by time buckets. Change detection will detect which time buckets have changed | ||
and only update those. | ||
|
||
[discrete] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another improvement idea: It would be good to explain a checkpointing best practice: using ingest timestamps (requires 7.11) instead of timestamps that are coming from outside. However I am not sure if this belongs here or in the put transform documentation (or both) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is already mentioned in create transform |
||
[[ml-transform-checkpoint-errors]] | ||
== Error handling | ||
|
@@ -61,18 +81,18 @@ persisted periodically. | |
Checkpoint failures can be categorized as follows: | ||
|
||
* Temporary failures: The checkpoint is retried. If 10 consecutive failures | ||
occur, the {transform} has a failed status. For example, this situation might | ||
occur, the {transform} has a failed status. For example, this situation might | ||
occur when there are shard failures and queries return only partial results. | ||
* Irrecoverable failures: The {transform} immediately fails. For example, this | ||
* Irrecoverable failures: The {transform} immediately fails. For example, this | ||
situation occurs when the source index is not found. | ||
* Adjustment failures: The {transform} retries with adjusted settings. For | ||
example, if a parent circuit breaker memory errors occur during the composite | ||
aggregation, the {transform} receives partial results. The aggregated search is | ||
retried with a smaller number of buckets. This retry is performed at the | ||
interval defined in the `frequency` property for the {transform}. If the search | ||
is retried to the point where it reaches a minimal number of buckets, an | ||
* Adjustment failures: The {transform} retries with adjusted settings. For | ||
example, if a parent circuit breaker memory errors occur during the composite | ||
aggregation, the {transform} receives partial results. The aggregated search is | ||
retried with a smaller number of buckets. This retry is performed at the | ||
interval defined in the `frequency` property for the {transform}. If the search | ||
is retried to the point where it reaches a minimal number of buckets, an | ||
irrecoverable failure occurs. | ||
|
||
If the node running the {transforms} fails, the {transform} restarts from the | ||
most recent persisted cursor position. This recovery process might repeat some | ||
If the node running the {transforms} fails, the {transform} restarts from the | ||
most recent persisted cursor position. This recovery process might repeat some | ||
of the work the {transform} had already done, but it ensures data consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created some visuals about change point detection, e.g. in #63315
This is a very technical one, but maybe we can take the ideas from it to create a better visual for this place.