Skip to content

Commit

Permalink
[DOCS] Adds further details and an example to how transform checkpoin…
Browse files Browse the repository at this point in the history
…ting works (#71615) (#71816)
  • Loading branch information
szabosteve authored Apr 19, 2021
1 parent 36440f1 commit 46328bb
Showing 1 changed file with 35 additions and 11 deletions.
46 changes: 35 additions & 11 deletions docs/reference/transform/checkpoints.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ destination index, it generates a _checkpoint_.

If your {transform} runs only once, there is logically only one checkpoint. If
your {transform} runs continuously, however, it creates checkpoints as it
ingests and transforms new source data.
ingests and transforms new source data. The `sync` property of the {transform}
configures checkpointing by specifying a time field.

To create a checkpoint, the {ctransform}:

Expand All @@ -22,21 +23,25 @@ indices. This check is done based on the interval defined in the transform's
+
If the source indices remain unchanged or if a checkpoint is already in progress
then it waits for the next timer.
+
If changes are found a checkpoint is created.

. Identifies which entities have changed.
. Identifies which entities and/or time buckets have changed.
+
The {transform} searches to see which entities have changed since the last time
it checked. The `sync` configuration object in the {transform} identifies a time
field in the source indices. The {transform} uses the values in that field to
synchronize the source and destination indices.
The {transform} searches to see which entities or time buckets have changed
between the last and the new checkpoint. The {transform} uses the values to
synchronize the source and destination indices with fewer operations than a
full re-run.

. Updates the destination index (the {dataframe}) with the changed entities.
. Updates the destination index (the {dataframe}) with the changes.
+
--
The {transform} applies changes related to either new or changed entities to the
destination index. The set of changed entities is paginated. For each page, the
{transform} performs a composite aggregation using a `terms` query. After all
the pages of changes have been applied, the checkpoint is complete.
The {transform} applies changes related to either new or changed entities or
time buckets to the destination index. The set of changes can be paginated. The
{transform} performs a composite aggregation similarly to the batch {transform}
operation, however it also injects query filters based on the previous step to
reduce the amount work. After all changes have been applied, the checkpoint is
complete.
--

This checkpoint process involves both search and indexing activity on the
Expand All @@ -49,6 +54,25 @@ support both the composite aggregation search and the indexing of its results.
TIP: If the cluster experiences unsuitable performance degradation due to the
{transform}, stop the {transform} and refer to <<transform-performance>>.


[discrete]
[[ml-transform-checkpoint-heuristics]]
== Change detection heuristics

When the {transform} runs in continuous mode, it updates the documents in the
destination index as new data comes in. The {transform} uses a set of heuristics
called change detection to update the destination index with fewer operations.

In this example, the data is grouped by host names. Change detection detects
which host names have changed, for example, host `A`, `C` and `G` and only
updates documents with those hosts but does not update documents that store
information about host `B`, `D`, or any other host that are not changed.

Another heuristic can be applied for time buckets when a `date_histogram` is
used to group by time buckets. Change detection detects which time buckets have
changed and only update those.


[discrete]
[[ml-transform-checkpoint-errors]]
== Error handling
Expand Down

0 comments on commit 46328bb

Please sign in to comment.