Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS]explain how checkpoints works for time buckets #71472

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 51 additions & 31 deletions docs/reference/transform/checkpoints.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -5,50 +5,70 @@
<titleabbrev>How checkpoints work</titleabbrev>
++++

Each time a {transform} examines the source indices and creates or updates the
Each time a {transform} examines the source indices and creates or updates the
destination index, it generates a _checkpoint_.

If your {transform} runs only once, there is logically only one checkpoint. If
your {transform} runs continuously, however, it creates checkpoints as it
ingests and transforms new source data.
If your {transform} runs only once, there is logically only one checkpoint. If
your {transform} runs continuously, however, it creates checkpoints as it
ingests and transforms new source data. The `sync` configuration object in the
{transform} configures checkpointing, e.g. by specifying a time field.

To create a checkpoint, the {ctransform}:

. Checks for changes to source indices.
+
Using a simple periodic timer, the {transform} checks for changes to the source
indices. This check is done based on the interval defined in the transform's
Using a simple periodic timer, the {transform} checks for changes to the source
indices. This check is done based on the interval defined in the transform's
`frequency` property.
+
If the source indices remain unchanged or if a checkpoint is already in progress
then it waits for the next timer.

. Identifies which entities have changed.
If changes are found a checkpoint gets created.

. Identifies which entities or time buckets have changed.
+
The {transform} searches to see which entities have changed since the last time
it checked. The `sync` configuration object in the {transform} identifies a time
field in the source indices. The {transform} uses the values in that field to
synchronize the source and destination indices.
. Updates the destination index (the {dataframe}) with the changed entities.
The {transform} searches to see which entities or time buckets have changed
between the last and the new checkpoint. The {transform} uses the values to
synchronize the source and destination indices with fewer operations than a
full re-run.

. Updates the destination index (the {dataframe}) with the changes
+
--
The {transform} applies changes related to either new or changed entities to the
destination index. The set of changed entities is paginated. For each page, the
{transform} performs a composite aggregation using a `terms` query. After all
the pages of changes have been applied, the checkpoint is complete.
The {transform} applies changes related to either new or changed entities or
time buckets to the destination index. The set of changes can be paginated. The
{transform} performs a composite aggregation like for the run once case, however
injects query filters based on step 2, to reduce the amount work. After all
changes have been applied, the checkpoint is complete.
--

This checkpoint process involves both search and indexing activity on the
cluster. We have attempted to favor control over performance while developing
{transforms}. We decided it was preferable for the {transform} to take longer to
complete, rather than to finish quickly and take precedence in resource
consumption. That being said, the cluster still requires enough resources to
support both the composite aggregation search and the indexing of its results.
{transforms}. We decided it was preferable for the {transform} to take longer to
complete, rather than to finish quickly and take precedence in resource
consumption. That being said, the cluster still requires enough resources to
support both the composite aggregation search and the indexing of its results.

TIP: If the cluster experiences unsuitable performance degradation due to the
{transform}, stop the {transform} and refer to <<transform-performance>>.

[discrete]
[[ml-transform-checkpoint-heuristics]]
== Change Detection Heuristics
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created some visuals about change point detection, e.g. in #63315

This is a very technical one, but maybe we can take the ideas from it to create a better visual for this place.


When transform runs in continuous mode it updates the documents in the
destination index as new data comes in. Transform uses a set of heuristics
called change detection to update the destination index with fewer operations.

As an example, assume you are grouping on hostnames. Change detection will detect
which hostnames have changed, e.g. host `A`, `C` and `G` and only update documents
with those hosts but not documents that store information about host `B`, `D`, ...

Another Heuristic can be applied for time buckets if you use a `date_histogram` to
group by time buckets. Change detection will detect which time buckets have changed
and only update those.

[discrete]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another improvement idea:

It would be good to explain a checkpointing best practice: using ingest timestamps (requires 7.11) instead of timestamps that are coming from outside. However I am not sure if this belongs here or in the put transform documentation (or both)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already mentioned in create transform

[[ml-transform-checkpoint-errors]]
== Error handling
Expand All @@ -61,18 +81,18 @@ persisted periodically.
Checkpoint failures can be categorized as follows:

* Temporary failures: The checkpoint is retried. If 10 consecutive failures
occur, the {transform} has a failed status. For example, this situation might
occur, the {transform} has a failed status. For example, this situation might
occur when there are shard failures and queries return only partial results.
* Irrecoverable failures: The {transform} immediately fails. For example, this
* Irrecoverable failures: The {transform} immediately fails. For example, this
situation occurs when the source index is not found.
* Adjustment failures: The {transform} retries with adjusted settings. For
example, if a parent circuit breaker memory errors occur during the composite
aggregation, the {transform} receives partial results. The aggregated search is
retried with a smaller number of buckets. This retry is performed at the
interval defined in the `frequency` property for the {transform}. If the search
is retried to the point where it reaches a minimal number of buckets, an
* Adjustment failures: The {transform} retries with adjusted settings. For
example, if a parent circuit breaker memory errors occur during the composite
aggregation, the {transform} receives partial results. The aggregated search is
retried with a smaller number of buckets. This retry is performed at the
interval defined in the `frequency` property for the {transform}. If the search
is retried to the point where it reaches a minimal number of buckets, an
irrecoverable failure occurs.

If the node running the {transforms} fails, the {transform} restarts from the
most recent persisted cursor position. This recovery process might repeat some
If the node running the {transforms} fails, the {transform} restarts from the
most recent persisted cursor position. This recovery process might repeat some
of the work the {transform} had already done, but it ensures data consistency.