diff --git a/docs/reference/transform/checkpoints.asciidoc b/docs/reference/transform/checkpoints.asciidoc index bff52e51c726a..21bdd9389f3de 100644 --- a/docs/reference/transform/checkpoints.asciidoc +++ b/docs/reference/transform/checkpoints.asciidoc @@ -5,50 +5,70 @@ <titleabbrev>How checkpoints work</titleabbrev> ++++ -Each time a {transform} examines the source indices and creates or updates the +Each time a {transform} examines the source indices and creates or updates the destination index, it generates a _checkpoint_. -If your {transform} runs only once, there is logically only one checkpoint. If -your {transform} runs continuously, however, it creates checkpoints as it -ingests and transforms new source data. +If your {transform} runs only once, there is logically only one checkpoint. If +your {transform} runs continuously, however, it creates checkpoints as it +ingests and transforms new source data. The `sync` configuration object in the +{transform} configures checkpointing, e.g. by specifying a time field. To create a checkpoint, the {ctransform}: . Checks for changes to source indices. + -Using a simple periodic timer, the {transform} checks for changes to the source -indices. This check is done based on the interval defined in the transform's +Using a simple periodic timer, the {transform} checks for changes to the source +indices. This check is done based on the interval defined in the transform's `frequency` property. + If the source indices remain unchanged or if a checkpoint is already in progress then it waits for the next timer. -. Identifies which entities have changed. +If changes are found a checkpoint gets created. + +. Identifies which entities or time buckets have changed. + -The {transform} searches to see which entities have changed since the last time -it checked. The `sync` configuration object in the {transform} identifies a time -field in the source indices. The {transform} uses the values in that field to -synchronize the source and destination indices. - -. Updates the destination index (the {dataframe}) with the changed entities. +The {transform} searches to see which entities or time buckets have changed +between the last and the new checkpoint. The {transform} uses the values to +synchronize the source and destination indices with fewer operations than a +full re-run. + +. Updates the destination index (the {dataframe}) with the changes + -- -The {transform} applies changes related to either new or changed entities to the -destination index. The set of changed entities is paginated. For each page, the -{transform} performs a composite aggregation using a `terms` query. After all -the pages of changes have been applied, the checkpoint is complete. +The {transform} applies changes related to either new or changed entities or +time buckets to the destination index. The set of changes can be paginated. The +{transform} performs a composite aggregation like for the run once case, however +injects query filters based on step 2, to reduce the amount work. After all +changes have been applied, the checkpoint is complete. -- This checkpoint process involves both search and indexing activity on the cluster. We have attempted to favor control over performance while developing -{transforms}. We decided it was preferable for the {transform} to take longer to -complete, rather than to finish quickly and take precedence in resource -consumption. That being said, the cluster still requires enough resources to -support both the composite aggregation search and the indexing of its results. +{transforms}. We decided it was preferable for the {transform} to take longer to +complete, rather than to finish quickly and take precedence in resource +consumption. That being said, the cluster still requires enough resources to +support both the composite aggregation search and the indexing of its results. TIP: If the cluster experiences unsuitable performance degradation due to the {transform}, stop the {transform} and refer to <<transform-performance>>. +[discrete] +[[ml-transform-checkpoint-heuristics]] +== Change Detection Heuristics + +When transform runs in continuous mode it updates the documents in the +destination index as new data comes in. Transform uses a set of heuristics +called change detection to update the destination index with fewer operations. + +As an example, assume you are grouping on hostnames. Change detection will detect +which hostnames have changed, e.g. host `A`, `C` and `G` and only update documents +with those hosts but not documents that store information about host `B`, `D`, ... + +Another Heuristic can be applied for time buckets if you use a `date_histogram` to +group by time buckets. Change detection will detect which time buckets have changed +and only update those. + [discrete] [[ml-transform-checkpoint-errors]] == Error handling @@ -61,18 +81,18 @@ persisted periodically. Checkpoint failures can be categorized as follows: * Temporary failures: The checkpoint is retried. If 10 consecutive failures -occur, the {transform} has a failed status. For example, this situation might +occur, the {transform} has a failed status. For example, this situation might occur when there are shard failures and queries return only partial results. -* Irrecoverable failures: The {transform} immediately fails. For example, this +* Irrecoverable failures: The {transform} immediately fails. For example, this situation occurs when the source index is not found. -* Adjustment failures: The {transform} retries with adjusted settings. For -example, if a parent circuit breaker memory errors occur during the composite -aggregation, the {transform} receives partial results. The aggregated search is -retried with a smaller number of buckets. This retry is performed at the -interval defined in the `frequency` property for the {transform}. If the search -is retried to the point where it reaches a minimal number of buckets, an +* Adjustment failures: The {transform} retries with adjusted settings. For +example, if a parent circuit breaker memory errors occur during the composite +aggregation, the {transform} receives partial results. The aggregated search is +retried with a smaller number of buckets. This retry is performed at the +interval defined in the `frequency` property for the {transform}. If the search +is retried to the point where it reaches a minimal number of buckets, an irrecoverable failure occurs. -If the node running the {transforms} fails, the {transform} restarts from the -most recent persisted cursor position. This recovery process might repeat some +If the node running the {transforms} fails, the {transform} restarts from the +most recent persisted cursor position. This recovery process might repeat some of the work the {transform} had already done, but it ensures data consistency.