diff --git a/docs/reference/ml/trained-models/apis/infer-trained-model-deployment.asciidoc b/docs/reference/ml/trained-models/apis/infer-trained-model-deployment.asciidoc index acb3109e8b3cd..48269886a35d3 100644 --- a/docs/reference/ml/trained-models/apis/infer-trained-model-deployment.asciidoc +++ b/docs/reference/ml/trained-models/apis/infer-trained-model-deployment.asciidoc @@ -46,8 +46,8 @@ Controls the amount of time to wait for {infer} results. Defaults to 10 seconds. `docs`:: (Required, array) An array of objects to pass to the model for inference. The objects should -contain a field matching your configured trained model input. Typically, the field -name is `text_field`. Currently, only a single value is allowed. +contain a field matching your configured trained model input. Typically, the +field name is `text_field`. Currently, only a single value is allowed. //// [[infer-trained-model-deployment-results]] @@ -62,8 +62,8 @@ name is `text_field`. Currently, only a single value is allowed. [[infer-trained-model-deployment-example]] == {api-examples-title} -The response depends on the task the model is trained for. If it is a -text classification task, the response is the score. For example: +The response depends on the task the model is trained for. If it is a text +classification task, the response is the score. For example: [source,console] -------------------------------------------------- @@ -123,8 +123,8 @@ The API returns in this case: ---- // NOTCONSOLE -Zero-shot classification tasks require extra configuration defining the class labels. -These labels are passed in the zero-shot inference config. +Zero-shot classification tasks require extra configuration defining the class +labels. These labels are passed in the zero-shot inference config. [source,console] -------------------------------------------------- @@ -150,7 +150,8 @@ POST _ml/trained_models/model2/deployment/_infer -------------------------------------------------- // TEST[skip:TBD] -The API returns the predicted label and the confidence, as well as the top classes: +The API returns the predicted label and the confidence, as well as the top +classes: [source,console-result] ---- @@ -204,8 +205,8 @@ POST _ml/trained_models/model2/deployment/_infer -------------------------------------------------- // TEST[skip:TBD] -When the input has been truncated due to the limit imposed by the model's `max_sequence_length` -the `is_truncated` field appears in the response. +When the input has been truncated due to the limit imposed by the model's +`max_sequence_length` the `is_truncated` field appears in the response. [source,console-result] ---- diff --git a/docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc b/docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc index 51a43b845f3e7..b036245def169 100644 --- a/docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc +++ b/docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc @@ -6,7 +6,11 @@ Infer trained model ++++ -Evaluates a trained model. The model may be any supervised model either trained by {dfanalytics} or imported. +Evaluates a trained model. The model may be any supervised model either trained +by {dfanalytics} or imported. + +NOTE: For model deployments with caching enabled, results may be returned +directly from the {infer} cache. beta::[] @@ -102,7 +106,8 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-fill-mask] ===== `num_top_classes`:::: (Optional, integer) -Number of top predicted tokens to return for replacing the mask token. Defaults to `0`. +Number of top predicted tokens to return for replacing the mask token. Defaults +to `0`. `results_field`:::: (Optional, string) @@ -272,7 +277,8 @@ The maximum amount of words in the answer. Defaults to `15`. `num_top_classes`:::: (Optional, integer) -The number the top found answers to return. Defaults to `0`, meaning only the best found answer is returned. +The number the top found answers to return. Defaults to `0`, meaning only the +best found answer is returned. `question`:::: (Required, string) @@ -368,7 +374,8 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-text-classific `num_top_classes`:::: (Optional, integer) -Specifies the number of top class predictions to return. Defaults to all classes (-1). +Specifies the number of top class predictions to return. Defaults to all classes +(-1). `results_field`:::: (Optional, string) @@ -879,8 +886,8 @@ POST _ml/trained_models/model2/_infer -------------------------------------------------- // TEST[skip:TBD] -When the input has been truncated due to the limit imposed by the model's `max_sequence_length` -the `is_truncated` field appears in the response. +When the input has been truncated due to the limit imposed by the model's +`max_sequence_length` the `is_truncated` field appears in the response. [source,console-result] ---- diff --git a/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc b/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc index baf2e086c3421..86210998731a0 100644 --- a/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc +++ b/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc @@ -30,20 +30,20 @@ in an ingest pipeline or directly in the <> API. Scaling inference performance can be achieved by setting the parameters `number_of_allocations` and `threads_per_allocation`. -Increasing `threads_per_allocation` means more threads are used when -an inference request is processed on a node. This can improve inference speed -for certain models. It may also result in improvement to throughput. +Increasing `threads_per_allocation` means more threads are used when an +inference request is processed on a node. This can improve inference speed for +certain models. It may also result in improvement to throughput. -Increasing `number_of_allocations` means more threads are used to -process multiple inference requests in parallel resulting in throughput -improvement. Each model allocation uses a number of threads defined by +Increasing `number_of_allocations` means more threads are used to process +multiple inference requests in parallel resulting in throughput improvement. +Each model allocation uses a number of threads defined by `threads_per_allocation`. -Model allocations are distributed across {ml} nodes. All allocations assigned -to a node share the same copy of the model in memory. To avoid -thread oversubscription which is detrimental to performance, model allocations -are distributed in such a way that the total number of used threads does not -surpass the node's allocated processors. +Model allocations are distributed across {ml} nodes. All allocations assigned to +a node share the same copy of the model in memory. To avoid thread +oversubscription which is detrimental to performance, model allocations are +distributed in such a way that the total number of used threads does not surpass +the node's allocated processors. [[start-trained-model-deployment-path-params]] == {api-path-parms-title} @@ -57,33 +57,36 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id] `cache_size`:: (Optional, <>) -The inference cache size (in memory outside the JVM heap) per node for the model. -The default value is the same size as the `model_size_bytes`. To disable the cache, `0b` can be provided. +The inference cache size (in memory outside the JVM heap) per node for the +model. The default value is the size of the model as reported by the +`model_size_bytes` field in the <>. To disable the +cache, `0b` can be provided. `number_of_allocations`:: (Optional, integer) The total number of allocations this model is assigned across {ml} nodes. -Increasing this value generally increases the throughput. -Defaults to 1. +Increasing this value generally increases the throughput. Defaults to 1. `queue_capacity`:: (Optional, integer) Controls how many inference requests are allowed in the queue at a time. Every machine learning node in the cluster where the model can be allocated has a queue of this size; when the number of requests exceeds the total value, -new requests are rejected with a 429 error. Defaults to 1024. Max allowed value is 1000000. +new requests are rejected with a 429 error. Defaults to 1024. Max allowed value +is 1000000. `threads_per_allocation`:: (Optional, integer) -Sets the number of threads used by each model allocation during inference. This generally increases -the speed per inference request. The inference process is a compute-bound process; -`threads_per_allocations` must not exceed the number of available allocated processors per node. -Defaults to 1. Must be a power of 2. Max allowed value is 32. +Sets the number of threads used by each model allocation during inference. This +generally increases the speed per inference request. The inference process is a +compute-bound process; `threads_per_allocations` must not exceed the number of +available allocated processors per node. Defaults to 1. Must be a power of 2. +Max allowed value is 32. `timeout`:: (Optional, time) -Controls the amount of time to wait for the model to deploy. Defaults -to 20 seconds. +Controls the amount of time to wait for the model to deploy. Defaults to 20 +seconds. `wait_for`:: (Optional, string)