Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.5] [DOCS] Highlights inference caching behavior (#91608) #91610

Merged
merged 1 commit into from
Nov 16, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ Controls the amount of time to wait for {infer} results. Defaults to 10 seconds.
`docs`::
(Required, array)
An array of objects to pass to the model for inference. The objects should
contain a field matching your configured trained model input. Typically, the field
name is `text_field`. Currently, only a single value is allowed.
contain a field matching your configured trained model input. Typically, the
field name is `text_field`. Currently, only a single value is allowed.

////
[[infer-trained-model-deployment-results]]
Expand All @@ -62,8 +62,8 @@ name is `text_field`. Currently, only a single value is allowed.
[[infer-trained-model-deployment-example]]
== {api-examples-title}

The response depends on the task the model is trained for. If it is a
text classification task, the response is the score. For example:
The response depends on the task the model is trained for. If it is a text
classification task, the response is the score. For example:

[source,console]
--------------------------------------------------
Expand Down Expand Up @@ -123,8 +123,8 @@ The API returns in this case:
----
// NOTCONSOLE

Zero-shot classification tasks require extra configuration defining the class labels.
These labels are passed in the zero-shot inference config.
Zero-shot classification tasks require extra configuration defining the class
labels. These labels are passed in the zero-shot inference config.

[source,console]
--------------------------------------------------
Expand All @@ -150,7 +150,8 @@ POST _ml/trained_models/model2/deployment/_infer
--------------------------------------------------
// TEST[skip:TBD]

The API returns the predicted label and the confidence, as well as the top classes:
The API returns the predicted label and the confidence, as well as the top
classes:

[source,console-result]
----
Expand Down Expand Up @@ -204,8 +205,8 @@ POST _ml/trained_models/model2/deployment/_infer
--------------------------------------------------
// TEST[skip:TBD]

When the input has been truncated due to the limit imposed by the model's `max_sequence_length`
the `is_truncated` field appears in the response.
When the input has been truncated due to the limit imposed by the model's
`max_sequence_length` the `is_truncated` field appears in the response.

[source,console-result]
----
Expand Down
19 changes: 13 additions & 6 deletions docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,11 @@
<titleabbrev>Infer trained model</titleabbrev>
++++

Evaluates a trained model. The model may be any supervised model either trained by {dfanalytics} or imported.
Evaluates a trained model. The model may be any supervised model either trained
by {dfanalytics} or imported.

NOTE: For model deployments with caching enabled, results may be returned
directly from the {infer} cache.

beta::[]

Expand Down Expand Up @@ -102,7 +106,8 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-fill-mask]
=====
`num_top_classes`::::
(Optional, integer)
Number of top predicted tokens to return for replacing the mask token. Defaults to `0`.
Number of top predicted tokens to return for replacing the mask token. Defaults
to `0`.

`results_field`::::
(Optional, string)
Expand Down Expand Up @@ -272,7 +277,8 @@ The maximum amount of words in the answer. Defaults to `15`.

`num_top_classes`::::
(Optional, integer)
The number the top found answers to return. Defaults to `0`, meaning only the best found answer is returned.
The number the top found answers to return. Defaults to `0`, meaning only the
best found answer is returned.

`question`::::
(Required, string)
Expand Down Expand Up @@ -368,7 +374,8 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-text-classific

`num_top_classes`::::
(Optional, integer)
Specifies the number of top class predictions to return. Defaults to all classes (-1).
Specifies the number of top class predictions to return. Defaults to all classes
(-1).

`results_field`::::
(Optional, string)
Expand Down Expand Up @@ -879,8 +886,8 @@ POST _ml/trained_models/model2/_infer
--------------------------------------------------
// TEST[skip:TBD]

When the input has been truncated due to the limit imposed by the model's `max_sequence_length`
the `is_truncated` field appears in the response.
When the input has been truncated due to the limit imposed by the model's
`max_sequence_length` the `is_truncated` field appears in the response.

[source,console-result]
----
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,20 +30,20 @@ in an ingest pipeline or directly in the <<infer-trained-model>> API.
Scaling inference performance can be achieved by setting the parameters
`number_of_allocations` and `threads_per_allocation`.

Increasing `threads_per_allocation` means more threads are used when
an inference request is processed on a node. This can improve inference speed
for certain models. It may also result in improvement to throughput.
Increasing `threads_per_allocation` means more threads are used when an
inference request is processed on a node. This can improve inference speed for
certain models. It may also result in improvement to throughput.

Increasing `number_of_allocations` means more threads are used to
process multiple inference requests in parallel resulting in throughput
improvement. Each model allocation uses a number of threads defined by
Increasing `number_of_allocations` means more threads are used to process
multiple inference requests in parallel resulting in throughput improvement.
Each model allocation uses a number of threads defined by
`threads_per_allocation`.

Model allocations are distributed across {ml} nodes. All allocations assigned
to a node share the same copy of the model in memory. To avoid
thread oversubscription which is detrimental to performance, model allocations
are distributed in such a way that the total number of used threads does not
surpass the node's allocated processors.
Model allocations are distributed across {ml} nodes. All allocations assigned to
a node share the same copy of the model in memory. To avoid thread
oversubscription which is detrimental to performance, model allocations are
distributed in such a way that the total number of used threads does not surpass
the node's allocated processors.

[[start-trained-model-deployment-path-params]]
== {api-path-parms-title}
Expand All @@ -57,33 +57,36 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id]

`cache_size`::
(Optional, <<byte-units,byte value>>)
The inference cache size (in memory outside the JVM heap) per node for the model.
The default value is the same size as the `model_size_bytes`. To disable the cache, `0b` can be provided.
The inference cache size (in memory outside the JVM heap) per node for the
model. The default value is the size of the model as reported by the
`model_size_bytes` field in the <<get-trained-models-stats>>. To disable the
cache, `0b` can be provided.

`number_of_allocations`::
(Optional, integer)
The total number of allocations this model is assigned across {ml} nodes.
Increasing this value generally increases the throughput.
Defaults to 1.
Increasing this value generally increases the throughput. Defaults to 1.

`queue_capacity`::
(Optional, integer)
Controls how many inference requests are allowed in the queue at a time.
Every machine learning node in the cluster where the model can be allocated
has a queue of this size; when the number of requests exceeds the total value,
new requests are rejected with a 429 error. Defaults to 1024. Max allowed value is 1000000.
new requests are rejected with a 429 error. Defaults to 1024. Max allowed value
is 1000000.

`threads_per_allocation`::
(Optional, integer)
Sets the number of threads used by each model allocation during inference. This generally increases
the speed per inference request. The inference process is a compute-bound process;
`threads_per_allocations` must not exceed the number of available allocated processors per node.
Defaults to 1. Must be a power of 2. Max allowed value is 32.
Sets the number of threads used by each model allocation during inference. This
generally increases the speed per inference request. The inference process is a
compute-bound process; `threads_per_allocations` must not exceed the number of
available allocated processors per node. Defaults to 1. Must be a power of 2.
Max allowed value is 32.

`timeout`::
(Optional, time)
Controls the amount of time to wait for the model to deploy. Defaults
to 20 seconds.
Controls the amount of time to wait for the model to deploy. Defaults to 20
seconds.

`wait_for`::
(Optional, string)
Expand Down