diff --git a/docs/reference/ml/trained-models/apis/infer-trained-model-deployment.asciidoc b/docs/reference/ml/trained-models/apis/infer-trained-model-deployment.asciidoc
index acb3109e8b3cd..48269886a35d3 100644
--- a/docs/reference/ml/trained-models/apis/infer-trained-model-deployment.asciidoc
+++ b/docs/reference/ml/trained-models/apis/infer-trained-model-deployment.asciidoc
@@ -46,8 +46,8 @@ Controls the amount of time to wait for {infer} results. Defaults to 10 seconds.
`docs`::
(Required, array)
An array of objects to pass to the model for inference. The objects should
-contain a field matching your configured trained model input. Typically, the field
-name is `text_field`. Currently, only a single value is allowed.
+contain a field matching your configured trained model input. Typically, the
+field name is `text_field`. Currently, only a single value is allowed.
////
[[infer-trained-model-deployment-results]]
@@ -62,8 +62,8 @@ name is `text_field`. Currently, only a single value is allowed.
[[infer-trained-model-deployment-example]]
== {api-examples-title}
-The response depends on the task the model is trained for. If it is a
-text classification task, the response is the score. For example:
+The response depends on the task the model is trained for. If it is a text
+classification task, the response is the score. For example:
[source,console]
--------------------------------------------------
@@ -123,8 +123,8 @@ The API returns in this case:
----
// NOTCONSOLE
-Zero-shot classification tasks require extra configuration defining the class labels.
-These labels are passed in the zero-shot inference config.
+Zero-shot classification tasks require extra configuration defining the class
+labels. These labels are passed in the zero-shot inference config.
[source,console]
--------------------------------------------------
@@ -150,7 +150,8 @@ POST _ml/trained_models/model2/deployment/_infer
--------------------------------------------------
// TEST[skip:TBD]
-The API returns the predicted label and the confidence, as well as the top classes:
+The API returns the predicted label and the confidence, as well as the top
+classes:
[source,console-result]
----
@@ -204,8 +205,8 @@ POST _ml/trained_models/model2/deployment/_infer
--------------------------------------------------
// TEST[skip:TBD]
-When the input has been truncated due to the limit imposed by the model's `max_sequence_length`
-the `is_truncated` field appears in the response.
+When the input has been truncated due to the limit imposed by the model's
+`max_sequence_length` the `is_truncated` field appears in the response.
[source,console-result]
----
diff --git a/docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc b/docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc
index 51a43b845f3e7..b036245def169 100644
--- a/docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc
+++ b/docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc
@@ -6,7 +6,11 @@
Infer trained model
++++
-Evaluates a trained model. The model may be any supervised model either trained by {dfanalytics} or imported.
+Evaluates a trained model. The model may be any supervised model either trained
+by {dfanalytics} or imported.
+
+NOTE: For model deployments with caching enabled, results may be returned
+directly from the {infer} cache.
beta::[]
@@ -102,7 +106,8 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-fill-mask]
=====
`num_top_classes`::::
(Optional, integer)
-Number of top predicted tokens to return for replacing the mask token. Defaults to `0`.
+Number of top predicted tokens to return for replacing the mask token. Defaults
+to `0`.
`results_field`::::
(Optional, string)
@@ -272,7 +277,8 @@ The maximum amount of words in the answer. Defaults to `15`.
`num_top_classes`::::
(Optional, integer)
-The number the top found answers to return. Defaults to `0`, meaning only the best found answer is returned.
+The number the top found answers to return. Defaults to `0`, meaning only the
+best found answer is returned.
`question`::::
(Required, string)
@@ -368,7 +374,8 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-text-classific
`num_top_classes`::::
(Optional, integer)
-Specifies the number of top class predictions to return. Defaults to all classes (-1).
+Specifies the number of top class predictions to return. Defaults to all classes
+(-1).
`results_field`::::
(Optional, string)
@@ -879,8 +886,8 @@ POST _ml/trained_models/model2/_infer
--------------------------------------------------
// TEST[skip:TBD]
-When the input has been truncated due to the limit imposed by the model's `max_sequence_length`
-the `is_truncated` field appears in the response.
+When the input has been truncated due to the limit imposed by the model's
+`max_sequence_length` the `is_truncated` field appears in the response.
[source,console-result]
----
diff --git a/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc b/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc
index baf2e086c3421..86210998731a0 100644
--- a/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc
+++ b/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc
@@ -30,20 +30,20 @@ in an ingest pipeline or directly in the <> API.
Scaling inference performance can be achieved by setting the parameters
`number_of_allocations` and `threads_per_allocation`.
-Increasing `threads_per_allocation` means more threads are used when
-an inference request is processed on a node. This can improve inference speed
-for certain models. It may also result in improvement to throughput.
+Increasing `threads_per_allocation` means more threads are used when an
+inference request is processed on a node. This can improve inference speed for
+certain models. It may also result in improvement to throughput.
-Increasing `number_of_allocations` means more threads are used to
-process multiple inference requests in parallel resulting in throughput
-improvement. Each model allocation uses a number of threads defined by
+Increasing `number_of_allocations` means more threads are used to process
+multiple inference requests in parallel resulting in throughput improvement.
+Each model allocation uses a number of threads defined by
`threads_per_allocation`.
-Model allocations are distributed across {ml} nodes. All allocations assigned
-to a node share the same copy of the model in memory. To avoid
-thread oversubscription which is detrimental to performance, model allocations
-are distributed in such a way that the total number of used threads does not
-surpass the node's allocated processors.
+Model allocations are distributed across {ml} nodes. All allocations assigned to
+a node share the same copy of the model in memory. To avoid thread
+oversubscription which is detrimental to performance, model allocations are
+distributed in such a way that the total number of used threads does not surpass
+the node's allocated processors.
[[start-trained-model-deployment-path-params]]
== {api-path-parms-title}
@@ -57,33 +57,36 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id]
`cache_size`::
(Optional, <>)
-The inference cache size (in memory outside the JVM heap) per node for the model.
-The default value is the same size as the `model_size_bytes`. To disable the cache, `0b` can be provided.
+The inference cache size (in memory outside the JVM heap) per node for the
+model. The default value is the size of the model as reported by the
+`model_size_bytes` field in the <>. To disable the
+cache, `0b` can be provided.
`number_of_allocations`::
(Optional, integer)
The total number of allocations this model is assigned across {ml} nodes.
-Increasing this value generally increases the throughput.
-Defaults to 1.
+Increasing this value generally increases the throughput. Defaults to 1.
`queue_capacity`::
(Optional, integer)
Controls how many inference requests are allowed in the queue at a time.
Every machine learning node in the cluster where the model can be allocated
has a queue of this size; when the number of requests exceeds the total value,
-new requests are rejected with a 429 error. Defaults to 1024. Max allowed value is 1000000.
+new requests are rejected with a 429 error. Defaults to 1024. Max allowed value
+is 1000000.
`threads_per_allocation`::
(Optional, integer)
-Sets the number of threads used by each model allocation during inference. This generally increases
-the speed per inference request. The inference process is a compute-bound process;
-`threads_per_allocations` must not exceed the number of available allocated processors per node.
-Defaults to 1. Must be a power of 2. Max allowed value is 32.
+Sets the number of threads used by each model allocation during inference. This
+generally increases the speed per inference request. The inference process is a
+compute-bound process; `threads_per_allocations` must not exceed the number of
+available allocated processors per node. Defaults to 1. Must be a power of 2.
+Max allowed value is 32.
`timeout`::
(Optional, time)
-Controls the amount of time to wait for the model to deploy. Defaults
-to 20 seconds.
+Controls the amount of time to wait for the model to deploy. Defaults to 20
+seconds.
`wait_for`::
(Optional, string)