elastic · szabosteve · Nov 16, 2022 · Nov 16, 2022
diff --git a/docs/reference/ml/trained-models/apis/infer-trained-model-deployment.asciidoc b/docs/reference/ml/trained-models/apis/infer-trained-model-deployment.asciidoc
@@ -46,8 +46,8 @@ Controls the amount of time to wait for {infer} results. Defaults to 10 seconds.
 `docs`::
 (Required, array)
 An array of objects to pass to the model for inference. The objects should
-contain a field matching your configured trained model input. Typically, the field
-name is `text_field`. Currently, only a single value is allowed.
+contain a field matching your configured trained model input. Typically, the 
+field name is `text_field`. Currently, only a single value is allowed.
 
 ////
 [[infer-trained-model-deployment-results]]
@@ -62,8 +62,8 @@ name is `text_field`. Currently, only a single value is allowed.
 [[infer-trained-model-deployment-example]]
 == {api-examples-title}
 
-The response depends on the task the model is trained for. If it is a
-text classification task, the response is the score. For example:
+The response depends on the task the model is trained for. If it is a text 
+classification task, the response is the score. For example:
 
 [source,console]
 --------------------------------------------------
@@ -123,8 +123,8 @@ The API returns in this case:
 ----
 // NOTCONSOLE
 
-Zero-shot classification tasks require extra configuration defining the class labels.
-These labels are passed in the zero-shot inference config.
+Zero-shot classification tasks require extra configuration defining the class 
+labels. These labels are passed in the zero-shot inference config.
 
 [source,console]
 --------------------------------------------------
@@ -150,7 +150,8 @@ POST _ml/trained_models/model2/deployment/_infer
 --------------------------------------------------
 // TEST[skip:TBD]
 
-The API returns the predicted label and the confidence, as well as the top classes:
+The API returns the predicted label and the confidence, as well as the top 
+classes:
 
 [source,console-result]
 ----
@@ -204,8 +205,8 @@ POST _ml/trained_models/model2/deployment/_infer
 --------------------------------------------------
 // TEST[skip:TBD]
 
-When the input has been truncated due to the limit imposed by the model's `max_sequence_length`
-the `is_truncated` field appears in the response.
+When the input has been truncated due to the limit imposed by the model's 
+`max_sequence_length` the `is_truncated` field appears in the response.
 
 [source,console-result]
 ----

diff --git a/docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc b/docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc
@@ -6,7 +6,11 @@
 <titleabbrev>Infer trained model</titleabbrev>
 ++++
 
-Evaluates a trained model. The model may be any supervised model either trained by {dfanalytics} or imported.
+Evaluates a trained model. The model may be any supervised model either trained 
+by {dfanalytics} or imported.
+
+NOTE: For model deployments with caching enabled, results may be returned 
+directly from the {infer} cache.
 
 beta::[]
 
@@ -102,7 +106,8 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-fill-mask]
 =====
 `num_top_classes`::::
 (Optional, integer)
-Number of top predicted tokens to return for replacing the mask token. Defaults to `0`.
+Number of top predicted tokens to return for replacing the mask token. Defaults 
+to `0`.
 
 `results_field`::::
 (Optional, string)
@@ -272,7 +277,8 @@ The maximum amount of words in the answer. Defaults to `15`.
 
 `num_top_classes`::::
 (Optional, integer)
-The number the top found answers to return. Defaults to `0`, meaning only the best found answer is returned.
+The number the top found answers to return. Defaults to `0`, meaning only the 
+best found answer is returned.
 
 `question`::::
 (Required, string)
@@ -368,7 +374,8 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-text-classific
 
 `num_top_classes`::::
 (Optional, integer)
-Specifies the number of top class predictions to return. Defaults to all classes (-1).
+Specifies the number of top class predictions to return. Defaults to all classes 
+(-1).
 
 `results_field`::::
 (Optional, string)
@@ -879,8 +886,8 @@ POST _ml/trained_models/model2/_infer
 --------------------------------------------------
 // TEST[skip:TBD]
 
-When the input has been truncated due to the limit imposed by the model's `max_sequence_length`
-the `is_truncated` field appears in the response.
+When the input has been truncated due to the limit imposed by the model's 
+`max_sequence_length` the `is_truncated` field appears in the response.
 
 [source,console-result]
 ----

diff --git a/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc b/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc
@@ -30,20 +30,20 @@ in an ingest pipeline or directly in the <<infer-trained-model>> API.
 Scaling inference performance can be achieved by setting the parameters
 `number_of_allocations` and `threads_per_allocation`.
 
-Increasing `threads_per_allocation` means more threads are used when
-an inference request is processed on a node. This can improve inference speed
-for certain models. It may also result in improvement to throughput.
+Increasing `threads_per_allocation` means more threads are used when an 
+inference request is processed on a node. This can improve inference speed for 
+certain models. It may also result in improvement to throughput.
 
-Increasing `number_of_allocations` means more threads are used to
-process multiple inference requests in parallel resulting in throughput
-improvement. Each model allocation uses a number of threads defined by
+Increasing `number_of_allocations` means more threads are used to process 
+multiple inference requests in parallel resulting in throughput improvement. 
+Each model allocation uses a number of threads defined by 
 `threads_per_allocation`.
 
-Model allocations are distributed across {ml} nodes. All allocations assigned
-to a node share the same copy of the model in memory. To avoid
-thread oversubscription which is detrimental to performance, model allocations
-are distributed in such a way that the total number of used threads does not
-surpass the node's allocated processors.
+Model allocations are distributed across {ml} nodes. All allocations assigned to 
+a node share the same copy of the model in memory. To avoid thread 
+oversubscription which is detrimental to performance, model allocations are 
+distributed in such a way that the total number of used threads does not surpass 
+the node's allocated processors.
 
 [[start-trained-model-deployment-path-params]]
 == {api-path-parms-title}
@@ -57,33 +57,36 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id]
 
 `cache_size`::
 (Optional, <<byte-units,byte value>>)
-The inference cache size (in memory outside the JVM heap) per node for the model.
-The default value is the same size as the `model_size_bytes`. To disable the cache, `0b` can be provided.
+The inference cache size (in memory outside the JVM heap) per node for the 
+model. The default value is the size of the model as reported by the 
+`model_size_bytes` field in the <<get-trained-models-stats>>. To disable the 
+cache, `0b` can be provided.
 
 `number_of_allocations`::
 (Optional, integer)
 The total number of allocations this model is assigned across {ml} nodes.
-Increasing this value generally increases the throughput.
-Defaults to 1.
+Increasing this value generally increases the throughput. Defaults to 1.
 
 `queue_capacity`::
 (Optional, integer)
 Controls how many inference requests are allowed in the queue at a time.
 Every machine learning node in the cluster where the model can be allocated
 has a queue of this size; when the number of requests exceeds the total value,
-new requests are rejected with a 429 error. Defaults to 1024. Max allowed value is 1000000.
+new requests are rejected with a 429 error. Defaults to 1024. Max allowed value 
+is 1000000.
 
 `threads_per_allocation`::
 (Optional, integer)
-Sets the number of threads used by each model allocation during inference. This generally increases
-the speed per inference request. The inference process is a compute-bound process;
-`threads_per_allocations` must not exceed the number of available allocated processors per node.
-Defaults to 1. Must be a power of 2. Max allowed value is 32.
+Sets the number of threads used by each model allocation during inference. This 
+generally increases the speed per inference request. The inference process is a 
+compute-bound process; `threads_per_allocations` must not exceed the number of 
+available allocated processors per node. Defaults to 1. Must be a power of 2. 
+Max allowed value is 32.
 
 `timeout`::
 (Optional, time)
-Controls the amount of time to wait for the model to deploy. Defaults
-to 20 seconds.
+Controls the amount of time to wait for the model to deploy. Defaults to 20 
+seconds.
 
 `wait_for`::
 (Optional, string)