Move bulk API's batch_size parameter to processors (#7719)

* Deprecate batch_size from bulk API & introduce batch_size in two processors Signed-off-by: Liyun Xiu <[email protected]> * Remove empty line Signed-off-by: Liyun Xiu <[email protected]> * Update _api-reference/document-apis/bulk.md Co-authored-by: Naarcha-AWS <[email protected]> Signed-off-by: Liyun Xiu <[email protected]> * Update _ingest-pipelines/processors/sparse-encoding.md Co-authored-by: Naarcha-AWS <[email protected]> Signed-off-by: Liyun Xiu <[email protected]> * Update _ingest-pipelines/processors/text-embedding.md Co-authored-by: Naarcha-AWS <[email protected]> Signed-off-by: Liyun Xiu <[email protected]> * Update _ml-commons-plugin/remote-models/batch-ingestion.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Naarcha-AWS <[email protected]> * Update _ml-commons-plugin/remote-models/batch-ingestion.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Naarcha-AWS <[email protected]> --------- Signed-off-by: Liyun Xiu <[email protected]> Signed-off-by: Naarcha-AWS <[email protected]> Co-authored-by: Naarcha-AWS <[email protected]> Co-authored-by: Nathan Bower <[email protected]>
opensearch-project · Jul 29, 2024 · 9f9e6d5 · 9f9e6d5
1 parent 98886f8
commit 9f9e6d5
Show file tree

Hide file tree

Showing 4 changed files with 8 additions and 4 deletions.
diff --git a/_api-reference/document-apis/bulk.md b/_api-reference/document-apis/bulk.md
@@ -59,7 +59,7 @@ routing | String | Routes the request to the specified shard.
 timeout | Time | How long to wait for the request to return. Default `1m`.
 type | String | (Deprecated) The default document type for documents that don't specify a type. Default is `_doc`. We highly recommend ignoring this parameter and using a type of `_doc` for all indexes.
 wait_for_active_shards | String | Specifies the number of active shards that must be available before OpenSearch processes the bulk request. Default is 1 (only the primary shard). Set to `all` or a positive integer. Values greater than 1 require replicas. For example, if you specify a value of 3, the index must have two replicas distributed across two additional nodes for the request to succeed.
-batch_size | Integer | Specifies the number of documents to be batched and sent to an ingest pipeline to be processed together. Default is `1` (documents are ingested by an ingest pipeline one at a time). If the bulk request doesn't explicitly specify an ingest pipeline or the index doesn't have a default ingest pipeline, then this parameter is ignored. Only documents with `create`, `index`, or `update` actions can be grouped into batches.
+batch_size | Integer | **(Deprecated)** Specifies the number of documents to be batched and sent to an ingest pipeline to be processed together. Default is `2147483647` (documents are ingested by an ingest pipeline all at once). If the bulk request doesn't explicitly specify an ingest pipeline or the index doesn't have a default ingest pipeline, then this parameter is ignored. Only documents with `create`, `index`, or `update` actions can be grouped into batches.
 {% comment %}_source | List | asdf
 _source_excludes | list | asdf
 _source_includes | list | asdf{% endcomment %}

diff --git a/_ingest-pipelines/processors/sparse-encoding.md b/_ingest-pipelines/processors/sparse-encoding.md
@@ -41,6 +41,7 @@ The following table lists the required and optional parameters for the `sparse_e
 `field_map.<vector_field>`  | String | Required | The name of the vector field in which to store the generated vector embeddings.
 `description`  | String | Optional  | A brief description of the processor.  |
 `tag` | String | Optional | An identifier tag for the processor. Useful for debugging to distinguish between processors of the same type. |
+`batch_size` | Integer | Optional | Specifies the number of documents to be batched and processed each time. Default is `1`. |
 
 ## Using the processor
 

diff --git a/_ingest-pipelines/processors/text-embedding.md b/_ingest-pipelines/processors/text-embedding.md
@@ -41,6 +41,7 @@ The following table lists the required and optional parameters for the `text_emb
 `field_map.<vector_field>`  | String | Required | The name of the vector field in which to store the generated text embeddings.
 `description`  | String | Optional  | A brief description of the processor.  |
 `tag` | String | Optional | An identifier tag for the processor. Useful for debugging to distinguish between processors of the same type. |
+`batch_size` | Integer | Optional | Specifies the number of documents to be batched and processed each time. Default is `1`. |
 
 ## Using the processor
 

diff --git a/_ml-commons-plugin/remote-models/batch-ingestion.md b/_ml-commons-plugin/remote-models/batch-ingestion.md
@@ -14,10 +14,11 @@ grand_parent: Integrating ML models
 
 If you are ingesting multiple documents and generating embeddings by invoking an externally hosted model, you can use batch ingestion to improve performance.
 
-The [Bulk API]({{site.url}}{{site.baseurl}}/api-reference/document-apis/bulk/) accepts a `batch_size` parameter that specifies to process documents in batches of a specified size. Processors that support batch ingestion will send each batch of documents to an externally hosted model in a single request.
+When using the [Bulk API]({{site.url}}{{site.baseurl}}/api-reference/document-apis/bulk/) to ingest documents, processors that support batch ingestion will split documents into batches and send each batch of documents to an externally hosted model in a single request.
 
 The [`text_embedding`]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-embedding/) and [`sparse_encoding`]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/sparse-encoding/) processors currently support batch ingestion.
 
+
 ## Step 1: Register a model group
 
 You can register a model in two ways:
@@ -212,7 +213,8 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline
         "model_id": "cleMb4kBJ1eYAeTMFFg4",
         "field_map": {
           "passage_text": "passage_embedding"
-        }
+        },
+        "batch_size": 5
       }
     }
   ]
@@ -222,7 +224,7 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline
 
 ## Step 6: Perform bulk indexing
 
-To ingest documents in bulk, call the Bulk API and provide the `batch_size` and `pipeline` parameters. If you don't provide a `pipeline` parameter, the default ingest pipeline for the index will be used for ingestion:
+To ingest documents in bulk, call the Bulk API and provide the `pipeline` parameter. If you don't provide a `pipeline` parameter, then the default ingest pipeline for the index will be used for ingestion:
 
 ```json
 POST _bulk?batch_size=5&pipeline=nlp-ingest-pipeline