Skip to content

Commit

Permalink
Move bulk API's batch_size parameter to processors (#7719)
Browse files Browse the repository at this point in the history
* Deprecate batch_size from bulk API & introduce batch_size in two processors

Signed-off-by: Liyun Xiu <[email protected]>

* Remove empty line

Signed-off-by: Liyun Xiu <[email protected]>

* Update _api-reference/document-apis/bulk.md

Co-authored-by: Naarcha-AWS <[email protected]>
Signed-off-by: Liyun Xiu <[email protected]>

* Update _ingest-pipelines/processors/sparse-encoding.md

Co-authored-by: Naarcha-AWS <[email protected]>
Signed-off-by: Liyun Xiu <[email protected]>

* Update _ingest-pipelines/processors/text-embedding.md

Co-authored-by: Naarcha-AWS <[email protected]>
Signed-off-by: Liyun Xiu <[email protected]>

* Update _ml-commons-plugin/remote-models/batch-ingestion.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update _ml-commons-plugin/remote-models/batch-ingestion.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

---------

Signed-off-by: Liyun Xiu <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>
Co-authored-by: Naarcha-AWS <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
  • Loading branch information
3 people authored Jul 29, 2024
1 parent 98886f8 commit 9f9e6d5
Show file tree
Hide file tree
Showing 4 changed files with 8 additions and 4 deletions.
2 changes: 1 addition & 1 deletion _api-reference/document-apis/bulk.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ routing | String | Routes the request to the specified shard.
timeout | Time | How long to wait for the request to return. Default `1m`.
type | String | (Deprecated) The default document type for documents that don't specify a type. Default is `_doc`. We highly recommend ignoring this parameter and using a type of `_doc` for all indexes.
wait_for_active_shards | String | Specifies the number of active shards that must be available before OpenSearch processes the bulk request. Default is 1 (only the primary shard). Set to `all` or a positive integer. Values greater than 1 require replicas. For example, if you specify a value of 3, the index must have two replicas distributed across two additional nodes for the request to succeed.
batch_size | Integer | Specifies the number of documents to be batched and sent to an ingest pipeline to be processed together. Default is `1` (documents are ingested by an ingest pipeline one at a time). If the bulk request doesn't explicitly specify an ingest pipeline or the index doesn't have a default ingest pipeline, then this parameter is ignored. Only documents with `create`, `index`, or `update` actions can be grouped into batches.
batch_size | Integer | **(Deprecated)** Specifies the number of documents to be batched and sent to an ingest pipeline to be processed together. Default is `2147483647` (documents are ingested by an ingest pipeline all at once). If the bulk request doesn't explicitly specify an ingest pipeline or the index doesn't have a default ingest pipeline, then this parameter is ignored. Only documents with `create`, `index`, or `update` actions can be grouped into batches.
{% comment %}_source | List | asdf
_source_excludes | list | asdf
_source_includes | list | asdf{% endcomment %}
Expand Down
1 change: 1 addition & 0 deletions _ingest-pipelines/processors/sparse-encoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ The following table lists the required and optional parameters for the `sparse_e
`field_map.<vector_field>` | String | Required | The name of the vector field in which to store the generated vector embeddings.
`description` | String | Optional | A brief description of the processor. |
`tag` | String | Optional | An identifier tag for the processor. Useful for debugging to distinguish between processors of the same type. |
`batch_size` | Integer | Optional | Specifies the number of documents to be batched and processed each time. Default is `1`. |

## Using the processor

Expand Down
1 change: 1 addition & 0 deletions _ingest-pipelines/processors/text-embedding.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ The following table lists the required and optional parameters for the `text_emb
`field_map.<vector_field>` | String | Required | The name of the vector field in which to store the generated text embeddings.
`description` | String | Optional | A brief description of the processor. |
`tag` | String | Optional | An identifier tag for the processor. Useful for debugging to distinguish between processors of the same type. |
`batch_size` | Integer | Optional | Specifies the number of documents to be batched and processed each time. Default is `1`. |

## Using the processor

Expand Down
8 changes: 5 additions & 3 deletions _ml-commons-plugin/remote-models/batch-ingestion.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,11 @@ grand_parent: Integrating ML models

If you are ingesting multiple documents and generating embeddings by invoking an externally hosted model, you can use batch ingestion to improve performance.

The [Bulk API]({{site.url}}{{site.baseurl}}/api-reference/document-apis/bulk/) accepts a `batch_size` parameter that specifies to process documents in batches of a specified size. Processors that support batch ingestion will send each batch of documents to an externally hosted model in a single request.
When using the [Bulk API]({{site.url}}{{site.baseurl}}/api-reference/document-apis/bulk/) to ingest documents, processors that support batch ingestion will split documents into batches and send each batch of documents to an externally hosted model in a single request.

The [`text_embedding`]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-embedding/) and [`sparse_encoding`]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/sparse-encoding/) processors currently support batch ingestion.


## Step 1: Register a model group

You can register a model in two ways:
Expand Down Expand Up @@ -212,7 +213,8 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline
"model_id": "cleMb4kBJ1eYAeTMFFg4",
"field_map": {
"passage_text": "passage_embedding"
}
},
"batch_size": 5
}
}
]
Expand All @@ -222,7 +224,7 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline

## Step 6: Perform bulk indexing

To ingest documents in bulk, call the Bulk API and provide the `batch_size` and `pipeline` parameters. If you don't provide a `pipeline` parameter, the default ingest pipeline for the index will be used for ingestion:
To ingest documents in bulk, call the Bulk API and provide the `pipeline` parameter. If you don't provide a `pipeline` parameter, then the default ingest pipeline for the index will be used for ingestion:

```json
POST _bulk?batch_size=5&pipeline=nlp-ingest-pipeline
Expand Down

0 comments on commit 9f9e6d5

Please sign in to comment.