From c817d9b929a673e93051f08a3c4d2761338ff163 Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Wed, 7 Aug 2024 18:31:08 +0000 Subject: [PATCH] Add documentation for ingest-attachment plugin (#7891) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * add ingest-attachment plugin doc Signed-off-by: Ricky Lippmann * extend ingest-attachment with information how to limit content Signed-off-by: Ricky Lippmann * Added target_bulk_bytes to the docs for logstash-output plugin (#7869) * Added target_bulk_bytes Signed-off-by: Sander van de Geijn * Update _tools/logstash/ship-to-opensearch.md Nice Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: Sander van de Geijn * Update _tools/logstash/ship-to-opensearch.md Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update ship-to-opensearch.md * Remove "we" * Update ship-to-opensearch.md * Update ship-to-opensearch.md * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Sander van de Geijn Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower Signed-off-by: Ricky Lippmann * Add doc for binary format support in k-NN (#7840) * Add doc for binary format support in k-NN Signed-off-by: Junqiu Lei * Resolve tech feedback Signed-off-by: Junqiu Lei * Doc review Signed-off-by: Fanit Kolchina * Add newline Signed-off-by: Fanit Kolchina * Formatting Signed-off-by: Fanit Kolchina * Link fix Signed-off-by: Fanit Kolchina * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Add query results to examples Signed-off-by: Junqiu Lei * Rephrased sentences and changed vector field name Signed-off-by: Fanit Kolchina * Editorial review Signed-off-by: Fanit Kolchina * Remove details from one of the requests Signed-off-by: Fanit Kolchina --------- Signed-off-by: Junqiu Lei Signed-off-by: Fanit Kolchina Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower Signed-off-by: Ricky Lippmann * Edit for redundant information and sections across Data Prepper (#7127) * Edit for redundant information and sections across Data Prepper Signed-off-by: Melissa Vagi * Edit for redundant information and sections across Data Prepper Signed-off-by: Melissa Vagi * Rewrite expression syntax and reorganize doc structure for readability Signed-off-by: Melissa Vagi * Rewrite expression syntax and reorganize doc structure for readability Signed-off-by: Melissa Vagi * Rewrite expression syntax and reorganize doc structure for readability Signed-off-by: Melissa Vagi * Rewrite expression syntax and reorganize doc structure for readability Signed-off-by: Melissa Vagi * Rewrite expression syntax and reorganize doc structure for readability Signed-off-by: Melissa Vagi * Update _data-prepper/index.md Signed-off-by: Melissa Vagi * Update configuring-data-prepper.md Signed-off-by: Melissa Vagi Signed-off-by: Melissa Vagi * Update _data-prepper/pipelines/expression-syntax.md Signed-off-by: Melissa Vagi * Update _data-prepper/pipelines/expression-syntax.md Signed-off-by: Melissa Vagi * Update _data-prepper/pipelines/pipelines.md Signed-off-by: Melissa Vagi * Update expression-syntax.md Signed-off-by: Melissa Vagi * Create Functions subpages Signed-off-by: Melissa Vagi * Create functions subpages Signed-off-by: Melissa Vagi * Copy edit Signed-off-by: Melissa Vagi * add remaining subpages Signed-off-by: Melissa Vagi * Update _data-prepper/index.md Co-authored-by: Nathan Bower Signed-off-by: Heather Halter * Apply suggestions from code review Accepted editorial suggestions. Co-authored-by: Nathan Bower Signed-off-by: Heather Halter * Apply suggestions from code review Accepted more editorial suggestions that were hidden. Co-authored-by: Nathan Bower Signed-off-by: Heather Halter * Apply suggestions from code review Co-authored-by: Heather Halter Signed-off-by: David Venable * removed-line Signed-off-by: Heather Halter * Fixed broken link to pipelines Signed-off-by: Heather Halter * Fixed broken links on Update add-entries.md Signed-off-by: Heather Halter * Fixed broken link in Update dynamo-db.md Signed-off-by: Heather Halter * Fixed link syntax in Update index.md Signed-off-by: Heather Halter --------- Signed-off-by: Melissa Vagi Signed-off-by: Heather Halter Signed-off-by: David Venable Signed-off-by: Heather Halter Co-authored-by: Heather Halter Co-authored-by: Nathan Bower Co-authored-by: David Venable Signed-off-by: Ricky Lippmann * Update index.md (#7893) fixed typo Signed-off-by: Philipp Dünnebeil <53494432+PhilD90@users.noreply.github.com> Signed-off-by: Ricky Lippmann * Fix typo and make left nav heading uniform for neural sparse processor (#7895) Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: Ricky Lippmann * Add custom JSON lexer and highlighting color scheme (#7892) * Add custom JSON lexer and highlighting color scheme Signed-off-by: Fanit Kolchina * Update _getting-started/quickstart.md Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Fanit Kolchina Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower Signed-off-by: Ricky Lippmann * Add model names to Vale (#7901) Signed-off-by: Fanit Kolchina Signed-off-by: Ricky Lippmann * Renamed data prepper files to have dashes for consistency (#7790) * Renamed data prepper files to have dashes for consistency Signed-off-by: Fanit Kolchina * More files Signed-off-by: Fanit Kolchina --------- Signed-off-by: Fanit Kolchina Signed-off-by: Ricky Lippmann * Add documentation for ml inference search request processor/ search response processor (#7852) * draft ml inference search request processor Signed-off-by: Mingshi Liu * add doc Signed-off-by: Mingshi Liu * add doc Signed-off-by: Mingshi Liu * Doc review Signed-off-by: Fanit Kolchina * Fixed links Signed-off-by: Fanit Kolchina * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Unify processor docs Signed-off-by: Fanit Kolchina * Update _query-dsl/geo-and-xy/xy.md Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Remove note Signed-off-by: Fanit Kolchina * Fix link Signed-off-by: Fanit Kolchina --------- Signed-off-by: Mingshi Liu Signed-off-by: Fanit Kolchina Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower Signed-off-by: Ricky Lippmann * Refactor k-NN documentation (#7890) * Refactor k-NN documentation Signed-off-by: Fanit Kolchina * Change field name for cohesiveness Signed-off-by: Fanit Kolchina * Apply suggestions from code review Co-authored-by: Heather Halter Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Fanit Kolchina Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Heather Halter Co-authored-by: Nathan Bower Signed-off-by: Ricky Lippmann * Ml commons batch inference (#7899) * add batch inference API Signed-off-by: Xun Zhang * add more links and mark the api as experimental Signed-off-by: Xun Zhang * use openAI as the blueprint example details Signed-off-by: Xun Zhang * address comments Signed-off-by: Xun Zhang * Doc review Signed-off-by: Fanit Kolchina * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Xun Zhang Signed-off-by: Fanit Kolchina Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower Signed-off-by: Ricky Lippmann * Remove repeated sentence in distributed tracing doc (#7906) Signed-off-by: Peter Alfonsi Co-authored-by: Peter Alfonsi Signed-off-by: Ricky Lippmann * Add apostrophe token filter page #7871 (#7884) * adding apostrophe token filter page #7871 Signed-off-by: AntonEliatra * fixing vale error Signed-off-by: AntonEliatra * Update apostrophe-token-filter.md Signed-off-by: AntonEliatra * updating the naming Signed-off-by: AntonEliatra * updating as per the review comments Signed-off-by: AntonEliatra * updating the heading to Apostrophe token filter Signed-off-by: AntonEliatra * updating as per PR comments Signed-off-by: AntonEliatra * Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: AntonEliatra --------- Signed-off-by: AntonEliatra Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower Signed-off-by: Ricky Lippmann * removed unnecessary backslash Signed-off-by: Ricky Lippmann * fix:add missing whitespace in table Signed-off-by: Ricky Lippmann * docs: add link to tika supported file formats Signed-off-by: Ricky Lippmann * Update ingest-attachment-plugin.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * adjust to keep technical specific information with improved wording Signed-off-by: Ricky Lippmann * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Ricky Lippmann Signed-off-by: Sander van de Geijn Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: Junqiu Lei Signed-off-by: Fanit Kolchina Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: Melissa Vagi Signed-off-by: Heather Halter Signed-off-by: David Venable Signed-off-by: Heather Halter Signed-off-by: Philipp Dünnebeil <53494432+PhilD90@users.noreply.github.com> Signed-off-by: Mingshi Liu Signed-off-by: Xun Zhang Signed-off-by: Peter Alfonsi Signed-off-by: AntonEliatra Co-authored-by: Sander van de Geijn Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower Co-authored-by: Junqiu Lei Co-authored-by: Fanit Kolchina Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Melissa Vagi Co-authored-by: Heather Halter Co-authored-by: David Venable Co-authored-by: Philipp Dünnebeil <53494432+PhilD90@users.noreply.github.com> Co-authored-by: Mingshi Liu Co-authored-by: Xun Zhang Co-authored-by: Peter Alfonsi Co-authored-by: Peter Alfonsi Co-authored-by: AntonEliatra (cherry picked from commit 8b731c55e23b63a68ba77c73370fd4116f5c4604) Signed-off-by: github-actions[bot] --- .../additional-plugins/index.md | 6 +- .../ingest-attachment-plugin.md | 228 ++++++++++++++++++ 2 files changed, 231 insertions(+), 3 deletions(-) create mode 100644 _install-and-configure/additional-plugins/ingest-attachment-plugin.md diff --git a/_install-and-configure/additional-plugins/index.md b/_install-and-configure/additional-plugins/index.md index de97af0b1a..87d0662442 100644 --- a/_install-and-configure/additional-plugins/index.md +++ b/_install-and-configure/additional-plugins/index.md @@ -9,7 +9,6 @@ nav_order: 10 There are many more plugins available in addition to those provided by the standard distribution of OpenSearch. These additional plugins have been built by OpenSearch developers or members of the OpenSearch community. While it isn't possible to provide an exhaustive list (because many plugins are not maintained in an OpenSearch GitHub repository), the following plugins, available in the [OpenSearch/plugins](https://github.com/opensearch-project/OpenSearch/tree/main/plugins) directory on GitHub, are some of the plugins that can be installed using one of the installation options, for example, using the command `bin/opensearch-plugin install `. - | Plugin name | Earliest available version | | :--- | :--- | | analysis-icu | 1.0.0 | @@ -22,7 +21,7 @@ There are many more plugins available in addition to those provided by the stand | discovery-azure-classic | 1.0.0 | | discovery-ec2 | 1.0.0 | | discovery-gce | 1.0.0 | -| ingest-attachment | 1.0.0 | +| [`ingest-attachment`]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/ingest-attachment-plugin/) | 1.0.0 | | mapper-annotated-text | 1.0.0 | | mapper-murmur3 | 1.0.0 | | [`mapper-size`]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/mapper-size-plugin/) | 1.0.0 | @@ -34,7 +33,8 @@ There are many more plugins available in addition to those provided by the stand | store-smb | 1.0.0 | | transport-nio | 1.0.0 | - ## Related articles + [Installing plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/) +[`ingest-attachment` plugin]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/ingest-attachment-plugin/) [`mapper-size` plugin]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/mapper-size-plugin/) diff --git a/_install-and-configure/additional-plugins/ingest-attachment-plugin.md b/_install-and-configure/additional-plugins/ingest-attachment-plugin.md new file mode 100644 index 0000000000..d2062f441b --- /dev/null +++ b/_install-and-configure/additional-plugins/ingest-attachment-plugin.md @@ -0,0 +1,228 @@ +--- +layout: default +title: Ingest-attachment plugin +parent: Installing plugins +nav_order: 20 + +--- + +# Ingest-attachment plugin + +The `ingest-attachment` plugin enables OpenSearch to extract content and other information from files using the Apache text extraction library [Tika](https://tika.apache.org/). +Supported document formats include PPT, PDF, RTF, ODF, and many more Tika ([Supported Document Formats](https://tika.apache.org/2.9.2/formats.html)). + +The input field must be a base64-encoded binary. + +## Installing the plugin + +Install the `ingest-attachment` plugin using the following command: + +```sh +./bin/opensearch-plugin install ingest-attachment +``` + +## Attachment processor options + +| Name | Required | Default | Description | +| :--- | :--- | :--- | :--- | +| `field` | Yes | N/A | The field from which to get the base64-encoded binary. | +| `target_field` | No | Attachment | The field that stores the attachment information. | +| `properties` | No | All properties | An array of properties that should be stored. Can be `content`, `language`, `date`, `title`, `author`, `keywords`, `content_type`, or `content_length`. | +| `indexed_chars` | No | `100_000` | The number of characters used for extraction to prevent fields from becoming too large. Use `-1` for no limit. | +| `indexed_chars_field` | No | `null` | The field name used to overwrite the number of chars being used for extraction, for example, `indexed_chars`. | +| `ignore_missing` | No | `false` | When `true`, the processor exits without modifying the document when the specified field doesn't exist. | + +## Example + +The following steps show you how to get started with the `ingest-attachment` plugin. + +### Step 1: Create an index for storing your attachments + +The following command creates an index for storing your attachments: + +```json +PUT /example-attachment-index +{ + "mappings": { + "properties": {} + } +} +``` + +### Step 2: Create a pipeline + +The following command creates a pipeline containing the attachment processor: + +```json +PUT _ingest/pipeline/attachment +{ + "description" : "Extract attachment information", + "processors" : [ + { + "attachment" : { + "field" : "data" + } + } + ] +} +``` + +### Step 3: Store an attachment + +Convert the attachment to a base64 string to pass it as `data`. +In this example the `base64` command converts the file `lorem.rtf`: + +```sh +base64 lorem.rtf +``` + +Alternatively, you can use Node.js to read the file to `base64`, as shown in the following commands: + +```typescript +import * as fs from "node:fs/promises"; +import path from "node:path"; + +const filePath = path.join(import.meta.dirname, "lorem.rtf"); +const base64File = await fs.readFile(filePath, { encoding: "base64" }); + +console.log(base64File); +``` + +The`.rtf` file contains the following base64 text: + +`Lorem ipsum dolor sit amet`: +`e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=`. + +```json +PUT example-attachment-index/_doc/lorem_rtf?pipeline=attachment +{ + "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=" +} +``` + +### Query results + +With the attachment processed, you can now search through the data using search queries, as shown in the following example: + +```json +POST example-attachment-index/_search +{ + "query": { + "match": { + "attachment.content": "ipsum" + } + } +} +``` + +OpenSearch responds with the following: + +```json +{ + "took": 5, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 1, + "relation": "eq" + }, + "max_score": 1.1724279, + "hits": [ + { + "_index": "example-attachment-index", + "_id": "lorem_rtf", + "_score": 1.1724279, + "_source": { + "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=", + "attachment": { + "content_type": "application/rtf", + "language": "pt", + "content": "Lorem ipsum dolor sit amet", + "content_length": 28 + } + } + } + ] + } +} +``` + +## Extracted information + +The following fields can be extracted using the plugin: + +- `content` +- `language` +- `date` +- `title` +- `author` +- `keywords` +- `content_type` +- `content_length` + +To extract only a subset of these fields, define them in the `properties` of the +pipeline processor, as shown in the following example: + +```json +PUT _ingest/pipeline/attachment +{ + "description" : "Extract attachment information", + "processors" : [ + { + "attachment" : { + "field" : "data", + "properties": ["content", "title", "author"] + } + } + ] +} +``` + +## Limit the extracted content + +To prevent extracting too many characters and overloading the node memory, the default limit is `100_000`. +You can change this value using the setting `indexed_chars`. For example, you can use `-1` for unlimited characters, but you need to make sure you have enough HEAP space on your OpenSearch node to extract the content of large documents. + +You can also define this limit per document using the `indexed_chars_field` request field. +If a document contains `indexed_chars_field`, it will overwrite the `indexed_chars` setting, as shown in the following example: + +```json +PUT _ingest/pipeline/attachment +{ + "description" : "Extract attachment information", + "processors" : [ + { + "attachment" : { + "field" : "data", + "indexed_chars" : 10, + "indexed_chars_field" : "max_chars", + } + } + ] +} +``` + +With the attachment pipeline configured, you can extract the default `10` characters without specifying `max_chars` in the request, as shown in the following example: + +```json +PUT example-attachment-index/_doc/lorem_rtf?pipeline=attachment +{ + "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=" +} +``` + +Alternatively, you can change the `max_char` per document in order to extract up to `15` characters, as shown in the following example: + +```json +PUT example-attachment-index/_doc/lorem_rtf?pipeline=attachment +{ + "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=", + "max_chars": 15 +} +```