Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for ingest-attachment plugin #7891

Merged
merged 25 commits into from
Aug 7, 2024
Merged
Changes from 1 commit
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
8abd551
add ingest-attachment plugin doc
ldrick Aug 1, 2024
a24fe1c
extend ingest-attachment with information how to limit content
ldrick Aug 2, 2024
0138e74
Added target_bulk_bytes to the docs for logstash-output plugin (#7869)
sandervandegeijn Aug 1, 2024
0fc2c26
Add doc for binary format support in k-NN (#7840)
junqiu-lei Aug 1, 2024
b1e3a77
Edit for redundant information and sections across Data Prepper (#7127)
vagimeli Aug 1, 2024
db40898
Update index.md (#7893)
PhilD90 Aug 2, 2024
9100add
Fix typo and make left nav heading uniform for neural sparse processo…
kolchfa-aws Aug 2, 2024
c81a16a
Add custom JSON lexer and highlighting color scheme (#7892)
kolchfa-aws Aug 2, 2024
ebe683b
Add model names to Vale (#7901)
kolchfa-aws Aug 2, 2024
aa5c433
Renamed data prepper files to have dashes for consistency (#7790)
kolchfa-aws Aug 2, 2024
fbbd2fd
Add documentation for ml inference search request processor/ search r…
mingshl Aug 2, 2024
e7fdc75
Refactor k-NN documentation (#7890)
kolchfa-aws Aug 5, 2024
0d69f35
Ml commons batch inference (#7899)
Zhangxunmt Aug 5, 2024
e0a9283
Remove repeated sentence in distributed tracing doc (#7906)
peteralfonsi Aug 6, 2024
c71cee8
Add apostrophe token filter page #7871 (#7884)
AntonEliatra Aug 6, 2024
5d8563a
removed unnecessary backslash
ldrick Aug 6, 2024
ed4d3c6
fix:add missing whitespace in table
ldrick Aug 6, 2024
d327a3c
docs: add link to tika supported file formats
ldrick Aug 6, 2024
76cbf7f
Merge branch 'main' into add-documentation-ingest-attachment-plugin
ldrick Aug 6, 2024
19a5838
Update ingest-attachment-plugin.md
Naarcha-AWS Aug 6, 2024
02169c4
Apply suggestions from code review
Naarcha-AWS Aug 6, 2024
073cafe
Merge branch 'main' into add-documentation-ingest-attachment-plugin
Naarcha-AWS Aug 7, 2024
c94789b
adjust to keep technical specific information with improved wording
ldrick Aug 7, 2024
0724235
Apply suggestions from code review
Naarcha-AWS Aug 7, 2024
6be0bba
Apply suggestions from code review
Naarcha-AWS Aug 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Apply suggestions from code review
Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Naarcha-AWS and natebower authored Aug 7, 2024
commit 0724235d630bab6431d9dd2cd1de2668f8d9df11
Original file line number Diff line number Diff line change
@@ -8,8 +8,8 @@

# Ingest-attachment plugin

The Ingest-attachment plugin enables OpenSearch to extract content and other information from files using the Apache text extraction library [Tika](https://tika.apache.org/).
Supported document formats include PPT, PDF, RTF, ODF and many more ([Tika Supported Document Formats](https://tika.apache.org/2.9.2/formats.html)).
The `ingest-attachment` plugin enables OpenSearch to extract content and other information from files using the Apache text extraction library [Tika](https://tika.apache.org/).

Check failure on line 11 in _install-and-configure/additional-plugins/ingest-attachment-plugin.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Tika. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Tika. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_install-and-configure/additional-plugins/ingest-attachment-plugin.md", "range": {"start": {"line": 11, "column": 145}}}, "severity": "ERROR"}
Supported document formats include PPT, PDF, RTF, ODF, and many more Tika ([Supported Document Formats](https://tika.apache.org/2.9.2/formats.html)).

Check failure on line 12 in _install-and-configure/additional-plugins/ingest-attachment-plugin.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Tika. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Tika. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_install-and-configure/additional-plugins/ingest-attachment-plugin.md", "range": {"start": {"line": 12, "column": 70}}}, "severity": "ERROR"}

The input field must be a base64-encoded binary.

@@ -25,18 +25,18 @@

| Name | Required | Default | Description |
| :--- | :--- | :--- | :--- |
natebower marked this conversation as resolved.
Show resolved Hide resolved
| `field` | yes | - | The field to get base64 encoded binary from. |
| `target_field` | no | attachment | The field that holds the attachment information. |
| `properties` | no | all properties | An array of properties, which should be stored. Can be `content`, `language`, `date`, `title`, `author`, `keywords`, `content_type`, `content_length`. |
| `indexed_chars` | no | `100_000` | The number of character used for extraction to prevent fields from becoming to large. Use `-1` for no limit. |
| `indexed_chars_field` | no | `null` | The field name from which you can overwrite the number of chars being used for extraction, for example, `indexed_chars`. |
| `ignore_missing` | no | `false` | When `true`, the processor exits without modifying the document when the specified field doesn't exist. |
| `field` | Yes | - | The field from which to get the base64-encoded binary. |
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
| `target_field` | No | Attachment | The field that stores the attachment information. |
| `properties` | No | All properties | An array of properties that should be stored. Can be `content`, `language`, `date`, `title`, `author`, `keywords`, `content_type`, or `content_length`. |
| `indexed_chars` | No | `100_000` | The number of characters used for extraction to prevent fields from becoming too large. Use `-1` for no limit. |
| `indexed_chars_field` | No | `null` | The field name used to overwrite the number of chars being used for extraction, for example, `indexed_chars`. |
| `ignore_missing` | No | `false` | When `true`, the processor exits without modifying the document when the specified field doesn't exist. |

natebower marked this conversation as resolved.
Show resolved Hide resolved
## Example

The following steps show how to get started with the Ingest-attachment plugin.
The following steps show you how to get started with the `ingest-attachment` plugin.

natebower marked this conversation as resolved.
Show resolved Hide resolved
### Create an index to store your attachments
### Step 1: Create an index for storing your attachments

The following command creates an index for storing your attachments:

@@ -49,9 +49,9 @@
}
```

### Create a pipeline with attachment processor
### Step 2: Create a pipeline

The following command creates a pipeline which contains the attachment processor:
The following command creates a pipeline containing the attachment processor:

```json
PUT _ingest/pipeline/attachment
@@ -67,16 +67,16 @@
}
```

### Store an attachment
### Step 3: Store an attachment

Convert the attachment to base64 string, to pass it as `data`.
In this example the Unix-like system `base64` command converts the file `lorem.rtf`:
Convert the attachment to a base64 string to pass it as `data`.
In this example the `base64` command converts the file `lorem.rtf`:

```sh
base64 lorem.rtf
```

Alternatively you can use Node.js to read the file to `base64`, as shown in the following commands:
Alternatively, you can use Node.js to read the file to `base64`, as shown in the following commands:

```typescript
import * as fs from "node:fs/promises";
@@ -88,7 +88,7 @@
console.log(base64File);
```

The following base64 string is for an `.rtf` file containing the text
The`.rtf` file contains the following base64 text:

`Lorem ipsum dolor sit amet`:
`e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=`.
@@ -153,7 +153,7 @@
}
```

## Extracted Information
## Extracted information

The following fields can be extracted using the plugin:

@@ -167,7 +167,7 @@
- `content_length`

To extract only a subset of these fields, define them in the `properties` of the
pipelines processor, as shown in the following example:
pipeline processor, as shown in the following example:

```json
PUT _ingest/pipeline/attachment
@@ -186,8 +186,8 @@

## Limit the extracted content

To prevent extracting too many characters and overload the node memory, the default limit is `100_000`.
You can change this value using the setting `indexed_chars`. For example, you can use `-1` for unlimited characters but you need to make sure you have enough HEAP space on your OpenSearch-Node to extract the content of large documents.
To prevent extracting too many characters and overloading the node memory, the default limit is `100_000`.
You can change this value using the setting `indexed_chars`. For example, you can use `-1` for unlimited characters, but you need to make sure you have enough HEAP space on your OpenSearch node to extract the content of large documents.

You can also define this limit per document using the `indexed_chars_field` request field.
If a document contains `indexed_chars_field`, it will overwrite the `indexed_chars` setting, as shown in the following example:
@@ -208,7 +208,7 @@
}
```

With the attachment pipeline set, you can extract the above defaulted `10` characters without specifying `max_chars` in the request, as shown in the following example:
With the attachment pipeline configured, you can extract the default `10` characters without specifying `max_chars` in the request, as shown in the following example:

```json
PUT example-attachment-index/_doc/lorem_rtf?pipeline=attachment
@@ -217,7 +217,7 @@
}
```

Alternatively, you can change the `max_char` per document to extract up to `15` characters, as shown in the following example:
Alternatively, you can change the `max_char` per document in order to extract up to `15` characters, as shown in the following example:

```json
PUT example-attachment-index/_doc/lorem_rtf?pipeline=attachment