Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for ingest-attachment plugin #7891

Merged
merged 25 commits into from
Aug 7, 2024
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
8abd551
add ingest-attachment plugin doc
ldrick Aug 1, 2024
a24fe1c
extend ingest-attachment with information how to limit content
ldrick Aug 2, 2024
0138e74
Added target_bulk_bytes to the docs for logstash-output plugin (#7869)
sandervandegeijn Aug 1, 2024
0fc2c26
Add doc for binary format support in k-NN (#7840)
junqiu-lei Aug 1, 2024
b1e3a77
Edit for redundant information and sections across Data Prepper (#7127)
vagimeli Aug 1, 2024
db40898
Update index.md (#7893)
PhilD90 Aug 2, 2024
9100add
Fix typo and make left nav heading uniform for neural sparse processo…
kolchfa-aws Aug 2, 2024
c81a16a
Add custom JSON lexer and highlighting color scheme (#7892)
kolchfa-aws Aug 2, 2024
ebe683b
Add model names to Vale (#7901)
kolchfa-aws Aug 2, 2024
aa5c433
Renamed data prepper files to have dashes for consistency (#7790)
kolchfa-aws Aug 2, 2024
fbbd2fd
Add documentation for ml inference search request processor/ search r…
mingshl Aug 2, 2024
e7fdc75
Refactor k-NN documentation (#7890)
kolchfa-aws Aug 5, 2024
0d69f35
Ml commons batch inference (#7899)
Zhangxunmt Aug 5, 2024
e0a9283
Remove repeated sentence in distributed tracing doc (#7906)
peteralfonsi Aug 6, 2024
c71cee8
Add apostrophe token filter page #7871 (#7884)
AntonEliatra Aug 6, 2024
5d8563a
removed unnecessary backslash
ldrick Aug 6, 2024
ed4d3c6
fix:add missing whitespace in table
ldrick Aug 6, 2024
d327a3c
docs: add link to tika supported file formats
ldrick Aug 6, 2024
76cbf7f
Merge branch 'main' into add-documentation-ingest-attachment-plugin
ldrick Aug 6, 2024
19a5838
Update ingest-attachment-plugin.md
Naarcha-AWS Aug 6, 2024
02169c4
Apply suggestions from code review
Naarcha-AWS Aug 6, 2024
073cafe
Merge branch 'main' into add-documentation-ingest-attachment-plugin
Naarcha-AWS Aug 7, 2024
c94789b
adjust to keep technical specific information with improved wording
ldrick Aug 7, 2024
0724235
Apply suggestions from code review
Naarcha-AWS Aug 7, 2024
6be0bba
Apply suggestions from code review
Naarcha-AWS Aug 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions _install-and-configure/additional-plugins/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ nav_order: 10

There are many more plugins available in addition to those provided by the standard distribution of OpenSearch. These additional plugins have been built by OpenSearch developers or members of the OpenSearch community. While it isn't possible to provide an exhaustive list (because many plugins are not maintained in an OpenSearch GitHub repository), the following plugins, available in the [OpenSearch/plugins](https://github.com/opensearch-project/OpenSearch/tree/main/plugins) directory on GitHub, are some of the plugins that can be installed using one of the installation options, for example, using the command `bin/opensearch-plugin install <plugin-name>`.


| Plugin name | Earliest available version |
| :--- | :--- |
| analysis-icu | 1.0.0 |
Expand All @@ -22,7 +21,7 @@ There are many more plugins available in addition to those provided by the stand
| discovery-azure-classic | 1.0.0 |
| discovery-ec2 | 1.0.0 |
| discovery-gce | 1.0.0 |
| ingest-attachment | 1.0.0 |
| [`ingest-attachment`]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/ingest-attachment-plugin/) | 1.0.0 |
| mapper-annotated-text | 1.0.0 |
| mapper-murmur3 | 1.0.0 |
| [`mapper-size`]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/mapper-size-plugin/) | 1.0.0 |
Expand All @@ -34,7 +33,8 @@ There are many more plugins available in addition to those provided by the stand
| store-smb | 1.0.0 |
| transport-nio | 1.0.0 |


## Related articles

[Installing plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/)
[`ingest-attachment` plugin]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/ingest-attachment-plugin/)
[`mapper-size` plugin]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/mapper-size-plugin/)
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
---
layout: default
title: Ingest-attachment plugin
parent: Installing plugins
nav_order: 20

---

# Ingest-attachment plugin

The `ingest-attachment` plugin enables OpenSearch to extract content and other information from files using the Apache text extraction library [Tika](https://tika.apache.org/).

Check failure on line 11 in _install-and-configure/additional-plugins/ingest-attachment-plugin.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Tika. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Tika. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_install-and-configure/additional-plugins/ingest-attachment-plugin.md", "range": {"start": {"line": 11, "column": 145}}}, "severity": "ERROR"}
Supported document formats include PPT, PDF, RTF, ODF, and many more Tika ([Supported Document Formats](https://tika.apache.org/2.9.2/formats.html)).

Check failure on line 12 in _install-and-configure/additional-plugins/ingest-attachment-plugin.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Tika. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Tika. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_install-and-configure/additional-plugins/ingest-attachment-plugin.md", "range": {"start": {"line": 12, "column": 70}}}, "severity": "ERROR"}

The input field must be a base64-encoded binary.

## Installing the plugin

Install the `ingest-attachment` plugin using the following command:

```sh
./bin/opensearch-plugin install ingest-attachment
```

## Attachment processor options

| Name | Required | Default | Description |
| :--- | :--- | :--- | :--- |
natebower marked this conversation as resolved.
Show resolved Hide resolved
| `field` | Yes | - | The field from which to get the base64-encoded binary. |
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
| `target_field` | No | Attachment | The field that stores the attachment information. |
| `properties` | No | All properties | An array of properties that should be stored. Can be `content`, `language`, `date`, `title`, `author`, `keywords`, `content_type`, or `content_length`. |
| `indexed_chars` | No | `100_000` | The number of characters used for extraction to prevent fields from becoming too large. Use `-1` for no limit. |
| `indexed_chars_field` | No | `null` | The field name used to overwrite the number of chars being used for extraction, for example, `indexed_chars`. |
| `ignore_missing` | No | `false` | When `true`, the processor exits without modifying the document when the specified field doesn't exist. |

natebower marked this conversation as resolved.
Show resolved Hide resolved
## Example

The following steps show you how to get started with the `ingest-attachment` plugin.

natebower marked this conversation as resolved.
Show resolved Hide resolved
### Step 1: Create an index for storing your attachments

The following command creates an index for storing your attachments:

```json
PUT /example-attachment-index
{
"mappings": {
"properties": {}
}
}
```

### Step 2: Create a pipeline

The following command creates a pipeline containing the attachment processor:

```json
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data"
}
}
]
}
```

### Step 3: Store an attachment

Convert the attachment to a base64 string to pass it as `data`.
In this example the `base64` command converts the file `lorem.rtf`:

```sh
base64 lorem.rtf
```

Alternatively, you can use Node.js to read the file to `base64`, as shown in the following commands:

```typescript
import * as fs from "node:fs/promises";
import path from "node:path";

const filePath = path.join(import.meta.dirname, "lorem.rtf");
const base64File = await fs.readFile(filePath, { encoding: "base64" });

console.log(base64File);
```

The`.rtf` file contains the following base64 text:

`Lorem ipsum dolor sit amet`:
`e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=`.

```json
PUT example-attachment-index/_doc/lorem_rtf?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
```

### Query results

With the attachment processed, you can now search through the data using search queries, as shown in the following example:

```json
POST example-attachment-index/_search
{
"query": {
"match": {
"attachment.content": "ipsum"
}
}
}
```

OpenSearch responds with the following:

```json
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.1724279,
"hits": [
{
"_index": "example-attachment-index",
"_id": "lorem_rtf",
"_score": 1.1724279,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "pt",
"content": "Lorem ipsum dolor sit amet",
"content_length": 28
}
}
}
]
}
}
```

## Extracted information

The following fields can be extracted using the plugin:

- `content`
- `language`
- `date`
- `title`
- `author`
- `keywords`
- `content_type`
- `content_length`

To extract only a subset of these fields, define them in the `properties` of the
pipeline processor, as shown in the following example:

```json
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"properties": ["content", "title", "author"]
}
}
]
}
```

## Limit the extracted content

To prevent extracting too many characters and overloading the node memory, the default limit is `100_000`.
You can change this value using the setting `indexed_chars`. For example, you can use `-1` for unlimited characters, but you need to make sure you have enough HEAP space on your OpenSearch node to extract the content of large documents.

You can also define this limit per document using the `indexed_chars_field` request field.
If a document contains `indexed_chars_field`, it will overwrite the `indexed_chars` setting, as shown in the following example:

```json
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars" : 10,
"indexed_chars_field" : "max_chars",
}
}
]
}
```

With the attachment pipeline configured, you can extract the default `10` characters without specifying `max_chars` in the request, as shown in the following example:

```json
PUT example-attachment-index/_doc/lorem_rtf?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
```

Alternatively, you can change the `max_char` per document in order to extract up to `15` characters, as shown in the following example:

```json
PUT example-attachment-index/_doc/lorem_rtf?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"max_chars": 15
}
```
Loading