Skip to content

Commit

Permalink
Add fingerprint processor (opensearch-project#7631)
Browse files Browse the repository at this point in the history
* Add fingerprint processor

Signed-off-by: gaobinlong <[email protected]>

* Completed doc review

Signed-off-by: Melissa Vagi <[email protected]>

* Update _ingest-pipelines/processors/fingerprint.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _ingest-pipelines/processors/fingerprint.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update nav order

Signed-off-by: Melissa Vagi <[email protected]>

* Update _ingest-pipelines/processors/fingerprint.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _ingest-pipelines/processors/fingerprint.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _ingest-pipelines/processors/fingerprint.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _ingest-pipelines/processors/fingerprint.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _ingest-pipelines/processors/fingerprint.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _ingest-pipelines/processors/index-processors.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

---------

Signed-off-by: gaobinlong <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: [email protected] <[email protected]>
  • Loading branch information
3 people authored and leanneeliatra committed Jul 24, 2024
1 parent bbac877 commit daeb6e6
Show file tree
Hide file tree
Showing 2 changed files with 159 additions and 0 deletions.
158 changes: 158 additions & 0 deletions _ingest-pipelines/processors/fingerprint.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
---
layout: default
title: Fingerprint
parent: Ingest processors
nav_order: 105
---

# Fingerprint processor
Introduced 2.16
{: .label .label-purple }

The `fingerprint` processor is used to generate a hash value for either certain specified fields or all fields in a document. The hash value can be used to deduplicate documents within an index and collapse search results.

For each field, the field name, the length of the field value, and the field value itself are concatenated and separated by the pipe character `|`. For example, if the field name is `field1` and the value is `value1`, then the concatenated string would be `|field1|3:value1|field2|10:value2|`. For object fields, the field name is flattened by joining the nested field names with a period `.`. For instance, if the object field is `root_field` with a sub-field `sub_field1` having the value `value1` and another sub-field `sub_field2` with the value `value2`, then the concatenated string would be `|root_field.sub_field1|1:value1|root_field.sub_field2|100:value2|`.

The following is the syntax for the `fingerprint` processor:

```json
{
"community_id": {
"fields": ["foo", "bar"],
"target_field": "fingerprint",
"hash_method": "[email protected]"
}
}
```
{% include copy-curl.html %}

## Configuration parameters

The following table lists the required and optional parameters for the `fingerprint` processor.

Parameter | Required/Optional | Description |
|-----------|-----------|-----------|
`fields` | Optional | A list of fields used to generate a hash value. |
`exclude_fields` | Optional | Specifies the fields to be excluded from hash value generation. It is mutually exclusive with the `fields` parameter; if both `exclude_fields` and `fields` are empty or null, then all fields are included in the hash value calculation. |
`hash_method` | Optional | Specifies the hashing algorithm to be used, with options being `[email protected]`, `[email protected]`, `[email protected]`, or `[email protected]`. Default is `[email protected]`. The version number is appended to ensure consistent hashing across OpenSearch versions, and new versions will support new hash methods. |
`target_field` | Optional | Specifies the name of the field in which the generated hash value will be stored. If not provided, then the hash value is stored in the `fingerprint` field by default. |
`ignore_missing` | Optional | Specifies whether the processor should exit quietly if one of the required fields is missing. Default is `false`. |
`description` | Optional | A brief description of the processor. |
`if` | Optional | A condition for running the processor. |
`ignore_failure` | Optional | If set to `true`, then failures are ignored. Default is `false`. |
`on_failure` | Optional | A list of processors to run if the processor fails. |
`tag` | Optional | An identifier tag for the processor. Useful for debugging in order to distinguish between processors of the same type. |

## Using the processor

Follow these steps to use the processor in a pipeline.

**Step 1: Create a pipeline**

The following query creates a pipeline named `fingerprint_pipeline` that uses the `fingerprint` processor to generate a hash value for specified fields in the document:

```json
PUT /_ingest/pipeline/fingerprint_pipeline
{
"description": "generate hash value for some specified fields the document",
"processors": [
{
"fingerprint": {
"fields": ["foo", "bar"]
}
}
]
}
```
{% include copy-curl.html %}

**Step 2 (Optional): Test the pipeline**

It is recommended that you test your pipeline before ingesting documents.
{: .tip}

To test the pipeline, run the following query:

```json
POST _ingest/pipeline/fingerprint_pipeline/_simulate
{
"docs": [
{
"_index": "testindex1",
"_id": "1",
"_source": {
"foo": "foo",
"bar": "bar"
}
}
]
}
```
{% include copy-curl.html %}

#### Response

The following example response confirms that the pipeline is working as expected:

```json
{
"docs": [
{
"doc": {
"_index": "testindex1",
"_id": "1",
"_source": {
"foo": "foo",
"bar": "bar",
"fingerprint": "[email protected]:fYeen7hTJ2zs9lpmUnk6nvH54sM="
},
"_ingest": {
"timestamp": "2024-03-11T02:17:22.329823Z"
}
}
}
]
}
```

**Step 3: Ingest a document**

The following query ingests a document into an index named `testindex1`:

```json
PUT testindex1/_doc/1?pipeline=fingerprint_pipeline
{
"foo": "foo",
"bar": "bar"
}
```
{% include copy-curl.html %}

#### Response

The request indexes the document into the `testindex1` index:

```json
{
"_index": "testindex1",
"_id": "1",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1
}
```

**Step 4 (Optional): Retrieve the document**

To retrieve the document, run the following query:

```json
GET testindex1/_doc/1
```
{% include copy-curl.html %}
1 change: 1 addition & 0 deletions _ingest-pipelines/processors/index-processors.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ Processor type | Description
`dot_expander` | Expands a field with dots into an object field.
`drop` |Drops a document without indexing it or raising any errors.
`fail` | Raises an exception and stops the execution of a pipeline.
`fingerprint` | Generates a hash value for either certain specified fields or all fields in a document.
`foreach` | Allows for another processor to be applied to each element of an array or an object field in a document.
`geoip` | Adds information about the geographical location of an IP address.
`geojson-feature` | Indexes GeoJSON data into a geospatial field.
Expand Down

0 comments on commit daeb6e6

Please sign in to comment.