Skip to content

Commit

Permalink
Merge branch 'main' into add-documentation-ingest-attachment-plugin
Browse files Browse the repository at this point in the history
  • Loading branch information
ldrick authored Aug 2, 2024
2 parents 83e08a6 + bc28bf9 commit 2a0cd6d
Show file tree
Hide file tree
Showing 26 changed files with 849 additions and 599 deletions.
55 changes: 19 additions & 36 deletions _data-prepper/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,42 +18,24 @@ Data Prepper is a server-side data collector capable of filtering, enriching, tr

With Data Prepper you can build custom pipelines to improve the operational view of applications. Two common use cases for Data Prepper are trace analytics and log analytics. [Trace analytics]({{site.url}}{{site.baseurl}}/data-prepper/common-use-cases/trace-analytics/) can help you visualize event flows and identify performance problems. [Log analytics]({{site.url}}{{site.baseurl}}/data-prepper/common-use-cases/log-analytics/) equips you with tools to enhance your search capabilities, conduct comprehensive analysis, and gain insights into your applications' performance and behavior.

## Concepts
## Key concepts and fundamentals

Data Prepper includes one or more **pipelines** that collect and filter data based on the components set within the pipeline. Each component is pluggable, enabling you to use your own custom implementation of each component. These components include the following:
Data Prepper ingests data through customizable [pipelines]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines/). These pipelines consist of pluggable components that you can customize to fit your needs, even allowing you to plug in your own implementations. A Data Prepper pipeline consists of the following components:

- One [source](#source)
- One or more [sinks](#sink)
- (Optional) One [buffer](#buffer)
- (Optional) One or more [processors](#processor)
- One [source]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/sources/)
- One or more [sinks]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sinks/sinks/)
- (Optional) One [buffer]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/buffers/buffers/)
- (Optional) One or more [processors]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/processors/)

A single instance of Data Prepper can have one or more pipelines.
Each pipeline contains two required components: `source` and `sink`. If a `buffer`, a `processor`, or both are missing from the pipeline, then Data Prepper uses the default `bounded_blocking` buffer and a no-op processor. Note that a single instance of Data Prepper can have one or more pipelines.

Each pipeline definition contains two required components: **source** and **sink**. If buffers and processors are missing from the Data Prepper pipeline, Data Prepper uses the default buffer and a no-op processor.
## Basic pipeline configurations

### Source
To understand how the pipeline components function within a Data Prepper configuration, see the following examples. Each pipeline configuration uses a `yaml` file format. For more information, see [Pipelines]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines/) for more information and examples.

Source is the input component that defines the mechanism through which a Data Prepper pipeline will consume events. A pipeline can have only one source. The source can consume events either by receiving the events over HTTP or HTTPS or by reading from external endpoints like OTeL Collector for traces and metrics and Amazon Simple Storage Service (Amazon S3). Sources have their own configuration options based on the format of the events (such as string, JSON, Amazon CloudWatch logs, or open telemetry trace). The source component consumes events and writes them to the buffer component.
### Minimal configuration

### Buffer

The buffer component acts as the layer between the source and the sink. Buffer can be either in-memory or disk based. The default buffer uses an in-memory queue called `bounded_blocking` that is bounded by the number of events. If the buffer component is not explicitly mentioned in the pipeline configuration, Data Prepper uses the default `bounded_blocking`.

### Sink

Sink is the output component that defines the destination(s) to which a Data Prepper pipeline publishes events. A sink destination could be a service, such as OpenSearch or Amazon S3, or another Data Prepper pipeline. When using another Data Prepper pipeline as the sink, you can chain multiple pipelines together based on the needs of the data. Sink contains its own configuration options based on the destination type.

### Processor

Processors are units within the Data Prepper pipeline that can filter, transform, and enrich events using your desired format before publishing the record to the sink component. The processor is not defined in the pipeline configuration; the events publish in the format defined in the source component. You can have more than one processor within a pipeline. When using multiple processors, the processors are run in the order they are defined inside the pipeline specification.

## Sample pipeline configurations

To understand how all pipeline components function within a Data Prepper configuration, see the following examples. Each pipeline configuration uses a `yaml` file format.

### Minimal component

This pipeline configuration reads from the file source and writes to another file in the same path. It uses the default options for the buffer and processor.
The following minimal pipeline configuration reads from the file source and writes the data to another file on the same path. It uses the default options for the `buffer` and `processor` components.

```yml
sample-pipeline:
Expand All @@ -65,13 +47,13 @@ sample-pipeline:
path: <path/to/output-file>
```
### All components
### Comprehensive configuration
The following pipeline uses a source that reads string events from the `input-file`. The source then pushes the data to the buffer, bounded by a max size of `1024`. The pipeline is configured to have `4` workers, each of them reading a maximum of `256` events from the buffer for every `100 milliseconds`. Each worker runs the `string_converter` processor and writes the output of the processor to the `output-file`.
The following comprehensive pipeline configuration uses both required and optional components:
```yml
sample-pipeline:
workers: 4 #Number of workers
workers: 4 # Number of workers
delay: 100 # in milliseconds, how often the workers should run
source:
file:
Expand All @@ -88,9 +70,10 @@ sample-pipeline:
path: <path/to/output-file>
```
## Next steps

To get started building your own custom pipelines with Data Prepper, see [Getting started]({{site.url}}{{site.baseurl}}/clients/data-prepper/get-started/).
In the given pipeline configuration, the `source` component reads string events from the `input-file` and pushes the data to a bounded buffer with a maximum size of `1024`. The `workers` component specifies `4` concurrent threads that will process events from the buffer, each reading a maximum of `256` events from the buffer every `100` milliseconds. Each `workers` component runs the `string_converter` processor, which converts the strings to uppercase and writes the processed output to the `output-file`.

<!---Delete this comment.--->
## Next steps

- [Get started with Data Prepper]({{site.url}}{{site.baseurl}}/data-prepper/getting-started/).
- [Get familiar with Data Prepper pipelines]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines/).
- [Explore common use cases]({{site.url}}{{site.baseurl}}/data-prepper/common-use-cases/common-use-cases/).
Original file line number Diff line number Diff line change
Expand Up @@ -103,8 +103,7 @@ check_interval | No | Duration | Specifies the time between checks of the heap s

### Extension plugins

Since Data Prepper 2.5, Data Prepper provides support for user configurable extension plugins. Extension plugins are shared common
configurations shared across pipeline plugins, such as [sources, buffers, processors, and sinks]({{site.url}}{{site.baseurl}}/data-prepper/index/#concepts).
Data Prepper provides support for user-configurable extension plugins. Extension plugins are common configurations shared across pipeline plugins, such as [sources, buffers, processors, and sinks]({{site.url}}{{site.baseurl}}/data-prepper/index/#key-concepts-and-fundamentals).

### AWS extension plugins

Expand Down
24 changes: 24 additions & 0 deletions _data-prepper/pipelines/cidrcontains.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
layout: default
title: cidrContains()
parent: Functions
grand_parent: Pipelines
nav_order: 5
---

# cidrContains()

The `cidrContains()` function is used to check if an IP address is contained within a specified Classless Inter-Domain Routing (CIDR) block or range of CIDR blocks. It accepts two or more arguments:

- The first argument is a JSON pointer, which represents the key or path to the field containing the IP address to be checked. It supports both IPv4 and IPv6 address formats.

- The subsequent arguments are strings representing one or more CIDR blocks or IP address ranges. The function checks if the IP address specified in the first argument matches or is contained within any of these CIDR blocks.

For example, if your data contains an IP address field named `client.ip` and you want to check if it belongs to the CIDR blocks `192.168.0.0/16` or `10.0.0.0/8`, you can use the `cidrContains()` function as follows:

```
cidrContains('/client.ip', '192.168.0.0/16', '10.0.0.0/8')
```
{% include copy-curl.html %}

This function returns `true` if the IP address matches any of the specified CIDR blocks or `false` if it does not.
8 changes: 6 additions & 2 deletions _data-prepper/pipelines/configuration/buffers/buffers.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,13 @@ layout: default
title: Buffers
parent: Pipelines
has_children: true
nav_order: 20
nav_order: 30
---

# Buffers

Buffers store data as it passes through the pipeline. If you implement a custom buffer, it can be memory based, which provides better performance, or disk based, which is larger in size.
The `buffer` component acts as an intermediary layer between the `source` and `sink` components in a Data Prepper pipeline. It serves as temporary storage for events, decoupling the `source` from the downstream processors and sinks. Buffers can be either in-memory or disk based.

If not explicitly specified in the pipeline configuration, Data Prepper uses the default `bounded_blocking` buffer, which is an in-memory queue bounded by the number of events it can store. The `bounded_blocking` buffer is a convenient option when the event volume and processing rates are manageable within the available memory constraints.


Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ You can configure the `add_entries` processor with the following options.
| `metadata_key` | No | The key for the new metadata attribute. The argument must be a literal string key and not a JSON Pointer. Either one string key or `metadata_key` is required. |
| `value` | No | The value of the new entry to be added, which can be used with any of the following data types: strings, Booleans, numbers, null, nested objects, and arrays. |
| `format` | No | A format string to use as the value of the new entry, for example, `${key1}-${key2}`, where `key1` and `key2` are existing keys in the event. Required if neither `value` nor `value_expression` is specified. |
| `value_expression` | No | An expression string to use as the value of the new entry. For example, `/key` is an existing key in the event with a type of either a number, a string, or a Boolean. Expressions can also contain functions returning number/string/integer. For example, `length(/key)` will return the length of the key in the event when the key is a string. For more information about keys, see [Expression syntax](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). |
| `add_when` | No | A [conditional expression](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/), such as `/some-key == "test"'`, that will be evaluated to determine whether the processor will be run on the event. |
| `value_expression` | No | An expression string to use as the value of the new entry. For example, `/key` is an existing key in the event with a type of either a number, a string, or a Boolean. Expressions can also contain functions returning number/string/integer. For example, `length(/key)` will return the length of the key in the event when the key is a string. For more information about keys, see [Expression syntax]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/). |
| `add_when` | No | A [conditional expression]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/), such as `/some-key == "test"'`, that will be evaluated to determine whether the processor will be run on the event. |
| `overwrite_if_key_exists` | No | When set to `true`, the existing value is overwritten if `key` already exists in the event. The default value is `false`. |
| `append_if_key_exists` | No | When set to `true`, the existing value will be appended if a `key` already exists in the event. An array will be created if the existing value is not an array. Default is `false`. |

Expand Down Expand Up @@ -135,7 +135,7 @@ When the input event contains the following data:
{"message": "hello"}
```

The processed event will have the same data, with the metadata, `{"length": 5}`, attached. You can subsequently use expressions like `getMetadata("length")` in the pipeline. For more information, see the [`getMetadata` function](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/#getmetadata) documentation.
The processed event will have the same data, with the metadata, `{"length": 5}`, attached. You can subsequently use expressions like `getMetadata("length")` in the pipeline. For more information, see [`getMetadata` function]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/get-metadata/).


### Example: Add a dynamic key
Expand Down
10 changes: 6 additions & 4 deletions _data-prepper/pipelines/configuration/processors/processors.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,14 @@ layout: default
title: Processors
has_children: true
parent: Pipelines
nav_order: 25
nav_order: 35
---

# Processors

Processors perform an action on your data, such as filtering, transforming, or enriching.
Processors are components within a Data Prepper pipeline that enable you to filter, transform, and enrich events using your desired format before publishing records to the `sink` component. If no `processor` is defined in the pipeline configuration, then the events are published in the format specified by the `source` component. You can incorporate multiple processors within a single pipeline, and they are executed sequentially as defined in the pipeline.

Prior to Data Prepper 1.3, these components were named *preppers*. In Data Prepper 1.3, the term *prepper* was deprecated in favor of *processor*. In Data Prepper 2.0, the term *prepper* was removed.
{: .note }


Prior to Data Prepper 1.3, processors were named preppers. Starting in Data Prepper 1.3, the term *prepper* is deprecated in favor of the term *processor*. Data Prepper will continue to support the term *prepper* until 2.0, where it will be removed.
{: .note }
16 changes: 9 additions & 7 deletions _data-prepper/pipelines/configuration/sinks/sinks.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,22 @@ layout: default
title: Sinks
parent: Pipelines
has_children: true
nav_order: 30
nav_order: 25
---

# Sinks

Sinks define where Data Prepper writes your data to.
A `sink` is an output component that specifies the destination(s) to which a Data Prepper pipeline publishes events. Sink destinations can be services like OpenSearch, Amazon Simple Storage Service (Amazon S3), or even another Data Prepper pipeline, enabling chaining of multiple pipelines. The sink component has the following configurable options that you can use to customize the destination type.

## General options for all sink types
## Configuration options

The following table describes options you can use to configure the `sinks` sink.

Option | Required | Type | Description
:--- | :--- |:------------| :---
routes | No | String list | A list of routes for which this sink applies. If not provided, this sink receives all events. See [conditional routing]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines#conditional-routing) for more information.
tags_target_key | No | String | When specified, includes event tags in the output of the provided key.
include_keys | No | String list | When specified, provides the keys in this list in the data sent to the sink. Some codecs and sinks do not allow use of this field.
exclude_keys | No | String list | When specified, excludes the keys given from the data sent to the sink. Some codecs and sinks do not allow use of this field.
`routes` | No | String list | A list of routes to which the sink applies. If not provided, then the sink receives all events. See [conditional routing]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines#conditional-routing) for more information.
`tags_target_key` | No | String | When specified, includes event tags in the output under the provided key.
`include_keys` | No | String list | When specified, provides only the listed keys in the data sent to the sink. Some codecs and sinks may not support this field.
`exclude_keys` | No | String list | When specified, excludes the listed keys from the data sent to the sink. Some codecs and sinks may not support this field.


2 changes: 1 addition & 1 deletion _data-prepper/pipelines/configuration/sources/dynamo-db.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ Option | Required | Type | Description

## Exposed metadata attributes

The following metadata will be added to each event that is processed by the `dynamodb` source. These metadata attributes can be accessed using the [expression syntax `getMetadata` function](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/#getmetadata).
The following metadata will be added to each event that is processed by the `dynamodb` source. These metadata attributes can be accessed using the [expression syntax `getMetadata` function]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/get-metadata/).

* `primary_key`: The primary key of the DynamoDB item. For tables that only contain a partition key, this value provides the partition key. For tables that contain both a partition and sort key, the `primary_key` attribute will be equal to the partition and sort key, separated by a `|`, for example, `partition_key|sort_key`.
* `partition_key`: The partition key of the DynamoDB item.
Expand Down
6 changes: 4 additions & 2 deletions _data-prepper/pipelines/configuration/sources/sources.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,11 @@ layout: default
title: Sources
parent: Pipelines
has_children: true
nav_order: 15
nav_order: 20
---

# Sources

Sources define where your data comes from within a Data Prepper pipeline.
A `source` is an input component that specifies how a Data Prepper pipeline ingests events. Each pipeline has a single source that either receives events over HTTP(S) or reads from external endpoints, such as OpenTelemetry Collector or Amazon Simple Storage Service (Amazon S3). Sources have configurable options based on the event format (string, JSON, Amazon CloudWatch logs, OpenTelemtry traces). The source consumes events and passes them to the [`buffer`]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/buffers/buffers/) component.


36 changes: 36 additions & 0 deletions _data-prepper/pipelines/contains.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
layout: default
title: contains()
parent: Functions
grand_parent: Pipelines
nav_order: 10
---

# contains()

The `contains()` function is used to check if a substring exists within a given string or the value of a field in an event. It takes two arguments:

- The first argument is either a literal string or a JSON pointer that represents the field or value to be searched.

- The second argument is the substring to be searched for within the first argument.
The function returns `true` if the substring specified in the second argument is found within the string or field value represented by the first argument. It returns `false` if it is not.

For example, if you want to check if the string `"abcd"` is contained within the value of a field named `message`, you can use the `contains()` function as follows:

```
contains('/message', 'abcd')
```
{% include copy-curl.html %}

This will return `true` if the field `message` contains the substring `abcd` or `false` if it does not.

Alternatively, you can also use a literal string as the first argument:

```
contains('This is a test message', 'test')
```
{% include copy-curl.html %}

In this case, the function will return `true` because the substring `test` is present within the string `This is a test message`.

Note that the `contains()` function performs a case-sensitive search by default. If you need to perform a case-insensitive search, you can use the `containsIgnoreCase()` function instead.
Loading

0 comments on commit 2a0cd6d

Please sign in to comment.