layout | title | nav_order | has_children | has_toc | nav_exclude | permalink | redirect_from | |||
---|---|---|---|---|---|---|---|---|---|---|
default |
Data Prepper |
1 |
false |
false |
true |
/data-prepper/ |
|
Data Prepper is a server-side data collector capable of filtering, enriching, transforming, normalizing, and aggregating data for downstream analysis and visualization. Data Prepper is the preferred data ingestion tool for OpenSearch. It is recommended for most data ingestion use cases in OpenSearch and for processing large, complex datasets.
With Data Prepper you can build custom pipelines to improve the operational view of applications. Two common use cases for Data Prepper are trace analytics and log analytics. Trace analytics can help you visualize event flows and identify performance problems. Log analytics equips you with tools to enhance your search capabilities, conduct comprehensive analysis, and gain insights into your applications' performance and behavior.
Data Prepper includes one or more pipelines that collect and filter data based on the components set within the pipeline. Each component is pluggable, enabling you to use your own custom implementation of each component. These components include the following:
- One source
- One or more sinks
- (Optional) One buffer
- (Optional) One or more processors
A single instance of Data Prepper can have one or more pipelines.
Each pipeline definition contains two required components: source and sink. If buffers and processors are missing from the Data Prepper pipeline, Data Prepper uses the default buffer and a no-op processor.
Source is the input component that defines the mechanism through which a Data Prepper pipeline will consume events. A pipeline can have only one source. The source can consume events either by receiving the events over HTTP or HTTPS or by reading from external endpoints like OTeL Collector for traces and metrics and Amazon Simple Storage Service (Amazon S3). Sources have their own configuration options based on the format of the events (such as string, JSON, Amazon CloudWatch logs, or open telemetry trace). The source component consumes events and writes them to the buffer component.
The buffer component acts as the layer between the source and the sink. Buffer can be either in-memory or disk based. The default buffer uses an in-memory queue called bounded_blocking
that is bounded by the number of events. If the buffer component is not explicitly mentioned in the pipeline configuration, Data Prepper uses the default bounded_blocking
.
Sink is the output component that defines the destination(s) to which a Data Prepper pipeline publishes events. A sink destination could be a service, such as OpenSearch or Amazon S3, or another Data Prepper pipeline. When using another Data Prepper pipeline as the sink, you can chain multiple pipelines together based on the needs of the data. Sink contains its own configuration options based on the destination type.
Processors are units within the Data Prepper pipeline that can filter, transform, and enrich events using your desired format before publishing the record to the sink component. The processor is not defined in the pipeline configuration; the events publish in the format defined in the source component. You can have more than one processor within a pipeline. When using multiple processors, the processors are run in the order they are defined inside the pipeline specification.
To understand how all pipeline components function within a Data Prepper configuration, see the following examples. Each pipeline configuration uses a yaml
file format.
This pipeline configuration reads from the file source and writes to another file in the same path. It uses the default options for the buffer and processor.
sample-pipeline:
source:
file:
path: <path/to/input-file>
sink:
- file:
path: <path/to/output-file>
The following pipeline uses a source that reads string events from the input-file
. The source then pushes the data to the buffer, bounded by a max size of 1024
. The pipeline is configured to have 4
workers, each of them reading a maximum of 256
events from the buffer for every 100 milliseconds
. Each worker runs the string_converter
processor and writes the output of the processor to the output-file
.
sample-pipeline:
workers: 4 #Number of workers
delay: 100 # in milliseconds, how often the workers should run
source:
file:
path: <path/to/input-file>
buffer:
bounded_blocking:
buffer_size: 1024 # max number of events the buffer will accept
batch_size: 256 # max number of events the buffer will drain for each read
processor:
- string_converter:
upper_case: true
sink:
- file:
path: <path/to/output-file>
To get started building your own custom pipelines with Data Prepper, see Getting started.