Skip to content

Commit

Permalink
Updates the Data Prepper S3 source documentation (#3813)
Browse files Browse the repository at this point in the history
* Updates the Data Prepper S3 source documentation with additional instructions for usage and new features in 2.2. Includes some clean-up as well.

Signed-off-by: David Venable <[email protected]>

* Add back in example

* Update _data-prepper/pipelines/configuration/sources/s3.md

Co-authored-by: Caroline <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update _data-prepper/pipelines/configuration/sources/s3.md

Co-authored-by: Caroline <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update _data-prepper/pipelines/configuration/sources/s3.md

Co-authored-by: Caroline <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update _data-prepper/pipelines/configuration/sources/s3.md

Co-authored-by: Caroline <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update s3.md

* Apply suggestions from code review

Co-authored-by: Caroline <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update _data-prepper/pipelines/configuration/sources/s3.md

Signed-off-by: Naarcha-AWS <[email protected]>

* Update s3.md

Implement Caroline's feedback

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update _data-prepper/pipelines/configuration/sources/s3.md

Signed-off-by: Naarcha-AWS <[email protected]>

* Update _data-prepper/pipelines/configuration/sources/s3.md

Signed-off-by: Naarcha-AWS <[email protected]>

* Update _data-prepper/pipelines/configuration/sources/s3.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update _data-prepper/pipelines/configuration/sources/s3.md

Signed-off-by: Naarcha-AWS <[email protected]>

---------

Signed-off-by: David Venable <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>
Co-authored-by: Naarcha-AWS <[email protected]>
Co-authored-by: Caroline <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
  • Loading branch information
4 people authored Apr 25, 2023
1 parent 6202845 commit aa2033c
Showing 1 changed file with 131 additions and 26 deletions.
157 changes: 131 additions & 26 deletions _data-prepper/pipelines/configuration/sources/s3.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,61 @@
---
layout: default
title: Amazon S3 source
title: s3 source
parent: Sources
grand_parent: Pipelines
nav_order: 20
---

# `s3` source

The Amazon Simple Storage Service (Amazon S3) source plugin reads events from [S3](https://aws.amazon.com/s3/) objects. The following table describes options you can use to configure the `s3` source.
# s3 source

`s3` is a source plugin that reads events from [Amazon Simple Storage Service (Amazon S3)](https://aws.amazon.com/s3/) objects. It requires an [Amazon Simple Queue Service (Amazon SQS)](https://aws.amazon.com/sqs/) queue that receives [S3 Event Notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html). After Amazon SQS is configured, the `s3` source receives messages from Amazon SQS. When the SQS message indicates that an S3 object was created, the `s3` source loads the S3 objects and then parses them using the configured [codec](#codec). You can also configure the `s3` source to use [Amazon S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html) instead of Data Prepper to parse S3 objects.

## IAM permissions

In order to use the `s3` source, configure your AWS Identity and Access Management (IAM) permissions to grant Data Prepper access to Amazon S3. You can use a configuration similar to the following JSON configuration:

```json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "s3-access",
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::<YOUR-BUCKET>/*"
},
{
"Sid": "sqs-access",
"Effect": "Allow",
"Action": [
"sqs:DeleteMessage",
"sqs:ReceiveMessage"
],
"Resource": "arn:aws:sqs:<YOUR-REGION>:<123456789012>:<YOUR-SQS-QUEUE>"
},
{
"Sid": "kms-access",
"Effect": "Allow",
"Action": "kms:Decrypt",
"Resource": "arn:aws:kms:<YOUR-REGION>:<123456789012>:key/<YOUR-KMS-KEY>"
}
]
}
```

If your S3 objects or Amazon SQS queues do not use [AWS Key Management Service (AWS KMS)](https://aws.amazon.com/kms/), remove the `kms:Decrypt` permission.


## Configuration

You can use the following options to configure the `s3` source.

Option | Required | Type | Description
:--- | :--- | :--- | :---
notification_type | Yes | String | Must be `sqs`.
compression | No | String | The compression algorithm to apply: `none`, `gzip`, or `automatic`. Default value is `none`.
codec | Yes | Codec | The codec to apply. Must be `newline`, `json`, or `csv`.
sqs | Yes | sqs | The [Amazon Simple Queue Service (SQS)](https://aws.amazon.com/sqs/) (Amazon SQS) configuration. See [sqs](#sqs) for details.
codec | Yes | Codec | The [codec](#codec) to apply.
sqs | Yes | sqs | The SQS configuration. See [sqs](#sqs) for details.
aws | Yes | aws | The AWS configuration. See [aws](#aws) for details.
on_error | No | String | Determines how to handle errors in Amazon SQS. Can be either `retain_messages` or `delete_messages`. If `retain_messages`, then Data Prepper will leave the message in the Amazon SQS queue and try again. This is recommended for dead-letter queues. If `delete_messages`, then Data Prepper will delete failed messages. Default value is `retain_messages`.
buffer_timeout | No | Duration | The amount of time allowed for for writing events to the Data Prepper buffer before timeout occurs. Any events that the Amazon S3 source cannot write to the buffer in this time will be discarded. Default value is 10 seconds.
Expand All @@ -24,6 +64,7 @@ metadata_root_key | No | String | Base key for adding S3 metadata to each Event.
disable_bucket_ownership_validation | No | Boolean | If `true`, the S3Source will not attempt to validate that the bucket is owned by the expected account. The expected account is the same account that owns the Amazon SQS queue. Defaults to `false`.
acknowledgments | No | Boolean | If `true`, enables `s3` sources to receive [end-to-end acknowledgments]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines/#end-to-end-acknowledgments) when events are received by OpenSearch sinks.


## sqs

The following parameters allow you to configure usage for Amazon SQS in the `s3` source plugin.
Expand All @@ -32,7 +73,7 @@ Option | Required | Type | Description
:--- | :--- | :--- | :---
queue_url | Yes | String | The URL of the Amazon SQS queue from which messages are received.
maximum_messages | No | Integer | The maximum number of messages to receive from the Amazon SQS queue in any single request. Default value is `10`.
visibility_timeout | No | Duration | The visibility timeout to apply to messages read from the Amazon SQS queue. This should be set to the amount of time that Data Prepper may take to read all the Amazon S3 objects in a batch. Default value is `30s`.
visibility_timeout | No | Duration | The visibility timeout to apply to messages read from the Amazon SQS queue. This should be set to the amount of time that Data Prepper may take to read all the S3 objects in a batch. Default value is `30s`.
wait_time | No | Duration | The amount of time to wait for long polling on the Amazon SQS API. Default value is `20s`.
poll_delay | No | Duration | A delay to place between reading/processing a batch of Amazon SQS messages and making a subsequent request. Default value is `0s`.

Expand All @@ -43,47 +84,111 @@ Option | Required | Type | Description
:--- | :--- | :--- | :---
region | No | String | The AWS Region to use for credentials. Defaults to [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html).
sts_role_arn | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon SQS and Amazon S3. Defaults to null, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html).
aws_sts_header_overrides | No | Map | A map of header overrides that the IAM role assumes for the sink plugin.

## codec

## file
The `codec` determines how the `s3` source parses each S3 object.

Source for flat file input.
### newline codec

Option | Required | Type | Description
:--- | :--- | :--- | :---
path | Yes | String | The path to the input file (e.g. `logs/my-log.log`).
format | No | String | The format of each line in the file. Valid options are `json` or `plain`. Default value is `plain`.
record_type | No | String | The record type to store. Valid options are `string` or `event`. Default value is `string`. If you would like to use the file source for log analytics use cases like grok, set this option to `event`.
The `newline` codec parses each single line as a single log event. This is ideal for most application logs because each event parses per single line. It can also be suitable for S3 objects that have individual JSON objects on each line, which matches well when used with the [parse_json]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/parse-json/) processor to parse each line.

## pipeline
Use the following options to configure the `newline` codec.

Source for reading from another pipeline.
Option | Required | Type | Description
:--- | :--- |:--------| :---
skip_lines | No | Integer | The number of lines to skip before creating events. You can use this configuration to skip common header rows. Default is `0`.
header_destination | No | String | A key value to assign to the header line of the S3 object. If this option is specified, then each event will contain a header_destination field.

### json codec

The `json` codec parses each S3 object as a single JSON object from a JSON array and then creates a Data Prepper log event for each object in the array.

### csv codec

The `csv` codec parses objects in comma-separated value (CSV) format, with each row producing a Data Prepper log event. Use the following options to configure the `csv` codec.

Option | Required | Type | Description
:--- |:---------|:------------| :---
delimiter | Yes | Integer | The delimiter separating columns. Default is `,`.
quote_character | Yes | String | The character used as a text qualifier for CSV data. Default is `"`.
header | No | String list | The header containing the column names used to parse CSV data.
detect_header | No | Boolean | Whether the first line of the S3 object should be interpreted as a header. Default is `true`.

## Using `s3_select` with the `s3` source

When configuring `s3_select` to parse S3 objects, use the following options.

Option | Required | Type | Description
:--- |:-----------------------|:------------| :---
expression | Yes, when using `s3_select` | String | The expression used to query the object. Maps directly to the [expression](https://docs.aws.amazon.com/AmazonS3/latest/API/API_SelectObjectContent.html#AmazonS3-SelectObjectContent-request-Expression) property.
expression_type | No | String | The type of the provided expression. Default value is `SQL`. Maps directly to the [ExpressionType](https://docs.aws.amazon.com/AmazonS3/latest/API/API_SelectObjectContent.html#AmazonS3-SelectObjectContent-request-ExpressionType).
input_serialization | Yes, when using `s3_select` | String | Provides the S3 Select file format. Amazon S3 uses this format to parse object data into records and returns only records that match the specified SQL expression. May be `csv`, `json`, or `parquet`.
compression_type | No | String | Specifies an object's compression format. Maps directly to the [CompressionType](https://docs.aws.amazon.com/AmazonS3/latest/API/API_InputSerialization.html#AmazonS3-Type-InputSerialization-CompressionType).
csv | No | [csv](#s3_select_csv) | Provides the CSV configuration for processing CSV data.
json | No | [json](#s3_select_json) | Provides the JSON configuration for processing JSON data.

### csv<a name="s3_select_csv"></a>

Use the following options in conjunction with the `csv` configuration for `s3_select` to determine how your parsed CSV file should be formatted.

These options map directly to options available in the S3 Select [CSVInput](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CSVInput.html) data type.

Option | Required | Type | Description
:--- |:---------|:------------| :---
file_header_info | No | String | Describes the first line of input. Maps directly to the [FileHeaderInfo](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CSVInput.html#AmazonS3-Type-CSVInput-FileHeaderInfo) property.
quote_escape | No | String | A single character used for escaping the quotation mark character inside an already escaped value. Maps directly to the [QuoteEscapeCharacter](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CSVInput.html#AmazonS3-Type-CSVInput-QuoteEscapeCharacter) property.
comments | No | String | A single character used to indicate that a row should be ignored when the character is present at the start of that row. Maps directly to the [Comments](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CSVInput.html#AmazonS3-Type-CSVInput-Comments) property.

#### json<a name="s3_select_json"></a>

Use the following option in conjunction with `json` for `s3_select` to determine how S3 Select processes the JSON file.

Option | Required | Type | Description
:--- |:---------|:------------| :---
type | No | String | The type of JSON array. May be either `DOCUMENT` or `LINES`. Maps directly to the [Type](https://docs.aws.amazon.com/AmazonS3/latest/API/API_JSONInput.html#AmazonS3-Type-JSONInput-Type) property.

Option | Required | Type | Description
:--- | :--- | :--- | :---
name | Yes | String | Name of the pipeline to read from.

## Metrics

The `s3` processor includes the following metrics.
The `s3` source includes the following metrics.

### Counters

* `s3ObjectsFailed`: The number of Amazon S3 objects that the `s3` source failed to read.
* `s3ObjectsNotFound`: The number of Amazon S3 objects that the `s3` source failed to read due to an Amazon S3 "Not Found" error. These are also counted toward `s3ObjectsFailed`.
* `s3ObjectsAccessDenied`: The number of Amazon S3 objects that the `s3` source failed to read due to an "Access Denied" or "Forbidden" error. These are also counted toward `s3ObjectsFailed`.
* `s3ObjectsSucceeded`: The number of Amazon S3 objects that the `s3` source successfully read.
* `s3ObjectsFailed`: The number of S3 objects that the `s3` source failed to read.
* `s3ObjectsNotFound`: The number of S3 objects that the `s3` source failed to read due to an S3 "Not Found" error. These are also counted toward `s3ObjectsFailed`.
* `s3ObjectsAccessDenied`: The number of S3 objects that the `s3` source failed to read due to an "Access Denied" or "Forbidden" error. These are also counted toward `s3ObjectsFailed`.
* `s3ObjectsSucceeded`: The number of S3 objects that the `s3` source successfully read.
* `sqsMessagesReceived`: The number of Amazon SQS messages received from the queue by the `s3` source.
* `sqsMessagesDeleted`: The number of Amazon SQS messages deleted from the queue by the `s3` source.
* `sqsMessagesFailed`: The number of Amazon SQS messages that the `s3` source failed to parse.

### Timers

* `s3ObjectReadTimeElapsed`: Measures the amount of time the `s3` source takes to perform a request to GET an S3 object, parse it, and write events to the buffer.
* `sqsMessageDelay`: Measures the amount of time from when Amazon S3 creates an object to when it is fully parsed.
* `sqsMessageDelay`: Measures the time elapsed from when S3 creates an object to when it is fully parsed.

### Distribution summaries

* `s3ObjectSizeBytes`: Measures the size of Amazon S3 objects as reported by the Amazon S3 `Content-Length`. For compressed objects, this is the compressed size.
* `s3ObjectSizeBytes`: Measures the size of S3 objects as reported by the S3 `Content-Length`. For compressed objects, this is the compressed size.
* `s3ObjectProcessedBytes`: Measures the bytes processed by the `s3` source for a given object. For compressed objects, this is the uncompressed size.
* `s3ObjectsEvents`: Measures the number of events (sometimes called records) produced by an S3 object.

## Example: Uncompressed logs

The following pipeline.yaml file shows the minimum configuration for reading uncompressed newline-delimited logs:

```
source:
s3:
notification_type: sqs
codec:
newline:
compression: none
sqs:
queue_url: "https://sqs.us-east-1.amazonaws.com/123456789012/MyQueue"
aws:
region: "us-east-1"
sts_role_arn: "arn:aws:iam::123456789012:role/Data-Prepper"
```

0 comments on commit aa2033c

Please sign in to comment.