From a931ca62409e0bec67d292fadec094ec7d408ea0 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 23 Jan 2024 09:30:26 -0700 Subject: [PATCH 01/20] Add feature to this section Signed-off-by: Melissa Vagi --- .../configuration/sources/selective-download.md | 10 ++++++++++ 1 file changed, 10 insertions(+) create mode 100644 _data-prepper/pipelines/configuration/sources/selective-download.md diff --git a/_data-prepper/pipelines/configuration/sources/selective-download.md b/_data-prepper/pipelines/configuration/sources/selective-download.md new file mode 100644 index 0000000000..a3e0609323 --- /dev/null +++ b/_data-prepper/pipelines/configuration/sources/selective-download.md @@ -0,0 +1,10 @@ +--- +layout: default +title: Selective download +parent: Sources +grand_parent: Pipelines +nav_order: 21 +--- + +# Selective download + From 1b4d8df296cfbc7f57e4ad16fe311ce24ee3b992 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Wed, 31 Jan 2024 12:15:28 -0700 Subject: [PATCH 02/20] add content Signed-off-by: Melissa Vagi --- .../sources/selective-download.md | 45 +++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/_data-prepper/pipelines/configuration/sources/selective-download.md b/_data-prepper/pipelines/configuration/sources/selective-download.md index a3e0609323..43af1e2e7f 100644 --- a/_data-prepper/pipelines/configuration/sources/selective-download.md +++ b/_data-prepper/pipelines/configuration/sources/selective-download.md @@ -8,3 +8,48 @@ nav_order: 21 # Selective download +If your pipeline uses an S3 source, you can use SQL expressions to perform filtering and computations on the contents of S3 objects before ingesting them into a pipeline. + +The `s3_select` option supports objects in the [Parquet File Format](https://parquet.apache.org/docs/). It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only) and supports columnar compression for the Parquet File Format using GZIP and Snappy. + +The following pipeline example downloads data in incoming S3 objects, encoded in the Parquet File Format: + +```json +pipeline: + source: + s3: + s3_select: + expression: "select * from s3object s" + input_serialization: parquet + notification_type: "sqs" +... +``` +{% include copy-curl.html %} + +The following pipeline example downloads only the first 10,000 records in the objects: + +```json +pipeline: + source: + s3: + s3_select: + expression: "select * from s3object s LIMIT 10000" + input_serialization: parquet + notification_type: "sqs" +... +``` +{% include copy-curl.html %} + +The following pipeline example checks for the minimum and maximum values of `data_value` before ingesting events into the pipeline: + +```json +pipeline: + source: + s3: + s3_select: + expression: "select s.* from s3object s where s.data_value > 200 and s.data_value < 500 " + input_serialization: parquet + notification_type: "sqs" +... +``` +{% include copy-curl.html %} From 934a19e7853847279d1cdadc7dcbd7702267c09d Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Wed, 31 Jan 2024 12:45:56 -0700 Subject: [PATCH 03/20] Copy edits Signed-off-by: Melissa Vagi --- .../pipelines/configuration/sources/selective-download.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_data-prepper/pipelines/configuration/sources/selective-download.md b/_data-prepper/pipelines/configuration/sources/selective-download.md index 43af1e2e7f..5412b2f5f7 100644 --- a/_data-prepper/pipelines/configuration/sources/selective-download.md +++ b/_data-prepper/pipelines/configuration/sources/selective-download.md @@ -12,7 +12,7 @@ If your pipeline uses an S3 source, you can use SQL expressions to perform filte The `s3_select` option supports objects in the [Parquet File Format](https://parquet.apache.org/docs/). It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only) and supports columnar compression for the Parquet File Format using GZIP and Snappy. -The following pipeline example downloads data in incoming S3 objects, encoded in the Parquet File Format: +The following example pipeline downloads data in incoming S3 objects, encoded in the Parquet File Format: ```json pipeline: @@ -26,7 +26,7 @@ pipeline: ``` {% include copy-curl.html %} -The following pipeline example downloads only the first 10,000 records in the objects: +The following example pipeline downloads only the first 10,000 records in the objects: ```json pipeline: @@ -40,7 +40,7 @@ pipeline: ``` {% include copy-curl.html %} -The following pipeline example checks for the minimum and maximum values of `data_value` before ingesting events into the pipeline: +The following example pipeline checks for the minimum and maximum values of `data_value` before ingesting events into the pipeline: ```json pipeline: From 2ee9fc06f891a585f81b8f0635251144e80844c8 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Wed, 6 Mar 2024 09:53:05 -0700 Subject: [PATCH 04/20] Update selective-download.md Signed-off-by: Melissa Vagi Signed-off-by: Melissa Vagi --- .../pipelines/configuration/sources/selective-download.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/_data-prepper/pipelines/configuration/sources/selective-download.md b/_data-prepper/pipelines/configuration/sources/selective-download.md index 5412b2f5f7..fe5f2f1395 100644 --- a/_data-prepper/pipelines/configuration/sources/selective-download.md +++ b/_data-prepper/pipelines/configuration/sources/selective-download.md @@ -1,9 +1,8 @@ --- layout: default title: Selective download -parent: Sources -grand_parent: Pipelines -nav_order: 21 +parent: Common use cases +nav_order: 50 --- # Selective download From 06cf450b490f87a103d56c360516374d9f7f0466 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Mon, 13 May 2024 11:46:32 -0600 Subject: [PATCH 05/20] Address tech review comments Signed-off-by: Melissa Vagi --- _data-prepper/common-use-cases/s3-logs.md | 60 +++++++++++++++++-- .../sources/selective-download.md | 54 ----------------- 2 files changed, 54 insertions(+), 60 deletions(-) delete mode 100644 _data-prepper/pipelines/configuration/sources/selective-download.md diff --git a/_data-prepper/common-use-cases/s3-logs.md b/_data-prepper/common-use-cases/s3-logs.md index 7986a7eef8..3e19f30889 100644 --- a/_data-prepper/common-use-cases/s3-logs.md +++ b/_data-prepper/common-use-cases/s3-logs.md @@ -20,7 +20,7 @@ The following diagram shows the overall architecture of the components involved. S3 source architecture{: .img-fluid} -The flow of data is as follows. +The data flow involving the components is as follows: 1. A system produces logs into the S3 bucket. 2. S3 creates an S3 event notification in the SQS queue. @@ -44,7 +44,6 @@ Before Data Prepper can read log data from S3, you need the following prerequisi - An S3 bucket. - A log producer that writes logs to S3. The exact log producer will vary depending on your specific use case, but could include writing logs to S3 or a service such as Amazon CloudWatch. - ## Getting started Use the following steps to begin loading logs from S3 with Data Prepper. @@ -88,6 +87,7 @@ Use the following example to set up permissions: ] } ``` +{% include copy-curl.html %} If your S3 objects or SQS queues do not use KMS, you can remove the `kms:Decrypt` permission. @@ -104,8 +104,8 @@ The following diagram shows the system architecture when using SQS with DLQ. To use an SQS dead-letter queue, perform the following steps: -1. Create a new SQS standard queue to act as your DLQ. -2. Configure your SQS's redrive policy [to use your DLQ](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-configure-dead-letter-queue.html). Consider using a low value such as 2 or 3 for the "Maximum Receives" setting. +1. Create a new SQS standard queue to act as the DLQ. +2. Configure your SQS redrive policy [to use DLQ](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-configure-dead-letter-queue.html). Consider using a low value such as 2 or 3 for the **Maximum Receives** setting. 3. Configure the Data Prepper `s3` source to use `retain_messages` for `on_error`. This is the default behavior. ## Pipeline design @@ -125,6 +125,7 @@ s3-log-pipeline: queue_url: "arn:aws:sqs::<123456789012>:" visibility_timeout: "2m" ``` +{% include copy-curl.html %} Configure the following options according to your use case: @@ -164,10 +165,11 @@ s3-log-pipeline: password: "admin" index: s3_logs ``` +{% include copy-curl.html %} ## Multiple Data Prepper pipelines -We recommend that you have one SQS queue per Data Prepper pipeline. In addition, you can have multiple nodes in the same cluster reading from the same SQS queue, which doesn't require additional configuration with Data Prepper. +It is recommended that you have one SQS queue per Data Prepper pipeline. In addition, you can have multiple nodes in the same cluster reading from the same SQS queue, which doesn't require additional configuration with Data Prepper. If you have multiple pipelines, you must create multiple SQS queues for each pipeline, even if both pipelines use the same S3 bucket. @@ -175,6 +177,52 @@ If you have multiple pipelines, you must create multiple SQS queues for each pip To meet the scale of logs produced by S3, some users require multiple SQS queues for their logs. You can use [Amazon Simple Notification Service](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) (Amazon SNS) to route event notifications from S3 to an SQS [fanout pattern](https://docs.aws.amazon.com/sns/latest/dg/sns-common-scenarios.html). Using SNS, all S3 event notifications are sent directly to a single SNS topic, where you can subscribe to multiple SQS queues. -To make sure that Data Prepper can directly parse the event from the SNS topic, configure [raw message delivery](https://docs.aws.amazon.com/sns/latest/dg/sns-large-payload-raw-message-delivery.html) on the SNS to SQS subscription. Setting this option will not affect other SQS queues that are subscribed to that SNS topic. +To make sure that Data Prepper can directly parse the event from the SNS topic, configure [raw message delivery](https://docs.aws.amazon.com/sns/latest/dg/sns-large-payload-raw-message-delivery.html) on the SNS to SQS subscription. Setting this option does not affect other SQS queues that are subscribed to that SNS topic. + +## Selective download + +If a pipeline uses an S3 source, you can use SQL expressions to perform filtering and computations on the contents of S3 objects before ingesting them into the pipeline. + +The `s3_select` option supports objects in the [Parquet File Format](https://parquet.apache.org/docs/). It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only) and supports columnar compression for the Parquet File Format using GZIP and Snappy. + +The following example pipeline downloads data in incoming S3 objects, encoded in the Parquet File Format: + +```json +pipeline: + source: + s3: + s3_select: + expression: "select * from s3object s" + input_serialization: parquet + notification_type: "sqs" +... +``` +{% include copy-curl.html %} + +The following example pipeline downloads only the first 10,000 records in the objects: +```json +pipeline: + source: + s3: + s3_select: + expression: "select * from s3object s LIMIT 10000" + input_serialization: parquet + notification_type: "sqs" +... +``` +{% include copy-curl.html %} +The following example pipeline checks for the minimum and maximum values of `data_value` before ingesting events into the pipeline: + +```json +pipeline: + source: + s3: + s3_select: + expression: "select s.* from s3object s where s.data_value > 200 and s.data_value < 500 " + input_serialization: parquet + notification_type: "sqs" +... +``` +{% include copy-curl.html %} diff --git a/_data-prepper/pipelines/configuration/sources/selective-download.md b/_data-prepper/pipelines/configuration/sources/selective-download.md deleted file mode 100644 index fe5f2f1395..0000000000 --- a/_data-prepper/pipelines/configuration/sources/selective-download.md +++ /dev/null @@ -1,54 +0,0 @@ ---- -layout: default -title: Selective download -parent: Common use cases -nav_order: 50 ---- - -# Selective download - -If your pipeline uses an S3 source, you can use SQL expressions to perform filtering and computations on the contents of S3 objects before ingesting them into a pipeline. - -The `s3_select` option supports objects in the [Parquet File Format](https://parquet.apache.org/docs/). It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only) and supports columnar compression for the Parquet File Format using GZIP and Snappy. - -The following example pipeline downloads data in incoming S3 objects, encoded in the Parquet File Format: - -```json -pipeline: - source: - s3: - s3_select: - expression: "select * from s3object s" - input_serialization: parquet - notification_type: "sqs" -... -``` -{% include copy-curl.html %} - -The following example pipeline downloads only the first 10,000 records in the objects: - -```json -pipeline: - source: - s3: - s3_select: - expression: "select * from s3object s LIMIT 10000" - input_serialization: parquet - notification_type: "sqs" -... -``` -{% include copy-curl.html %} - -The following example pipeline checks for the minimum and maximum values of `data_value` before ingesting events into the pipeline: - -```json -pipeline: - source: - s3: - s3_select: - expression: "select s.* from s3object s where s.data_value > 200 and s.data_value < 500 " - input_serialization: parquet - notification_type: "sqs" -... -``` -{% include copy-curl.html %} From d53813f204b52f2feef1b3a15e9816df57a54b80 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Mon, 13 May 2024 11:54:14 -0600 Subject: [PATCH 06/20] Address tech review comments Signed-off-by: Melissa Vagi --- _data-prepper/common-use-cases/s3-logs.md | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/_data-prepper/common-use-cases/s3-logs.md b/_data-prepper/common-use-cases/s3-logs.md index 3e19f30889..fce7c23e12 100644 --- a/_data-prepper/common-use-cases/s3-logs.md +++ b/_data-prepper/common-use-cases/s3-logs.md @@ -9,7 +9,6 @@ nav_order: 40 Data Prepper allows you to load logs from [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3), including traditional logs, JSON documents, and CSV logs. - ## Architecture Data Prepper can read objects from S3 buckets using an [Amazon Simple Queue Service (SQS)](https://aws.amazon.com/sqs/) (Amazon SQS) queue and [Amazon S3 Event Notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html). @@ -28,7 +27,6 @@ The data flow involving the components is as follows: 4. Data Prepper downloads the content from the S3 object. 5. Data Prepper sends a document to OpenSearch for the content in the S3 object. - ## Pipeline overview Data Prepper supports reading data from S3 using the [`s3` source]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3/). @@ -56,8 +54,7 @@ Use the following steps to begin loading logs from S3 with Data Prepper. ### Setting permissions for Data Prepper -To view S3 logs, Data Prepper needs access to Amazon SQS and S3. -Use the following example to set up permissions: +To view S3 logs, Data Prepper needs access to Amazon SQS and S3. Use the following example to set up permissions: ```json { @@ -93,7 +90,7 @@ If your S3 objects or SQS queues do not use KMS, you can remove the `kms:Decrypt ### SQS dead-letter queue -The are two options for how to handle errors resulting from processing S3 objects. +The two options for how to handle errors resulting from processing S3 objects are as follows: - Use an SQS dead-letter queue (DLQ) to track the failure. This is the recommended approach. - Delete the message from SQS. You must manually find the S3 object and correct the error. @@ -105,7 +102,7 @@ The following diagram shows the system architecture when using SQS with DLQ. To use an SQS dead-letter queue, perform the following steps: 1. Create a new SQS standard queue to act as the DLQ. -2. Configure your SQS redrive policy [to use DLQ](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-configure-dead-letter-queue.html). Consider using a low value such as 2 or 3 for the **Maximum Receives** setting. +2. Configure your SQS re-drive policy [to use DLQ](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-configure-dead-letter-queue.html). Consider using a low value such as 2 or 3 for the **Maximum Receives** setting. 3. Configure the Data Prepper `s3` source to use `retain_messages` for `on_error`. This is the default behavior. ## Pipeline design From 5644771632130b92eabb900259c1ba26b72df2d8 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Fri, 24 May 2024 13:31:24 -0600 Subject: [PATCH 07/20] Update _data-prepper/common-use-cases/s3-logs.md Co-authored-by: David Venable Signed-off-by: Melissa Vagi --- _data-prepper/common-use-cases/s3-logs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/common-use-cases/s3-logs.md b/_data-prepper/common-use-cases/s3-logs.md index fce7c23e12..fc067a7c9d 100644 --- a/_data-prepper/common-use-cases/s3-logs.md +++ b/_data-prepper/common-use-cases/s3-logs.md @@ -176,7 +176,7 @@ To meet the scale of logs produced by S3, some users require multiple SQS queues To make sure that Data Prepper can directly parse the event from the SNS topic, configure [raw message delivery](https://docs.aws.amazon.com/sns/latest/dg/sns-large-payload-raw-message-delivery.html) on the SNS to SQS subscription. Setting this option does not affect other SQS queues that are subscribed to that SNS topic. -## Selective download +## Filtering and retrieving data using Amazon S3 Select If a pipeline uses an S3 source, you can use SQL expressions to perform filtering and computations on the contents of S3 objects before ingesting them into the pipeline. From 3f99c9c855031c4ef297d957f1f3302f38920e81 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Fri, 24 May 2024 13:31:30 -0600 Subject: [PATCH 08/20] Update _data-prepper/common-use-cases/s3-logs.md Co-authored-by: David Venable Signed-off-by: Melissa Vagi --- _data-prepper/common-use-cases/s3-logs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/common-use-cases/s3-logs.md b/_data-prepper/common-use-cases/s3-logs.md index fc067a7c9d..ebbd72a5b8 100644 --- a/_data-prepper/common-use-cases/s3-logs.md +++ b/_data-prepper/common-use-cases/s3-logs.md @@ -182,7 +182,7 @@ If a pipeline uses an S3 source, you can use SQL expressions to perform filterin The `s3_select` option supports objects in the [Parquet File Format](https://parquet.apache.org/docs/). It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only) and supports columnar compression for the Parquet File Format using GZIP and Snappy. -The following example pipeline downloads data in incoming S3 objects, encoded in the Parquet File Format: +The following example pipeline retrieves all data S3 objects encoded in the Parquet File Format: ```json pipeline: From eb59294f77e8c38bd841797e6b94194297e353f3 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Fri, 24 May 2024 13:31:39 -0600 Subject: [PATCH 09/20] Update _data-prepper/common-use-cases/s3-logs.md Co-authored-by: David Venable Signed-off-by: Melissa Vagi --- _data-prepper/common-use-cases/s3-logs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/common-use-cases/s3-logs.md b/_data-prepper/common-use-cases/s3-logs.md index ebbd72a5b8..94891cf6a4 100644 --- a/_data-prepper/common-use-cases/s3-logs.md +++ b/_data-prepper/common-use-cases/s3-logs.md @@ -196,7 +196,7 @@ pipeline: ``` {% include copy-curl.html %} -The following example pipeline downloads only the first 10,000 records in the objects: +The following example pipeline retrieves only the first 10,000 records in the objects: ```json pipeline: From 1e4537bd3ddc5527d646a277e8b595c732056785 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Fri, 24 May 2024 13:31:47 -0600 Subject: [PATCH 10/20] Update _data-prepper/common-use-cases/s3-logs.md Co-authored-by: David Venable Signed-off-by: Melissa Vagi --- _data-prepper/common-use-cases/s3-logs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/common-use-cases/s3-logs.md b/_data-prepper/common-use-cases/s3-logs.md index 94891cf6a4..e50e199661 100644 --- a/_data-prepper/common-use-cases/s3-logs.md +++ b/_data-prepper/common-use-cases/s3-logs.md @@ -210,7 +210,7 @@ pipeline: ``` {% include copy-curl.html %} -The following example pipeline checks for the minimum and maximum values of `data_value` before ingesting events into the pipeline: +The following example pipeline retrieves records from S3 objects that have a `data_value` in the given range of 200-500. ```json pipeline: From 7fa5358b02fe0b938884cfb2d997ba45d252d6d2 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Wed, 29 May 2024 13:24:22 -0600 Subject: [PATCH 11/20] Update s3-logs.md Signed-off-by: Melissa Vagi Signed-off-by: Melissa Vagi --- _data-prepper/common-use-cases/s3-logs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/common-use-cases/s3-logs.md b/_data-prepper/common-use-cases/s3-logs.md index e50e199661..1c29349466 100644 --- a/_data-prepper/common-use-cases/s3-logs.md +++ b/_data-prepper/common-use-cases/s3-logs.md @@ -210,7 +210,7 @@ pipeline: ``` {% include copy-curl.html %} -The following example pipeline retrieves records from S3 objects that have a `data_value` in the given range of 200-500. +The following example pipeline retrieves records from S3 objects that have a `data_value` in the given range of 200--500. ```json pipeline: From 5509831951ff9a456b5e24a0392f897e5cc14739 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Wed, 29 May 2024 13:41:52 -0600 Subject: [PATCH 12/20] Update s3-logs.md Signed-off-by: Melissa Vagi Signed-off-by: Melissa Vagi --- _data-prepper/common-use-cases/s3-logs.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/_data-prepper/common-use-cases/s3-logs.md b/_data-prepper/common-use-cases/s3-logs.md index 1c29349466..7a885a5e68 100644 --- a/_data-prepper/common-use-cases/s3-logs.md +++ b/_data-prepper/common-use-cases/s3-logs.md @@ -182,6 +182,9 @@ If a pipeline uses an S3 source, you can use SQL expressions to perform filterin The `s3_select` option supports objects in the [Parquet File Format](https://parquet.apache.org/docs/). It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only) and supports columnar compression for the Parquet File Format using GZIP and Snappy. +Refer to S3 Select user guides [Filtering and retrieving data using Amazon S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html) and [SQL reference for Amazon S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-select-sql-reference.html) for comprehensive information about using Amazon S3 Select. +{: .note} + The following example pipeline retrieves all data S3 objects encoded in the Parquet File Format: ```json From 372c24f3b84439bf66016fe892a8886e78db5af2 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Wed, 29 May 2024 13:43:03 -0600 Subject: [PATCH 13/20] Update s3-logs.md Signed-off-by: Melissa Vagi Signed-off-by: Melissa Vagi --- _data-prepper/common-use-cases/s3-logs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/common-use-cases/s3-logs.md b/_data-prepper/common-use-cases/s3-logs.md index 7a885a5e68..0d9243eac2 100644 --- a/_data-prepper/common-use-cases/s3-logs.md +++ b/_data-prepper/common-use-cases/s3-logs.md @@ -182,7 +182,7 @@ If a pipeline uses an S3 source, you can use SQL expressions to perform filterin The `s3_select` option supports objects in the [Parquet File Format](https://parquet.apache.org/docs/). It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only) and supports columnar compression for the Parquet File Format using GZIP and Snappy. -Refer to S3 Select user guides [Filtering and retrieving data using Amazon S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html) and [SQL reference for Amazon S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-select-sql-reference.html) for comprehensive information about using Amazon S3 Select. +Refer to user guides [Filtering and retrieving data using Amazon S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html) and [SQL reference for Amazon S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-select-sql-reference.html) for comprehensive information about using Amazon S3 Select. {: .note} The following example pipeline retrieves all data S3 objects encoded in the Parquet File Format: From e146721709246c634919e2721e2898d6044b0fee Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 27 Jun 2024 10:49:14 -0600 Subject: [PATCH 14/20] Update _data-prepper/common-use-cases/s3-logs.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/common-use-cases/s3-logs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/common-use-cases/s3-logs.md b/_data-prepper/common-use-cases/s3-logs.md index 0d9243eac2..f04bc80d34 100644 --- a/_data-prepper/common-use-cases/s3-logs.md +++ b/_data-prepper/common-use-cases/s3-logs.md @@ -213,7 +213,7 @@ pipeline: ``` {% include copy-curl.html %} -The following example pipeline retrieves records from S3 objects that have a `data_value` in the given range of 200--500. +The following example pipeline retrieves records from S3 objects that have a `data_value` in the given range of 200--500: ```json pipeline: From 1615a036b006f3a619090b542c944bdf296929e7 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 27 Jun 2024 10:49:53 -0600 Subject: [PATCH 15/20] Update _data-prepper/common-use-cases/s3-logs.md Signed-off-by: Melissa Vagi --- _data-prepper/common-use-cases/s3-logs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/common-use-cases/s3-logs.md b/_data-prepper/common-use-cases/s3-logs.md index f04bc80d34..ea408649f2 100644 --- a/_data-prepper/common-use-cases/s3-logs.md +++ b/_data-prepper/common-use-cases/s3-logs.md @@ -19,7 +19,7 @@ The following diagram shows the overall architecture of the components involved. S3 source architecture{: .img-fluid} -The data flow involving the components is as follows: +The component data flow is as follows: 1. A system produces logs into the S3 bucket. 2. S3 creates an S3 event notification in the SQS queue. From 18c4245b5030953e40fc82109909e1bf709ee3af Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 27 Jun 2024 10:50:13 -0600 Subject: [PATCH 16/20] Update _data-prepper/common-use-cases/s3-logs.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/common-use-cases/s3-logs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/common-use-cases/s3-logs.md b/_data-prepper/common-use-cases/s3-logs.md index ea408649f2..4ad22c1768 100644 --- a/_data-prepper/common-use-cases/s3-logs.md +++ b/_data-prepper/common-use-cases/s3-logs.md @@ -90,7 +90,7 @@ If your S3 objects or SQS queues do not use KMS, you can remove the `kms:Decrypt ### SQS dead-letter queue -The two options for how to handle errors resulting from processing S3 objects are as follows: +The following two options can be used to handle S3 object processing errors: - Use an SQS dead-letter queue (DLQ) to track the failure. This is the recommended approach. - Delete the message from SQS. You must manually find the S3 object and correct the error. From ac30dec585fc7f2ed61b7ed8d4df9090937536cf Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 27 Jun 2024 10:50:32 -0600 Subject: [PATCH 17/20] Update _data-prepper/common-use-cases/s3-logs.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/common-use-cases/s3-logs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/common-use-cases/s3-logs.md b/_data-prepper/common-use-cases/s3-logs.md index 4ad22c1768..e13fe135a3 100644 --- a/_data-prepper/common-use-cases/s3-logs.md +++ b/_data-prepper/common-use-cases/s3-logs.md @@ -166,7 +166,7 @@ s3-log-pipeline: ## Multiple Data Prepper pipelines -It is recommended that you have one SQS queue per Data Prepper pipeline. In addition, you can have multiple nodes in the same cluster reading from the same SQS queue, which doesn't require additional configuration with Data Prepper. +It is recommended that you have one SQS queue per Data Prepper pipeline. In addition, you can have multiple nodes in the same cluster reading from the same SQS queue, which doesn't require additional Data Prepper configuration. If you have multiple pipelines, you must create multiple SQS queues for each pipeline, even if both pipelines use the same S3 bucket. From f2b57d29fd46e9b8092db49ba627dbdd25a9a9ce Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 27 Jun 2024 10:50:51 -0600 Subject: [PATCH 18/20] Update _data-prepper/common-use-cases/s3-logs.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/common-use-cases/s3-logs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/common-use-cases/s3-logs.md b/_data-prepper/common-use-cases/s3-logs.md index e13fe135a3..0b084c101b 100644 --- a/_data-prepper/common-use-cases/s3-logs.md +++ b/_data-prepper/common-use-cases/s3-logs.md @@ -174,7 +174,7 @@ If you have multiple pipelines, you must create multiple SQS queues for each pip To meet the scale of logs produced by S3, some users require multiple SQS queues for their logs. You can use [Amazon Simple Notification Service](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) (Amazon SNS) to route event notifications from S3 to an SQS [fanout pattern](https://docs.aws.amazon.com/sns/latest/dg/sns-common-scenarios.html). Using SNS, all S3 event notifications are sent directly to a single SNS topic, where you can subscribe to multiple SQS queues. -To make sure that Data Prepper can directly parse the event from the SNS topic, configure [raw message delivery](https://docs.aws.amazon.com/sns/latest/dg/sns-large-payload-raw-message-delivery.html) on the SNS to SQS subscription. Setting this option does not affect other SQS queues that are subscribed to that SNS topic. +To make sure that Data Prepper can directly parse the event from the SNS topic, configure [raw message delivery](https://docs.aws.amazon.com/sns/latest/dg/sns-large-payload-raw-message-delivery.html) on the SNS-to-SQS subscription. Applying this option does not affect other SQS queues subscribed to the SNS topic. ## Filtering and retrieving data using Amazon S3 Select From 46ca0b60956e4bffc5ebc01fcd6305bfe40863bb Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 27 Jun 2024 10:51:11 -0600 Subject: [PATCH 19/20] Update _data-prepper/common-use-cases/s3-logs.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/common-use-cases/s3-logs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/common-use-cases/s3-logs.md b/_data-prepper/common-use-cases/s3-logs.md index 0b084c101b..efaca41156 100644 --- a/_data-prepper/common-use-cases/s3-logs.md +++ b/_data-prepper/common-use-cases/s3-logs.md @@ -182,7 +182,7 @@ If a pipeline uses an S3 source, you can use SQL expressions to perform filterin The `s3_select` option supports objects in the [Parquet File Format](https://parquet.apache.org/docs/). It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only) and supports columnar compression for the Parquet File Format using GZIP and Snappy. -Refer to user guides [Filtering and retrieving data using Amazon S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html) and [SQL reference for Amazon S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-select-sql-reference.html) for comprehensive information about using Amazon S3 Select. +Refer to [Filtering and retrieving data using Amazon S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html) and [SQL reference for Amazon S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-select-sql-reference.html) for comprehensive information about using Amazon S3 Select. {: .note} The following example pipeline retrieves all data S3 objects encoded in the Parquet File Format: From 155138e86b4b93fa798e2b3f968e4f0976f5e84c Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 27 Jun 2024 10:51:44 -0600 Subject: [PATCH 20/20] Update _data-prepper/common-use-cases/s3-logs.md Signed-off-by: Melissa Vagi --- _data-prepper/common-use-cases/s3-logs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/common-use-cases/s3-logs.md b/_data-prepper/common-use-cases/s3-logs.md index efaca41156..8d5a9ce967 100644 --- a/_data-prepper/common-use-cases/s3-logs.md +++ b/_data-prepper/common-use-cases/s3-logs.md @@ -185,7 +185,7 @@ The `s3_select` option supports objects in the [Parquet File Format](https://par Refer to [Filtering and retrieving data using Amazon S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html) and [SQL reference for Amazon S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-select-sql-reference.html) for comprehensive information about using Amazon S3 Select. {: .note} -The following example pipeline retrieves all data S3 objects encoded in the Parquet File Format: +The following example pipeline retrieves all data from S3 objects encoded in the Parquet File Format: ```json pipeline: