Add event aggregation use case to Data Prepper documentation #6206

vagimeli · 2024-01-18T00:03:11Z

Description

Adds event aggregation use case to Data Prepper documentation

Issues Resolved

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Melissa Vagi <[email protected]>

vagimeli · 2024-01-18T00:07:59Z

Hi @lizsnyder: Please review this use case carryover to Data Prepper at your availability. Thank you :)

lizsnyder

LGTM

vagimeli · 2024-01-18T17:37:23Z

@dlvenable or @Naarcha-AWS: Please provide a doc review of this PR. This use case is one in a series of uses cases we're bringing over from OpenSearch Service to Data Prepper. I've updated or removed language as applicable to OpenSearch open source and Data Prepper. I'll tag you in the other PRs for your doc reviews. Thanks, @vagimeli

_data-prepper/common-use-cases/event-aggregation.md

dlvenable · 2024-01-19T19:19:10Z

_data-prepper/common-use-cases/event-aggregation.md

+
+# Event aggregation with Data Prepper
+
+You can use Data Prepper to aggregate data from different events over a period of time. Aggregating events can help reduce unnecessary log volume and handle use cases like multiline logs that come in as separate events. The [Aggregate processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) is a stateful processor that groups events based on the values for a set of specified identification keys, and performs a configurable action on each group.


I think it may be advantageous to format the aggregate processor as aggregate processor or aggregate processor.

Thus, change:

The [Aggregate processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/)

to

The [`aggregate` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/)

dlvenable · 2024-01-19T19:20:20Z

_data-prepper/common-use-cases/event-aggregation.md

+
+## Basic usage
+
+The following example pipeline extracts the fields `sourceIp`, `destinationIp`, and `port` using the [Grok processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/grok/), and then aggregates on those fields over a period of 30 seconds using the [Aggregate processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) and the `put_all` action. At the end of the 30 seconds, the aggregated log is sent to the OpenSearch sink.


Similar comment as above with [Aggregate processor].

I'll carryover this change across the other use case PRs.

dlvenable · 2024-01-19T19:23:57Z

_data-prepper/common-use-cases/event-aggregation.md

+```
+{% include copy-curl.html %}
+
+The Grok processor will extract the `identification_keys` to create the following logs:


The grok processor is not creating the following logs. It is modifying the existing log events to look like what follows. Also, it is extract keys more generally. The pipeline author has to extract them to match what aggregate expects for identification_keys.

The grok processor will extract extract keys such that the log events look like the following. These events now have the data which the aggregate processor will need for the `identification_keys`.

Co-authored-by: Naarcha-AWS <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>

Signed-off-by: Melissa Vagi <[email protected]>

vagimeli · 2024-01-19T23:03:03Z

@dlvenable I've address your comments and let @natebower know this is ready for editorial review. Thanks for your review.

natebower

@vagimeli Please see my comments and changes and tag me when addressed. I'd like to reread line 57 before approving. Thanks!

_data-prepper/common-use-cases/event-aggregation.md

natebower · 2024-01-22T12:49:52Z

_data-prepper/common-use-cases/event-aggregation.md

+
+# Event aggregation with Data Prepper
+
+You can use Data Prepper to aggregate data from different events over a period of time. Aggregating events can help reduce unnecessary log volume and handle use cases like multiline logs that come in as separate events. The [`aggregate` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) is a stateful processor that groups events based on the values for a set of specified identification keys, and performs a configurable action on each group.


"multiline logs that are received as"?

natebower · 2024-01-22T12:51:09Z

_data-prepper/common-use-cases/event-aggregation.md

+
+You can use Data Prepper to aggregate data from different events over a period of time. Aggregating events can help reduce unnecessary log volume and handle use cases like multiline logs that come in as separate events. The [`aggregate` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) is a stateful processor that groups events based on the values for a set of specified identification keys, and performs a configurable action on each group.
+
+State in the `aggregate` processor is stored in memory. For example, in order to combine four events into one, the processor needs to retain pieces of the first three events. The state of an aggregate group of events is kept for a configurable amount of time. Depending on your logs, the aggregate action being used, and the amount of memory options in the processor configuration, the aggregation could take place over a long period of time.


Suggested change

State in the `aggregate` processor is stored in memory. For example, in order to combine four events into one, the processor needs to retain pieces of the first three events. The state of an aggregate group of events is kept for a configurable amount of time. Depending on your logs, the aggregate action being used, and the amount of memory options in the processor configuration, the aggregation could take place over a long period of time.

State in the `aggregate` processor is stored in memory. For example, in order to combine four events into one, the processor needs to retain pieces of the first three events. The state of an aggregate group of events is kept for a configurable amount of time. Depending on your logs, the aggregate action being used, and the number of memory options in the processor configuration, the aggregation could take place over a long period of time.

natebower · 2024-01-22T12:52:36Z

_data-prepper/common-use-cases/event-aggregation.md

+
+You can use Data Prepper to aggregate data from different events over a period of time. Aggregating events can help reduce unnecessary log volume and handle use cases like multiline logs that come in as separate events. The [`aggregate` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) is a stateful processor that groups events based on the values for a set of specified identification keys, and performs a configurable action on each group.
+
+State in the `aggregate` processor is stored in memory. For example, in order to combine four events into one, the processor needs to retain pieces of the first three events. The state of an aggregate group of events is kept for a configurable amount of time. Depending on your logs, the aggregate action being used, and the amount of memory options in the processor configuration, the aggregation could take place over a long period of time.


"The aggregate processor state is stored in memory"? Otherwise, can the sentence start with "The state"?

natebower · 2024-01-22T12:53:58Z

_data-prepper/common-use-cases/event-aggregation.md

+
+## Basic usage
+
+The following example pipeline extracts the fields `sourceIp`, `destinationIp`, and `port` using the [`grok` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/grok/), and then aggregates on those fields over a period of 30 seconds using the [`aggregate` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) and the `put_all` action. At the end of the 30 seconds, the aggregated log is sent to the OpenSearch sink.


Suggested change

The following example pipeline extracts the fields `sourceIp`, `destinationIp`, and `port` using the [`grok` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/grok/), and then aggregates on those fields over a period of 30 seconds using the [`aggregate` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) and the `put_all` action. At the end of the 30 seconds, the aggregated log is sent to the OpenSearch sink.

The following example pipeline extracts the fields `sourceIp`, `destinationIp`, and `port` using the [`grok` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/grok/) and then aggregates on those fields over a period of 30 seconds using the [`aggregate` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) and the `put_all` action. At the end of the 30-second period, the aggregated log is sent to the OpenSearch sink.

natebower · 2024-01-22T13:00:36Z

_data-prepper/common-use-cases/event-aggregation.md

+
+## Log aggregation and conditional routing
+
+You can use multiple plugins to combine log aggregation with conditional routing. In this example, the sub-pipeline `log-aggregate-pipeline` receives logs by using an HTTP client like FluentBit and extracts important values from the logs by matching the value in the `log` key against the common Apache log pattern.


Suggested change

You can use multiple plugins to combine log aggregation with conditional routing. In this example, the sub-pipeline `log-aggregate-pipeline` receives logs by using an HTTP client like FluentBit and extracts important values from the logs by matching the value in the `log` key against the common Apache log pattern.

You can use multiple plugins to combine log aggregation with conditional routing. In this example, the sub-pipeline `log-aggregate-pipeline` receives logs by using an HTTP client, like FluentBit, and extracts important values from the logs by matching the value in the `log` key against the common Apache log pattern.

natebower · 2024-01-22T13:01:10Z

_data-prepper/common-use-cases/event-aggregation.md

+
+## Log aggregation and conditional routing
+
+You can use multiple plugins to combine log aggregation with conditional routing. In this example, the sub-pipeline `log-aggregate-pipeline` receives logs by using an HTTP client like FluentBit and extracts important values from the logs by matching the value in the `log` key against the common Apache log pattern.


End of last sentence: "against the Apache Common Log Format"?

natebower · 2024-01-22T13:02:02Z

_data-prepper/common-use-cases/event-aggregation.md

+
+You can use multiple plugins to combine log aggregation with conditional routing. In this example, the sub-pipeline `log-aggregate-pipeline` receives logs by using an HTTP client like FluentBit and extracts important values from the logs by matching the value in the `log` key against the common Apache log pattern.
+
+Two of the values the sub-pipeline extracts from the logs with a grok pattern include `response` and `clientip`. The `aggregate` processor then uses the `clientip` value, along with the `remove_duplicates` option, to drop any logs that contain a `clientip` that has already been processed within the given `group_duration`.


Suggested change

Two of the values the sub-pipeline extracts from the logs with a grok pattern include `response` and `clientip`. The `aggregate` processor then uses the `clientip` value, along with the `remove_duplicates` option, to drop any logs that contain a `clientip` that has already been processed within the given `group_duration`.

Two of the values that the sub-pipeline extracts from the logs with a grok pattern include `response` and `clientip`. The `aggregate` processor then uses the `clientip` value, along with the `remove_duplicates` option, to drop any logs that contain a `clientip` that has already been processed within the given `group_duration`.

natebower · 2024-01-22T13:02:13Z

_data-prepper/common-use-cases/event-aggregation.md

+
+You can use multiple plugins to combine log aggregation with conditional routing. In this example, the sub-pipeline `log-aggregate-pipeline` receives logs by using an HTTP client like FluentBit and extracts important values from the logs by matching the value in the `log` key against the common Apache log pattern.
+
+Two of the values the sub-pipeline extracts from the logs with a grok pattern include `response` and `clientip`. The `aggregate` processor then uses the `clientip` value, along with the `remove_duplicates` option, to drop any logs that contain a `clientip` that has already been processed within the given `group_duration`.


Should "Grok" be capitalized?

It looks like we capitalize when it's part of a proper name, such as Grok Debugger, lowercase it in common usage.

It looks like "Grok pattern" is used here and here and in many other places, with Elastic being the notable exception. I would lean toward capitalizing it if it's a reference to a proper noun, which it appears to be.

Okay. I'll clean up the capitalization across Data Prepper and OpenSearch ingest processors.

natebower · 2024-01-22T13:04:14Z

_data-prepper/common-use-cases/event-aggregation.md

+
+Two of the values the sub-pipeline extracts from the logs with a grok pattern include `response` and `clientip`. The `aggregate` processor then uses the `clientip` value, along with the `remove_duplicates` option, to drop any logs that contain a `clientip` that has already been processed within the given `group_duration`.
+
+Three routes, or conditional statements, exist in the pipeline. These routes separate the value of the response into 2xx/3xx, 4xx, and 5xx responses. Logs with a 2xx and 3xx status are sent to the aggregated_2xx_3xx index, logs with a 4xx status are sent to the aggregated_4xx index, and logs with a 5xx status are sent to the aggregated_5xx index.


Second sentence: Confirm that it shouldn't be "2xx or 3xx status". Should the index names be in code font?

@natebower is correct. These are logs that have either a 2xx or 3xx status.

@vagimeli , We want to change this:

Logs with a 2xx and 3xx status

to:

Logs with a 2xx or 3xx status

Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>

dlvenable · 2024-01-22T17:38:25Z

_data-prepper/common-use-cases/event-aggregation.md

@@ -0,0 +1,135 @@
+---
+layout: default
+title: Event aggregation with Data Prepper


Can we also drop the " with Data Prepper" here?

Signed-off-by: Melissa Vagi <[email protected]>

vagimeli · 2024-01-22T20:05:01Z

@dlvenable @natebower Please see the revised documentation based on your reviews. Let me know if any other changes are needed. Thanks, Melissa

natebower

LGTM

vagimeli · 2024-01-22T21:52:04Z

LGTM

Thanks @natebower. @dlvenable Please review and provide your approval at your availability. Thanks, Melissa

Signed-off-by: Melissa Vagi <[email protected]>

dlvenable

This looks great. Thank you!

* Add event aggregation use case to Data Prepper documentation --------- Signed-off-by: Melissa Vagi <[email protected]> Co-authored-by: Naarcha-AWS <[email protected]> Co-authored-by: Nathan Bower <[email protected]> (cherry picked from commit d20468b) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…6575) * Add event aggregation use case to Data Prepper documentation --------- (cherry picked from commit d20468b) Signed-off-by: Melissa Vagi <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Naarcha-AWS <[email protected]> Co-authored-by: Nathan Bower <[email protected]>

…rch-project#6206) * Add event aggregation use case to Data Prepper documentation --------- Signed-off-by: Melissa Vagi <[email protected]> Co-authored-by: Naarcha-AWS <[email protected]> Co-authored-by: Nathan Bower <[email protected]>

Add event aggregation use case to Data Prepper documentation

cff5731

Signed-off-by: Melissa Vagi <[email protected]>

vagimeli added 3 - Tech review PR: Tech review in progress data-prepper Content gap labels Jan 18, 2024

vagimeli self-assigned this Jan 18, 2024

vagimeli requested review from hdhalter, kolchfa-aws, Naarcha-AWS, AMoo-Miki, natebower and dlvenable as code owners January 18, 2024 00:03

lizsnyder approved these changes Jan 18, 2024

View reviewed changes

vagimeli added 4 - Doc review PR: Doc review in progress and removed 3 - Tech review PR: Tech review in progress labels Jan 18, 2024

Naarcha-AWS approved these changes Jan 18, 2024

View reviewed changes

_data-prepper/common-use-cases/event-aggregation.md Outdated Show resolved Hide resolved

dlvenable requested changes Jan 19, 2024

View reviewed changes

vagimeli and others added 2 commits January 19, 2024 15:53

Update _data-prepper/common-use-cases/event-aggregation.md

64915a9

Co-authored-by: Naarcha-AWS <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>

Address SME comments

7792e43

Signed-off-by: Melissa Vagi <[email protected]>

natebower requested changes Jan 22, 2024

View reviewed changes

Update _data-prepper/common-use-cases/event-aggregation.md

cf45cf7

Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>

dlvenable requested changes Jan 22, 2024

View reviewed changes

vagimeli added 2 commits January 22, 2024 12:55

Address editorial review feedback

1605e6d

Signed-off-by: Melissa Vagi <[email protected]>

Address editorial review feedback

90493f8

Signed-off-by: Melissa Vagi <[email protected]>

Merge branch 'main' into event-aggregation

031625c

natebower approved these changes Jan 22, 2024

View reviewed changes

vagimeli added 4 commits January 22, 2024 14:52

Merge branch 'main' into event-aggregation

963e5ad

Copy edits

67266ca

Signed-off-by: Melissa Vagi <[email protected]>

Copy edits

12a4c76

Signed-off-by: Melissa Vagi <[email protected]>

Copy edits

1bd7ead

Signed-off-by: Melissa Vagi <[email protected]>

vagimeli added 3 - Tech review PR: Tech review in progress and removed 4 - Doc review PR: Doc review in progress labels Jan 31, 2024

Merge branch 'main' into event-aggregation

809b6d4

dlvenable approved these changes Feb 15, 2024

View reviewed changes

vagimeli merged commit d20468b into main Feb 22, 2024
4 checks passed

vagimeli deleted the event-aggregation branch February 22, 2024 17:21

vagimeli added 3 - Done Issue is done/complete backport 2.12 PR: Backport label for 2.12 and removed 3 - Tech review PR: Tech review in progress labels Mar 2, 2024

opensearch-trigger-bot bot mentioned this pull request Mar 5, 2024

[Backport 2.12] Add event aggregation use case to Data Prepper documentation #6575

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add event aggregation use case to Data Prepper documentation #6206

Add event aggregation use case to Data Prepper documentation #6206

vagimeli commented Jan 18, 2024

vagimeli commented Jan 18, 2024

lizsnyder left a comment

vagimeli commented Jan 18, 2024

dlvenable Jan 19, 2024

dlvenable Jan 19, 2024

vagimeli Jan 19, 2024

dlvenable Jan 19, 2024

vagimeli commented Jan 19, 2024

natebower left a comment

natebower Jan 22, 2024

natebower Jan 22, 2024

natebower Jan 22, 2024

natebower Jan 22, 2024

natebower Jan 22, 2024

natebower Jan 22, 2024

natebower Jan 22, 2024

natebower Jan 22, 2024

vagimeli Jan 22, 2024

natebower Jan 22, 2024

vagimeli Jan 23, 2024

natebower Jan 22, 2024

dlvenable Jan 22, 2024

dlvenable Jan 22, 2024

dlvenable Jan 22, 2024

vagimeli commented Jan 22, 2024

natebower left a comment

vagimeli commented Jan 22, 2024

dlvenable left a comment


		# Event aggregation with Data Prepper

		You can use Data Prepper to aggregate data from different events over a period of time. Aggregating events can help reduce unnecessary log volume and handle use cases like multiline logs that come in as separate events. The [Aggregate processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) is a stateful processor that groups events based on the values for a set of specified identification keys, and performs a configurable action on each group.


		## Basic usage

		The following example pipeline extracts the fields `sourceIp`, `destinationIp`, and `port` using the [Grok processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/grok/), and then aggregates on those fields over a period of 30 seconds using the [Aggregate processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) and the `put_all` action. At the end of the 30 seconds, the aggregated log is sent to the OpenSearch sink.


		You can use Data Prepper to aggregate data from different events over a period of time. Aggregating events can help reduce unnecessary log volume and handle use cases like multiline logs that come in as separate events. The [`aggregate` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) is a stateful processor that groups events based on the values for a set of specified identification keys, and performs a configurable action on each group.

		State in the `aggregate` processor is stored in memory. For example, in order to combine four events into one, the processor needs to retain pieces of the first three events. The state of an aggregate group of events is kept for a configurable amount of time. Depending on your logs, the aggregate action being used, and the amount of memory options in the processor configuration, the aggregation could take place over a long period of time.


		## Log aggregation and conditional routing

		You can use multiple plugins to combine log aggregation with conditional routing. In this example, the sub-pipeline `log-aggregate-pipeline` receives logs by using an HTTP client like FluentBit and extracts important values from the logs by matching the value in the `log` key against the common Apache log pattern.


		You can use multiple plugins to combine log aggregation with conditional routing. In this example, the sub-pipeline `log-aggregate-pipeline` receives logs by using an HTTP client like FluentBit and extracts important values from the logs by matching the value in the `log` key against the common Apache log pattern.

		Two of the values the sub-pipeline extracts from the logs with a grok pattern include `response` and `clientip`. The `aggregate` processor then uses the `clientip` value, along with the `remove_duplicates` option, to drop any logs that contain a `clientip` that has already been processed within the given `group_duration`.

	Two of the values the sub-pipeline extracts from the logs with a grok pattern include `response` and `clientip`. The `aggregate` processor then uses the `clientip` value, along with the `remove_duplicates` option, to drop any logs that contain a `clientip` that has already been processed within the given `group_duration`.
	Two of the values that the sub-pipeline extracts from the logs with a grok pattern include `response` and `clientip`. The `aggregate` processor then uses the `clientip` value, along with the `remove_duplicates` option, to drop any logs that contain a `clientip` that has already been processed within the given `group_duration`.


		Two of the values the sub-pipeline extracts from the logs with a grok pattern include `response` and `clientip`. The `aggregate` processor then uses the `clientip` value, along with the `remove_duplicates` option, to drop any logs that contain a `clientip` that has already been processed within the given `group_duration`.

		Three routes, or conditional statements, exist in the pipeline. These routes separate the value of the response into 2xx/3xx, 4xx, and 5xx responses. Logs with a 2xx and 3xx status are sent to the aggregated_2xx_3xx index, logs with a 4xx status are sent to the aggregated_4xx index, and logs with a 5xx status are sent to the aggregated_5xx index.

Add event aggregation use case to Data Prepper documentation #6206

Add event aggregation use case to Data Prepper documentation #6206

Conversation

vagimeli commented Jan 18, 2024

Description

Issues Resolved

Checklist

vagimeli commented Jan 18, 2024

lizsnyder left a comment

Choose a reason for hiding this comment

vagimeli commented Jan 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vagimeli commented Jan 19, 2024

natebower left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vagimeli commented Jan 22, 2024

natebower left a comment

Choose a reason for hiding this comment

vagimeli commented Jan 22, 2024

dlvenable left a comment

Choose a reason for hiding this comment