-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add event aggregation use case to Data Prepper documentation #6206
Conversation
Signed-off-by: Melissa Vagi <[email protected]>
Hi @lizsnyder: Please review this use case carryover to Data Prepper at your availability. Thank you :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@dlvenable or @Naarcha-AWS: Please provide a doc review of this PR. This use case is one in a series of uses cases we're bringing over from OpenSearch Service to Data Prepper. I've updated or removed language as applicable to OpenSearch open source and Data Prepper. I'll tag you in the other PRs for your doc reviews. Thanks, @vagimeli |
|
||
# Event aggregation with Data Prepper | ||
|
||
You can use Data Prepper to aggregate data from different events over a period of time. Aggregating events can help reduce unnecessary log volume and handle use cases like multiline logs that come in as separate events. The [Aggregate processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) is a stateful processor that groups events based on the values for a set of specified identification keys, and performs a configurable action on each group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be advantageous to format the aggregate processor as aggregate
processor or aggregate
processor.
Thus, change:
The [Aggregate processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/)
to
The [`aggregate` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/)
|
||
## Basic usage | ||
|
||
The following example pipeline extracts the fields `sourceIp`, `destinationIp`, and `port` using the [Grok processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/grok/), and then aggregates on those fields over a period of 30 seconds using the [Aggregate processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) and the `put_all` action. At the end of the 30 seconds, the aggregated log is sent to the OpenSearch sink. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment as above with [Aggregate processor].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll carryover this change across the other use case PRs.
``` | ||
{% include copy-curl.html %} | ||
|
||
The Grok processor will extract the `identification_keys` to create the following logs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The grok processor is not creating the following logs. It is modifying the existing log events to look like what follows. Also, it is extract keys more generally. The pipeline author has to extract them to match what aggregate
expects for identification_keys
.
The grok processor will extract extract keys such that the log events look like the following. These events now have the data which the aggregate processor will need for the `identification_keys`.
Co-authored-by: Naarcha-AWS <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
@dlvenable I've address your comments and let @natebower know this is ready for editorial review. Thanks for your review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vagimeli Please see my comments and changes and tag me when addressed. I'd like to reread line 57 before approving. Thanks!
|
||
# Event aggregation with Data Prepper | ||
|
||
You can use Data Prepper to aggregate data from different events over a period of time. Aggregating events can help reduce unnecessary log volume and handle use cases like multiline logs that come in as separate events. The [`aggregate` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) is a stateful processor that groups events based on the values for a set of specified identification keys, and performs a configurable action on each group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"multiline logs that are received as"?
|
||
You can use Data Prepper to aggregate data from different events over a period of time. Aggregating events can help reduce unnecessary log volume and handle use cases like multiline logs that come in as separate events. The [`aggregate` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) is a stateful processor that groups events based on the values for a set of specified identification keys, and performs a configurable action on each group. | ||
|
||
State in the `aggregate` processor is stored in memory. For example, in order to combine four events into one, the processor needs to retain pieces of the first three events. The state of an aggregate group of events is kept for a configurable amount of time. Depending on your logs, the aggregate action being used, and the amount of memory options in the processor configuration, the aggregation could take place over a long period of time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
State in the `aggregate` processor is stored in memory. For example, in order to combine four events into one, the processor needs to retain pieces of the first three events. The state of an aggregate group of events is kept for a configurable amount of time. Depending on your logs, the aggregate action being used, and the amount of memory options in the processor configuration, the aggregation could take place over a long period of time. | |
State in the `aggregate` processor is stored in memory. For example, in order to combine four events into one, the processor needs to retain pieces of the first three events. The state of an aggregate group of events is kept for a configurable amount of time. Depending on your logs, the aggregate action being used, and the number of memory options in the processor configuration, the aggregation could take place over a long period of time. |
|
||
You can use Data Prepper to aggregate data from different events over a period of time. Aggregating events can help reduce unnecessary log volume and handle use cases like multiline logs that come in as separate events. The [`aggregate` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) is a stateful processor that groups events based on the values for a set of specified identification keys, and performs a configurable action on each group. | ||
|
||
State in the `aggregate` processor is stored in memory. For example, in order to combine four events into one, the processor needs to retain pieces of the first three events. The state of an aggregate group of events is kept for a configurable amount of time. Depending on your logs, the aggregate action being used, and the amount of memory options in the processor configuration, the aggregation could take place over a long period of time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The aggregate
processor state is stored in memory"? Otherwise, can the sentence start with "The state"?
|
||
## Basic usage | ||
|
||
The following example pipeline extracts the fields `sourceIp`, `destinationIp`, and `port` using the [`grok` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/grok/), and then aggregates on those fields over a period of 30 seconds using the [`aggregate` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) and the `put_all` action. At the end of the 30 seconds, the aggregated log is sent to the OpenSearch sink. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following example pipeline extracts the fields `sourceIp`, `destinationIp`, and `port` using the [`grok` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/grok/), and then aggregates on those fields over a period of 30 seconds using the [`aggregate` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) and the `put_all` action. At the end of the 30 seconds, the aggregated log is sent to the OpenSearch sink. | |
The following example pipeline extracts the fields `sourceIp`, `destinationIp`, and `port` using the [`grok` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/grok/) and then aggregates on those fields over a period of 30 seconds using the [`aggregate` processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/aggregate/) and the `put_all` action. At the end of the 30-second period, the aggregated log is sent to the OpenSearch sink. |
|
||
## Log aggregation and conditional routing | ||
|
||
You can use multiple plugins to combine log aggregation with conditional routing. In this example, the sub-pipeline `log-aggregate-pipeline` receives logs by using an HTTP client like FluentBit and extracts important values from the logs by matching the value in the `log` key against the common Apache log pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use multiple plugins to combine log aggregation with conditional routing. In this example, the sub-pipeline `log-aggregate-pipeline` receives logs by using an HTTP client like FluentBit and extracts important values from the logs by matching the value in the `log` key against the common Apache log pattern. | |
You can use multiple plugins to combine log aggregation with conditional routing. In this example, the sub-pipeline `log-aggregate-pipeline` receives logs by using an HTTP client, like FluentBit, and extracts important values from the logs by matching the value in the `log` key against the common Apache log pattern. |
|
||
## Log aggregation and conditional routing | ||
|
||
You can use multiple plugins to combine log aggregation with conditional routing. In this example, the sub-pipeline `log-aggregate-pipeline` receives logs by using an HTTP client like FluentBit and extracts important values from the logs by matching the value in the `log` key against the common Apache log pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
End of last sentence: "against the Apache Common Log Format"?
|
||
You can use multiple plugins to combine log aggregation with conditional routing. In this example, the sub-pipeline `log-aggregate-pipeline` receives logs by using an HTTP client like FluentBit and extracts important values from the logs by matching the value in the `log` key against the common Apache log pattern. | ||
|
||
Two of the values the sub-pipeline extracts from the logs with a grok pattern include `response` and `clientip`. The `aggregate` processor then uses the `clientip` value, along with the `remove_duplicates` option, to drop any logs that contain a `clientip` that has already been processed within the given `group_duration`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two of the values the sub-pipeline extracts from the logs with a grok pattern include `response` and `clientip`. The `aggregate` processor then uses the `clientip` value, along with the `remove_duplicates` option, to drop any logs that contain a `clientip` that has already been processed within the given `group_duration`. | |
Two of the values that the sub-pipeline extracts from the logs with a grok pattern include `response` and `clientip`. The `aggregate` processor then uses the `clientip` value, along with the `remove_duplicates` option, to drop any logs that contain a `clientip` that has already been processed within the given `group_duration`. |
|
||
You can use multiple plugins to combine log aggregation with conditional routing. In this example, the sub-pipeline `log-aggregate-pipeline` receives logs by using an HTTP client like FluentBit and extracts important values from the logs by matching the value in the `log` key against the common Apache log pattern. | ||
|
||
Two of the values the sub-pipeline extracts from the logs with a grok pattern include `response` and `clientip`. The `aggregate` processor then uses the `clientip` value, along with the `remove_duplicates` option, to drop any logs that contain a `clientip` that has already been processed within the given `group_duration`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "Grok" be capitalized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we capitalize when it's part of a proper name, such as Grok Debugger, lowercase it in common usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I'll clean up the capitalization across Data Prepper and OpenSearch ingest processors.
|
||
Two of the values the sub-pipeline extracts from the logs with a grok pattern include `response` and `clientip`. The `aggregate` processor then uses the `clientip` value, along with the `remove_duplicates` option, to drop any logs that contain a `clientip` that has already been processed within the given `group_duration`. | ||
|
||
Three routes, or conditional statements, exist in the pipeline. These routes separate the value of the response into 2xx/3xx, 4xx, and 5xx responses. Logs with a 2xx and 3xx status are sent to the aggregated_2xx_3xx index, logs with a 4xx status are sent to the aggregated_4xx index, and logs with a 5xx status are sent to the aggregated_5xx index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second sentence: Confirm that it shouldn't be "2xx or 3xx status". Should the index names be in code font?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@natebower is correct. These are logs that have either a 2xx or 3xx status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vagimeli , We want to change this:
Logs with a
2xx
and3xx
status
to:
Logs with a
2xx
or3xx
status
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Melissa Vagi <[email protected]>
@@ -0,0 +1,135 @@ | |||
--- | |||
layout: default | |||
title: Event aggregation with Data Prepper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also drop the " with Data Prepper" here?
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
@dlvenable @natebower Please see the revised documentation based on your reviews. Let me know if any other changes are needed. Thanks, Melissa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @natebower. @dlvenable Please review and provide your approval at your availability. Thanks, Melissa |
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Thank you!
* Add event aggregation use case to Data Prepper documentation --------- Signed-off-by: Melissa Vagi <[email protected]> Co-authored-by: Naarcha-AWS <[email protected]> Co-authored-by: Nathan Bower <[email protected]> (cherry picked from commit d20468b) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…6575) * Add event aggregation use case to Data Prepper documentation --------- (cherry picked from commit d20468b) Signed-off-by: Melissa Vagi <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Naarcha-AWS <[email protected]> Co-authored-by: Nathan Bower <[email protected]>
…rch-project#6206) * Add event aggregation use case to Data Prepper documentation --------- Signed-off-by: Melissa Vagi <[email protected]> Co-authored-by: Naarcha-AWS <[email protected]> Co-authored-by: Nathan Bower <[email protected]>
Description
Adds event aggregation use case to Data Prepper documentation
Issues Resolved
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.