Provide a way to terminate an aggregation group early in the aggregation processor #5240
Labels
enhancement
New feature or request
good first issue
Good for newcomers
help wanted
Extra attention is needed
Is your feature request related to a problem? Please describe.
I have a pipeline to ingest logs in opensearch, and I use the aggregation processor with the
put_map
action.At the moment, the only way a group can close with this action is to wait for the
group_duration
to expire.That means that all records that have been merged but whose group is not yet closed still lives in memory in the data-prepper nodes.
For high throughput or high latency pipeline where you have to specify a large
group_duration
, or both, that means a lot of memory will be wasted on already merged records that are just waiting for the expiration of the group.There should be a way to terminate a group and flush the result to the next processor or sink if you know you do not need to wait.
Describe the solution you'd like
The solution could work in two steps:
The pipeline configuration could look like:
Describe alternatives you've considered (Optional)
Other option: add a
close_when
expression common to allAggregateAction
that provides the custom expression that guards the closure of the group.This expression can be evaluated when
AggregateGroupManager.getGroupsToConclude()
is called, so the changes inAggregateProcessor
are minimal.Additional context
The aggregate processor first checks for groups to conclude and then processes the current batch. This logic should be reversed so the events are flushed immediately after the aggregation.
The text was updated successfully, but these errors were encountered: