Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow using event fields in s3 sink object_key #3310

Closed
cameronattard opened this issue Sep 7, 2023 · 13 comments
Closed

Allow using event fields in s3 sink object_key #3310

cameronattard opened this issue Sep 7, 2023 · 13 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@cameronattard
Copy link

cameronattard commented Sep 7, 2023

Is your feature request related to a problem? Please describe.
Currently it seems like all objects from the s3 sink are sent using the same prefix, with only date-time being configurable. This means in order to retrieve a subset of events, e.g. logs from a specific hostname, you need to query all events for the time period.

Describe the solution you'd like
We would like to send events to different s3 object prefixes based on specific event fields, for example, hostname. This makes searching events in s3 simpler and cheaper as you can directly query the relevant subset of events.

Describe alternatives you've considered (Optional)
We could potentially use separate sinks for each subset of logs but this is not really dynamic or scalable.

Additional context
N/A

@dlvenable
Copy link
Member

@cameronattard , Thank you for this suggestion. I think this could be a useful feature and could allow for Hive-style partitioning which is useful with use-cases such as Amazon Athena.

https://docs.aws.amazon.com/athena/latest/ug/partitions.html

One difficulty with this solution is that we would also need to route events to the desired object and have multiple objects "in-flight". This could work quite nicely with the new multipart buffer.

Would you be interested in taking this up?

@dlvenable dlvenable added enhancement New feature or request and removed untriaged labels Sep 13, 2023
@cameronattard
Copy link
Author

@dlvenable thanks for the feedback. Unfortunately I have neither the expertise nor the bandwidth to implement this.

@kkondaka
Copy link
Collaborator

kkondaka commented Oct 4, 2023

@dlvenable, it looks like the ask here is that we make the pattern in https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/s3-sink/src/main/java/org/opensearch/dataprepper/plugins/sink/s3/configuration/ObjectKeyOptions.java to be configurable. We can make pattern configurable and allow expressions in it. I think that would help here. Also, we can add getHostName() function available in expressions and that would address the example case mentioned.

@dlvenable dlvenable added this to the v2.6 milestone Oct 4, 2023
@dlvenable dlvenable moved this from Unplanned to To do in Data Prepper Tracking Board Oct 4, 2023
@dlvenable
Copy link
Member

@kkondaka , That is the basic ask yes. However, it is somewhat more complicated because the S3 sink will need to have multiple S3 objects and group events to go into those objects. For example, if the pattern includes the timestamp's year, month, and date, then we must group the events into different objects corresponding to the event's timestamp - not the current timestamp.

Also, we should consider how this intersects with the thresholds. Should the thresholds be applied per group? Or for the entire sink? The per-group approach is natural, but could lead to memory issues as the sink could have dozens of groups.

@dlvenable
Copy link
Member

Also, Data Prepper should support Hadoop file system partitioning. For example, you can partition by a timestamp:

path_prefix: "events/year=%{yyyy}/month=%{MM}/day=%{dd}/"

The example above will partition by the current time. But, we really want to partition by the timestamp. We will need some additional capability in Data Prepper to get part of a timestamp.

Perhaps a date-time format method?

path_prefix: "events/year=${date_time_format(eventTime, "YYYY")}/month=${date_time_format(eventTime, "MM")}/day=${date_time_format(eventTime, "dd")}/"

@dlvenable
Copy link
Member

I created #3434 for the timestamp formatting.

@cameronattard, If you are looking to use time formatting, please take a look and provide any feedback on that proposal. Thanks!

@cameronattard
Copy link
Author

I should clarify that hostname is just a generic example. Ideally we should be able to inject any arbitrary event field into the object key.

@kkondaka
Copy link
Collaborator

kkondaka commented Oct 7, 2023

@cameronattard of course. That's why I was suggesting adding a support for expression, so that any field and functions can be part of the object name

@dlvenable dlvenable modified the milestones: v2.6, v2.7 Oct 23, 2023
@dlvenable dlvenable modified the milestones: v2.7, v2.8 Nov 1, 2023
@faisalabujabal
Copy link

@dlvenable using expressions in the s3 sink config is a feature our project really needs. can it also be applied to the s3 bucket name to support dynamic buckets extracted or constructed from the event?

@dlvenable
Copy link
Member

@graytaylor0 , Is this resolved by #4346 and #4385?

@graytaylor0
Copy link
Member

Yes those add dynamic path_prefix and dynamic bucket support. They do not add support to configure the object_key in the s3 sink, but the ask here is just about configuring path_prefix and bucket, so closing this issue

@MrR0807
Copy link

MrR0807 commented Jul 8, 2024

Hello,

Can you please add additional documentation and pipeline examples how one could utilize this functionality? It is a very useful one, but I cannot understand the correct syntax, nor paid AWS support knows how to write one.

@cameronattard
Copy link
Author

Hello,

Can you please add additional documentation and pipeline examples how one could utilize this functionality? It is a very useful one, but I cannot understand the correct syntax, nor paid AWS support knows how to write one.

I'm using it at the moment, here is an example:

        object_key:
          path_prefix: "opensearch-ingestion/${/your_field_name}/%{yyyy}/%{MM}/%{dd}/%{HH}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

6 participants