S3 Event Decoding Consistency #1687

dlvenable · 2022-08-23T15:28:35Z

Is your feature request related to a problem? Please describe.

The s3 source includes two codes in 1.5 and a new codec for CSV processing is coming in 2.0. These populate Events somewhat differently.

newline-delimited -> The newline is saved to the message key of the Event. This is a single string.
json -> The JSON is expanded into message. So, if the JSON has a key named sourceIp, it is populated in /message/sourceIp.
csv -> Each key is expanded directly into the root of the Event (/). Thus, if the CSV has a key named sourceIp, it is populated in /sourceIp.

Also, the s3 processor adds two special keys to all Events: bucket and key. These indicate the S3 bucket and key, respectively, for the object. The S3 Processor populates this, not the Codecs.

Describe the solution you'd like

First, all codecs should put the data in the same place consistently. Second, we should decide where we want this data to reside (/message or /). Third, it should avoid conflicting with the bucket and key.

One possible solution is to change the s3 source to save the bucket and key to a top-level object named s3. Then the codecs save to the root (/). This could lead to conflicts if the actual data has a column or field named s3. But, if we make this key configurable, then pipeline authors could potentially avoid this.

Describe alternatives you've considered (Optional)

An alternative would be more robust support for Event metadata. The bucket and key could be saved as metadata. However, Data Prepper's conditional routing and processors don't support Event metadata presently.

Additional context

The text was updated successfully, but these errors were encountered:

asifsmohammed · 2022-08-23T15:36:57Z

I like the suggestion that it should be configurable where we save the data to avoid conflicts.

finnroblin · 2022-08-23T15:50:37Z

I also like the suggestion that the destination (message or /) is configurable. Many processors use message as the default source, so changing the csv codec to write keys to message by default would preserve consistency with the other codecs and with other sources.

From my perspective, it's easier to deal with shallow (root-level) fields in OpenSearch and other processors because message/ doesn't need to be appended to the front of each field when doing a transformation on that field. However, other users might have a different perspective and use case, so having this be configurable could help everyone.

I like the suggestion that the key and bucket are within a top-level s3 key. We could get around conflicts by changing any conflicting keys to be an absolute path (if s3 is key in data and data is configured to write to root, then append message/s3 to root and keep s3/bucket or s3/root).

… by default. Resolves opensearch-project#1687 Signed-off-by: David Venable <[email protected]>

…rom the message key. Resolves opensearch-project#1687 Signed-off-by: David Venable <[email protected]>

Support a configurable key for S3 metadata and make this base key s3/ by default. Moved the output of JSON from S3 objects into the root of the Event from the message key. Resolves #1687 Signed-off-by: David Venable <[email protected]>

dlvenable added the untriaged label Aug 23, 2022

dlvenable added this to the v2.0 milestone Sep 9, 2022

dlvenable added enhancement New feature or request and removed untriaged labels Sep 9, 2022

dlvenable self-assigned this Sep 24, 2022

dlvenable added a commit to dlvenable/data-prepper that referenced this issue Sep 24, 2022

Support a configurable key for S3 metadata and make this base key s3/…

2dd9207

… by default. Resolves opensearch-project#1687 Signed-off-by: David Venable <[email protected]>

dlvenable added a commit to dlvenable/data-prepper that referenced this issue Sep 24, 2022

Moved the output of JSON from S3 objects into the root of the Event f…

bade15d

…rom the message key. Resolves opensearch-project#1687 Signed-off-by: David Venable <[email protected]>

dlvenable mentioned this issue Sep 24, 2022

S3 Event Consistency via metadata and JSON changes #1803

Merged

4 tasks

dlvenable closed this as completed in #1803 Sep 26, 2022

asifsmohammed moved this to Done in Data Prepper Tracking Board Dec 9, 2022

asifsmohammed added this to Data Prepper Tracking Board Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 Event Decoding Consistency #1687

S3 Event Decoding Consistency #1687

dlvenable commented Aug 23, 2022

asifsmohammed commented Aug 23, 2022

finnroblin commented Aug 23, 2022

S3 Event Decoding Consistency #1687

S3 Event Decoding Consistency #1687

Comments

dlvenable commented Aug 23, 2022

asifsmohammed commented Aug 23, 2022

finnroblin commented Aug 23, 2022