Add new InputCodec interface to support seek-able input and corresponding implementation and tests for S3 objects #2727

asuresh8 · 2023-05-22T15:41:11Z

Description

Adds new codec interface to support seeking through an input and corresponding implementation and tests to read S3 files in this way.

Issues Resolved

#1532

Check List

New functionality includes testing.
New functionality has been documented.
- New functionality has javadoc added
Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

dlvenable · 2023-05-22T20:40:44Z

data-prepper-api/src/main/java/org/opensearch/dataprepper/model/codec/InputCodec.java

+     * @param eventConsumer The consumer which handles each event from the stream
+     * @throws IOException throws IOException when invalid input is received or incorrect codec name is provided
+     */
+    void parse(InputFile inputFile, Consumer<Record<Event>> eventConsumer) throws IOException;


We shouldn't add a Parquet file in here since this couples the whole Data Prepper API with Parquet.

Ideally, we have a simpler model here and then adapt that to the Parquet InputFile in the parquet-codecs project.

We also may need to consider how to let the InputCodec advertise whether it requires being seekable or not. We are planning to the InputCodec in other sources as well, which may not always be seekable. For example, the Kafka source may use them (though that is probably seekable as it may be just a byte array in the end).

Updated to have a separate interface call InputFile and OutputFile that extends the Parquet InputFile and OutputFile interfaces.

I don't think the InputCodec should need to advertise if it requires seeking or not. In general InputStreams are not seek-able. They should be read sequentially. This is especially true of Kafka consumers. The data should be read sequentially. Files on the other hand are not required to be read sequentially.

I like the idea of presenting either as an option. Kafka is a good example if stream being more efficient. Local files, S3, or HDFS are good example of file being more efficient. The decision to use which source should be taken in the sink or the source depending on what is more efficient.

Thanks for making some of these changes. I'm thinking of having Parquet entirely removed from the data-prepper-api and s3-source projects.

We can have an interface in data-prepper-api which represents something similar (hopefully with only what we really need). There can be an S3 implementation similar to what you have presently. And then finally, the parquet-codecs project can adapt that interface into the Parquet InputFile model.

I'll see if I can put together something a little more concrete and get back to you on it.

I agree that dependency management can be done better as we are now pulling in parquet map reduce dependencies, but I don't think that's a blocker. We can always go back and clean up the dependencies by adding our own implementations later on.

I think a simple fix that can be applied here is:

Copy the existing Parquet interfaces and mostly re-use them. I think there may be some opportunity to simplify as they have both byte[] and ByteBuffer fields that seem similar. This will remove Parquet from data-prepper-api which is the most important part.

Keep the dependency within the S3 source for now. We can fix this part in a later PR.

I don't think this is that simple. We definitely want to use the ParquetReader library because that handles a lot of situations for us. Re-implementing that library is a lot of unnecessary work. And to be able to use that library the files just need to implement InputFile. This is a pretty minimal ask. The only dependency is on Parquet Common.

We could build a set of converters to avoid this dependency, but that adds another layer of complexity because each implementation of our InputFile needs to map to a Parquet implementation of InputFile. It's much simpler to just ensure that our implementation of an InputFile is a Parquet InputFile.

I'm not asking to re-implement the ParquetReader. Only adapting between the two interfaces.

So, create a new interface in data-prepper-api. Then create an adapter which adapts the Parquet InputFile to the data-prepper-api InputFile.

That's a matter of opinion. It's not a blocker for adding this feature. You are welcome to make that change on top of this change.

...ommon/src/main/java/org/opensearch/dataprepper/plugins/fs/LocalFilePositionOutputStream.java

data-prepper-api/src/main/java/org/opensearch/dataprepper/model/codec/InputCodec.java

...plugins/s3-source/src/main/java/org/opensearch/dataprepper/plugins/source/S3InputStream.java

...per-plugins/common/src/main/java/org/opensearch/dataprepper/plugins/fs/LocalInputStream.java

...src/integrationTest/java/org/opensearch/dataprepper/plugins/source/S3ScanObjectWorkerIT.java

...ce/src/integrationTest/java/org/opensearch/dataprepper/plugins/source/S3ObjectGenerator.java

...pper-plugins/common/src/main/java/org/opensearch/dataprepper/plugins/fs/LocalOutputFile.java

...r-plugins/common/src/test/java/org/opensearch/dataprepper/plugins/fs/LocalInputFileTest.java

...per-plugins/common/src/main/java/org/opensearch/dataprepper/plugins/fs/LocalInputStream.java

...plugins/s3-source/src/main/java/org/opensearch/dataprepper/plugins/source/S3InputStream.java

The InputCodec currently only supports reading from an InputStream. This interface does not work for inputs that require random read access. In order to support random access reads in addition to sequential reads, the InputCode interface should support reading from an InputFile. For convenience, the existing InputFile interface defined in the parquet-common package is being used. Signed-off-by: Adi Suresh <[email protected]>

Previously, S3 files could only be sequentially streamed. Random access to an S3 file required downloading the file locally. This change introduces an input stream on top of the S3 object that support random read access. Additionally, the tests have been updated to test all codecs. Signed-off-by: Adi Suresh <[email protected]>

Instead of directly using the Parquet InputFile interface, the DataPrepper InputFile interface is used by codecs and the Data Prepper InputFile interface implements the Parquet InputFile interface. Signed-off-by: Adi Suresh <[email protected]>

Signed-off-by: Adi Suresh <[email protected]>

The decompression engine should be passed along with the InputFile in cases where the file does not contain a hint to whethere a decompression codec should be used or not. Signed-off-by: Adi Suresh <[email protected]>

Signed-off-by: Adi Suresh <[email protected]>

This changes add the correct JSON encoding logic and corresponding tests to assert the correct behavior. Signed-off-by: Adi Suresh <[email protected]>

Signed-off-by: Adi Suresh <[email protected]>

We will have to fix this dependency issue later.

asuresh8 requested review from chenqi0805, engechas, graytaylor0, dinujoh, kkondaka, cmanning09, asifsmohammed, dlvenable and oeyh as code owners May 22, 2023 15:41

asuresh8 force-pushed the parquet-codecs branch from daf8e3a to 2857238 Compare May 22, 2023 15:58

dlvenable previously requested changes May 22, 2023

View reviewed changes

asuresh8 force-pushed the parquet-codecs branch 2 times, most recently from aa9c5ae to 1654410 Compare May 23, 2023 19:35

asuresh8 requested a review from dlvenable May 23, 2023 19:39

asuresh8 force-pushed the parquet-codecs branch 5 times, most recently from e65af5b to 5abd083 Compare May 30, 2023 22:11

chenqi0805 previously approved these changes Jun 2, 2023

View reviewed changes

dinujoh reviewed Jun 2, 2023

View reviewed changes

asuresh8 dismissed chenqi0805’s stale review via 25b18a1 June 2, 2023 19:34

asuresh8 force-pushed the parquet-codecs branch from 970092e to 25b18a1 Compare June 2, 2023 19:34

asuresh8 requested a review from dinujoh June 2, 2023 19:35

dinujoh previously approved these changes Jun 5, 2023

View reviewed changes

asuresh8 dismissed dinujoh’s stale review via 6cb8642 June 5, 2023 14:34

asuresh8 force-pushed the parquet-codecs branch from 3dc2ae9 to 6cb8642 Compare June 5, 2023 14:34

chenqi0805 previously approved these changes Jun 5, 2023

View reviewed changes

dinujoh previously approved these changes Jun 5, 2023

View reviewed changes

asuresh8 dismissed dinujoh’s stale review via 52aaf13 June 5, 2023 15:59

asuresh8 added 20 commits June 5, 2023 12:27

Fix CVE warnings for hadoop-commons dependencies.

097cad3

Signed-off-by: Adi Suresh <[email protected]>

Fix tests and style errors in local file implementations.

d95dc36

Signed-off-by: Adi Suresh <[email protected]>

Fix LocalOutputFile and corresponding tests.

34331d8

Signed-off-by: Adi Suresh <[email protected]>

Add decompomression engine to API and implementations to common.

3f6771b

The decompression engine should be passed along with the InputFile in cases where the file does not contain a hint to whethere a decompression codec should be used or not. Signed-off-by: Adi Suresh <[email protected]>

Fixes for integration tests.

e511447

Signed-off-by: Adi Suresh <[email protected]>

Fix broken import.

dafd229

Signed-off-by: Adi Suresh <[email protected]>

Fix checkstyle issues.

a9ce104

Signed-off-by: Adi Suresh <[email protected]>

Add proper json encoder and tests.

52bb559

This changes add the correct JSON encoding logic and corresponding tests to assert the correct behavior. Signed-off-by: Adi Suresh <[email protected]>

Fix build and checkstyle issues.

ac8a87d

Signed-off-by: Adi Suresh <[email protected]>

Use string escaping library instead of custom string escaping.

c45cef4

Signed-off-by: Adi Suresh <[email protected]>

Address PR comments on InputFile and InputStream types.

7c26da0

Signed-off-by: Adi Suresh <[email protected]>

Remove unused imports.

92ed67b

Signed-off-by: Adi Suresh <[email protected]>

Add missing test tag.

1738254

Signed-off-by: Adi Suresh <[email protected]>

Remove unused imports

52a2a52

Signed-off-by: Adi Suresh <[email protected]>

Fix S3 inputFile and inputStream tests.

6a26c98

Signed-off-by: Adi Suresh <[email protected]>

Fix javadoc error

61f1d33

Signed-off-by: Adi Suresh <[email protected]>

Remove unused import

d2b2023

Signed-off-by: Adi Suresh <[email protected]>

asuresh8 dismissed dinujoh’s stale review via d2b2023 June 5, 2023 17:28

asuresh8 force-pushed the parquet-codecs branch from 52aaf13 to d2b2023 Compare June 5, 2023 17:28

chenqi0805 approved these changes Jun 5, 2023

View reviewed changes

dinujoh approved these changes Jun 5, 2023

View reviewed changes

dlvenable merged commit 09a6497 into opensearch-project:main Jun 5, 2023

This was referenced Jun 5, 2023

Parquet codecs #2528

Closed

Support parsing Parquet formatted files from S3 (and other sources) #2445

Closed

asuresh8 deleted the parquet-codecs branch June 5, 2023 22:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new InputCodec interface to support seek-able input and corresponding implementation and tests for S3 objects #2727

Add new InputCodec interface to support seek-able input and corresponding implementation and tests for S3 objects #2727

asuresh8 commented May 22, 2023

dlvenable May 22, 2023

dlvenable May 22, 2023

asuresh8 May 23, 2023

dlvenable May 23, 2023

asuresh8 May 23, 2023

dlvenable May 25, 2023

asuresh8 May 26, 2023

dlvenable May 30, 2023

asuresh8 May 30, 2023

Add new InputCodec interface to support seek-able input and corresponding implementation and tests for S3 objects #2727

Add new InputCodec interface to support seek-able input and corresponding implementation and tests for S3 objects #2727

Conversation

asuresh8 commented May 22, 2023

Description

Issues Resolved

Check List

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment