Support S3 Select on native readers #17522

alexjo2144 · 2023-05-16T19:55:54Z

Description

Add S3 Select pushdown support to native readers for JSON and CSV.

Based on: #17563
I am not sure we want to do this until fixing the rest of the issues in: #17775

Additional context and related issues

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

alexjo2144 · 2023-05-16T19:56:37Z

@trinodb/maintainers someone mind kicking off a build with secrets? I couldn't get the S3 tests running locally for some reason.

findepi · 2023-05-16T20:59:39Z

/test-with-secrets sha=71aa2ee869ce9fcf5b6da2e21d3ac395b6871f3f

github-actions · 2023-05-16T23:55:36Z

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4997055706

electrum · 2023-07-06T19:33:46Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectPageSourceProvider.java

+import static java.lang.Math.toIntExact;
+import static java.util.Objects.requireNonNull;
+
+public class S3SelectPageSourceProvider


Name this S3SelectPageSourceFactory to match the others

electrum · 2023-07-06T19:46:35Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectTrinoInput.java

+        long start = request.getScanRange().getStart();
+        SelectObjectContentRequest contentRequest = request.withScanRange(new ScanRange()
+                .withStart(start)
+                .withEnd(start + length));


This seems wrong. readTail() is intended to read (up to) the last N bytes of the file. It is used for reading the footer of ORC and Parquet files.

Per the ScanRange docs:

If only the End parameter is supplied, it is interpreted to mean scan the last N bytes of the file.

I believe the correct implementation would be

SelectObjectContentRequest contentRequest = request.withScanRange(new ScanRange().withEnd(length));

However, we shouldn't need to implement this method as it won't be used.

electrum · 2023-07-06T19:50:24Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectTrinoInput.java

+    }
+
+    @Override
+    public void readFully(long position, byte[] buffer, int offset, int length)


Is this method used? TextLineReader is created via TextLineReaderFactory which does inputFile.newStream(), so I think we could change S3SelectInputFile.newInput() to throw UnsupportedOperationException. Text files don't use positioned reads.

electrum · 2023-07-06T19:51:09Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectUtils.java

+            TupleDomain<HiveColumnHandle> effectivePredicate,
+            List<HiveColumnHandle> readerColumns)
+    {
+        //There are no effective predicates and readercolumns and columntypes are identical to schema


Nit: space after //

electrum · 2023-07-06T19:51:44Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectUtils.java

+    private S3SelectUtils()
+    { }


private S3SelectUtils() {}

electrum · 2023-07-06T19:55:29Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectPageSourceProvider.java

+            default -> throw new IllegalStateException("Unknown s3 select data type: " + s3SelectDataType);
+        }
+
+        if (!lineReaderFactory.getHiveOutputFormatClassName().equals(schema.getProperty(FILE_INPUT_FORMAT)) ||


When would this return false? S3SelectSerDeDataTypeMapper already maps from the serde, so it seems strange that we need to check this again.

electrum · 2023-07-06T20:02:48Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectInputStream.java

+
+    private void closeStream()
+    {
+        if (input == null) {


Let's remove this check. It's legal to use try-with-resources on a null reference, and we want to close the client regardless of if there is an input stream. Also, we never actually set it to null.

electrum · 2023-07-06T20:05:32Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectInputStream.java

+        catch (IOException ignored) {
+        }
+        finally {
+            input = null;


We could remove this, since we otherwise never set it to null, and it's legal to close a Closeable multiple times (not that we do here).

electrum · 2023-07-06T20:05:59Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectInputStream.java

+            return;
+        }
+        closed = true;
+        closeStream();


We can inline this method since it's only used once

electrum · 2023-07-06T20:11:51Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectInputStream.java

+        }
+
+        // for negative seek, reopen the file
+        if (position < this.position) {


I don't think we need to support negative seek for text files. Also, it seems like TrinoS3SelectClient is designed such that you can only call getRecordsContent() once. Otherwise, the SelectObjectContentResult won't be closed and the requestComplete flag won't be handled correctly.

So, we should probably fail here:

throw new IOException("Negative seek is not supported for S3 Select");

ankushChatterjee · 2023-07-07T08:39:10Z

@alexjo2144 Since this a new implementation, shouldn't AWS SDK V2 be used to implement this?

If it helps, I have a commit converting the current S3 select code to V2 : ankushChatterjee@1d67a14

(I have done some basic sanity testing)

This is based off the branch from PR: #17866

Minio tests produced the correct results, however tests against a real S3 bucket did not.

S3 Select queries on CSV files are shown to have correctness problems. JSON files can still be enabled/disabled using the existing config and session properties.

electrum · 2023-07-11T02:16:13Z

@ankushChatterjee It is a new implementation, but it reuses TrinoS3ClientFactory, so converting that to SDK V2 is out of scope for this PR. Can you put up a PR with your change? If it's ready, we can merge it now and rebase this PR on top.

Note that the old S3 Select implementation will be removed when we remove all of the Hive reader code.

ankushChatterjee · 2023-07-11T12:08:56Z

@electrum thanks for the response. The branch currently is based on the #17866 PR, which is not merged yet. I will cherry pick the commit on master and try to create a new branch from it with the changes and raise a PR.

Add S3 Select pushdown support to native readers for JSON and CSV.

alexjo2144 · 2023-07-11T19:24:01Z

It is a new implementation, but it reuses TrinoS3ClientFactory

I may need to re-implement this anyway, because of conflicts with #18146

The TrinoS3ClientFactory works by passing S3 relevant properties through the Configuration object, but I'll need to switch it out. I'm not exactly sure how to pipe those in now, but I'll give it a stab.

ankushChatterjee · 2023-07-13T06:58:11Z

Raised the PR : #18270 for SDK v2 migration of the old code.

electrum · 2023-07-31T01:24:42Z

lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3FileSystem.java

@@ -44,16 +47,18 @@
 import static com.google.common.collect.Multimaps.toMultimap;
 import static java.util.Objects.requireNonNull;

-final class S3FileSystem
+public final class S3FileSystem


Need to make the constructor package private with this change

electrum · 2023-07-31T01:25:04Z

lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3FileSystemFactory.java

+                config.isRequesterPays(),
+                config.getSseType(),
+                config.getSseKmsKeyId());
+


Remove trailing blank line

electrum · 2023-07-31T01:25:44Z

lib/trino-filesystem-s3/pom.xml

@@ -73,6 +73,11 @@
            </exclusions>
        </dependency>

+        <dependency>
+            <groupId>software.amazon.awssdk</groupId>
+            <artifactId>aws-crt-client</artifactId>


We don't want to use AWS CRT since it's in C and thus goes through JNI

electrum · 2023-07-31T18:13:18Z

lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3SelectInputFile.java

+    @Override
+    public TrinoInputStream newStream()
+    {
+        return null; // new S3InputStream(location(), client, newGetObjectRequest(), length);


This seems backwards. Since S3 Select is only for text formats, we should only need the input stream version. TrinoInput is only used by ORC and Parquet.

electrum · 2023-07-31T18:14:08Z

lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3SelectInputFile.java

+        return selectObjectContentRequest.build();
+    }
+
+    private boolean headObject()


We could reuse or delegate to S3InputFile

electrum · 2023-07-31T18:53:14Z

lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3SelectInput.java

+                    this.isDone = this.next.isEmpty();
+                }
+                catch (InterruptedException e) {
+                    throw new TrinoException(StandardErrorCode.GENERIC_INTERNAL_ERROR, "Interrupted"); // TODO: Better error message


We can simply use RuntimeException here

electrum · 2023-07-31T18:53:39Z

lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3SelectInput.java

+    }
+
+    /**
+     * Below classes are required for compatibility between AWS Java SDK 1.x and 2.x


This comment doesn't seem relevant in the context of a new implementation

mosabua · 2023-08-11T22:33:18Z

Should we close this?

cla-bot bot added the cla-signed label May 16, 2023

github-actions bot added hive Hive connector tests:hive labels May 16, 2023

alexjo2144 force-pushed the hive/s3-select-migration branch 2 times, most recently from 34f1a78 to 677ad44 Compare June 8, 2023 20:31

alexjo2144 requested review from electrum and dain June 8, 2023 21:20

electrum reviewed Jul 6, 2023

View reviewed changes

alexjo2144 added 3 commits July 10, 2023 14:00

Fix JSON S3 select queries with quote characters

766a546

Disable S3 Select pushdown on decimal columns

3e73bca

Minio tests produced the correct results, however tests against a real S3 bucket did not.

Put S3 Select for CSV files behind an experimental flag

d5aea6b

S3 Select queries on CSV files are shown to have correctness problems. JSON files can still be enabled/disabled using the existing config and session properties.

alexjo2144 and others added 2 commits July 11, 2023 10:22

empty

c65d4ba

Support S3 Select on native readers

4b7197f

Add S3 Select pushdown support to native readers for JSON and CSV.

electrum mentioned this pull request Jul 11, 2023

Decouple Trino from Hadoop and Hive codebases #15921

Closed

ankushChatterjee mentioned this pull request Jul 13, 2023

Migrate S3 Select to use AWS Java SDK v2 #18270

Closed

WIP

0406c54

alexjo2144 force-pushed the hive/s3-select-migration branch from 677ad44 to 0406c54 Compare July 14, 2023 15:18

github-actions bot added the docs label Jul 14, 2023

electrum reviewed Jul 31, 2023

View reviewed changes

alexjo2144 closed this Sep 1, 2023

kekwan mentioned this pull request Oct 21, 2024

Will Trino S3 file system support AWS Common Runtime (CRT) #23855

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support S3 Select on native readers #17522

Support S3 Select on native readers #17522

alexjo2144 commented May 16, 2023 •

edited

Loading

alexjo2144 commented May 16, 2023

findepi commented May 16, 2023

github-actions bot commented May 16, 2023

electrum Jul 6, 2023

electrum Jul 6, 2023

electrum Jul 6, 2023

electrum Jul 6, 2023

electrum Jul 6, 2023

electrum Jul 6, 2023

electrum Jul 6, 2023

electrum Jul 6, 2023

electrum Jul 6, 2023

electrum Jul 6, 2023

ankushChatterjee commented Jul 7, 2023 •

edited

Loading

electrum commented Jul 11, 2023

ankushChatterjee commented Jul 11, 2023

alexjo2144 commented Jul 11, 2023 •

edited

Loading

ankushChatterjee commented Jul 13, 2023 •

edited

Loading

electrum Jul 31, 2023

electrum Jul 31, 2023

electrum Jul 31, 2023

electrum Jul 31, 2023

electrum Jul 31, 2023

electrum Jul 31, 2023

electrum Jul 31, 2023

mosabua commented Aug 11, 2023

Support S3 Select on native readers #17522

Support S3 Select on native readers #17522

Conversation

alexjo2144 commented May 16, 2023 • edited Loading

Description

Additional context and related issues

Release notes

alexjo2144 commented May 16, 2023

findepi commented May 16, 2023

github-actions bot commented May 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ankushChatterjee commented Jul 7, 2023 • edited Loading

electrum commented Jul 11, 2023

ankushChatterjee commented Jul 11, 2023

alexjo2144 commented Jul 11, 2023 • edited Loading

ankushChatterjee commented Jul 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mosabua commented Aug 11, 2023

alexjo2144 commented May 16, 2023 •

edited

Loading

ankushChatterjee commented Jul 7, 2023 •

edited

Loading

alexjo2144 commented Jul 11, 2023 •

edited

Loading

ankushChatterjee commented Jul 13, 2023 •

edited

Loading