Skip to content

Commit

Permalink
Remove support for S3 Select
Browse files Browse the repository at this point in the history
  • Loading branch information
electrum committed Sep 6, 2023
1 parent a231676 commit bcbfd39
Show file tree
Hide file tree
Showing 49 changed files with 244 additions and 3,200 deletions.
82 changes: 0 additions & 82 deletions docs/src/main/sphinx/connector/hive-s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -312,85 +312,3 @@ classpath and must be able to communicate with your custom key management system
the `org.apache.hadoop.conf.Configurable` interface from the Hadoop Java API, then the Hadoop configuration
is passed in after the object instance is created, and before it is asked to provision or retrieve any
encryption keys.

(s3selectpushdown)=

## S3 Select pushdown

S3 Select pushdown enables pushing down projection (SELECT) and predicate (WHERE)
processing to [S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectSELECTContent.html).
With S3 Select Pushdown, Trino only retrieves the required data from S3 instead
of entire S3 objects, reducing both latency and network usage.

### Is S3 Select a good fit for my workload?

Performance of S3 Select pushdown depends on the amount of data filtered by the
query. Filtering a large number of rows should result in better performance. If
the query doesn't filter any data, then pushdown may not add any additional value
and the user is charged for S3 Select requests. Thus, we recommend that you
benchmark your workloads with and without S3 Select to see if using it may be
suitable for your workload. By default, S3 Select Pushdown is disabled and you
should enable it in production after proper benchmarking and cost analysis. For
more information on S3 Select request cost, please see
[Amazon S3 Cloud Storage Pricing](https://aws.amazon.com/s3/pricing/).

Use the following guidelines to determine if S3 Select is a good fit for your
workload:

- Your query filters out more than half of the original data set.
- Your query filter predicates use columns that have a data type supported by
Trino and S3 Select.
The `TIMESTAMP`, `DECIMAL`, `REAL`, and `DOUBLE` data types are not
supported by S3 Select Pushdown. For more information about supported data
types for S3 Select, see the
[Data Types documentation](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-data-types.html).
- Your network connection between Amazon S3 and the Amazon EMR cluster has good
transfer speed and available bandwidth. Amazon S3 Select does not compress
HTTP responses, so the response size may increase for compressed input files.

### Considerations and limitations

- Only objects stored in JSON format are supported. Objects can be uncompressed,
or optionally compressed with gzip or bzip2.
- The "AllowQuotedRecordDelimiters" property is not supported. If this property
is specified, the query fails.
- Amazon S3 server-side encryption with customer-provided encryption keys
(SSE-C) and client-side encryption are not supported.
- S3 Select Pushdown is not a substitute for using columnar or compressed file
formats such as ORC and Parquet.

### Enabling S3 Select pushdown

You can enable S3 Select Pushdown using the `s3_select_pushdown_enabled`
Hive session property, or using the `hive.s3select-pushdown.enabled`
configuration property. The session property overrides the config
property, allowing you enable or disable on a per-query basis. Non-filtering
queries (`SELECT * FROM table`) are not pushed down to S3 Select,
as they retrieve the entire object content.

For uncompressed files, S3 Select scans ranges of bytes in parallel. The scan range
requests run across the byte ranges of the internal Hive splits for the query fragments
pushed down to S3 Select. Changes in the Hive connector {ref}`performance tuning
configuration properties <hive-performance-tuning-configuration>` are likely to impact
S3 Select pushdown performance.

S3 Select can be enabled for TEXTFILE data using the
`hive.s3select-pushdown.experimental-textfile-pushdown-enabled` configuration property,
however this has been shown to produce incorrect results. For more information see
[the GitHub Issue.](https://github.com/trinodb/trino/issues/17775)

### Understanding and tuning the maximum connections

Trino can use its native S3 file system or EMRFS. When using the native FS, the
maximum connections is configured via the `hive.s3.max-connections`
configuration property. When using EMRFS, the maximum connections is configured
via the `fs.s3.maxConnections` Hadoop configuration property.

S3 Select Pushdown bypasses the file systems, when accessing Amazon S3 for
predicate operations. In this case, the value of
`hive.s3select-pushdown.max-connections` determines the maximum number of
client connections allowed for those operations from worker nodes.

If your workload experiences the error *Timeout waiting for connection from
pool*, increase the value of both `hive.s3select-pushdown.max-connections` and
the maximum connections configuration for the file system you are using.
10 changes: 0 additions & 10 deletions docs/src/main/sphinx/connector/hive.md
Original file line number Diff line number Diff line change
Expand Up @@ -253,16 +253,6 @@ Hive connector documentation.
- Enables automatic column level statistics collection on write. See
`Table Statistics <#table-statistics>`__ for details.
- ``true``
* - ``hive.s3select-pushdown.enabled``
- Enable query pushdown to JSON files using the AWS S3 Select service.
- ``false``
* - ``hive.s3select-pushdown.experimental-textfile-pushdown-enabled``
- Enable query pushdown to TEXTFILE tables using the AWS S3 Select service.
- ``false``
* - ``hive.s3select-pushdown.max-connections``
- Maximum number of simultaneously open connections to S3 for
:ref:`s3selectpushdown`.
- 500
* - ``hive.file-status-cache-tables``
- Cache directory listing for specific tables. Examples:
Expand Down
2 changes: 1 addition & 1 deletion docs/src/main/sphinx/release/release-300.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
(e.g., min > max). To disable this behavior, set the configuration
property `hive.parquet.fail-on-corrupted-statistics`
or session property `parquet_fail_with_corrupted_statistics` to false.
- Add support for {ref}`s3selectpushdown`, which enables pushing down
- Add support for S3 Select pushdown, which enables pushing down
column selection and range filters into S3 for text files.

## Kudu connector
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,6 @@ private void bindSecurityMapping(Binder binder)
newSetBinder(binder, DynamicConfigurationProvider.class).addBinding()
.to(S3SecurityMappingConfigurationProvider.class).in(Scopes.SINGLETON);

checkArgument(!getProperty("hive.s3select-pushdown.enabled").map(Boolean::parseBoolean).orElse(false), "S3 security mapping is not compatible with S3 Select pushdown");
checkArgument(!buildConfigObject(RubixEnabledConfig.class).isCacheEnabled(), "S3 security mapping is not compatible with Hive caching");
}

Expand Down
67 changes: 0 additions & 67 deletions plugin/trino-hive-hadoop2/bin/run_hive_s3_select_json_tests.sh

This file was deleted.

32 changes: 0 additions & 32 deletions plugin/trino-hive-hadoop2/bin/run_hive_s3_tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -46,38 +46,6 @@ exec_in_hadoop_master_container /usr/bin/hive -e "
LOCATION '${table_path}'
TBLPROPERTIES ('skip.header.line.count'='2', 'skip.footer.line.count'='2')"

table_path="s3a://${S3_BUCKET}/${test_directory}/trino_s3select_test_external_fs_with_pipe_delimiter/"
exec_in_hadoop_master_container hadoop fs -mkdir -p "${table_path}"
exec_in_hadoop_master_container hadoop fs -put -f /docker/files/test_table_with_pipe_delimiter.csv{,.gz,.bz2} "${table_path}"
exec_in_hadoop_master_container /usr/bin/hive -e "
CREATE EXTERNAL TABLE trino_s3select_test_external_fs_with_pipe_delimiter(t_bigint bigint, s_bigint bigint)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '${table_path}'"

table_path="s3a://${S3_BUCKET}/${test_directory}/trino_s3select_test_external_fs_with_comma_delimiter/"
exec_in_hadoop_master_container hadoop fs -mkdir -p "${table_path}"
exec_in_hadoop_master_container hadoop fs -put -f /docker/files/test_table_with_comma_delimiter.csv{,.gz,.bz2} "${table_path}"
exec_in_hadoop_master_container /usr/bin/hive -e "
CREATE EXTERNAL TABLE trino_s3select_test_external_fs_with_comma_delimiter(t_bigint bigint, s_bigint bigint)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '${table_path}'"

table_path="s3a://${S3_BUCKET}/${test_directory}/trino_s3select_test_csv_scan_range_pushdown/"
exec_in_hadoop_master_container hadoop fs -mkdir -p "${table_path}"
exec_in_hadoop_master_container /docker/files/hadoop-put.sh /docker/files/test_table_csv_scan_range_select_pushdown_{1,2,3}.csv "${table_path}"
exec_in_hadoop_master_container sudo -Eu hive beeline -u jdbc:hive2://localhost:10000/default -n hive -e "
CREATE EXTERNAL TABLE trino_s3select_test_csv_scan_range_pushdown(index bigint, id string, value1 bigint, value2 bigint, value3 bigint,
value4 bigint, value5 bigint, title string, firstname string, lastname string, flag string, day bigint,
month bigint, year bigint, country string, comment string, email string, identifier string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '${table_path}'"

stop_unnecessary_hadoop_services

# restart hive-metastore to apply S3 changes in core-site.xml
Expand Down
23 changes: 0 additions & 23 deletions plugin/trino-hive-hadoop2/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -221,10 +221,6 @@
<exclude>**/TestHive.java</exclude>
<exclude>**/TestHiveThriftMetastoreWithS3.java</exclude>
<exclude>**/TestHiveFileSystemS3.java</exclude>
<exclude>**/TestHiveFileSystemS3SelectPushdown.java</exclude>
<exclude>**/TestHiveFileSystemS3SelectJsonPushdown.java</exclude>
<exclude>**/TestHiveFileSystemS3SelectCsvPushdownWithSplits.java</exclude>
<exclude>**/TestHiveFileSystemS3SelectJsonPushdownWithSplits.java</exclude>
<exclude>**/TestHiveFileSystemWasb.java</exclude>
<exclude>**/TestHiveFileSystemAbfsAccessKey.java</exclude>
<exclude>**/TestHiveFileSystemAbfsOAuth.java</exclude>
Expand Down Expand Up @@ -263,25 +259,6 @@
<includes>
<include>**/TestHiveThriftMetastoreWithS3.java</include>
<include>**/TestHiveFileSystemS3.java</include>
<include>**/TestHiveFileSystemS3SelectPushdown.java</include>
<include>**/TestHiveFileSystemS3SelectCsvPushdownWithSplits.java</include>
</includes>
</configuration>
</plugin>
</plugins>
</build>
</profile>
<profile>
<id>test-hive-hadoop2-s3-select-json</id>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<configuration>
<includes>
<include>**/TestHiveFileSystemS3SelectJsonPushdown.java</include>
<include>**/TestHiveFileSystemS3SelectJsonPushdownWithSplits.java</include>
</includes>
</configuration>
</plugin>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,6 @@ protected void setup(String host, int port, String databaseName, String containe
checkParameter(host, "host"),
port,
checkParameter(databaseName, "database name"),
false,
createHdfsConfiguration());
}

Expand Down
Loading

0 comments on commit bcbfd39

Please sign in to comment.