Remove support for S3 Select

trinodb · Sep 6, 2023 · bcbfd39 · bcbfd39
1 parent a231676
commit bcbfd39
Show file tree

Hide file tree

Showing 49 changed files with 244 additions and 3,200 deletions.
diff --git a/docs/src/main/sphinx/connector/hive-s3.md b/docs/src/main/sphinx/connector/hive-s3.md
@@ -312,85 +312,3 @@ classpath and must be able to communicate with your custom key management system
 the `org.apache.hadoop.conf.Configurable` interface from the Hadoop Java API, then the Hadoop configuration
 is passed in after the object instance is created, and before it is asked to provision or retrieve any
 encryption keys.
-
-(s3selectpushdown)=
-
-## S3 Select pushdown
-
-S3 Select pushdown enables pushing down projection (SELECT) and predicate (WHERE)
-processing to [S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectSELECTContent.html).
-With S3 Select Pushdown, Trino only retrieves the required data from S3 instead
-of entire S3 objects, reducing both latency and network usage.
-
-### Is S3 Select a good fit for my workload?
-
-Performance of S3 Select pushdown depends on the amount of data filtered by the
-query. Filtering a large number of rows should result in better performance. If
-the query doesn't filter any data, then pushdown may not add any additional value
-and the user is charged for S3 Select requests. Thus, we recommend that you
-benchmark your workloads with and without S3 Select to see if using it may be
-suitable for your workload. By default, S3 Select Pushdown is disabled and you
-should enable it in production after proper benchmarking and cost analysis. For
-more information on S3 Select request cost, please see
-[Amazon S3 Cloud Storage Pricing](https://aws.amazon.com/s3/pricing/).
-
-Use the following guidelines to determine if S3 Select is a good fit for your
-workload:
-
-- Your query filters out more than half of the original data set.
-- Your query filter predicates use columns that have a data type supported by
-  Trino and S3 Select.
-  The `TIMESTAMP`, `DECIMAL`, `REAL`, and `DOUBLE` data types are not
-  supported by S3 Select Pushdown. For more information about supported data
-  types for S3 Select, see the
-  [Data Types documentation](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-data-types.html).
-- Your network connection between Amazon S3 and the Amazon EMR cluster has good
-  transfer speed and available bandwidth. Amazon S3 Select does not compress
-  HTTP responses, so the response size may increase for compressed input files.
-
-### Considerations and limitations
-
-- Only objects stored in JSON format are supported. Objects can be uncompressed,
-  or optionally compressed with gzip or bzip2.
-- The "AllowQuotedRecordDelimiters" property is not supported. If this property
-  is specified, the query fails.
-- Amazon S3 server-side encryption with customer-provided encryption keys
-  (SSE-C) and client-side encryption are not supported.
-- S3 Select Pushdown is not a substitute for using columnar or compressed file
-  formats such as ORC and Parquet.
-
-### Enabling S3 Select pushdown
-
-You can enable S3 Select Pushdown using the `s3_select_pushdown_enabled`
-Hive session property, or using the `hive.s3select-pushdown.enabled`
-configuration property. The session property overrides the config
-property, allowing you enable or disable on a per-query basis. Non-filtering
-queries (`SELECT * FROM table`) are not pushed down to S3 Select,
-as they retrieve the entire object content.
-
-For uncompressed files, S3 Select scans ranges of bytes in parallel. The scan range
-requests run across the byte ranges of the internal Hive splits for the query fragments
-pushed down to S3 Select. Changes in the Hive connector {ref}`performance tuning
-configuration properties <hive-performance-tuning-configuration>` are likely to impact
-S3 Select pushdown performance.
-
-S3 Select can be enabled for TEXTFILE data using the
-`hive.s3select-pushdown.experimental-textfile-pushdown-enabled` configuration property,
-however this has been shown to produce incorrect results. For more information see
-[the GitHub Issue.](https://github.com/trinodb/trino/issues/17775)
-
-### Understanding and tuning the maximum connections
-
-Trino can use its native S3 file system or EMRFS. When using the native FS, the
-maximum connections is configured via the `hive.s3.max-connections`
-configuration property. When using EMRFS, the maximum connections is configured
-via the `fs.s3.maxConnections` Hadoop configuration property.
-
-S3 Select Pushdown bypasses the file systems, when accessing Amazon S3 for
-predicate operations. In this case, the value of
-`hive.s3select-pushdown.max-connections` determines the maximum number of
-client connections allowed for those operations from worker nodes.
-
-If your workload experiences the error *Timeout waiting for connection from
-pool*, increase the value of both `hive.s3select-pushdown.max-connections` and
-the maximum connections configuration for the file system you are using.
diff --git a/docs/src/main/sphinx/connector/hive.md b/docs/src/main/sphinx/connector/hive.md
@@ -253,16 +253,6 @@ Hive connector documentation.
       - Enables automatic column level statistics collection on write. See
         `Table Statistics <#table-statistics>`__ for details.
       - ``true``
-    * - ``hive.s3select-pushdown.enabled``
-      - Enable query pushdown to JSON files using the AWS S3 Select service.
-      - ``false``
-    * - ``hive.s3select-pushdown.experimental-textfile-pushdown-enabled``
-      - Enable query pushdown to TEXTFILE tables using the AWS S3 Select service.
-      - ``false``
-    * - ``hive.s3select-pushdown.max-connections``
-      - Maximum number of simultaneously open connections to S3 for
-        :ref:`s3selectpushdown`.
-      - 500
     * - ``hive.file-status-cache-tables``
       - Cache directory listing for specific tables. Examples:
 

diff --git a/docs/src/main/sphinx/release/release-300.md b/docs/src/main/sphinx/release/release-300.md
@@ -46,7 +46,7 @@
   (e.g., min > max). To disable this behavior, set the configuration
   property `hive.parquet.fail-on-corrupted-statistics`
   or session property `parquet_fail_with_corrupted_statistics` to false.
-- Add support for {ref}`s3selectpushdown`, which enables pushing down
+- Add support for S3 Select pushdown, which enables pushing down
   column selection and range filters into S3 for text files.
 
 ## Kudu connector

diff --git a/lib/trino-hdfs/src/main/java/io/trino/hdfs/s3/HiveS3Module.java b/lib/trino-hdfs/src/main/java/io/trino/hdfs/s3/HiveS3Module.java
@@ -83,7 +83,6 @@ private void bindSecurityMapping(Binder binder)
         newSetBinder(binder, DynamicConfigurationProvider.class).addBinding()
                 .to(S3SecurityMappingConfigurationProvider.class).in(Scopes.SINGLETON);
 
-        checkArgument(!getProperty("hive.s3select-pushdown.enabled").map(Boolean::parseBoolean).orElse(false), "S3 security mapping is not compatible with S3 Select pushdown");
         checkArgument(!buildConfigObject(RubixEnabledConfig.class).isCacheEnabled(), "S3 security mapping is not compatible with Hive caching");
     }
 

diff --git a/plugin/trino-hive-hadoop2/bin/run_hive_s3_select_json_tests.sh b/plugin/trino-hive-hadoop2/bin/run_hive_s3_select_json_tests.sh
diff --git a/plugin/trino-hive-hadoop2/bin/run_hive_s3_tests.sh b/plugin/trino-hive-hadoop2/bin/run_hive_s3_tests.sh
@@ -46,38 +46,6 @@ exec_in_hadoop_master_container /usr/bin/hive -e "
     LOCATION '${table_path}'
     TBLPROPERTIES ('skip.header.line.count'='2', 'skip.footer.line.count'='2')"
 
-table_path="s3a://${S3_BUCKET}/${test_directory}/trino_s3select_test_external_fs_with_pipe_delimiter/"
-exec_in_hadoop_master_container hadoop fs -mkdir -p "${table_path}"
-exec_in_hadoop_master_container hadoop fs -put -f /docker/files/test_table_with_pipe_delimiter.csv{,.gz,.bz2} "${table_path}"
-exec_in_hadoop_master_container /usr/bin/hive -e "
-    CREATE EXTERNAL TABLE trino_s3select_test_external_fs_with_pipe_delimiter(t_bigint bigint, s_bigint bigint)
-    ROW FORMAT DELIMITED
-    FIELDS TERMINATED BY '|'
-    STORED AS TEXTFILE
-    LOCATION '${table_path}'"
-
-table_path="s3a://${S3_BUCKET}/${test_directory}/trino_s3select_test_external_fs_with_comma_delimiter/"
-exec_in_hadoop_master_container hadoop fs -mkdir -p "${table_path}"
-exec_in_hadoop_master_container hadoop fs -put -f /docker/files/test_table_with_comma_delimiter.csv{,.gz,.bz2} "${table_path}"
-exec_in_hadoop_master_container /usr/bin/hive -e "
-    CREATE EXTERNAL TABLE trino_s3select_test_external_fs_with_comma_delimiter(t_bigint bigint, s_bigint bigint)
-    ROW FORMAT DELIMITED
-    FIELDS TERMINATED BY ','
-    STORED AS TEXTFILE
-    LOCATION '${table_path}'"
-
-table_path="s3a://${S3_BUCKET}/${test_directory}/trino_s3select_test_csv_scan_range_pushdown/"
-exec_in_hadoop_master_container hadoop fs -mkdir -p "${table_path}"
-exec_in_hadoop_master_container /docker/files/hadoop-put.sh /docker/files/test_table_csv_scan_range_select_pushdown_{1,2,3}.csv "${table_path}"
-exec_in_hadoop_master_container sudo -Eu hive beeline -u jdbc:hive2://localhost:10000/default -n hive -e "
-    CREATE EXTERNAL TABLE trino_s3select_test_csv_scan_range_pushdown(index bigint, id string, value1 bigint, value2 bigint, value3 bigint,
-     value4 bigint, value5 bigint, title string, firstname string, lastname string, flag string, day bigint,
-     month bigint, year bigint, country string, comment string, email string, identifier string)
-    ROW FORMAT DELIMITED
-    FIELDS TERMINATED BY '|'
-    STORED AS TEXTFILE
-    LOCATION '${table_path}'"
-
 stop_unnecessary_hadoop_services
 
 # restart hive-metastore to apply S3 changes in core-site.xml

diff --git a/plugin/trino-hive-hadoop2/pom.xml b/plugin/trino-hive-hadoop2/pom.xml
@@ -221,10 +221,6 @@
                                 <exclude>**/TestHive.java</exclude>
                                 <exclude>**/TestHiveThriftMetastoreWithS3.java</exclude>
                                 <exclude>**/TestHiveFileSystemS3.java</exclude>
-                                <exclude>**/TestHiveFileSystemS3SelectPushdown.java</exclude>
-                                <exclude>**/TestHiveFileSystemS3SelectJsonPushdown.java</exclude>
-                                <exclude>**/TestHiveFileSystemS3SelectCsvPushdownWithSplits.java</exclude>
-                                <exclude>**/TestHiveFileSystemS3SelectJsonPushdownWithSplits.java</exclude>
                                 <exclude>**/TestHiveFileSystemWasb.java</exclude>
                                 <exclude>**/TestHiveFileSystemAbfsAccessKey.java</exclude>
                                 <exclude>**/TestHiveFileSystemAbfsOAuth.java</exclude>
@@ -263,25 +259,6 @@
                             <includes>
                                 <include>**/TestHiveThriftMetastoreWithS3.java</include>
                                 <include>**/TestHiveFileSystemS3.java</include>
-                                <include>**/TestHiveFileSystemS3SelectPushdown.java</include>
-                                <include>**/TestHiveFileSystemS3SelectCsvPushdownWithSplits.java</include>
-                            </includes>
-                        </configuration>
-                    </plugin>
-                </plugins>
-            </build>
-        </profile>
-        <profile>
-            <id>test-hive-hadoop2-s3-select-json</id>
-            <build>
-                <plugins>
-                    <plugin>
-                        <groupId>org.apache.maven.plugins</groupId>
-                        <artifactId>maven-surefire-plugin</artifactId>
-                        <configuration>
-                            <includes>
-                                <include>**/TestHiveFileSystemS3SelectJsonPushdown.java</include>
-                                <include>**/TestHiveFileSystemS3SelectJsonPushdownWithSplits.java</include>
                             </includes>
                         </configuration>
                     </plugin>

diff --git a/...trino-hive-hadoop2/src/test/java/io/trino/plugin/hive/AbstractTestHiveFileSystemAbfs.java b/...trino-hive-hadoop2/src/test/java/io/trino/plugin/hive/AbstractTestHiveFileSystemAbfs.java
@@ -66,7 +66,6 @@ protected void setup(String host, int port, String databaseName, String containe
                 checkParameter(host, "host"),
                 port,
                 checkParameter(databaseName, "database name"),
-                false,
                 createHdfsConfiguration());
     }