Deny DML commands on Hive bucketed tables created by Spark #10815

findinpath · 2022-01-26T21:29:34Z

Apache Spark does not use 'bucketing_version'
parameter while creating Hive tables.

In order to avoid corrupting the bucket files of the
Hive table by using an invalid hashing function for
the bucketing, Trino does not support any DML commands
on Hive bucketed tables created by Spark.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

...rino-product-tests/src/main/java/io/trino/tests/product/hive/TestHiveSparkCompatibility.java

losipiuk · 2022-01-28T15:35:51Z

...rino-product-tests/src/main/java/io/trino/tests/product/hive/TestHiveSparkCompatibility.java

+        onSpark().executeQuery(
+                "CREATE TABLE default." + hiveTableName + "(a_key integer, a_value integer) " +
+                        "USING ORC " +
+                        "TBLPROPERTIES ('transactional'='true') " +


We can test - but it feels weird. Can such a table realistically be created in Spark? Does spark support ORC ACID tables format. Or it would still write data as it was non-transactional table?

Note that the spark bucketed check happens after the check

if (!isFullAcidTable(table.getParameters())) { throw new TrinoException(NOT_SUPPORTED, "Hive update is only supported for ACID transactional tables"); }

so I think that Spark does create ORC ACID tables.

But doesn't it come from the fact that you explicitly put TBLPROPERTIES ('transactional'='true') in CREATE TABLE statement.
Would that make sense for Spark users to put sht like that? (I guess Spark will pass all the options verbatim, but not necessarily make any use of them).

If anything can depend on this, we should test with both: TBLPROPERTIES ('transactional'='true') set in Spark, and not set. (maybe skipping testing with TBLPROPERTIES ('transactional'='false')...)

+----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE TABLE `spark_update_bucketed_table_43s315y438ky`( | | `a_key` int, | | `a_value` int) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' | | WITH SERDEPROPERTIES ( | | 'path'='hdfs://hadoop-master:9000/user/hive/warehouse/spark_update_bucketed_table_43s315y438ky') | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' | | LOCATION | | 'hdfs://hadoop-master:9000/user/hive/warehouse/spark_update_bucketed_table_43s315y438ky' | | TBLPROPERTIES ( | | 'spark.sql.create.version'='3.1.2', | | 'spark.sql.sources.provider'='ORC', | | 'spark.sql.sources.schema.bucketCol.0'='a_key', | | 'spark.sql.sources.schema.numBucketCols'='1', | | 'spark.sql.sources.schema.numBuckets'='3', | | 'spark.sql.sources.schema.numParts'='1', | | 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"a_key","type":"integer","nullable":true,"metadata":{}},{"name":"a_value","type":"integer","nullable":true,"metadata":{}}]}', | | 'transactional'='true', | | 'transient_lastDdlTime'='1643892072') | +----------------------------------------------------+

I had a look in Hive and Spark does indeed create the table as transactional even when the table parameter TBLPROPERTIES ('transactional'='true') is missing.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

findepi · 2022-02-01T16:12:46Z

...rino-product-tests/src/main/java/io/trino/tests/product/hive/TestHiveSparkCompatibility.java

+        onSpark().executeQuery(
+                "CREATE TABLE default." + hiveTableName + "(a_key integer, a_value integer) " +
+                        "USING ORC " +
+                        "TBLPROPERTIES ('transactional'='true') " +


If anything can depend on this, we should test with both: TBLPROPERTIES ('transactional'='true') set in Spark, and not set. (maybe skipping testing with TBLPROPERTIES ('transactional'='false')...)

findepi · 2022-02-03T16:04:44Z

there is a conflict, please rebase.

findepi

LGTM, but the build is red.

Apache Spark has its own bucketing algorithm which is not fully compatible with the one used by Hive. In order to avoid corrupting the bucket files of the Hive table by using an invalid hashing function for the bucketing, Trino does not support any DML commands on Hive bucketed tables created by Spark.

findepi · 2022-02-08T09:26:41Z

The last push diff https://github.com/trinodb/trino/compare/57535f8d811d83abae720773b096db3f99f44f46..f5fcac68c9ae843310bac7536b1e984900824745 is not tiny. @findinpath what has changed?

findinpath · 2022-02-08T09:29:18Z

The last push diff https://github.com/trinodb/trino/compare/57535f8d811d83abae720773b096db3f99f44f46..f5fcac68c9ae843310bac7536b1e984900824745 is not tiny. @findinpath what has changed?

I just did a rebase. No actual changes.

findepi · 2022-02-08T09:38:56Z

Thanks @findinpath . i personally experiment with helping reviewers by adding comments like #10720 (comment) and also by separating rebase from other changes (example there too). Not saying that i'll always remember.

cla-bot bot added the cla-signed label Jan 26, 2022

findinpath requested review from findepi and losipiuk and removed request for findepi January 26, 2022 21:29

github-actions bot added the tests:hive label Jan 26, 2022

findinpath requested review from MiguelWeezardo and findepi January 26, 2022 21:30

findinpath force-pushed the hive-deny-dml-commands-on-spark-hive-bucketed-tables branch 2 times, most recently from 4c3a3fa to d6e1fcd Compare January 27, 2022 05:24

findepi reviewed Jan 27, 2022

View reviewed changes

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java Outdated Show resolved Hide resolved

findinpath force-pushed the hive-deny-dml-commands-on-spark-hive-bucketed-tables branch from d6e1fcd to 3ee61e7 Compare January 28, 2022 12:34

losipiuk reviewed Jan 28, 2022

View reviewed changes

...rino-product-tests/src/main/java/io/trino/tests/product/hive/TestHiveSparkCompatibility.java Outdated Show resolved Hide resolved

losipiuk reviewed Jan 28, 2022

View reviewed changes

findinpath force-pushed the hive-deny-dml-commands-on-spark-hive-bucketed-tables branch from 3ee61e7 to 5d418dc Compare January 28, 2022 16:21

findinpath requested a review from findepi February 1, 2022 11:35

findepi reviewed Feb 1, 2022

View reviewed changes

findinpath force-pushed the hive-deny-dml-commands-on-spark-hive-bucketed-tables branch from a25d853 to fb7bc67 Compare February 3, 2022 12:57

findinpath requested a review from findepi February 3, 2022 12:57

findinpath force-pushed the hive-deny-dml-commands-on-spark-hive-bucketed-tables branch 3 times, most recently from 3029b56 to 57535f8 Compare February 4, 2022 09:25

findepi approved these changes Feb 7, 2022

View reviewed changes

findinpath force-pushed the hive-deny-dml-commands-on-spark-hive-bucketed-tables branch from 57535f8 to f5fcac6 Compare February 8, 2022 07:36

findepi merged commit c889838 into trinodb:master Feb 8, 2022

findepi mentioned this pull request Feb 8, 2022

Release notes for 371 #10941

Closed

github-actions bot added this to the 371 milestone Feb 8, 2022

mosabua mentioned this pull request Feb 8, 2022

Add Trino 371 release notes #10943

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deny DML commands on Hive bucketed tables created by Spark #10815

Deny DML commands on Hive bucketed tables created by Spark #10815

findinpath commented Jan 26, 2022

losipiuk Jan 28, 2022

findinpath Jan 28, 2022

losipiuk Jan 28, 2022

findepi Feb 1, 2022

findinpath Feb 3, 2022

findepi Feb 1, 2022

findepi commented Feb 3, 2022

findepi left a comment

findepi commented Feb 8, 2022

findinpath commented Feb 8, 2022

findepi commented Feb 8, 2022

Deny DML commands on Hive bucketed tables created by Spark #10815

Deny DML commands on Hive bucketed tables created by Spark #10815

Conversation

findinpath commented Jan 26, 2022

losipiuk Jan 28, 2022

Choose a reason for hiding this comment

findinpath Jan 28, 2022

Choose a reason for hiding this comment

losipiuk Jan 28, 2022

Choose a reason for hiding this comment

findepi Feb 1, 2022

Choose a reason for hiding this comment

findinpath Feb 3, 2022

Choose a reason for hiding this comment

findepi Feb 1, 2022

Choose a reason for hiding this comment

findepi commented Feb 3, 2022

findepi left a comment

Choose a reason for hiding this comment

findepi commented Feb 8, 2022

findinpath commented Feb 8, 2022

findepi commented Feb 8, 2022