-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deny DML commands on Hive bucketed tables created by Spark #10815
Deny DML commands on Hive bucketed tables created by Spark #10815
Conversation
4c3a3fa
to
d6e1fcd
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
d6e1fcd
to
3ee61e7
Compare
...rino-product-tests/src/main/java/io/trino/tests/product/hive/TestHiveSparkCompatibility.java
Outdated
Show resolved
Hide resolved
onSpark().executeQuery( | ||
"CREATE TABLE default." + hiveTableName + "(a_key integer, a_value integer) " + | ||
"USING ORC " + | ||
"TBLPROPERTIES ('transactional'='true') " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can test - but it feels weird. Can such a table realistically be created in Spark? Does spark support ORC ACID tables format. Or it would still write data as it was non-transactional table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the spark bucketed check happens after the check
if (!isFullAcidTable(table.getParameters())) {
throw new TrinoException(NOT_SUPPORTED, "Hive update is only supported for ACID transactional tables");
}
so I think that Spark does create ORC ACID tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But doesn't it come from the fact that you explicitly put TBLPROPERTIES ('transactional'='true')
in CREATE TABLE
statement.
Would that make sense for Spark users to put sht like that? (I guess Spark will pass all the options verbatim, but not necessarily make any use of them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anything can depend on this, we should test with both: TBLPROPERTIES ('transactional'='true')
set in Spark, and not set. (maybe skipping testing with TBLPROPERTIES ('transactional'='false')
...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE TABLE `spark_update_bucketed_table_43s315y438ky`( |
| `a_key` int, |
| `a_value` int) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| WITH SERDEPROPERTIES ( |
| 'path'='hdfs://hadoop-master:9000/user/hive/warehouse/spark_update_bucketed_table_43s315y438ky') |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://hadoop-master:9000/user/hive/warehouse/spark_update_bucketed_table_43s315y438ky' |
| TBLPROPERTIES ( |
| 'spark.sql.create.version'='3.1.2', |
| 'spark.sql.sources.provider'='ORC', |
| 'spark.sql.sources.schema.bucketCol.0'='a_key', |
| 'spark.sql.sources.schema.numBucketCols'='1', |
| 'spark.sql.sources.schema.numBuckets'='3', |
| 'spark.sql.sources.schema.numParts'='1', |
| 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"a_key","type":"integer","nullable":true,"metadata":{}},{"name":"a_value","type":"integer","nullable":true,"metadata":{}}]}', |
| 'transactional'='true', |
| 'transient_lastDdlTime'='1643892072') |
+----------------------------------------------------+
I had a look in Hive and Spark does indeed create the table as transactional even when the table parameter TBLPROPERTIES ('transactional'='true')
is missing.
3ee61e7
to
5d418dc
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
onSpark().executeQuery( | ||
"CREATE TABLE default." + hiveTableName + "(a_key integer, a_value integer) " + | ||
"USING ORC " + | ||
"TBLPROPERTIES ('transactional'='true') " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anything can depend on this, we should test with both: TBLPROPERTIES ('transactional'='true')
set in Spark, and not set. (maybe skipping testing with TBLPROPERTIES ('transactional'='false')
...)
a25d853
to
fb7bc67
Compare
there is a conflict, please rebase. |
3029b56
to
57535f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but the build is red.
Apache Spark has its own bucketing algorithm which is not fully compatible with the one used by Hive. In order to avoid corrupting the bucket files of the Hive table by using an invalid hashing function for the bucketing, Trino does not support any DML commands on Hive bucketed tables created by Spark.
57535f8
to
f5fcac6
Compare
The last push diff https://github.com/trinodb/trino/compare/57535f8d811d83abae720773b096db3f99f44f46..f5fcac68c9ae843310bac7536b1e984900824745 is not tiny. @findinpath what has changed? |
I just did a rebase. No actual changes. |
Thanks @findinpath . i personally experiment with helping reviewers by adding comments like #10720 (comment) and also by separating rebase from other changes (example there too). Not saying that i'll always remember. |
Apache Spark does not use 'bucketing_version'
parameter while creating Hive tables.
In order to avoid corrupting the bucket files of the
Hive table by using an invalid hashing function for
the bucketing, Trino does not support any DML commands
on Hive bucketed tables created by Spark.