Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deny DML commands on Hive bucketed tables created by Spark #10815

Conversation

findinpath
Copy link
Contributor

Apache Spark does not use 'bucketing_version'
parameter while creating Hive tables.

In order to avoid corrupting the bucket files of the
Hive table by using an invalid hashing function for
the bucketing, Trino does not support any DML commands
on Hive bucketed tables created by Spark.

@cla-bot cla-bot bot added the cla-signed label Jan 26, 2022
@findinpath findinpath requested review from findepi and losipiuk and removed request for findepi January 26, 2022 21:29
@findinpath findinpath force-pushed the hive-deny-dml-commands-on-spark-hive-bucketed-tables branch 2 times, most recently from 4c3a3fa to d6e1fcd Compare January 27, 2022 05:24
@findinpath findinpath force-pushed the hive-deny-dml-commands-on-spark-hive-bucketed-tables branch from d6e1fcd to 3ee61e7 Compare January 28, 2022 12:34
onSpark().executeQuery(
"CREATE TABLE default." + hiveTableName + "(a_key integer, a_value integer) " +
"USING ORC " +
"TBLPROPERTIES ('transactional'='true') " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can test - but it feels weird. Can such a table realistically be created in Spark? Does spark support ORC ACID tables format. Or it would still write data as it was non-transactional table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the spark bucketed check happens after the check

if (!isFullAcidTable(table.getParameters())) {
            throw new TrinoException(NOT_SUPPORTED, "Hive update is only supported for ACID transactional tables");
        }

so I think that Spark does create ORC ACID tables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But doesn't it come from the fact that you explicitly put TBLPROPERTIES ('transactional'='true') in CREATE TABLE statement.
Would that make sense for Spark users to put sht like that? (I guess Spark will pass all the options verbatim, but not necessarily make any use of them).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything can depend on this, we should test with both: TBLPROPERTIES ('transactional'='true') set in Spark, and not set. (maybe skipping testing with TBLPROPERTIES ('transactional'='false')...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `spark_update_bucketed_table_43s315y438ky`( |
|   `a_key` int,                                     |
|   `a_value` int)                                   |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'      |
| WITH SERDEPROPERTIES (                             |
|   'path'='hdfs://hadoop-master:9000/user/hive/warehouse/spark_update_bucketed_table_43s315y438ky')  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'  |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION                                           |
|   'hdfs://hadoop-master:9000/user/hive/warehouse/spark_update_bucketed_table_43s315y438ky' |
| TBLPROPERTIES (                                    |
|   'spark.sql.create.version'='3.1.2',              |
|   'spark.sql.sources.provider'='ORC',              |
|   'spark.sql.sources.schema.bucketCol.0'='a_key',  |
|   'spark.sql.sources.schema.numBucketCols'='1',    |
|   'spark.sql.sources.schema.numBuckets'='3',       |
|   'spark.sql.sources.schema.numParts'='1',         |
|   'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"a_key","type":"integer","nullable":true,"metadata":{}},{"name":"a_value","type":"integer","nullable":true,"metadata":{}}]}',  |
|   'transactional'='true',                          |
|   'transient_lastDdlTime'='1643892072')            |
+----------------------------------------------------+

I had a look in Hive and Spark does indeed create the table as transactional even when the table parameter TBLPROPERTIES ('transactional'='true') is missing.

@findinpath findinpath force-pushed the hive-deny-dml-commands-on-spark-hive-bucketed-tables branch from 3ee61e7 to 5d418dc Compare January 28, 2022 16:21
@findinpath findinpath requested a review from findepi February 1, 2022 11:35
onSpark().executeQuery(
"CREATE TABLE default." + hiveTableName + "(a_key integer, a_value integer) " +
"USING ORC " +
"TBLPROPERTIES ('transactional'='true') " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything can depend on this, we should test with both: TBLPROPERTIES ('transactional'='true') set in Spark, and not set. (maybe skipping testing with TBLPROPERTIES ('transactional'='false')...)

@findinpath findinpath force-pushed the hive-deny-dml-commands-on-spark-hive-bucketed-tables branch from a25d853 to fb7bc67 Compare February 3, 2022 12:57
@findinpath findinpath requested a review from findepi February 3, 2022 12:57
@findepi
Copy link
Member

findepi commented Feb 3, 2022

there is a conflict, please rebase.

@findinpath findinpath force-pushed the hive-deny-dml-commands-on-spark-hive-bucketed-tables branch 3 times, most recently from 3029b56 to 57535f8 Compare February 4, 2022 09:25
Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but the build is red.

Apache Spark has its own bucketing algorithm which
is not fully compatible with the one used by Hive.

In order to avoid corrupting the bucket files of the
Hive table by using an invalid hashing function for
the bucketing, Trino does not support any DML commands
on Hive bucketed tables created by Spark.
@findinpath findinpath force-pushed the hive-deny-dml-commands-on-spark-hive-bucketed-tables branch from 57535f8 to f5fcac6 Compare February 8, 2022 07:36
@findepi
Copy link
Member

findepi commented Feb 8, 2022

@findinpath
Copy link
Contributor Author

The last push diff https://github.com/trinodb/trino/compare/57535f8d811d83abae720773b096db3f99f44f46..f5fcac68c9ae843310bac7536b1e984900824745 is not tiny. @findinpath what has changed?

I just did a rebase. No actual changes.

@findepi
Copy link
Member

findepi commented Feb 8, 2022

Thanks @findinpath . i personally experiment with helping reviewers by adding comments like #10720 (comment) and also by separating rebase from other changes (example there too). Not saying that i'll always remember.

@findepi findepi merged commit c889838 into trinodb:master Feb 8, 2022
@findepi findepi mentioned this pull request Feb 8, 2022
@github-actions github-actions bot added this to the 371 milestone Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants