[SPARK-26571][SQL] Update Hive Serde mapping with canonical name of Parquet and Orc FileFormat #23491

gengliangwang · 2019-01-08T12:45:08Z

What changes were proposed in this pull request?

Currently Spark table maintains Hive catalog storage format, so that Hive client can read it. In HiveSerDe.scala, Spark uses a mapping from its data source to HiveSerde. The mapping is old, we need to update with latest canonical name of Parquet and Orc FileFormat.

Otherwise the following queries will result in wrong Serde value in Hive table(default value org.apache.hadoop.mapred.SequenceFileInputFormat), and Hive client will fail to read the output table:

df.write.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat").saveAsTable(..)

df.write.format("org.apache.spark.sql.execution.datasources.orc.OrcFileFormat").saveAsTable(..)

This minor PR is to fix the mapping.

How was this patch tested?

Unit test.

cloud-fan · 2019-01-08T13:25:18Z

LGTM, this is similar to #20165

Can we move the test in #20165 to DataSourceWithHiveMetastoreCatalogSuite as well?

cloud-fan · 2019-01-08T13:26:04Z

sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala

@@ -74,8 +74,10 @@ object HiveSerDe {
  def sourceToSerDe(source: String): Option[HiveSerDe] = {
    val key = source.toLowerCase(Locale.ROOT) match {
      case s if s.startsWith("org.apache.spark.sql.parquet") => "parquet"


I think this is stale. There was a time the parquet data source is under org.apache.spark.sql.parquet package.

cc @gatorsmile

+1, I think we can remove the stale ones.

sorry my bad, users can still create table with using org.apache.spark.sql.parquet, we can't break it.

Yes, it is safer to keep it.

gengliangwang · 2019-01-08T13:37:08Z

@cloud-fan I think the test in #20165 is covered in https://github.com/apache/spark/pull/23491/files#diff-301489b65b8dec1a8ecf6135eac793e9R174

I can remove the test case in this PR.

SparkQA · 2019-01-08T16:52:59Z

Test build #100929 has finished for PR 23491 at commit 01a67ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @gengliangwang .

SparkQA · 2019-01-08T17:47:34Z

Test build #100930 has finished for PR 23491 at commit d457fda.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-08T21:33:47Z

Test build #100936 has finished for PR 23491 at commit 2d52bc4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-01-09T02:19:03Z

thanks, merging to master!

HyukjinKwon · 2019-01-09T02:33:42Z

Merged to master.

HyukjinKwon · 2019-01-09T02:33:49Z

Oops

dongjoon-hyun · 2019-01-09T02:47:11Z

Hi, @cloud-fan
Can we have this in older branches?

cloud-fan · 2019-01-09T02:48:40Z

sure, feel free to backport it

dongjoon-hyun · 2019-01-09T02:52:42Z

Thanks!

…arquet and Orc FileFormat ## What changes were proposed in this pull request? Currently Spark table maintains Hive catalog storage format, so that Hive client can read it. In `HiveSerDe.scala`, Spark uses a mapping from its data source to HiveSerde. The mapping is old, we need to update with latest canonical name of Parquet and Orc FileFormat. Otherwise the following queries will result in wrong Serde value in Hive table(default value `org.apache.hadoop.mapred.SequenceFileInputFormat`), and Hive client will fail to read the output table: ``` df.write.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat").saveAsTable(..) ``` ``` df.write.format("org.apache.spark.sql.execution.datasources.orc.OrcFileFormat").saveAsTable(..) ``` This minor PR is to fix the mapping. ## How was this patch tested? Unit test. Closes #23491 from gengliangwang/fixHiveSerdeMap. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 311f32f) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2019-01-09T03:09:53Z

Merged to branch-2.4, too.

…arquet and Orc FileFormat ## What changes were proposed in this pull request? Currently Spark table maintains Hive catalog storage format, so that Hive client can read it. In `HiveSerDe.scala`, Spark uses a mapping from its data source to HiveSerde. The mapping is old, we need to update with latest canonical name of Parquet and Orc FileFormat. Otherwise the following queries will result in wrong Serde value in Hive table(default value `org.apache.hadoop.mapred.SequenceFileInputFormat`), and Hive client will fail to read the output table: ``` df.write.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat").saveAsTable(..) ``` ``` df.write.format("org.apache.spark.sql.execution.datasources.orc.OrcFileFormat").saveAsTable(..) ``` This minor PR is to fix the mapping. ## How was this patch tested? Unit test. Closes apache#23491 from gengliangwang/fixHiveSerdeMap. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…arquet and Orc FileFormat ## What changes were proposed in this pull request? Currently Spark table maintains Hive catalog storage format, so that Hive client can read it. In `HiveSerDe.scala`, Spark uses a mapping from its data source to HiveSerde. The mapping is old, we need to update with latest canonical name of Parquet and Orc FileFormat. Otherwise the following queries will result in wrong Serde value in Hive table(default value `org.apache.hadoop.mapred.SequenceFileInputFormat`), and Hive client will fail to read the output table: ``` df.write.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat").saveAsTable(..) ``` ``` df.write.format("org.apache.spark.sql.execution.datasources.orc.OrcFileFormat").saveAsTable(..) ``` This minor PR is to fix the mapping. ## How was this patch tested? Unit test. Closes apache#23491 from gengliangwang/fixHiveSerdeMap. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 311f32f) Signed-off-by: Dongjoon Hyun <[email protected]>

fix

01a67ab

cloud-fan reviewed Jan 8, 2019

View reviewed changes

remove duplicated test case

d457fda

dongjoon-hyun approved these changes Jan 8, 2019

View reviewed changes

update test case

2d52bc4

cloud-fan approved these changes Jan 8, 2019

View reviewed changes

cloud-fan closed this in 311f32f Jan 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-26571][SQL] Update Hive Serde mapping with canonical name of Parquet and Orc FileFormat #23491

[SPARK-26571][SQL] Update Hive Serde mapping with canonical name of Parquet and Orc FileFormat #23491

gengliangwang commented Jan 8, 2019 •

edited

Loading

cloud-fan commented Jan 8, 2019

cloud-fan Jan 8, 2019

cloud-fan Jan 8, 2019

gengliangwang Jan 8, 2019

cloud-fan Jan 8, 2019 •

edited

Loading

gengliangwang Jan 8, 2019

gengliangwang commented Jan 8, 2019 •

edited

Loading

SparkQA commented Jan 8, 2019

dongjoon-hyun left a comment

SparkQA commented Jan 8, 2019

SparkQA commented Jan 8, 2019

cloud-fan commented Jan 9, 2019

HyukjinKwon commented Jan 9, 2019

HyukjinKwon commented Jan 9, 2019

dongjoon-hyun commented Jan 9, 2019

cloud-fan commented Jan 9, 2019

dongjoon-hyun commented Jan 9, 2019

dongjoon-hyun commented Jan 9, 2019

[SPARK-26571][SQL] Update Hive Serde mapping with canonical name of Parquet and Orc FileFormat #23491

[SPARK-26571][SQL] Update Hive Serde mapping with canonical name of Parquet and Orc FileFormat #23491

Conversation

gengliangwang commented Jan 8, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Jan 8, 2019

cloud-fan Jan 8, 2019

Choose a reason for hiding this comment

cloud-fan Jan 8, 2019

Choose a reason for hiding this comment

gengliangwang Jan 8, 2019

Choose a reason for hiding this comment

cloud-fan Jan 8, 2019 • edited Loading

Choose a reason for hiding this comment

gengliangwang Jan 8, 2019

Choose a reason for hiding this comment

gengliangwang commented Jan 8, 2019 • edited Loading

SparkQA commented Jan 8, 2019

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Jan 8, 2019

SparkQA commented Jan 8, 2019

cloud-fan commented Jan 9, 2019

HyukjinKwon commented Jan 9, 2019

HyukjinKwon commented Jan 9, 2019

dongjoon-hyun commented Jan 9, 2019

cloud-fan commented Jan 9, 2019

dongjoon-hyun commented Jan 9, 2019

dongjoon-hyun commented Jan 9, 2019

gengliangwang commented Jan 8, 2019 •

edited

Loading

cloud-fan Jan 8, 2019 •

edited

Loading

gengliangwang commented Jan 8, 2019 •

edited

Loading