[SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default #19499

dongjoon-hyun · 2017-10-14T18:51:15Z

What changes were proposed in this pull request?

Like Parquet, this PR aims to turn on spark.sql.hive.convertMetastoreOrc by default.

How was this patch tested?

Pass all the existing test cases.

SparkQA · 2017-10-14T20:13:15Z

Test build #82763 has finished for PR 19499 at commit b9c4954.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-15T03:24:03Z

Test build #82765 has finished for PR 19499 at commit 83cde8b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-10-15T03:45:16Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

@@ -937,26 +937,22 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto
  }

  test("test statistics of LogicalRelation converted from Hive serde tables") {


This should be handled in a separate PR, #19500 .
After #19500, I will remove this change on test code from this PR.

SparkQA · 2017-10-17T01:35:42Z

Test build #82824 has finished for PR 19499 at commit cf7edbb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-12-06T23:50:18Z

Since we starts to use new nativeOrcFileFormat by default, I'm retriggering this.

dongjoon-hyun · 2017-12-06T23:50:25Z

Retest this please.

gatorsmile · 2017-12-07T00:04:25Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala

@@ -106,7 +106,7 @@ private[spark] object HiveUtils extends Logging {
    .doc("When set to true, the built-in ORC reader and writer are used to process " +
      "ORC tables created by using the HiveQL syntax, instead of Hive serde.")
    .booleanConf
-    .createWithDefault(false)
+    .createWithDefault(true)


This change was made in https://issues.apache.org/jira/browse/SPARK-15705.

This issue has been resolved?

Yes, it's resolved as you see the last my comment on the JIRA.

On the JIRA, there is a result on 2.1.1 and 2.2.0. And the following is the result on 2.2.1.

scala> sql("set spark.sql.hive.convertMetastoreOrc=true") scala> spark.table("default.test").printSchema root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- state: string (nullable = true) scala> spark.version res2: String = 2.2.1

By this PR: #19470 ?

I think it's before #19470 because it's fixed on 2.1.1.

How about https://issues.apache.org/jira/browse/SPARK-15757?

Yep. That is resolved via https://issues.apache.org/jira/browse/SPARK-14387 by me.

Ur, please wait a moment. I'll double check the case to make it sure.

Yep. It's resolved via SPARK-14387. The following is a result of SPARK-15757 example on 2.2.1.

hive> CREATE TABLE source(inv_date_sk INT, inv_item_sk INT, inv_warehouse_sk INT, inv_quantity_on_hand INT); hive> INSERT INTO source VALUES(1,1,1,1); hive> CREATE TABLE inventory(inv_date_sk INT, inv_item_sk INT, inv_warehouse_sk INT, inv_quantity_on_hand INT) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS ORC; hive> INSERT OVERWRITE TABLE inventory SELECT * FROM source; scala> sql("set spark.sql.hive.convertMetastoreOrc=true") scala> sql("SELECT * FROM inventory").show +-----------+-----------+----------------+--------------------+ |inv_date_sk|inv_item_sk|inv_warehouse_sk|inv_quantity_on_hand| +-----------+-----------+----------------+--------------------+ | 1| 1| 1| 1| +-----------+-----------+----------------+--------------------+ scala> spark.version res2: String = 2.2.1

SparkQA · 2017-12-07T01:18:35Z

Test build #84583 has finished for PR 19499 at commit cf7edbb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-12-07T01:42:05Z

Good. The failures are expected ones and handled by #19882 .

org.apache.spark.sql.hive.orc.OrcQuerySuite.SPARK-8501: Avoids discovery schema from empty ORC files
- The test case expects a failure, but it's fixed in new OrcFileFormat.
org.apache.spark.sql.hive.orc.OrcSourceSuite.SPARK-19459/SPARK-18220: read char/varchar column written by Hive
- This is VARCHAR issue.

dongjoon-hyun · 2017-12-07T01:43:11Z

I'll retrigger after merging #19882 .

dongjoon-hyun · 2017-12-07T15:20:52Z

Retest this please.

SparkQA · 2017-12-07T17:29:24Z

Test build #84604 has finished for PR 19499 at commit cf7edbb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-12-07T23:45:39Z

Thanks! Merged to master.

dongjoon-hyun · 2017-12-08T03:19:10Z

Thank you so much, @gatorsmile !

…olumn order is different ## What changes were proposed in this pull request? Until 2.2.1, with the default configuration, Apache Spark returns incorrect results when ORC file schema is different from metastore schema order. This is due to Hive 1.2.1 library and some issues on `convertMetastoreOrc` option. ```scala scala> Seq(1 -> 2).toDF("c1", "c2").write.format("orc").mode("overwrite").save("/tmp/o") scala> sql("CREATE EXTERNAL TABLE o(c2 INT, c1 INT) STORED AS orc LOCATION '/tmp/o'") scala> spark.table("o").show // This is wrong. +---+---+ | c2| c1| +---+---+ | 1| 2| +---+---+ scala> spark.read.orc("/tmp/o").show // This is correct. +---+---+ | c1| c2| +---+---+ | 1| 2| +---+---+ ``` After [SPARK-22279](#19499), the default configuration doesn't have this bug. Although Hive 1.2.1 library code path still has the problem, we had better have a test coverage on what we have now in order to prevent future regression on it. ## How was this patch tested? Pass the Jenkins with a newly added test test. Author: Dongjoon Hyun <[email protected]> Closes #19928 from dongjoon-hyun/SPARK-22267.

## What changes were proposed in this pull request? Until 2.2.1, Spark raises `NullPointerException` on zero-size ORC files. Usually, these zero-size ORC files are generated by 3rd-party apps like Flume. ```scala scala> sql("create table empty_orc(a int) stored as orc location '/tmp/empty_orc'") $ touch /tmp/empty_orc/zero.orc scala> sql("select * from empty_orc").show java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) ... Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) ``` After [SPARK-22279](apache#19499), Apache Spark with the default configuration doesn't have this bug. Although Hive 1.2.1 library code path still has the problem, we had better have a test coverage on what we have now in order to prevent future regression on it. ## How was this patch tested? Pass a newly added test case. Author: Dongjoon Hyun <[email protected]> Closes apache#19948 from dongjoon-hyun/SPARK-19809-EMPTY-FILE.

…by default ## What changes were proposed in this pull request? This is to revert the changes made in #19499 , because this causes a regression. We should not ignore the table-specific compression conf when the Hive serde tables are converted to the data source tables. ## How was this patch tested? The existing tests. Author: gatorsmile <[email protected]> Closes #20536 from gatorsmile/revert22279. (cherry picked from commit 3473fda) Signed-off-by: Wenchen Fan <[email protected]>

…by default ## What changes were proposed in this pull request? This is to revert the changes made in apache#19499 , because this causes a regression. We should not ignore the table-specific compression conf when the Hive serde tables are converted to the data source tables. ## How was this patch tested? The existing tests. Author: gatorsmile <[email protected]> Closes apache#20536 from gatorsmile/revert22279.

dongjoon-hyun commented Oct 15, 2017

View reviewed changes

[SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default

cf7edbb

dongjoon-hyun changed the title ~~[SPARK-22279][SQL][WIP] Turn on spark.sql.hive.convertMetastoreOrc by default~~ [SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default Oct 17, 2017

gatorsmile reviewed Dec 7, 2017

View reviewed changes

asfgit closed this in aa1764b Dec 7, 2017

dongjoon-hyun deleted the SPARK-22279 branch December 8, 2017 03:19

dongjoon-hyun mentioned this pull request Dec 8, 2017

[SPARK-22267][SQL][TEST] Spark SQL incorrectly reads ORC files when column order is different #19928

Closed

dongjoon-hyun mentioned this pull request Dec 12, 2017

[SPARK-19809][SQL][TEST] NullPointerException on zero-size ORC file #19948

Closed

gatorsmile mentioned this pull request Feb 7, 2018

Revert [SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default #20536

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default #19499

[SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default #19499

dongjoon-hyun commented Oct 14, 2017

SparkQA commented Oct 14, 2017

SparkQA commented Oct 15, 2017

dongjoon-hyun Oct 15, 2017

SparkQA commented Oct 17, 2017

dongjoon-hyun commented Dec 6, 2017

dongjoon-hyun commented Dec 6, 2017

gatorsmile Dec 7, 2017

dongjoon-hyun Dec 7, 2017

dongjoon-hyun Dec 7, 2017

gatorsmile Dec 7, 2017

dongjoon-hyun Dec 7, 2017

gatorsmile Dec 7, 2017

dongjoon-hyun Dec 7, 2017 •

edited

Loading

dongjoon-hyun Dec 7, 2017

dongjoon-hyun Dec 7, 2017 •

edited

Loading

SparkQA commented Dec 7, 2017

dongjoon-hyun commented Dec 7, 2017

dongjoon-hyun commented Dec 7, 2017

dongjoon-hyun commented Dec 7, 2017

SparkQA commented Dec 7, 2017

gatorsmile commented Dec 7, 2017

dongjoon-hyun commented Dec 8, 2017

		@@ -937,26 +937,22 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto
		}

		test("test statistics of LogicalRelation converted from Hive serde tables") {

[SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default #19499

[SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default #19499

Conversation

dongjoon-hyun commented Oct 14, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 14, 2017

SparkQA commented Oct 15, 2017

Choose a reason for hiding this comment

SparkQA commented Oct 17, 2017

dongjoon-hyun commented Dec 6, 2017

dongjoon-hyun commented Dec 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Dec 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Dec 7, 2017 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Dec 7, 2017

dongjoon-hyun commented Dec 7, 2017

dongjoon-hyun commented Dec 7, 2017

dongjoon-hyun commented Dec 7, 2017

SparkQA commented Dec 7, 2017

gatorsmile commented Dec 7, 2017

dongjoon-hyun commented Dec 8, 2017

dongjoon-hyun Dec 7, 2017 •

edited

Loading

dongjoon-hyun Dec 7, 2017 •

edited

Loading