-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default #19499
Conversation
Test build #82763 has finished for PR 19499 at commit
|
Test build #82765 has finished for PR 19499 at commit
|
@@ -937,26 +937,22 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto | |||
} | |||
|
|||
test("test statistics of LogicalRelation converted from Hive serde tables") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test build #82824 has finished for PR 19499 at commit
|
Since we starts to use new |
Retest this please. |
@@ -106,7 +106,7 @@ private[spark] object HiveUtils extends Logging { | |||
.doc("When set to true, the built-in ORC reader and writer are used to process " + | |||
"ORC tables created by using the HiveQL syntax, instead of Hive serde.") | |||
.booleanConf | |||
.createWithDefault(false) | |||
.createWithDefault(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was made in https://issues.apache.org/jira/browse/SPARK-15705.
This issue has been resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's resolved as you see the last my comment on the JIRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the JIRA, there is a result on 2.1.1 and 2.2.0. And the following is the result on 2.2.1.
scala> sql("set spark.sql.hive.convertMetastoreOrc=true")
scala> spark.table("default.test").printSchema
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- state: string (nullable = true)
scala> spark.version
res2: String = 2.2.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By this PR: #19470 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's before #19470 because it's fixed on 2.1.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. That is resolved via https://issues.apache.org/jira/browse/SPARK-14387 by me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, please wait a moment. I'll double check the case to make it sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. It's resolved via SPARK-14387. The following is a result of SPARK-15757 example on 2.2.1.
hive> CREATE TABLE source(inv_date_sk INT, inv_item_sk INT, inv_warehouse_sk INT, inv_quantity_on_hand INT);
hive> INSERT INTO source VALUES(1,1,1,1);
hive> CREATE TABLE inventory(inv_date_sk INT, inv_item_sk INT, inv_warehouse_sk INT, inv_quantity_on_hand INT)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS ORC;
hive> INSERT OVERWRITE TABLE inventory SELECT * FROM source;
scala> sql("set spark.sql.hive.convertMetastoreOrc=true")
scala> sql("SELECT * FROM inventory").show
+-----------+-----------+----------------+--------------------+
|inv_date_sk|inv_item_sk|inv_warehouse_sk|inv_quantity_on_hand|
+-----------+-----------+----------------+--------------------+
| 1| 1| 1| 1|
+-----------+-----------+----------------+--------------------+
scala> spark.version
res2: String = 2.2.1
Test build #84583 has finished for PR 19499 at commit
|
Good. The failures are expected ones and handled by #19882 .
|
I'll retrigger after merging #19882 . |
Retest this please. |
Test build #84604 has finished for PR 19499 at commit
|
Thanks! Merged to master. |
Thank you so much, @gatorsmile ! |
…olumn order is different ## What changes were proposed in this pull request? Until 2.2.1, with the default configuration, Apache Spark returns incorrect results when ORC file schema is different from metastore schema order. This is due to Hive 1.2.1 library and some issues on `convertMetastoreOrc` option. ```scala scala> Seq(1 -> 2).toDF("c1", "c2").write.format("orc").mode("overwrite").save("/tmp/o") scala> sql("CREATE EXTERNAL TABLE o(c2 INT, c1 INT) STORED AS orc LOCATION '/tmp/o'") scala> spark.table("o").show // This is wrong. +---+---+ | c2| c1| +---+---+ | 1| 2| +---+---+ scala> spark.read.orc("/tmp/o").show // This is correct. +---+---+ | c1| c2| +---+---+ | 1| 2| +---+---+ ``` After [SPARK-22279](#19499), the default configuration doesn't have this bug. Although Hive 1.2.1 library code path still has the problem, we had better have a test coverage on what we have now in order to prevent future regression on it. ## How was this patch tested? Pass the Jenkins with a newly added test test. Author: Dongjoon Hyun <[email protected]> Closes #19928 from dongjoon-hyun/SPARK-22267.
## What changes were proposed in this pull request? Until 2.2.1, Spark raises `NullPointerException` on zero-size ORC files. Usually, these zero-size ORC files are generated by 3rd-party apps like Flume. ```scala scala> sql("create table empty_orc(a int) stored as orc location '/tmp/empty_orc'") $ touch /tmp/empty_orc/zero.orc scala> sql("select * from empty_orc").show java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) ... Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) ``` After [SPARK-22279](apache#19499), Apache Spark with the default configuration doesn't have this bug. Although Hive 1.2.1 library code path still has the problem, we had better have a test coverage on what we have now in order to prevent future regression on it. ## How was this patch tested? Pass a newly added test case. Author: Dongjoon Hyun <[email protected]> Closes apache#19948 from dongjoon-hyun/SPARK-19809-EMPTY-FILE.
…by default ## What changes were proposed in this pull request? This is to revert the changes made in #19499 , because this causes a regression. We should not ignore the table-specific compression conf when the Hive serde tables are converted to the data source tables. ## How was this patch tested? The existing tests. Author: gatorsmile <[email protected]> Closes #20536 from gatorsmile/revert22279. (cherry picked from commit 3473fda) Signed-off-by: Wenchen Fan <[email protected]>
…by default ## What changes were proposed in this pull request? This is to revert the changes made in apache#19499 , because this causes a regression. We should not ignore the table-specific compression conf when the Hive serde tables are converted to the data source tables. ## How was this patch tested? The existing tests. Author: gatorsmile <[email protected]> Closes apache#20536 from gatorsmile/revert22279.
…by default ## What changes were proposed in this pull request? This is to revert the changes made in apache#19499 , because this causes a regression. We should not ignore the table-specific compression conf when the Hive serde tables are converted to the data source tables. ## How was this patch tested? The existing tests. Author: gatorsmile <[email protected]> Closes apache#20536 from gatorsmile/revert22279.
What changes were proposed in this pull request?
Like Parquet, this PR aims to turn on
spark.sql.hive.convertMetastoreOrc
by default.How was this patch tested?
Pass all the existing test cases.