Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default #19499

Closed
wants to merge 1 commit into from
Closed

[SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default #19499

wants to merge 1 commit into from

Conversation

dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

Like Parquet, this PR aims to turn on spark.sql.hive.convertMetastoreOrc by default.

How was this patch tested?

Pass all the existing test cases.

@SparkQA
Copy link

SparkQA commented Oct 14, 2017

Test build #82763 has finished for PR 19499 at commit b9c4954.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 15, 2017

Test build #82765 has finished for PR 19499 at commit 83cde8b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -937,26 +937,22 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto
}

test("test statistics of LogicalRelation converted from Hive serde tables") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be handled in a separate PR, #19500 .
After #19500, I will remove this change on test code from this PR.

@SparkQA
Copy link

SparkQA commented Oct 17, 2017

Test build #82824 has finished for PR 19499 at commit cf7edbb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-22279][SQL][WIP] Turn on spark.sql.hive.convertMetastoreOrc by default [SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default Oct 17, 2017
@dongjoon-hyun
Copy link
Member Author

Since we starts to use new nativeOrcFileFormat by default, I'm retriggering this.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@@ -106,7 +106,7 @@ private[spark] object HiveUtils extends Logging {
.doc("When set to true, the built-in ORC reader and writer are used to process " +
"ORC tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(false)
.createWithDefault(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was made in https://issues.apache.org/jira/browse/SPARK-15705.

This issue has been resolved?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's resolved as you see the last my comment on the JIRA.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the JIRA, there is a result on 2.1.1 and 2.2.0. And the following is the result on 2.2.1.

scala> sql("set spark.sql.hive.convertMetastoreOrc=true")
scala> spark.table("default.test").printSchema
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- state: string (nullable = true)

scala> spark.version
res2: String = 2.2.1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By this PR: #19470 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's before #19470 because it's fixed on 2.1.1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. That is resolved via https://issues.apache.org/jira/browse/SPARK-14387 by me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, please wait a moment. I'll double check the case to make it sure.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. It's resolved via SPARK-14387. The following is a result of SPARK-15757 example on 2.2.1.

hive> CREATE TABLE source(inv_date_sk INT, inv_item_sk INT, inv_warehouse_sk INT, inv_quantity_on_hand INT);
hive> INSERT INTO source VALUES(1,1,1,1);
hive> CREATE TABLE inventory(inv_date_sk INT, inv_item_sk INT, inv_warehouse_sk INT, inv_quantity_on_hand INT)
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS ORC;
hive> INSERT OVERWRITE TABLE inventory SELECT * FROM source;

scala> sql("set spark.sql.hive.convertMetastoreOrc=true")
scala> sql("SELECT * FROM inventory").show
+-----------+-----------+----------------+--------------------+
|inv_date_sk|inv_item_sk|inv_warehouse_sk|inv_quantity_on_hand|
+-----------+-----------+----------------+--------------------+
|          1|          1|               1|                   1|
+-----------+-----------+----------------+--------------------+
scala> spark.version
res2: String = 2.2.1

@SparkQA
Copy link

SparkQA commented Dec 7, 2017

Test build #84583 has finished for PR 19499 at commit cf7edbb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Good. The failures are expected ones and handled by #19882 .

  • org.apache.spark.sql.hive.orc.OrcQuerySuite.SPARK-8501: Avoids discovery schema from empty ORC files
    • The test case expects a failure, but it's fixed in new OrcFileFormat.
  • org.apache.spark.sql.hive.orc.OrcSourceSuite.SPARK-19459/SPARK-18220: read char/varchar column written by Hive
    • This is VARCHAR issue.

@dongjoon-hyun
Copy link
Member Author

I'll retrigger after merging #19882 .

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Dec 7, 2017

Test build #84604 has finished for PR 19499 at commit cf7edbb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in aa1764b Dec 7, 2017
@dongjoon-hyun
Copy link
Member Author

Thank you so much, @gatorsmile !

@dongjoon-hyun dongjoon-hyun deleted the SPARK-22279 branch December 8, 2017 03:19
asfgit pushed a commit that referenced this pull request Dec 11, 2017
…olumn order is different

## What changes were proposed in this pull request?

Until 2.2.1, with the default configuration, Apache Spark returns incorrect results when ORC file schema is different from metastore schema order. This is due to Hive 1.2.1 library and some issues on `convertMetastoreOrc` option.

```scala
scala> Seq(1 -> 2).toDF("c1", "c2").write.format("orc").mode("overwrite").save("/tmp/o")
scala> sql("CREATE EXTERNAL TABLE o(c2 INT, c1 INT) STORED AS orc LOCATION '/tmp/o'")
scala> spark.table("o").show    // This is wrong.
+---+---+
| c2| c1|
+---+---+
|  1|  2|
+---+---+
scala> spark.read.orc("/tmp/o").show  // This is correct.
+---+---+
| c1| c2|
+---+---+
|  1|  2|
+---+---+
```

After [SPARK-22279](#19499), the default configuration doesn't have this bug. Although Hive 1.2.1 library code path still has the problem, we had better have a test coverage on what we have now in order to prevent future regression on it.

## How was this patch tested?

Pass the Jenkins with a newly added test test.

Author: Dongjoon Hyun <[email protected]>

Closes #19928 from dongjoon-hyun/SPARK-22267.
ghost pushed a commit to dbtsai/spark that referenced this pull request Dec 12, 2017
## What changes were proposed in this pull request?

Until 2.2.1, Spark raises `NullPointerException` on zero-size ORC files. Usually, these zero-size ORC files are generated by 3rd-party apps like Flume.

```scala
scala> sql("create table empty_orc(a int) stored as orc location '/tmp/empty_orc'")

$ touch /tmp/empty_orc/zero.orc

scala> sql("select * from empty_orc").show
java.lang.RuntimeException: serious problem at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
...
Caused by: java.lang.NullPointerException at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
```

After [SPARK-22279](apache#19499), Apache Spark with the default configuration doesn't have this bug. Although Hive 1.2.1 library code path still has the problem, we had better have a test coverage on what we have now in order to prevent future regression on it.

## How was this patch tested?

Pass a newly added test case.

Author: Dongjoon Hyun <[email protected]>

Closes apache#19948 from dongjoon-hyun/SPARK-19809-EMPTY-FILE.
asfgit pushed a commit that referenced this pull request Feb 8, 2018
…by default

## What changes were proposed in this pull request?

This is to revert the changes made in #19499 , because this causes a regression. We should not ignore the table-specific compression conf when the Hive serde tables are converted to the data source tables.

## How was this patch tested?

The existing tests.

Author: gatorsmile <[email protected]>

Closes #20536 from gatorsmile/revert22279.

(cherry picked from commit 3473fda)
Signed-off-by: Wenchen Fan <[email protected]>
ghost pushed a commit to dbtsai/spark that referenced this pull request Feb 8, 2018
…by default

## What changes were proposed in this pull request?

This is to revert the changes made in apache#19499 , because this causes a regression. We should not ignore the table-specific compression conf when the Hive serde tables are converted to the data source tables.

## How was this patch tested?

The existing tests.

Author: gatorsmile <[email protected]>

Closes apache#20536 from gatorsmile/revert22279.
robert3005 pushed a commit to palantir/spark that referenced this pull request Feb 12, 2018
…by default

## What changes were proposed in this pull request?

This is to revert the changes made in apache#19499 , because this causes a regression. We should not ignore the table-specific compression conf when the Hive serde tables are converted to the data source tables.

## How was this patch tested?

The existing tests.

Author: gatorsmile <[email protected]>

Closes apache#20536 from gatorsmile/revert22279.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants