[SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema #19470

dongjoon-hyun · 2017-10-11T03:54:00Z

What changes were proposed in this pull request?

Before Hive 2.0, ORC File schema has invalid column names like _col1 and _col2. This is a well-known limitation and there are several Apache Spark issues with spark.sql.hive.convertMetastoreOrc=true. This PR ignores ORC File schema and use Spark schema.

How was this patch tested?

Pass the newly added test case.

…file schema

SparkQA · 2017-10-11T05:55:38Z

Test build #82620 has finished for PR 19470 at commit d11ce09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-10-11T06:03:12Z

Hi, @gatorsmile and @cloud-fan .
Could you review this PR?

gatorsmile · 2017-10-11T19:54:44Z

Could you create test cases with the different schemas between files and hive metastore.

dongjoon-hyun · 2017-10-11T20:18:35Z

Thank you for review, @gatorsmile .
Sure. I assume that you want to check the regression here. Could you tell me the degree of difference?

Here, this PR is focusing on missing-columns scenario after ADD COLUMNS. This is a typical use cases by customers.

gatorsmile · 2017-10-11T22:35:33Z

I remember we previously hit multiple issues due to the schema difference between the actual orc-file schema and the metastore schema. Just ensure it still exists and it does not make the current support worse.

dongjoon-hyun · 2017-10-11T23:11:45Z

Ya, that was my question, too.

What kind of difference does Spark support, especially in ORC? Apache Spark only supports HiveFileFormat so far, not old OrcFileFormat.
In addition, there is no Schema Merging. Randomly (usually the bigging ORC file?), the first correct ORC file schema is used now. For old ORC cases, those file schema are meaningless like _colX.

For me, HiveMetastore schema is the only valid one in Apache Spark.

dongjoon-hyun · 2017-10-12T18:28:53Z

To be clear, I'll file another JIRA about ORC status on mismatched column orders.

dongjoon-hyun · 2017-10-12T18:47:43Z

Hi, @gatorsmile .

For mismatched column orders: SPARK-22267 shows the current status of ORC.
For mismatched column types: Parquet also does not support.

Based on the above, I'll proceed to add more test cases in order to prevent regression.

dongjoon-hyun · 2017-10-12T19:27:07Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

+  }
+
+  // This test case is added to prevent regression.
+  test("SPARK-22267 Spark SQL incorrectly reads ORC files when column order is different") {


This is added to prevent regression according to your request, @gatorsmile ~

it's weird to have a test verifying a bug, I think it's good enough to have a JIRA tracking this bug.

SparkQA · 2017-10-12T21:31:01Z

Test build #82701 has finished for PR 19470 at commit 8ac1acf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-10-13T03:06:11Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala

@@ -138,8 +138,7 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
      if (maybePhysicalSchema.isEmpty) {


nit

val isEmptyFile = OrcFileOperator.readSchema(Seq(file.filePath), Some(conf)).isEmpty if (isEmptyFile) { ... } else ...

cloud-fan · 2017-10-13T03:07:29Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala

@@ -138,8 +138,7 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
      if (maybePhysicalSchema.isEmpty) {
        Iterator.empty
      } else {
-        val physicalSchema = maybePhysicalSchema.get
-        OrcRelation.setRequiredColumns(conf, physicalSchema, requiredSchema)
+        OrcRelation.setRequiredColumns(conf, dataSchema, requiredSchema)


does it work? seems here we lie to the orc reader about the physical schema.

oh i see, we only need to pass the required column indices to orc reader.

cloud-fan · 2017-10-13T03:09:01Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala

+        case (field, ordinal) =>
+          var ref = oi.getStructFieldRef(field.name)
+          if (ref == null) {
+            val maybeIndex = dataSchema.getFieldIndex(field.name)


the requiredSchema is guaranteed to be contained in the dataSchema.

cloud-fan · 2017-10-13T03:13:00Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala


      iterator.map { value =>
        val raw = deserializer.deserialize(value)
        var i = 0
        val length = fieldRefs.length
        while (i < length) {
-          val fieldValue = oi.getStructFieldData(raw, fieldRefs(i))
+          val fieldRef = fieldRefs(i)
+          val fieldValue = if (fieldRef == null) null else oi.getStructFieldData(raw, fieldRefs(i))
          if (fieldValue == null) {


nit:

if (fieldRef == null) { row.setNull... } else { val fieldValue = ... ... }

cloud-fan · 2017-10-13T03:14:31Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

+          )
+
+          checkAnswer(
+            sql(s"SELECT * FROM $db.t"),


please list all columns here instead of *, to make the test more clear

cloud-fan · 2017-10-13T03:14:57Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

+            Row(null, "12"))
+
+          checkAnswer(
+            sql(s"SELECT * FROM $db.t"),


cloud-fan · 2017-10-13T03:16:42Z

LGTM except some minor comments

dongjoon-hyun · 2017-10-13T04:26:29Z

@cloud-fan . Thank you so much for review!
I updated the PR except one: IffieldValue is null, we also use setNull again in else. So, the current one is simpler.

if (fieldRef == null) {
  row.setNull...
} else {
  val fieldValue = ...
  ...
}

gatorsmile · 2017-10-13T04:46:58Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

@@ -2050,4 +2050,60 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
      }
    }
  }
+
+  test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file schema") {


Improve the test case for checking the other formats?

since it depends on the CONVERT_METASTORE_XXX conf, maybe also test parquet.

Yep. I'll add parquet, too.

cloud-fan · 2017-10-13T04:47:42Z

LGTM, pending jenkins

gatorsmile · 2017-10-13T05:15:24Z

LGTM too.

viirya · 2017-10-13T05:38:28Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

+      Seq("true", "false").foreach { value =>
+        withSQLConf(
+          HiveUtils.CONVERT_METASTORE_ORC.key -> value,
+          HiveUtils.CONVERT_METASTORE_PARQUET.key -> value) {


As you separate orc and parquet to two tests in fact, maybe you just need to test against one config at one time, i.e., orc -> HiveUtils.CONVERT_METASTORE_ORC, parquet -> HiveUtils.CONVERT_METASTORE_PARQUET.key.

Thank you for review, @viirya . For that, yes, we can, but that will be a little-bit overkill.

viirya · 2017-10-13T05:40:30Z

One minor comment doesn't affect this. LGTM.

SparkQA · 2017-10-13T06:27:38Z

Test build #82718 has finished for PR 19470 at commit ef2123e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-13T07:05:01Z

Test build #82720 has finished for PR 19470 at commit 8e7fe9b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-10-13T07:06:24Z

Retest this please.

dongjoon-hyun · 2017-10-13T07:07:44Z

R failure seems to be irrelevant.

[error] running /home/jenkins/workspace/SparkPullRequestBuilder/R/run-tests.sh ; process was terminated by signal 9
Attempting to post to Github...

SparkQA · 2017-10-13T09:10:40Z

Test build #82724 has finished for PR 19470 at commit 8e7fe9b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-10-13T15:00:06Z

Now, it's passed again. :)

… ORC table instead of ORC file schema Before Hive 2.0, ORC File schema has invalid column names like `_col1` and `_col2`. This is a well-known limitation and there are several Apache Spark issues with `spark.sql.hive.convertMetastoreOrc=true`. This PR ignores ORC File schema and use Spark schema. Pass the newly added test case. Author: Dongjoon Hyun <[email protected]> Closes #19470 from dongjoon-hyun/SPARK-18355. (cherry picked from commit e6e3600) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2017-10-13T15:13:03Z

thanks, merging to master/2.2!

dongjoon-hyun · 2017-10-13T15:45:53Z

Thank you so much, @cloud-fan , @gatorsmile , and @viirya !

dongjoon-hyun · 2017-10-13T15:50:23Z

BTW, @cloud-fan . Could you review #18460 , too? I think we need your final approval. :)

… ORC table instead of ORC file schema Before Hive 2.0, ORC File schema has invalid column names like `_col1` and `_col2`. This is a well-known limitation and there are several Apache Spark issues with `spark.sql.hive.convertMetastoreOrc=true`. This PR ignores ORC File schema and use Spark schema. Pass the newly added test case. Author: Dongjoon Hyun <[email protected]> Closes apache#19470 from dongjoon-hyun/SPARK-18355. (cherry picked from commit e6e3600) Signed-off-by: Wenchen Fan <[email protected]>

[SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC …

d11ce09

…file schema

dongjoon-hyun changed the title ~~[SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema~~ [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema Oct 11, 2017

Add SPARK-22267 test case to prevent regression.

8ac1acf

dongjoon-hyun commented Oct 12, 2017

View reviewed changes

cloud-fan reviewed Oct 13, 2017

View reviewed changes

Address comments.

ef2123e

dongjoon-hyun changed the title ~~[SPARK-14387][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema~~ [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema Oct 13, 2017

gatorsmile reviewed Oct 13, 2017

View reviewed changes

Add parquet, too.

8e7fe9b

viirya reviewed Oct 13, 2017

View reviewed changes

asfgit closed this in e6e3600 Oct 13, 2017

dongjoon-hyun deleted the SPARK-18355 branch October 13, 2017 15:45

mpetruska mentioned this pull request Nov 15, 2017

[SPARK-22267][SQL] [WIP] Spark SQL incorrectly reads ORC file when column order is different #19744

Closed

gatorsmile mentioned this pull request Dec 7, 2017

[SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default #19499

Closed

		@@ -138,8 +138,7 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
		if (maybePhysicalSchema.isEmpty) {

[SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema #19470

[SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema #19470

Conversation

dongjoon-hyun commented Oct 11, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 11, 2017

dongjoon-hyun commented Oct 11, 2017

gatorsmile commented Oct 11, 2017

dongjoon-hyun commented Oct 11, 2017

gatorsmile commented Oct 11, 2017 • edited Loading

dongjoon-hyun commented Oct 11, 2017 • edited Loading

dongjoon-hyun commented Oct 12, 2017

dongjoon-hyun commented Oct 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Oct 13, 2017

dongjoon-hyun commented Oct 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Oct 13, 2017

gatorsmile commented Oct 13, 2017

viirya Oct 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Oct 13, 2017

SparkQA commented Oct 13, 2017

SparkQA commented Oct 13, 2017

dongjoon-hyun commented Oct 13, 2017

dongjoon-hyun commented Oct 13, 2017

SparkQA commented Oct 13, 2017

dongjoon-hyun commented Oct 13, 2017

cloud-fan commented Oct 13, 2017

dongjoon-hyun commented Oct 13, 2017

dongjoon-hyun commented Oct 13, 2017

dongjoon-hyun commented Oct 11, 2017 •

edited

Loading

gatorsmile commented Oct 11, 2017 •

edited

Loading

dongjoon-hyun commented Oct 11, 2017 •

edited

Loading

dongjoon-hyun commented Oct 12, 2017 •

edited

Loading

viirya Oct 13, 2017 •

edited

Loading