[SPARK-20682][SQL] Update ORC data source based on Apache ORC library #18953

dongjoon-hyun · 2017-08-16T01:44:24Z

What changes were proposed in this pull request?

Since SPARK-21422, Apache Spark starts to depend on Apache ORC 1.4.0. This PR updates the existing Hive 1.2-based ORC data source by removing Hive dependency and using a new Apache ORC 1.4.0 library only.

The newly updated ORC format in this PR will enable the followings easily. Also, we can expect more later.

Make the ORC data source independent from Hive module.
(We will be able to move new ORC data soruce into sql/core easily at the next step.)
Supports column names with dot (SPARK-21791, [SPARK-21791][SQL] ORC should support column names with dot #19004)
Support for pushing down filters for DATE types (SPARK-21787, [SPARK-21787][SQL] Support for pushing down filters for DateType in native OrcFileFormat #18995)

How was this patch tested?

Pass the Jenkins with the updated test suites.

SparkQA · 2017-08-16T04:15:42Z

Test build #80707 has finished for PR 18953 at commit 051ed1f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class OrcFileFormatOld extends FileFormat with DataSourceRegister with Serializable

dongjoon-hyun · 2017-08-16T06:26:18Z

Rebased to the master since #18640 is merged.

dongjoon-hyun · 2017-08-16T06:37:35Z

sql/hive/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister

@@ -1,2 +1,2 @@
-org.apache.spark.sql.hive.orc.OrcFileFormat
+org.apache.spark.sql.hive.orc.OrcFileFormatOld


This will be reverted after review.

dongjoon-hyun · 2017-08-16T06:37:57Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala

@@ -47,11 +47,11 @@ import org.apache.spark.util.SerializableConfiguration
 * `FileFormat` for reading ORC files. If this is moved or renamed, please update
 * `DataSource`'s backwardCompatibilityMap.
 */
-class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable {
+class OrcFileFormatOld extends FileFormat with DataSourceRegister with Serializable {


This change of name will be reverted after review.

dongjoon-hyun · 2017-08-16T06:39:16Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala

@@ -343,7 +343,7 @@ class OrcQuerySuite extends QueryTest with BeforeAndAfterAll with OrcTest {
    }
  }

-  test("SPARK-8501: Avoids discovery schema from empty ORC files") {
+  ignore("SPARK-8501: Avoids discovery schema from empty ORC files") {


This only happens on old Hive.

SparkQA · 2017-08-16T06:54:41Z

Test build #80710 has finished for PR 18953 at commit 22dbe35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-16T07:03:33Z

Hi, @cloud-fan , @gatorsmile , @rxin , @sameeragarwal , and @viirya .
Could you review this ORC PR? I narrow down the focus and reduce the size of PR.
For review purpose, I replace the old ORC with new ORC.
Thank you always!

SparkQA · 2017-08-16T07:04:49Z

Test build #80721 has finished for PR 18953 at commit 07778ed.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-16T07:08:09Z

Retest this please.

cloud-fan · 2017-08-16T08:30:23Z

what's the project plan for this ORC stuff? shall we move the old orc data source to sql/core with orc 1.4 first, and then send a new PR for vectorized reader?

dongjoon-hyun · 2017-08-16T08:37:28Z

Hi, @cloud-fan .
In my email, I wrote in the following order .

1. SPARK-21422: Depend on Apache ORC 1.4.0
2. SPARK-20682: Add a new faster ORC data source based on Apache ORC
3. SPARK-20728: Make ORCFileFormat configurable between sql/hive and sql/core
4. SPARK-16060: Vectorized Orc Reader

In Apache Spark 2.3, I thought we need to keep both by option spark.sql.orc.enabled.

Do you mean removing old orc data source from sql/hive?

In this PR, I replaces sql/hive ORC to reduce the burden of review of test code. The new test code is in #17980.

dongjoon-hyun · 2017-08-16T08:39:21Z

This PR is about 1,100 lines and #17980 is about 3,833 lines (including Vectorized part, too).
I also updated #17980 today, too. So, if you want to review that PR, that is also great!
Thank you for review again, @cloud-fan !

dongjoon-hyun · 2017-08-16T08:49:20Z

For the reader, there are three part.

OrcColumnarBatchReader: It's not included here. (This is the one which you and @viirya mentioned before )
OrcRecordIterator: It's included here and use ORC VectorizedRowBatch, but it doesn't use Spark vectorization.
RecordReaderIterator[OrcStruct]: It's used here.

Like (1), I can exclude (2) in this PR to minimize again. Is it okay?

SparkQA · 2017-08-16T10:01:36Z

Test build #80722 has finished for PR 18953 at commit 07778ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-08-16T16:31:23Z

Are the ORC APIs changed a lot in 1.4? I was expecting a small patch to upgrade the current ORC data source, without moving it to sql/core.

dongjoon-hyun · 2017-08-16T16:40:58Z

The goal is using ORC without -Phive. You can build Spark and use ORC datasource.

Previously, org.apache.spark.sql.hive.orc.ORCFileFormat is tightly coupled with Hive code outside org.apache.spark.sql.hive.orc package. For example, org.apache.spark.sql.hive.HiveInspectors. Also, it uses the following imports.

import org.apache.hadoop.hive.conf.HiveConf.ConfVars
import org.apache.hadoop.hive.ql.io.orc._
import org.apache.hadoop.hive.serde2.objectinspector.{SettableStructObjectInspector, StructObjectInspector}
import org.apache.hadoop.hive.serde2.typeinfo.{StructTypeInfo, TypeInfoUtils}

dongjoon-hyun · 2017-08-16T17:32:52Z

In case of OrcFilters.scala, the API is changed like the following.

- Some(builder.startAnd().isNull(attribute).end())
+ Some(builder.startAnd().isNull(attribute, getType(attribute)).end())

You can see more diff by diff sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala.

dongjoon-hyun · 2017-08-16T18:45:41Z

@cloud-fan . I'll rethink about consolidation the old and the new. Thank you for the advice!

dongjoon-hyun · 2017-08-16T23:35:37Z

So far, the current ORC related code looks too old and tightly integrated with hive-exec-1.2.1.spark2.jar and hive module side-by-side.
The patch also need to touch every part because everything is changed; Especially, OrcInputFormat.createReader (in hive-exec), Filter, SearchArgument, HiveInspectors.

dongjoon-hyun · 2017-08-17T03:43:00Z

Hi, @cloud-fan . As you adviced, I will replace old ORC in the current namespace and will try to move to sql/core later. Although, we cannot switch among old ORC and new ORC, we can bring back old ORC if need from the code. Thanks.

dongjoon-hyun · 2017-08-17T06:17:44Z

@cloud-fan . The PR is updated. Now, it's minimized as +493 and −247 lines.

SparkQA · 2017-08-17T07:04:49Z

Test build #80771 has finished for PR 18953 at commit 80c80f3.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-17T07:31:28Z

Retest this please

SparkQA · 2017-08-17T09:25:15Z

Test build #80777 has finished for PR 18953 at commit 80c80f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-17T16:17:46Z

Hi, @cloud-fan , @gatorsmile , @sameeragarwal , @rxin , @viirya .
Could you review this ORC PR again? According to the advice, I'm replacing the existing ORC inside sql/hive. We can move this later into sql/core and can remove unused ORC related code later.

SparkQA · 2017-08-18T07:04:50Z

Test build #80827 has finished for PR 18953 at commit f8de872.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-18T07:18:24Z

Retest this please.

SparkQA · 2017-08-18T08:38:06Z

Test build #80832 has finished for PR 18953 at commit f8de872.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-18T13:19:30Z

Test build #80840 has finished for PR 18953 at commit c9321df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-18T22:44:28Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala

      assert("NONE" === expectedCompressionKind.name())
    }
  }

-  // Following codec is not supported in Hive 1.2.1, ignore it now
-  ignore("LZO compression options for writing to an ORC file not supported in Hive 1.2.1") {


This is a known improvement.

cloud-fan · 2017-08-22T07:03:57Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcUtils.scala

+  /**
+   * Return Spark Catalyst value from WritableComparable object.
+   */
+  private[orc] def getCatalystValue(value: WritableComparable[_], dataType: DataType): Any = {


we'd better return a function to avoid per-row pattern matche. cc @HyukjinKwon who fixed similar problems many times.

SparkQA · 2017-08-22T10:45:40Z

Test build #80980 has finished for PR 18953 at commit 3d602ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-23T02:49:03Z

Test build #81012 has finished for PR 18953 at commit 263b3dc.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-23T04:42:06Z

Test build #81013 has finished for PR 18953 at commit 8507aef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-23T05:03:29Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcUtils.scala

+   * Builds a WritableComparable-return function ahead of time according to DataType
+   * to avoid pattern matching and branching costs per row.
+   */
+  private[orc] def getWritableWrapper(dataType: DataType): Any => Any = dataType match {


Hi, @cloud-fan .
I updated the PR to return functions. Could you review again?

dongjoon-hyun · 2017-08-25T04:02:30Z

Hi, @cloud-fan .
Could you review this again when you have sometime?

SparkQA · 2017-08-25T22:15:01Z

Test build #81142 has finished for PR 18953 at commit b9b348d.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-26T00:44:16Z

Test build #81148 has finished for PR 18953 at commit 6548cf8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-26T02:12:06Z

Now, it becomes +432 −98.

dongjoon-hyun · 2017-08-26T19:31:27Z

Hi, @cloud-fan and @gatorsmile .
Could you review this PR when you have sometime?
If we need more refactoring or spin-off, please let me know.
Thank you always.

dongjoon-hyun · 2017-08-28T18:22:10Z

Hi, @cloud-fan .
Could you review this PR?

dongjoon-hyun · 2017-08-30T15:32:12Z

Hi, @cloud-fan and @gatorsmile .
I know that you have been spending much time for reviewing my PRs (including this).
Thank you always. If you have something in mind, please let me know. I'll try to improve it.

dongjoon-hyun · 2017-08-30T23:23:11Z

Retest this please.

SparkQA · 2017-08-31T01:20:55Z

Test build #81268 has finished for PR 18953 at commit 6548cf8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-31T17:32:21Z

Hi, @marmbrus , @liancheng , @yhuai .
Could you give me some advice about this ORC upgrade PR?
I tried to minimize the diff of PR, so I didn't remove the unused old one.
Thank you in advance.

dongjoon-hyun · 2017-09-01T21:16:55Z

Retest this please.

SparkQA · 2017-09-01T23:16:56Z

Test build #81326 has finished for PR 18953 at commit 6548cf8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-09-03T18:14:50Z

Hi, All.
Although ORC seems not to be a prefered storage format in Apache Spark, ORC is very important to me. Could anyone review this again?

SparkQA · 2017-09-07T10:27:29Z

Test build #81513 has finished for PR 18953 at commit 014f2f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-09-08T23:25:39Z

Retest this please.

SparkQA · 2017-09-09T01:26:48Z

Test build #81569 has finished for PR 18953 at commit 014f2f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-10T02:56:56Z

Test build #81597 has finished for PR 18953 at commit ed43eb7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class OrcSourceSuite extends OrcSuite with SQLTestUtils

dongjoon-hyun · 2017-12-03T17:23:29Z

This is resolved via #19651 .

dongjoon-hyun commented Aug 16, 2017

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-20682][SQL] Implement new ORC data source based on Apache ORC~~ [SPARK-20682][SQL] Update ORC data source based on Apache ORC library Aug 17, 2017

dongjoon-hyun mentioned this pull request Aug 18, 2017

[SPARK-21787][SQL] Support for pushing down filters for DateType in native OrcFileFormat #18995

Closed

dongjoon-hyun commented Aug 18, 2017

View reviewed changes

cloud-fan reviewed Aug 22, 2017

View reviewed changes

dongjoon-hyun commented Aug 23, 2017

View reviewed changes

dongjoon-hyun mentioned this pull request Aug 25, 2017

[SPARK-21839][SQL] Support SQL config for ORC compression #19055

Closed

dongjoon-hyun mentioned this pull request Sep 5, 2017

[SPARK-21783][SQL] Turn on ORC filter push-down by default #18991

Closed

dongjoon-hyun added 2 commits September 9, 2017 17:52

[SPARK-20682][SQL] Update ORC data source based on Apache ORC library

9317a37

merge SPARK-21791 into here.

ed43eb7

dongjoon-hyun mentioned this pull request Oct 26, 2017

[SPARK-15474][SQL] Write and read back non-emtpy schema with empty dataframe #19571

Closed

dongjoon-hyun closed this Dec 3, 2017

dongjoon-hyun deleted the SPARK-20682-3 branch January 7, 2019 07:04

		@@ -1,2 +1,2 @@
		org.apache.spark.sql.hive.orc.OrcFileFormat
		org.apache.spark.sql.hive.orc.OrcFileFormatOld

[SPARK-20682][SQL] Update ORC data source based on Apache ORC library #18953

[SPARK-20682][SQL] Update ORC data source based on Apache ORC library #18953

Conversation

dongjoon-hyun commented Aug 16, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Aug 16, 2017

dongjoon-hyun commented Aug 16, 2017

dongjoon-hyun Aug 16, 2017

Choose a reason for hiding this comment

dongjoon-hyun Aug 16, 2017

Choose a reason for hiding this comment

dongjoon-hyun Aug 16, 2017

Choose a reason for hiding this comment

SparkQA commented Aug 16, 2017

dongjoon-hyun commented Aug 16, 2017

SparkQA commented Aug 16, 2017

dongjoon-hyun commented Aug 16, 2017

cloud-fan commented Aug 16, 2017

dongjoon-hyun commented Aug 16, 2017 • edited Loading

dongjoon-hyun commented Aug 16, 2017 • edited Loading

dongjoon-hyun commented Aug 16, 2017 • edited Loading

SparkQA commented Aug 16, 2017

cloud-fan commented Aug 16, 2017

dongjoon-hyun commented Aug 16, 2017 • edited Loading

dongjoon-hyun commented Aug 16, 2017

dongjoon-hyun commented Aug 16, 2017

dongjoon-hyun commented Aug 16, 2017

dongjoon-hyun commented Aug 17, 2017

dongjoon-hyun commented Aug 17, 2017

SparkQA commented Aug 17, 2017

dongjoon-hyun commented Aug 17, 2017

SparkQA commented Aug 17, 2017

dongjoon-hyun commented Aug 17, 2017

SparkQA commented Aug 18, 2017

dongjoon-hyun commented Aug 18, 2017

SparkQA commented Aug 18, 2017

SparkQA commented Aug 18, 2017

dongjoon-hyun Aug 18, 2017

Choose a reason for hiding this comment

cloud-fan Aug 22, 2017

Choose a reason for hiding this comment

SparkQA commented Aug 22, 2017

SparkQA commented Aug 23, 2017

SparkQA commented Aug 23, 2017

dongjoon-hyun Aug 23, 2017

Choose a reason for hiding this comment

dongjoon-hyun commented Aug 25, 2017

SparkQA commented Aug 25, 2017

SparkQA commented Aug 26, 2017

dongjoon-hyun commented Aug 26, 2017

dongjoon-hyun commented Aug 26, 2017

dongjoon-hyun commented Aug 28, 2017

dongjoon-hyun commented Aug 30, 2017

dongjoon-hyun commented Aug 30, 2017

SparkQA commented Aug 31, 2017

dongjoon-hyun commented Aug 31, 2017

dongjoon-hyun commented Sep 1, 2017

SparkQA commented Sep 1, 2017

dongjoon-hyun commented Sep 3, 2017

SparkQA commented Sep 7, 2017

dongjoon-hyun commented Sep 8, 2017

SparkQA commented Sep 9, 2017

SparkQA commented Sep 10, 2017

dongjoon-hyun commented Dec 3, 2017

dongjoon-hyun commented Aug 16, 2017 •

edited

Loading

dongjoon-hyun commented Aug 16, 2017 •

edited

Loading

dongjoon-hyun commented Aug 16, 2017 •

edited

Loading

dongjoon-hyun commented Aug 16, 2017 •

edited

Loading

dongjoon-hyun commented Aug 16, 2017 •

edited

Loading