Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20682][SQL] Update ORC data source based on Apache ORC library #18953

Closed
wants to merge 2 commits into from
Closed

[SPARK-20682][SQL] Update ORC data source based on Apache ORC library #18953

wants to merge 2 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Aug 16, 2017

What changes were proposed in this pull request?

Since SPARK-21422, Apache Spark starts to depend on Apache ORC 1.4.0. This PR updates the existing Hive 1.2-based ORC data source by removing Hive dependency and using a new Apache ORC 1.4.0 library only.

The newly updated ORC format in this PR will enable the followings easily. Also, we can expect more later.

  1. Make the ORC data source independent from Hive module.
    (We will be able to move new ORC data soruce into sql/core easily at the next step.)
  2. Supports column names with dot (SPARK-21791, [SPARK-21791][SQL] ORC should support column names with dot #19004)
  3. Support for pushing down filters for DATE types (SPARK-21787, [SPARK-21787][SQL] Support for pushing down filters for DateType in native OrcFileFormat #18995)

How was this patch tested?

Pass the Jenkins with the updated test suites.

@SparkQA
Copy link

SparkQA commented Aug 16, 2017

Test build #80707 has finished for PR 18953 at commit 051ed1f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class OrcFileFormatOld extends FileFormat with DataSourceRegister with Serializable

@dongjoon-hyun
Copy link
Member Author

Rebased to the master since #18640 is merged.

@@ -1,2 +1,2 @@
org.apache.spark.sql.hive.orc.OrcFileFormat
org.apache.spark.sql.hive.orc.OrcFileFormatOld
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be reverted after review.

@@ -47,11 +47,11 @@ import org.apache.spark.util.SerializableConfiguration
* `FileFormat` for reading ORC files. If this is moved or renamed, please update
* `DataSource`'s backwardCompatibilityMap.
*/
class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable {
class OrcFileFormatOld extends FileFormat with DataSourceRegister with Serializable {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change of name will be reverted after review.

@@ -343,7 +343,7 @@ class OrcQuerySuite extends QueryTest with BeforeAndAfterAll with OrcTest {
}
}

test("SPARK-8501: Avoids discovery schema from empty ORC files") {
ignore("SPARK-8501: Avoids discovery schema from empty ORC files") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only happens on old Hive.

@SparkQA
Copy link

SparkQA commented Aug 16, 2017

Test build #80710 has finished for PR 18953 at commit 22dbe35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan , @gatorsmile , @rxin , @sameeragarwal , and @viirya .
Could you review this ORC PR? I narrow down the focus and reduce the size of PR.
For review purpose, I replace the old ORC with new ORC.
Thank you always!

@SparkQA
Copy link

SparkQA commented Aug 16, 2017

Test build #80721 has finished for PR 18953 at commit 07778ed.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@cloud-fan
Copy link
Contributor

what's the project plan for this ORC stuff? shall we move the old orc data source to sql/core with orc 1.4 first, and then send a new PR for vectorized reader?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 16, 2017

Hi, @cloud-fan .
In my email, I wrote in the following order .

1. SPARK-21422: Depend on Apache ORC 1.4.0
2. SPARK-20682: Add a new faster ORC data source based on Apache ORC
3. SPARK-20728: Make ORCFileFormat configurable between sql/hive and sql/core
4. SPARK-16060: Vectorized Orc Reader

In Apache Spark 2.3, I thought we need to keep both by option spark.sql.orc.enabled.

Do you mean removing old orc data source from sql/hive?

In this PR, I replaces sql/hive ORC to reduce the burden of review of test code. The new test code is in #17980.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 16, 2017

This PR is about 1,100 lines and #17980 is about 3,833 lines (including Vectorized part, too).
I also updated #17980 today, too. So, if you want to review that PR, that is also great!
Thank you for review again, @cloud-fan !

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 16, 2017

For the reader, there are three part.

  1. OrcColumnarBatchReader: It's not included here. (This is the one which you and @viirya mentioned before )
  2. OrcRecordIterator: It's included here and use ORC VectorizedRowBatch, but it doesn't use Spark vectorization.
  3. RecordReaderIterator[OrcStruct]: It's used here.

Like (1), I can exclude (2) in this PR to minimize again. Is it okay?

@SparkQA
Copy link

SparkQA commented Aug 16, 2017

Test build #80722 has finished for PR 18953 at commit 07778ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Are the ORC APIs changed a lot in 1.4? I was expecting a small patch to upgrade the current ORC data source, without moving it to sql/core.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 16, 2017

The goal is using ORC without -Phive. You can build Spark and use ORC datasource.

Previously, org.apache.spark.sql.hive.orc.ORCFileFormat is tightly coupled with Hive code outside org.apache.spark.sql.hive.orc package. For example, org.apache.spark.sql.hive.HiveInspectors. Also, it uses the following imports.

import org.apache.hadoop.hive.conf.HiveConf.ConfVars
import org.apache.hadoop.hive.ql.io.orc._
import org.apache.hadoop.hive.serde2.objectinspector.{SettableStructObjectInspector, StructObjectInspector}
import org.apache.hadoop.hive.serde2.typeinfo.{StructTypeInfo, TypeInfoUtils}

@dongjoon-hyun
Copy link
Member Author

In case of OrcFilters.scala, the API is changed like the following.

- Some(builder.startAnd().isNull(attribute).end())
+ Some(builder.startAnd().isNull(attribute, getType(attribute)).end())

You can see more diff by diff sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala.

@dongjoon-hyun
Copy link
Member Author

@cloud-fan . I'll rethink about consolidation the old and the new. Thank you for the advice!

@dongjoon-hyun
Copy link
Member Author

So far, the current ORC related code looks too old and tightly integrated with hive-exec-1.2.1.spark2.jar and hive module side-by-side.
The patch also need to touch every part because everything is changed; Especially, OrcInputFormat.createReader (in hive-exec), Filter, SearchArgument, HiveInspectors.

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan . As you adviced, I will replace old ORC in the current namespace and will try to move to sql/core later. Although, we cannot switch among old ORC and new ORC, we can bring back old ORC if need from the code. Thanks.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-20682][SQL] Implement new ORC data source based on Apache ORC [SPARK-20682][SQL] Update ORC data source based on Apache ORC library Aug 17, 2017
@dongjoon-hyun
Copy link
Member Author

@cloud-fan . The PR is updated. Now, it's minimized as +493 and −247 lines.

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80771 has finished for PR 18953 at commit 80c80f3.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80777 has finished for PR 18953 at commit 80c80f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan , @gatorsmile , @sameeragarwal , @rxin , @viirya .
Could you review this ORC PR again? According to the advice, I'm replacing the existing ORC inside sql/hive. We can move this later into sql/core and can remove unused ORC related code later.

@SparkQA
Copy link

SparkQA commented Aug 18, 2017

Test build #80827 has finished for PR 18953 at commit f8de872.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Aug 18, 2017

Test build #80832 has finished for PR 18953 at commit f8de872.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 18, 2017

Test build #80840 has finished for PR 18953 at commit c9321df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

assert("NONE" === expectedCompressionKind.name())
}
}

// Following codec is not supported in Hive 1.2.1, ignore it now
ignore("LZO compression options for writing to an ORC file not supported in Hive 1.2.1") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a known improvement.

/**
* Return Spark Catalyst value from WritableComparable object.
*/
private[orc] def getCatalystValue(value: WritableComparable[_], dataType: DataType): Any = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd better return a function to avoid per-row pattern matche. cc @HyukjinKwon who fixed similar problems many times.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80980 has finished for PR 18953 at commit 3d602ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #81012 has finished for PR 18953 at commit 263b3dc.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #81013 has finished for PR 18953 at commit 8507aef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Builds a WritableComparable-return function ahead of time according to DataType
* to avoid pattern matching and branching costs per row.
*/
private[orc] def getWritableWrapper(dataType: DataType): Any => Any = dataType match {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @cloud-fan .
I updated the PR to return functions. Could you review again?

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan .
Could you review this again when you have sometime?

@SparkQA
Copy link

SparkQA commented Aug 25, 2017

Test build #81142 has finished for PR 18953 at commit b9b348d.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2017

Test build #81148 has finished for PR 18953 at commit 6548cf8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Now, it becomes +432 −98.

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan and @gatorsmile .
Could you review this PR when you have sometime?
If we need more refactoring or spin-off, please let me know.
Thank you always.

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan .
Could you review this PR?

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan and @gatorsmile .
I know that you have been spending much time for reviewing my PRs (including this).
Thank you always. If you have something in mind, please let me know. I'll try to improve it.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Aug 31, 2017

Test build #81268 has finished for PR 18953 at commit 6548cf8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @marmbrus , @liancheng , @yhuai .
Could you give me some advice about this ORC upgrade PR?
I tried to minimize the diff of PR, so I didn't remove the unused old one.
Thank you in advance.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Sep 1, 2017

Test build #81326 has finished for PR 18953 at commit 6548cf8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, All.
Although ORC seems not to be a prefered storage format in Apache Spark, ORC is very important to me. Could anyone review this again?

@SparkQA
Copy link

SparkQA commented Sep 7, 2017

Test build #81513 has finished for PR 18953 at commit 014f2f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Sep 9, 2017

Test build #81569 has finished for PR 18953 at commit 014f2f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 10, 2017

Test build #81597 has finished for PR 18953 at commit ed43eb7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class OrcSourceSuite extends OrcSuite with SQLTestUtils

@dongjoon-hyun
Copy link
Member Author

This is resolved via #19651 .

@dongjoon-hyun dongjoon-hyun deleted the SPARK-20682-3 branch January 7, 2019 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants