-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20682][SQL] Update ORC data source based on Apache ORC library #18953
Conversation
Test build #80707 has finished for PR 18953 at commit
|
Rebased to the master since #18640 is merged. |
@@ -1,2 +1,2 @@ | |||
org.apache.spark.sql.hive.orc.OrcFileFormat | |||
org.apache.spark.sql.hive.orc.OrcFileFormatOld |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be reverted after review.
@@ -47,11 +47,11 @@ import org.apache.spark.util.SerializableConfiguration | |||
* `FileFormat` for reading ORC files. If this is moved or renamed, please update | |||
* `DataSource`'s backwardCompatibilityMap. | |||
*/ | |||
class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable { | |||
class OrcFileFormatOld extends FileFormat with DataSourceRegister with Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change of name will be reverted after review.
@@ -343,7 +343,7 @@ class OrcQuerySuite extends QueryTest with BeforeAndAfterAll with OrcTest { | |||
} | |||
} | |||
|
|||
test("SPARK-8501: Avoids discovery schema from empty ORC files") { | |||
ignore("SPARK-8501: Avoids discovery schema from empty ORC files") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only happens on old Hive.
Test build #80710 has finished for PR 18953 at commit
|
Hi, @cloud-fan , @gatorsmile , @rxin , @sameeragarwal , and @viirya . |
Test build #80721 has finished for PR 18953 at commit
|
Retest this please. |
what's the project plan for this ORC stuff? shall we move the old orc data source to sql/core with orc 1.4 first, and then send a new PR for vectorized reader? |
Hi, @cloud-fan .
In Apache Spark 2.3, I thought we need to keep both by option Do you mean In this PR, I replaces |
This PR is about 1,100 lines and #17980 is about 3,833 lines (including Vectorized part, too). |
For the reader, there are three part.
Like (1), I can exclude (2) in this PR to minimize again. Is it okay? |
Test build #80722 has finished for PR 18953 at commit
|
Are the ORC APIs changed a lot in 1.4? I was expecting a small patch to upgrade the current ORC data source, without moving it to sql/core. |
The goal is using ORC without Previously,
|
In case of
You can see more diff by |
@cloud-fan . I'll rethink about consolidation the old and the new. Thank you for the advice! |
So far, the current ORC related code looks too old and tightly integrated with |
Hi, @cloud-fan . As you adviced, I will replace old ORC in the current namespace and will try to move to |
@cloud-fan . The PR is updated. Now, it's minimized as +493 and −247 lines. |
Test build #80771 has finished for PR 18953 at commit
|
Retest this please |
Test build #80777 has finished for PR 18953 at commit
|
Hi, @cloud-fan , @gatorsmile , @sameeragarwal , @rxin , @viirya . |
Test build #80827 has finished for PR 18953 at commit
|
Retest this please. |
Test build #80832 has finished for PR 18953 at commit
|
Test build #80840 has finished for PR 18953 at commit
|
assert("NONE" === expectedCompressionKind.name()) | ||
} | ||
} | ||
|
||
// Following codec is not supported in Hive 1.2.1, ignore it now | ||
ignore("LZO compression options for writing to an ORC file not supported in Hive 1.2.1") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a known improvement.
/** | ||
* Return Spark Catalyst value from WritableComparable object. | ||
*/ | ||
private[orc] def getCatalystValue(value: WritableComparable[_], dataType: DataType): Any = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'd better return a function to avoid per-row pattern matche. cc @HyukjinKwon who fixed similar problems many times.
Test build #80980 has finished for PR 18953 at commit
|
Test build #81012 has finished for PR 18953 at commit
|
Test build #81013 has finished for PR 18953 at commit
|
* Builds a WritableComparable-return function ahead of time according to DataType | ||
* to avoid pattern matching and branching costs per row. | ||
*/ | ||
private[orc] def getWritableWrapper(dataType: DataType): Any => Any = dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @cloud-fan .
I updated the PR to return functions. Could you review again?
Hi, @cloud-fan . |
Test build #81142 has finished for PR 18953 at commit
|
Test build #81148 has finished for PR 18953 at commit
|
Now, it becomes |
Hi, @cloud-fan and @gatorsmile . |
Hi, @cloud-fan . |
Hi, @cloud-fan and @gatorsmile . |
Retest this please. |
Test build #81268 has finished for PR 18953 at commit
|
Hi, @marmbrus , @liancheng , @yhuai . |
Retest this please. |
Test build #81326 has finished for PR 18953 at commit
|
Hi, All. |
Test build #81513 has finished for PR 18953 at commit
|
Retest this please. |
Test build #81569 has finished for PR 18953 at commit
|
Test build #81597 has finished for PR 18953 at commit
|
This is resolved via #19651 . |
What changes were proposed in this pull request?
Since SPARK-21422, Apache Spark starts to depend on Apache ORC 1.4.0. This PR updates the existing Hive 1.2-based ORC data source by removing Hive dependency and using a new Apache ORC 1.4.0 library only.
The newly updated ORC format in this PR will enable the followings easily. Also, we can expect more later.
(We will be able to move new ORC data soruce into
sql/core
easily at the next step.)How was this patch tested?
Pass the Jenkins with the updated test suites.