-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20728][SQL] Make ORCFileFormat configurable between sql/hive and sql/core #17980
Conversation
Test build #76924 has finished for PR 17980 at commit
|
Test build #76931 has started for PR 17980 at commit |
@@ -55,10 +56,12 @@ class DDLSourceLoadSuite extends DataSourceTest with SharedSQLContext { | |||
} | |||
|
|||
test("should fail to load ORC without Hive Support") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this test case when we remove sql/hive
ORCFileFormat.
Retest this please. |
Test build #76935 has finished for PR 17980 at commit
|
Test build #77011 has started for PR 17980 at commit |
Retest this please |
Test build #77019 has finished for PR 17980 at commit
|
Test build #77264 has finished for PR 17980 at commit
|
Hi, @sameeragarwal , @cloud-fan , @gatorsmile . |
Retest this please. |
Test build #78192 has finished for PR 17980 at commit
|
Test build #79264 has finished for PR 17980 at commit
|
Retest this please. |
Test build #79269 has finished for PR 17980 at commit
|
Hi, @rxin, @sameeragarwal , @cloud-fan , @gatorsmile , @viirya . |
Test build #79412 has finished for PR 17980 at commit
|
Test build #79570 has finished for PR 17980 at commit
|
In general, this PR seems to have the large number of lines. |
Thank you for review, @kiszk .
If we need to split more, we can split the followings.
Is that helpful for review? In general, 50% lines are new ORC test codes which is duplicated from old ORC test codes. Could you recommend some better way? |
Hi, @kiszk . I will start with |
If new test cases work for the existing orc component, how about updating test cases at first? |
I developed new ORC component and tests in |
I see. I thought that new tests are for old and new ORC components. |
I'm trying to split the PR as you recommended before. Thank you always. |
Retest this please. |
Test build #80224 has finished for PR 17980 at commit
|
Retest this please |
Test build #80254 has finished for PR 17980 at commit
|
Test build #80603 has finished for PR 17980 at commit
|
## What changes were proposed in this pull request? Like Parquet, this PR aims to depend on the latest Apache ORC 1.4 for Apache Spark 2.3. There are key benefits for Apache ORC 1.4. - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more. - Maintainability: Reduce the Hive dependency and can remove old legacy code later. Later, we can get the following two key benefits by adding new ORCFileFormat in SPARK-20728 (apache#17980), too. - Usability: User can use ORC data sources without hive module, i.e, -Phive. - Speed: Use both Spark ColumnarBatch and ORC RowBatch together. This will be faster than the current implementation in Spark. ## How was this patch tested? Pass the jenkins. Author: Dongjoon Hyun <[email protected]> Closes apache#18640 from dongjoon-hyun/SPARK-21422.
Test build #80727 has finished for PR 17980 at commit
|
Test build #80729 has finished for PR 17980 at commit
|
Retest this please. |
Test build #80741 has finished for PR 17980 at commit
|
Retest this please. |
Test build #81090 has finished for PR 17980 at commit
|
Test build #81095 has finished for PR 17980 at commit
|
Test build #81096 has finished for PR 17980 at commit
|
Test build #81190 has finished for PR 17980 at commit
|
Test build #81274 has finished for PR 17980 at commit
|
Test build #81379 has finished for PR 17980 at commit
|
Test build #81612 has finished for PR 17980 at commit
|
@dongjoon-hyun Do we still need this PR? |
Thank you for review, @jiangxb1987 . Yes. This is still required. I'll rebase this PR after #19651 . |
Almost codes inside this PR are merged now except #19943 . |
What changes were proposed in this pull request?
SPARK-20682 is trying to improve Apache Spark to have a new ORCFileFormat based on Apache ORC for many reasons.
On top of that, this PR depends on SPARK-20682 and aims to provide a configuration to choose the default ORCFileFormat from legacy
sql/hive
module or newsql/core
module.For example, this configuration will affects the following operations.
Since SPARK-20682 (#17924 and #17943) are still under review, I'm inevitably including the dependent code. I'll update this and previous PR according to the review result.
How was this patch tested?
Pass the Jenkins with new test suites.