Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20728][SQL] Make ORCFileFormat configurable between sql/hive and sql/core #17980

Closed
wants to merge 1 commit into from
Closed

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented May 15, 2017

What changes were proposed in this pull request?

SPARK-20682 is trying to improve Apache Spark to have a new ORCFileFormat based on Apache ORC for many reasons.

On top of that, this PR depends on SPARK-20682 and aims to provide a configuration to choose the default ORCFileFormat from legacy sql/hive module or new sql/core module.

For example, this configuration will affects the following operations.

spark.read.orc(...)
CREATE TABLE t
USING ORC
...

Since SPARK-20682 (#17924 and #17943) are still under review, I'm inevitably including the dependent code. I'll update this and previous PR according to the review result.

How was this patch tested?

Pass the Jenkins with new test suites.

@SparkQA
Copy link

SparkQA commented May 15, 2017

Test build #76924 has finished for PR 17980 at commit 7716234.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 15, 2017

Test build #76931 has started for PR 17980 at commit 73d56f2.

@@ -55,10 +56,12 @@ class DDLSourceLoadSuite extends DataSourceTest with SharedSQLContext {
}

test("should fail to load ORC without Hive Support") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this test case when we remove sql/hive ORCFileFormat.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented May 15, 2017

Test build #76935 has finished for PR 17980 at commit 73d56f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 17, 2017

Test build #77011 has started for PR 17980 at commit 87dd161.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented May 17, 2017

Test build #77019 has finished for PR 17980 at commit 87dd161.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 24, 2017

Test build #77264 has finished for PR 17980 at commit 05716c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @sameeragarwal , @cloud-fan , @gatorsmile .
Could you give me some advice on this? I think this is the start of Feature Parity for ORC.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jun 17, 2017

Test build #78192 has finished for PR 17980 at commit 05716c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2017

Test build #79264 has finished for PR 17980 at commit 39c3851.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 6, 2017

Test build #79269 has finished for PR 17980 at commit 39c3851.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin, @sameeragarwal , @cloud-fan , @gatorsmile , @viirya .
Since Apache Spark 2.2.0 release is going to coming soon, it's a good time to restart this.
Could you give me some advice how to make ORC support up-to-date in Apache Spark 2.3?

@SparkQA
Copy link

SparkQA commented Jul 9, 2017

Test build #79412 has finished for PR 17980 at commit cbc5bc2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 12, 2017

Test build #79570 has finished for PR 17980 at commit 5db6acf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Jul 14, 2017

In general, this PR seems to have the large number of lines.
Are there any ways to split this PR into multiple smaller PRs?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jul 14, 2017

Thank you for review, @kiszk .

If we need to split more, we can split the followings.

  • Adding Apache ORC dependency (pom and dependency changes)
  • OrcReadBenchmark (388 lines)

Is that helpful for review?

In general, 50% lines are new ORC test codes which is duplicated from old ORC test codes. Could you recommend some better way?

@dongjoon-hyun
Copy link
Member Author

Hi, @kiszk . I will start with Adding Apache ORC dependency (pom and dependency changes) in
SPARK-21422 first.

@kiszk
Copy link
Member

kiszk commented Jul 15, 2017

If new test cases work for the existing orc component, how about updating test cases at first?

@dongjoon-hyun
Copy link
Member Author

I developed new ORC component and tests in sql/core module without sql/hive module. Do you mean add a sql/hive module test-dependency into sql module?

@kiszk
Copy link
Member

kiszk commented Jul 17, 2017

I see. I thought that new tests are for old and new ORC components.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jul 17, 2017

I'm trying to split the PR as you recommended before. Thank you always.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Aug 4, 2017

Test build #80224 has finished for PR 17980 at commit 5db6acf.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented Aug 4, 2017

Test build #80254 has finished for PR 17980 at commit 5db6acf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 14, 2017

Test build #80603 has finished for PR 17980 at commit 5db6acf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ghost pushed a commit to dbtsai/spark that referenced this pull request Aug 16, 2017
## What changes were proposed in this pull request?

Like Parquet, this PR aims to depend on the latest Apache ORC 1.4 for Apache Spark 2.3. There are key benefits for Apache ORC 1.4.

- Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
- Maintainability: Reduce the Hive dependency and can remove old legacy code later.

Later, we can get the following two key benefits by adding new ORCFileFormat in SPARK-20728 (apache#17980), too.
- Usability: User can use ORC data sources without hive module, i.e, -Phive.
- Speed: Use both Spark ColumnarBatch and ORC RowBatch together. This will be faster than the current implementation in Spark.

## How was this patch tested?

Pass the jenkins.

Author: Dongjoon Hyun <[email protected]>

Closes apache#18640 from dongjoon-hyun/SPARK-21422.
@SparkQA
Copy link

SparkQA commented Aug 16, 2017

Test build #80727 has finished for PR 17980 at commit 5a9958d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 16, 2017

Test build #80729 has finished for PR 17980 at commit 61b27e5.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Aug 16, 2017

Test build #80741 has finished for PR 17980 at commit 61b27e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Aug 24, 2017

Test build #81090 has finished for PR 17980 at commit 61b27e5.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 24, 2017

Test build #81095 has finished for PR 17980 at commit 02371c6.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 24, 2017

Test build #81096 has finished for PR 17980 at commit 04b8494.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 28, 2017

Test build #81190 has finished for PR 17980 at commit 257e5a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 31, 2017

Test build #81274 has finished for PR 17980 at commit edead5d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 4, 2017

Test build #81379 has finished for PR 17980 at commit c50e997.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 11, 2017

Test build #81612 has finished for PR 17980 at commit 01868ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

@dongjoon-hyun Do we still need this PR?

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @jiangxb1987 . Yes. This is still required. I'll rebase this PR after #19651 .

@dongjoon-hyun
Copy link
Member Author

Almost codes inside this PR are merged now except #19943 .

@dongjoon-hyun dongjoon-hyun deleted the SPARK-20728 branch December 21, 2017 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants