Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20682][SQL] Implement new ORC data source based on Apache ORC #17943

Closed
wants to merge 1 commit into from
Closed

[SPARK-20682][SQL] Implement new ORC data source based on Apache ORC #17943

wants to merge 1 commit into from

Conversation

dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

Since SPARK-2883, Apache Spark supports Apache ORC inside sql/hive module with Hive dependency. This issue aims to add a new and faster ORC data source inside sql/core and to replace the old ORC data source eventually. In this issue, the latest Apache ORC 1.4.0 (released yesterday) is used.

There are four key benefits.

  • Speed: Use both Spark ColumnarBatch and ORC RowBatch together later. In this PR, only RowBatch is used. This is faster than the current implementation in Spark. For ColumnarBatch, we need to benchmark and choose the fastest way to use it later. (Please refer some discussion on [SPARK-20682][SQL] Support a new faster ORC data source based on Apache ORC #17924)
  • Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
  • Usability: User can use ORC data sources without hive module, i.e, -Phive.
  • Maintainability: Reduce the Hive dependency and can remove old legacy code later.

The followings are two examples of comparisons in OrcReadBenchmark.scala.

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz

SQL Single Int Column Scan:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
SQL ORC Vectorized Reader                      278 /  320         56.5          17.7       1.0X
SQL ORC MR Reader                              348 /  358         45.2          22.1       0.8X
HIVE ORC MR Reader                             418 /  430         37.6          26.6       0.7X

Partitioned Table:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
SQL Read data column                           273 /  283         57.6          17.4       1.0X
SQL Read partition column                      252 /  266         62.5          16.0       1.1X
SQL Read both columns                          283 /  293         55.5          18.0       1.0X
HIVE Read data column                          510 /  520         30.8          32.4       0.5X
HIVE Read partition column                     420 /  425         37.5          26.7       0.7X
HIVE Read both columns                         527 /  538         29.9          33.5       0.5X

How was this patch tested?

Pass the Jenkins tests with newly added test suites in sql/core.

@SparkQA
Copy link

SparkQA commented May 11, 2017

Test build #76764 has finished for PR 17943 at commit 70bc00e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 11, 2017

Test build #76776 has started for PR 17943 at commit d1417aa.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented May 11, 2017

Test build #76790 has finished for PR 17943 at commit d1417aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan and @viirya .
Could you review new ORC data source (without ColumnarBatch part) again?

@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented Jun 17, 2017

Test build #78193 has finished for PR 17943 at commit d1417aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Please refer the superset PR #17980 .

@dongjoon-hyun dongjoon-hyun deleted the SPARK-20682-2 branch September 9, 2017 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants