[SPARK-20682][SQL] Support a new faster ORC data source based on Apache ORC #17924

dongjoon-hyun · 2017-05-09T19:31:57Z

What changes were proposed in this pull request?

Since SPARK-2883, Apache Spark supports Apache ORC inside sql/hive module with Hive dependency. This issue aims to add a new and faster ORC data source inside sql/core and to replace the old ORC data source eventually. In this issue, the latest Apache ORC 1.4.0 (released yesterday) is used.

There are four key benefits.

Speed: Use both Spark ColumnarBatch and ORC RowBatch together. This is faster than the current implementation in Spark.
Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
Usability: User can use ORC data sources without hive module, i.e, -Phive.
Maintainability: Reduce the Hive dependency and can remove old legacy code later.

The followings are two examples of comparisons in OrcReadBenchmark.scala.

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz

SQL Single Int Column Scan:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
SQL ORC Vectorized                             170 /  194         92.5          10.8       1.0X
SQL ORC MR                                     388 /  396         40.5          24.7       0.4X
HIVE ORC MR                                    488 /  496         32.3          31.0       0.3X

Partitioned Table:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
SQL Read data column                           188 /  227         83.6          12.0       1.0X
SQL Read partition column                       98 /  109        161.2           6.2       1.9X
SQL Read both columns                          193 /  227         81.5          12.3       1.0X
HIVE Read data column                          530 /  530         29.7          33.7       0.4X
HIVE Read partition column                     420 /  423         37.4          26.7       0.4X
HIVE Read both columns                         558 /  562         28.2          35.5       0.3X

How was this patch tested?

Pass the Jenkins tests with newly added test suites in sql/core.

SparkQA · 2017-05-09T19:46:45Z

Test build #76693 has finished for PR 17924 at commit 8bfd4bb.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class OrcColumnarBatchReader extends RecordReader[Void, ColumnarBatch] with Logging
class OrcRecordIterator extends Iterator[InternalRow] with Logging

SparkQA · 2017-05-09T20:16:16Z

Test build #76695 has finished for PR 17924 at commit 4607e0e.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-05-09T21:14:08Z

Retest this please.

SparkQA · 2017-05-09T21:30:48Z

Test build #76699 has finished for PR 17924 at commit 4607e0e.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-10T00:49:17Z

Test build #76705 has finished for PR 17924 at commit 85ef731.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-10T03:02:19Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.scala

+  /**
+   * ColumnarBatch for vectorized execution by whole-stage codegen.
+   */
+  private var columnarBatch: ColumnarBatch = _


IIRC, @viirya also has a PR for vectorized orc reader. In that PR, we simply wrap the orc column batch to expose spark column batch interfaces, instead of writing orc column batch to spark column batch. I think that approach is more efficient.

Oh, thank you for the comment. It sounds efficient. I'll take a look.

More specially, we wrap Hive's ColumnVector in a batch to expose Spark's ColumnVector for constructing Spark's ColumnarBatch. So we don't need to move data from one vector format to another vector format.

Btw, the PR is at #13775.

viirya · 2017-05-10T03:29:39Z

From the current benchmark, seems the performance has not obvious improvement, compared with the vectorized Hive ORC reader #13775.

Maybe with more efficient batch approach as @cloud-fan suggested, it can perform better.

Besides performance, getting rid of Hive dependency on ORC datasource is a great advantage for this.

viirya · 2017-05-10T03:48:10Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala

+  }
+
+  private val SQL_ORC_FILE_FORMAT = "org.apache.spark.sql.execution.datasources.orc.OrcFileFormat"
+  private val HIVE_ORC_FILE_FORMAT = "org.apache.spark.sql.hive.orc.OrcFileFormat"


Will we keep current Hive ORC datasource even this is in Spark?

We need to keep both versions before complete transition and for safety. Instead, we can make configurable which file format is used for orc data source string, e.g, USING ORC.

So to avoid datasource name conflict, we may change Hive ORC datasource's shortName to other than "orc".

dongjoon-hyun · 2017-05-10T04:09:44Z

@cloud-fan and @viirya .

Shall we remove the vectorized part from this PR?

The non-vectorized ORCFileFormat is mandatory and also the performance is better than the current one.
After merging sql/core ORCFileFormat, many people (including @viirya and me) can work together in parallel.

How do you think about that?

viirya · 2017-05-10T04:11:13Z

@dongjoon-hyun It is good for me. We can reduce the size of this PR too and mitigate review job.

dongjoon-hyun · 2017-05-10T04:15:26Z

Yep. Since this is an approach adding new dependency on Apache ORC, the non-vectorized PR also will need more supports(or approval) from the committers. I'll wait for more opinions at the current status for a while.

dongjoon-hyun · 2017-05-10T15:35:43Z

Hi, @rxin , @marmbrus , @hvanhovell , @gatorsmile , @sameeragarwal .
Could you give us your opinions on this approach in Spark SQL part, too?

dongjoon-hyun · 2017-05-11T02:01:55Z

Hi, All.
For further discussion and easy comparison, I made another PR (#17943) except ColumnarBatch.

dongjoon-hyun · 2017-05-15T00:51:48Z

Retest this please.

SparkQA · 2017-05-15T03:46:20Z

Test build #76920 has finished for PR 17924 at commit 85ef731.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-14T02:29:23Z

Retest this please.

SparkQA · 2017-08-14T05:48:34Z

Test build #80604 has finished for PR 17924 at commit 85ef731.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cenyuhai · 2017-09-03T14:56:18Z

@dongjoon-hyun I have a question: does this orc data sources reader support a table contains multiple file format
for example:
table/
day=2017-09-01 RCFile
day=2017-09-02 ORCFile

ParquetFileFormat doesn't support this feature.

dongjoon-hyun · 2017-09-03T15:20:45Z

Hi, I didn't try that, but that's not a concept of Spark data source table. Please don't expect that. :)

dongjoon-hyun · 2017-09-03T15:23:17Z

BTW, the latest version is maintained in #17980.
Recently, Spark Vector format is changed.

dongjoon-hyun · 2017-09-08T20:02:57Z

Please refer the superset in #17980 .

## What changes were proposed in this pull request? This PR adds an ORC columnar-batch reader to native `OrcFileFormat`. Since both Spark `ColumnarBatch` and ORC `RowBatch` are used together, it is faster than the current Spark implementation. This replaces the prior PR, #17924. Also, this PR adds `OrcReadBenchmark` to show the performance improvement. ## How was this patch tested? Pass the existing test cases. Author: Dongjoon Hyun <[email protected]> Closes #19943 from dongjoon-hyun/SPARK-16060. (cherry picked from commit f44ba91) Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? This PR adds an ORC columnar-batch reader to native `OrcFileFormat`. Since both Spark `ColumnarBatch` and ORC `RowBatch` are used together, it is faster than the current Spark implementation. This replaces the prior PR, #17924. Also, this PR adds `OrcReadBenchmark` to show the performance improvement. ## How was this patch tested? Pass the existing test cases. Author: Dongjoon Hyun <[email protected]> Closes #19943 from dongjoon-hyun/SPARK-16060.

[SPARK-20682][SQL] Implement new ORC data source based on Apache ORC

8bfd4bb

Fix doc issue.

4607e0e

Fix doc issue.

85ef731

cloud-fan reviewed May 10, 2017

View reviewed changes

viirya reviewed May 10, 2017

View reviewed changes

dongjoon-hyun mentioned this pull request May 11, 2017

[SPARK-20682][SQL] Implement new ORC data source based on Apache ORC #17943

Closed

dongjoon-hyun mentioned this pull request May 15, 2017

[SPARK-20728][SQL] Make ORCFileFormat configurable between sql/hive and sql/core #17980

Closed

dongjoon-hyun mentioned this pull request Aug 4, 2017

[SPARK-21422][BUILD] Depend on Apache ORC 1.4.0 #18640

Closed

dongjoon-hyun mentioned this pull request Aug 16, 2017

[SPARK-20682][SQL] Update ORC data source based on Apache ORC library #18953

Closed

dongjoon-hyun closed this Sep 8, 2017

dongjoon-hyun mentioned this pull request Dec 11, 2017

[SPARK-16060][SQL] Support Vectorized ORC Reader #19943

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20682][SQL] Support a new faster ORC data source based on Apache ORC #17924

[SPARK-20682][SQL] Support a new faster ORC data source based on Apache ORC #17924

dongjoon-hyun commented May 9, 2017 •

edited

Loading

SparkQA commented May 9, 2017

SparkQA commented May 9, 2017

dongjoon-hyun commented May 9, 2017

SparkQA commented May 9, 2017

SparkQA commented May 10, 2017

cloud-fan May 10, 2017

dongjoon-hyun May 10, 2017

viirya May 10, 2017 •

edited

Loading

viirya May 10, 2017

viirya commented May 10, 2017

viirya May 10, 2017

dongjoon-hyun May 10, 2017

viirya May 10, 2017

dongjoon-hyun commented May 10, 2017

viirya commented May 10, 2017

dongjoon-hyun commented May 10, 2017

dongjoon-hyun commented May 10, 2017 •

edited

Loading

dongjoon-hyun commented May 11, 2017 •

edited

Loading

dongjoon-hyun commented May 15, 2017

SparkQA commented May 15, 2017

dongjoon-hyun commented Aug 14, 2017

SparkQA commented Aug 14, 2017

cenyuhai commented Sep 3, 2017

dongjoon-hyun commented Sep 3, 2017

dongjoon-hyun commented Sep 3, 2017

dongjoon-hyun commented Sep 8, 2017

[SPARK-20682][SQL] Support a new faster ORC data source based on Apache ORC #17924

[SPARK-20682][SQL] Support a new faster ORC data source based on Apache ORC #17924

Conversation

dongjoon-hyun commented May 9, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented May 9, 2017

SparkQA commented May 9, 2017

dongjoon-hyun commented May 9, 2017

SparkQA commented May 9, 2017

SparkQA commented May 10, 2017

cloud-fan May 10, 2017

Choose a reason for hiding this comment

dongjoon-hyun May 10, 2017

Choose a reason for hiding this comment

viirya May 10, 2017 • edited Loading

Choose a reason for hiding this comment

viirya May 10, 2017

Choose a reason for hiding this comment

viirya commented May 10, 2017

viirya May 10, 2017

Choose a reason for hiding this comment

dongjoon-hyun May 10, 2017

Choose a reason for hiding this comment

viirya May 10, 2017

Choose a reason for hiding this comment

dongjoon-hyun commented May 10, 2017

viirya commented May 10, 2017

dongjoon-hyun commented May 10, 2017

dongjoon-hyun commented May 10, 2017 • edited Loading

dongjoon-hyun commented May 11, 2017 • edited Loading

dongjoon-hyun commented May 15, 2017

SparkQA commented May 15, 2017

dongjoon-hyun commented Aug 14, 2017

SparkQA commented Aug 14, 2017

cenyuhai commented Sep 3, 2017

dongjoon-hyun commented Sep 3, 2017

dongjoon-hyun commented Sep 3, 2017

dongjoon-hyun commented Sep 8, 2017

dongjoon-hyun commented May 9, 2017 •

edited

Loading

viirya May 10, 2017 •

edited

Loading

dongjoon-hyun commented May 10, 2017 •

edited

Loading

dongjoon-hyun commented May 11, 2017 •

edited

Loading