[SPARK-20682][SQL] Implement new ORC data source based on Apache ORC #17943
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Since SPARK-2883, Apache Spark supports Apache ORC inside
sql/hive
module with Hive dependency. This issue aims to add a new and faster ORC data source insidesql/core
and to replace the old ORC data source eventually. In this issue, the latest Apache ORC 1.4.0 (released yesterday) is used.There are four key benefits.
ColumnarBatch
and ORCRowBatch
together later. In this PR, onlyRowBatch
is used. This is faster than the current implementation in Spark. ForColumnarBatch
, we need to benchmark and choose the fastest way to use it later. (Please refer some discussion on [SPARK-20682][SQL] Support a new faster ORC data source based on Apache ORC #17924)ORC
data sources without hive module, i.e,-Phive
.The followings are two examples of comparisons in
OrcReadBenchmark.scala
.How was this patch tested?
Pass the Jenkins tests with newly added test suites in
sql/core
.