-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20682][SQL] Support a new faster ORC data source based on Apache ORC #17924
Conversation
Test build #76693 has finished for PR 17924 at commit
|
Test build #76695 has finished for PR 17924 at commit
|
Retest this please. |
Test build #76699 has finished for PR 17924 at commit
|
Test build #76705 has finished for PR 17924 at commit
|
/** | ||
* ColumnarBatch for vectorized execution by whole-stage codegen. | ||
*/ | ||
private var columnarBatch: ColumnarBatch = _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, @viirya also has a PR for vectorized orc reader. In that PR, we simply wrap the orc column batch to expose spark column batch interfaces, instead of writing orc column batch to spark column batch. I think that approach is more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thank you for the comment. It sounds efficient. I'll take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More specially, we wrap Hive's ColumnVector
in a batch to expose Spark's ColumnVector
for constructing Spark's ColumnarBatch
. So we don't need to move data from one vector format to another vector format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, the PR is at #13775.
From the current benchmark, seems the performance has not obvious improvement, compared with the vectorized Hive ORC reader #13775. Maybe with more efficient batch approach as @cloud-fan suggested, it can perform better. Besides performance, getting rid of Hive dependency on ORC datasource is a great advantage for this. |
} | ||
|
||
private val SQL_ORC_FILE_FORMAT = "org.apache.spark.sql.execution.datasources.orc.OrcFileFormat" | ||
private val HIVE_ORC_FILE_FORMAT = "org.apache.spark.sql.hive.orc.OrcFileFormat" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we keep current Hive ORC datasource even this is in Spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to keep both versions before complete transition and for safety. Instead, we can make configurable which file format is used for orc
data source string, e.g, USING ORC
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to avoid datasource name conflict, we may change Hive ORC datasource's shortName to other than "orc".
@cloud-fan and @viirya . Shall we remove the vectorized part from this PR?
How do you think about that? |
@dongjoon-hyun It is good for me. We can reduce the size of this PR too and mitigate review job. |
Yep. Since this is an approach adding new dependency on Apache ORC, the non-vectorized PR also will need more supports(or approval) from the committers. I'll wait for more opinions at the current status for a while. |
Hi, @rxin , @marmbrus , @hvanhovell , @gatorsmile , @sameeragarwal . |
Hi, All. |
Retest this please. |
Test build #76920 has finished for PR 17924 at commit
|
Retest this please. |
Test build #80604 has finished for PR 17924 at commit
|
@dongjoon-hyun I have a question: does this orc data sources reader support a table contains multiple file format ParquetFileFormat doesn't support this feature. |
Hi, I didn't try that, but that's not a concept of Spark data source table. Please don't expect that. :) |
BTW, the latest version is maintained in #17980. |
Please refer the superset in #17980 . |
## What changes were proposed in this pull request? This PR adds an ORC columnar-batch reader to native `OrcFileFormat`. Since both Spark `ColumnarBatch` and ORC `RowBatch` are used together, it is faster than the current Spark implementation. This replaces the prior PR, #17924. Also, this PR adds `OrcReadBenchmark` to show the performance improvement. ## How was this patch tested? Pass the existing test cases. Author: Dongjoon Hyun <[email protected]> Closes #19943 from dongjoon-hyun/SPARK-16060. (cherry picked from commit f44ba91) Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? This PR adds an ORC columnar-batch reader to native `OrcFileFormat`. Since both Spark `ColumnarBatch` and ORC `RowBatch` are used together, it is faster than the current Spark implementation. This replaces the prior PR, #17924. Also, this PR adds `OrcReadBenchmark` to show the performance improvement. ## How was this patch tested? Pass the existing test cases. Author: Dongjoon Hyun <[email protected]> Closes #19943 from dongjoon-hyun/SPARK-16060.
What changes were proposed in this pull request?
Since SPARK-2883, Apache Spark supports Apache ORC inside
sql/hive
module with Hive dependency. This issue aims to add a new and faster ORC data source insidesql/core
and to replace the old ORC data source eventually. In this issue, the latest Apache ORC 1.4.0 (released yesterday) is used.There are four key benefits.
ColumnarBatch
and ORCRowBatch
together. This is faster than the current implementation in Spark.ORC
data sources without hive module, i.e,-Phive
.The followings are two examples of comparisons in
OrcReadBenchmark.scala
.How was this patch tested?
Pass the Jenkins tests with newly added test suites in
sql/core
.