-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2883] [SQL] ORC data source for Spark SQL #6194
Conversation
Merged build triggered. |
Merged build started. |
Test build #32839 has started for PR 6194 at commit |
@zhzhan Here is a rough list of my updates:
Made corresponding changes according to the new API changes.
Extracted some useful testing utility methods to Mainly styling issues and code simplifications. |
Some TODO items related to testing:
|
@liancheng Thanks for the followup. For the future work, feel free to assign to me. |
.orElse(tryLeft.flatMap(_ => buildSearchArgument(left, builder))) | ||
.orElse(tryRight.flatMap(_ => buildSearchArgument(right, builder))) | ||
|
||
case And(left, right) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be Or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, thanks!
Test build #32839 has finished for PR 6194 at commit
|
Merged build finished. Test FAILed. |
Test FAILed. |
// children with brand new builders, and only do the actual conversion with the right builder | ||
// instance when the children are proven to be convertible. | ||
// | ||
// P.S.: Hive seems to use `SearchArgument` together with `ExprNodeGenericFuncDesc` only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked with hive team. For external user, it is more expected to use the current builder approach, although hive internally build xml file by ExprNodeGenericFuncDesc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Do you know are there any other projects that uses ORC SearchArgument
builder API? I'm looking for examples. I think the problem we faced should be pretty general. Would like to see how other projects solve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is SearchArgument builder API stable/compatible for different hive version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scwf BTW, with the help of the newly introduced isolated classloader mechanism, Spark SQL can always depend on the most recent version of Hive. At the meanwhile, users can specify arbitrary Hive metastore version to use. So even if this API changes across Hive versions, we don't need shim code to ensure compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get it, thanks for the explanation
Jenkins, test this please. |
Merged build triggered. |
Merged build started. |
Test build #32848 has started for PR 6194 at commit |
@liancheng FYI: For schema merging, I checked some orc experts, and probably it is not supported the filter push down if the column is not in that specific orc file (I myself does not check the implementation yet). In the meantime, separating orc from hive is a on-going effort. We can separate orc from hive afterwards, and upgrade orc support to latest, which I think will improve the performance and a lot and remove potential version mismatch due to hive versions. |
Test build #32848 has finished for PR 6194 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build triggered. |
Merged build started. |
Test build #32874 has started for PR 6194 at commit |
@zhzhan Thanks for the information. |
Merged build triggered. |
Merged build started. |
Test build #32881 has started for PR 6194 at commit |
Test build #32874 has finished for PR 6194 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build started. |
Test build #32907 has started for PR 6194 at commit |
Test build #32907 has finished for PR 6194 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build triggered. |
Merged build started. |
Test build #32926 has started for PR 6194 at commit |
LGTM with respect to API change (there isn't any). |
Test build #32926 has finished for PR 6194 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
In the last a few commits, I added hiveContext.read.format("orc").load("hdfs://...") and df.write.format("orc").save("hdfs://...") Note that ORC data source is coupled with Hive. If users try to use it with |
@marmbrus This should be ready to go. |
This PR updates PR #6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR #3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes #6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support (cherry picked from commit aa31e43) Signed-off-by: Michael Armbrust <[email protected]>
Thanks guys! Merged to master and 1.4. |
Fix break caused by merging #6225 and #6194. Author: Michael Armbrust <[email protected]> Closes #6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break
Fix break caused by merging #6225 and #6194. Author: Michael Armbrust <[email protected]> Closes #6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break (cherry picked from commit fcf90b7) Signed-off-by: Andrew Or <[email protected]>
This PR updates PR apache#6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR apache#3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes apache#6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support
Fix break caused by merging apache#6225 and apache#6194. Author: Michael Armbrust <[email protected]> Closes apache#6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break
This PR updates PR apache#6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR apache#3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes apache#6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support
Fix break caused by merging apache#6225 and apache#6194. Author: Michael Armbrust <[email protected]> Closes apache#6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break
This PR updates PR apache#6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR apache#3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes apache#6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support
Fix break caused by merging apache#6225 and apache#6194. Author: Michael Armbrust <[email protected]> Closes apache#6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break
This PR updates PR #6135 authored by @zhzhan from Hortonworks.
This PR implements a Spark SQL data source for accessing ORC files.
New Features
Future Work
Acknowledgements
This PR also include initial work done by @scwf from Huawei (PR #3753).