-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6088: [Rust] [DataFusion] Projection execution plan #4988
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4988 +/- ##
===========================================
- Coverage 87.56% 82.78% -4.79%
===========================================
Files 1002 90 -912
Lines 143208 25599 -117609
Branches 1418 0 -1418
===========================================
- Hits 125397 21191 -104206
+ Misses 17449 4408 -13041
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
@andygrove Left some comments. I've lost track of the bigger picture here - do you happen to have a design doc describing what you are trying to achieve with all the PRs? |
batch_size: usize, | ||
} | ||
|
||
impl ExecutionPlan for CsvExec { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now thinking about it: would it be better if we rename ExecutionPlan
something like DataSource
? Also with the current interface how can we pushdown various filters to the storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good reference may be Spark Data Source V2 APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more like org.apache.spark.sql.execution.SparkPlan
which is the base class for physical operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. The doc helped a lot, although this is different from other designs I've touched before (e.g., Hive, Impala) where there are query fragments and operators within a query fragment. The parallelism/repartitioning happens on query fragment level but instead here it is on the operator level.
for entry in fs::read_dir(dir)? { | ||
let entry = entry?; | ||
let path = entry.path(); | ||
let path_name = path.as_os_str().to_str().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: avoid unwrap
here?
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
//! Defines the projection execution plan. A projection determines which columns or expressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Not sure if I understand this. Shouldn't this be part of logical query optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Projection can happen in many places in a query plan e.g. SELECT sqrt(x) FROM (SELECT MAX(a) AS x FROM foo)
. This is separate from the projection that gets pushed down to the data source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I thought this can be done by a separate analysis on the logical plan though (in Hive this is how it is done with column pruning).
@sunchao Good point. I will write up a design doc to explain all of this. |
@sunchao Here is a document explaining the changes https://docs.google.com/document/d/1ATZGIs8ry_kJeoTgmJjLrg6Ssb5VE7lNzWuz_4p6EWk/edit?usp=sharing |
Thanks @andygrove . The doc is very helpful, and looking forward to see the entire piece to be done. It would be great if you can help to address/answer the comments I had above. You also need to rebase the PR. |
This PR implements the projection and CSV execution plans (I can split this into two PRs if necessary - one for CSV then one for projection). Note that while I implement execution plans for each relational operator (projection, selection, aggregate, etc) there will be duplicate implementations because we already have the existing execution code that directly executes the logical plan. Once the new physical plan is in place, I will remove the original execution logic (and translate the logical plan to a physical plan). Closes apache#4988 from andygrove/ARROW-6088 and squashes the following commits: 755365c <Andy Grove> Rebase and remove unwrap fec84af <Andy Grove> test only delete temp path if exist 8f11c81 <Andy Grove> save 6db609f <Andy Grove> test passes 717dcd8 <Andy Grove> implement mutex for iterator abf6d5e <Andy Grove> Save a26575e <Andy Grove> rough out CSV execution plan e806c76 <Andy Grove> formatting 768a7ae <Andy Grove> Implement Column expression d1ede3c <Andy Grove> Implement projection logix 1875902 <Andy Grove> Roughing out projection execution plan Authored-by: Andy Grove <[email protected]> Signed-off-by: Andy Grove <[email protected]>
This PR implements the projection and CSV execution plans (I can split this into two PRs if necessary - one for CSV then one for projection).
Note that while I implement execution plans for each relational operator (projection, selection, aggregate, etc) there will be duplicate implementations because we already have the existing execution code that directly executes the logical plan. Once the new physical plan is in place, I will remove the original execution logic (and translate the logical plan to a physical plan).