Simplify file struct abstractions #1120

rdettai · 2021-10-15T08:06:57Z

Rationale for this change

Currently we have many abstractions that sound very similar: PartitionedFile, FilePartition, ParquetPartition. This is an attempt to simplify the code by removing FilePartition and ParquetPartition.

What changes are included in this PR?

removal of FilePartition and ParquetPartition
simplification of the file display in the parquet exec plan with the FileGroupsDisplay wrapper
FileGroupsDisplay was not applied to the CSV, Avro and Json exec plans as those will be handled in a separate PR (for Multiple files per partition for CSV Json and Avro exec plans #1122)

Are there any user-facing changes?

FilePartition was publicly accessible but not really part of the public API

rdettai · 2021-10-15T08:17:33Z

datafusion/src/physical_plan/file_format/parquet.rs

-/// Represents one partition of a Parquet data set and this currently means one Parquet file.
-///
-/// In the future it would be good to support subsets of files based on ranges of row groups
-/// so that we can better parallelize reads of large files across available cores (see
-/// [ARROW-10995](https://issues.apache.org/jira/browse/ARROW-10995)).
-///
-/// We may also want to support reading Parquet files that are partitioned based on a key and
-/// in this case we would want this partition struct to represent multiple files for a given
-/// partition key (see [ARROW-11019](https://issues.apache.org/jira/browse/ARROW-11019)).
-#[derive(Debug, Clone)]
-pub struct ParquetPartition {


These comments were mostly outdated and the other features mentioned are now planned in the ListingTable provider

rdettai · 2021-10-15T08:46:48Z

datafusion/src/physical_plan/file_format/parquet.rs

-            .iter()
-            .map(|fp| fp.file_partition.files.as_slice())
-            .collect()
+    pub fn file_groups(&self) -> &[Vec<PartitionedFile>] {


We replace partitions with file_groups to try decrease the overuse of the term "partition" which represents different (yet similar 😅) things in different contexts:

on the listing table side, a partition refer to a "hive partition", that is to say a set of files grouped into a folder because they share a common attribute

on the execution plan side, a partition is a unit of parallelism. Files are grouped together to provide a good workload for one thread/executor.

houqp · 2021-10-16T04:47:08Z

cc @yjshen

yjshen · 2021-10-17T02:34:34Z

datafusion/src/physical_plan/file_format/parquet.rs

 /// Execution plan for scanning one or more Parquet partitions
 #[derive(Debug, Clone)]
 pub struct ParquetExec {
    object_store: Arc<dyn ObjectStore>,
-    /// Parquet partitions to read
-    partitions: Vec<ParquetPartition>,
+    /// List of parquet files, grouped by output partition


"output partition" is vague here.
file_group, i.e. Vec<PartitionedFile>, is the unit of parallelism and will be processed by one single executor/thread.

are you suggesting us coming up with a name that's semantically closer to concurrency?

I was referring to the ExecutionPlan.output_partitioning(). Let me change this for something slightly more explicit 🙂 (I'll try to update this later today)

Yes. I think we can rephrase this line to avoid the ambiguity

rdettai · 2021-10-17T08:39:22Z

Thank you all for your reviews 😃, eager to see this merged so that I can create the PR for #1122

houqp · 2021-10-17T19:18:23Z

Thanks @rdettai !

[refacto] simplify file struct abstractions

8f97d3c

github-actions bot added ballista datafusion Changes in the datafusion crate labels Oct 15, 2021

rdettai commented Oct 15, 2021

View reviewed changes

rdettai mentioned this pull request Oct 15, 2021

Multiple files per partition for CSV Json and Avro exec plans #1122

Closed

rdettai commented Oct 15, 2021

View reviewed changes

Dandandan approved these changes Oct 15, 2021

View reviewed changes

houqp approved these changes Oct 16, 2021

View reviewed changes

houqp added the enhancement New feature or request label Oct 16, 2021

yjshen reviewed Oct 17, 2021

View reviewed changes

yjshen approved these changes Oct 17, 2021

View reviewed changes

[doc] explicit desc for file_groups

aa30373

houqp merged commit 161fcd8 into apache:master Oct 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify file struct abstractions #1120

Simplify file struct abstractions #1120

rdettai commented Oct 15, 2021 •

edited

Loading

rdettai Oct 15, 2021

rdettai Oct 15, 2021 •

edited

Loading

houqp commented Oct 16, 2021

yjshen Oct 17, 2021

houqp Oct 17, 2021

rdettai Oct 17, 2021

yjshen Oct 17, 2021

rdettai Oct 17, 2021

yjshen Oct 17, 2021

rdettai commented Oct 17, 2021 •

edited

Loading

houqp commented Oct 17, 2021

Simplify file struct abstractions #1120

Simplify file struct abstractions #1120

Conversation

rdettai commented Oct 15, 2021 • edited Loading

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

rdettai Oct 15, 2021

Choose a reason for hiding this comment

rdettai Oct 15, 2021 • edited Loading

Choose a reason for hiding this comment

houqp commented Oct 16, 2021

yjshen Oct 17, 2021

Choose a reason for hiding this comment

houqp Oct 17, 2021

Choose a reason for hiding this comment

rdettai Oct 17, 2021

Choose a reason for hiding this comment

yjshen Oct 17, 2021

Choose a reason for hiding this comment

rdettai Oct 17, 2021

Choose a reason for hiding this comment

yjshen Oct 17, 2021

Choose a reason for hiding this comment

rdettai commented Oct 17, 2021 • edited Loading

houqp commented Oct 17, 2021

rdettai commented Oct 15, 2021 •

edited

Loading

rdettai Oct 15, 2021 •

edited

Loading

rdettai commented Oct 17, 2021 •

edited

Loading