Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API to copy an existing RowGroup, including metadata from one parquet file to another #4823

Open
alamb opened this issue Sep 16, 2023 · 3 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Sep 16, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In DataFusion, @devinjdangelo is using the append_column API to write parquet files in parallel (apache/datafusion#7562)

However, when trying to copy the RowGroupMetadata to the API to copy any bloom filters / page offsets, or others is awkward

Describe the solution you'd like

I would like a way to to call the append_column api given a RowGroupMetaData object from the existing file

Ideally there would be an API that produced a ColumnCloseResult from a RowGroupMetaData or some convenience API that took a reader + RowGroupMetadata from another file and did the necessary copy

Perhaps something like

impl SerializedRowGroupWriter {
...
  /// appends an entire RowGroup from the specified reader, including all
  /// metadata, to the in progress parquet file. 
  pub fn append_row_group(&mut self, rg: Box<dyn RowGroupReader>) -> Result<...> { 
   ...
  }
}

https://docs.rs/parquet/latest/parquet/file/writer/struct.SerializedRowGroupWriter.html#method.append_column

Describe alternatives you've considered

Additional context

@alamb alamb added parquet Changes to the parquet crate enhancement Any new improvement worthy of a entry in the changelog labels Sep 16, 2023
@tustvold
Copy link
Contributor

Page and column indexes aren't stored at the row group level, so I'm not sure about this. We should definitely facilitate the use-case you describe, I'm not sure this is the way to do it

@alamb
Copy link
Contributor Author

alamb commented Sep 17, 2023

Page and column indexes aren't stored at the row group level, so I'm not sure about this. We should definitely facilitate the use-case you describe, I'm not sure this is the way to do it

It seems like the lowest level that makes sense when "copying parquet data from one file to another" is the RowGroup. Thus I think having an API in terms of the RowGroup makes sense (that is not to say there can't be more fine grained APIs at a lower level too). 🤔

@tustvold
Copy link
Contributor

I think my comment was ambiguous, RowGroupMetadata doesn't contain page indices, I therefore don't think it can be the basis of this API. I have no objections to the notion of logically appending row groups, purely your suggested mechanism

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

2 participants