Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go][Parquet] pqarrow.FileWriter allow adding parquet metadata after writing rowgroups #35775

Closed
xxgreg opened this issue May 26, 2023 · 8 comments · Fixed by #37786
Closed

Comments

@xxgreg
Copy link

xxgreg commented May 26, 2023

Describe the enhancement requested

I'd like to add a key/value metadata field to the Parquet metadata. The value of the field is not known until after the row groups have been written.

It looks like it is possible to do this when using parquet/file.Writer but it isn't possible when using pqarrow.FileWriter.

Would it make sense to add some API to allow for this?

Or perhaps it's already there, and I can't see it.

Component(s)

Go

xxgreg pushed a commit to xxgreg/arrow that referenced this issue May 26, 2023
xxgreg pushed a commit to xxgreg/arrow that referenced this issue May 26, 2023
@xxgreg
Copy link
Author

xxgreg commented May 26, 2023

I added a quick change to allow this. See commits above.

I can add some tests and send a PR if there's any interest.

It turns out that although you can update the key value metadata on parquet/file.Writer, the changes don't get persisted, so I added a commit which changes this also.

@mapleFU
Copy link
Member

mapleFU commented May 26, 2023

Finish api LGTM.
cc @zeroshade

@zeroshade
Copy link
Member

zeroshade commented May 26, 2023

thanks @xxgreg!! There's definitely interest in this and thanks for fixing the bug with persisting the metadata. Please add some tests and send a PR, it should automatically add me as a codeowner to review it.

@xxgreg
Copy link
Author

xxgreg commented Jun 1, 2023

Apologies for the slow reply. And thanks for the super quick replies!

I've got a couple of other priorities to sort out first. I'm planning to pick this up again in 2 weeks.

@zeroshade zeroshade changed the title [GO] pqarrow.FileWriter allow adding parquet metadata after writing rowgroups [Go][Parquet] pqarrow.FileWriter allow adding parquet metadata after writing rowgroups Jun 1, 2023
@tschaub
Copy link
Contributor

tschaub commented Aug 26, 2023

I'm also interested in being able to set the file metadata after writing row groups. The commit proposed by @xxgreg changes the signature of the metadata.Finish() method. From @mapleFU's comment above, it sounds like this API change would be acceptable. If this backwards incompatible change is acceptable, I would like to help write tests to get this in. Or alternatively, if a non-breaking change would be preferred, I could try to make that happen. Any suggestions from maintainers on the preferred direction (non-breaking change only or breaking change ok) would be appreciated.

@tschaub
Copy link
Contributor

tschaub commented Aug 26, 2023

It would also be useful to see an example of how the writer.FileMetadata and writer.KeyValueMetadata fields are intended to be used.

I only see writer.FileMetadata being set and then immediately being used in the file_writer.go file. So it isn't clear why this field needs to be exported at all (but perhaps it is used by other packages) - or even why it is is a struct field.

The writer.KeyValueMetadata field is only set by the WithWriteMetadata() function and then only accessed by the NewParquetWriter() function. If it is only used for this purpose, it could be unexported.

I'm unclear on why the FileMetadata and KeyValueMetadata fields are exported on the file.Writer struct (but I'm probably missing other uses for both of these fields). Are these meant to settable before the writer is closed? Or should there be an explicit writer.SetKeyValueMetadata() method that would be called before closing a writer (somewhat similar to segmentio/parquet-go#399).

@zeroshade
Copy link
Member

@tschaub I think it would be fine to create a backwards incompatible change here (as long as we mark the PR appropriately). While a non-breaking change would always be preferred (such as creating a new function instead of changing the existing one) it's not a deal breaker on this.

I'm unclear on why the FileMetadata and KeyValueMetadata fields are exported on the file.Writer struct (but I'm probably missing other uses for both of these fields). Are these meant to settable before the writer is closed? Or should there be an explicit writer.SetKeyValueMetadata() method that would be called before closing a writer (somewhat similar to segmentio/parquet-go#399).

You are exactly correct, they were meant to be settable before the writer is closed as Finish / Close is where the metadata gets written so they can be set / updated anytime before the writer is closed when it will actually write the metadata. I think it's reasonable to change these to be no longer exported and add a SetKeyValueMetadata method if everyone thinks this would be a better API.

@tschaub
Copy link
Contributor

tschaub commented Sep 19, 2023

I've put together a proposed set of changes in #37786.

zeroshade pushed a commit that referenced this issue Sep 21, 2023
…fter writing row groups (#37786)

### Rationale for this change

The key value file metadata may include information generated while writing row groups.  Currently, it is not possible to set the key value file metadata after creating a writer.  With the changes in this branch, key value pairs may be added any time before closing the writer.

### What changes are included in this PR?

This branch adds a `writer.AppendKeyValueMetadata(key, value)` method to the parquet `file.Writer` and to the `pqarrow.FileWriter`.

### Are these changes tested?

Tests are added for the new functionality.

### Are there any user-facing changes?

The `KeyValueMetadata` field on the parquet `file.Writer` has been renamed to `initialKeyValueMetadata`.  This is a breaking change.  Although the field was exported, setting it did not result in new key value metadata being written.  Instead, it represented the initial key value metadata if the writer was passed the `WithWriteMetadata` write option.

The `WithWriteMetadata` option can still be used to provide the initial key value metadata values.  In addition, the `AppendKeyValueMetadata` method can be called to add key value pairs after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.  Previously, setting this field value had no effect.

**This PR includes breaking changes to public APIs.** 

The `KeyValueMetadata` field is no longer exported from the parquet `file.Writer` struct.  Use the `WithWriteMetadata` writer option to set key value metadata when creating a writer or use the `AppendKeyValueMetadata` method to add key value metadata after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.
* Closes: #35775

Authored-by: Tim Schaub <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
@zeroshade zeroshade added this to the 14.0.0 milestone Sep 21, 2023
etseidl pushed a commit to etseidl/arrow that referenced this issue Sep 28, 2023
…tten after writing row groups (apache#37786)

### Rationale for this change

The key value file metadata may include information generated while writing row groups.  Currently, it is not possible to set the key value file metadata after creating a writer.  With the changes in this branch, key value pairs may be added any time before closing the writer.

### What changes are included in this PR?

This branch adds a `writer.AppendKeyValueMetadata(key, value)` method to the parquet `file.Writer` and to the `pqarrow.FileWriter`.

### Are these changes tested?

Tests are added for the new functionality.

### Are there any user-facing changes?

The `KeyValueMetadata` field on the parquet `file.Writer` has been renamed to `initialKeyValueMetadata`.  This is a breaking change.  Although the field was exported, setting it did not result in new key value metadata being written.  Instead, it represented the initial key value metadata if the writer was passed the `WithWriteMetadata` write option.

The `WithWriteMetadata` option can still be used to provide the initial key value metadata values.  In addition, the `AppendKeyValueMetadata` method can be called to add key value pairs after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.  Previously, setting this field value had no effect.

**This PR includes breaking changes to public APIs.** 

The `KeyValueMetadata` field is no longer exported from the parquet `file.Writer` struct.  Use the `WithWriteMetadata` writer option to set key value metadata when creating a writer or use the `AppendKeyValueMetadata` method to add key value metadata after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.
* Closes: apache#35775

Authored-by: Tim Schaub <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…tten after writing row groups (apache#37786)

### Rationale for this change

The key value file metadata may include information generated while writing row groups.  Currently, it is not possible to set the key value file metadata after creating a writer.  With the changes in this branch, key value pairs may be added any time before closing the writer.

### What changes are included in this PR?

This branch adds a `writer.AppendKeyValueMetadata(key, value)` method to the parquet `file.Writer` and to the `pqarrow.FileWriter`.

### Are these changes tested?

Tests are added for the new functionality.

### Are there any user-facing changes?

The `KeyValueMetadata` field on the parquet `file.Writer` has been renamed to `initialKeyValueMetadata`.  This is a breaking change.  Although the field was exported, setting it did not result in new key value metadata being written.  Instead, it represented the initial key value metadata if the writer was passed the `WithWriteMetadata` write option.

The `WithWriteMetadata` option can still be used to provide the initial key value metadata values.  In addition, the `AppendKeyValueMetadata` method can be called to add key value pairs after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.  Previously, setting this field value had no effect.

**This PR includes breaking changes to public APIs.** 

The `KeyValueMetadata` field is no longer exported from the parquet `file.Writer` struct.  Use the `WithWriteMetadata` writer option to set key value metadata when creating a writer or use the `AppendKeyValueMetadata` method to add key value metadata after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.
* Closes: apache#35775

Authored-by: Tim Schaub <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…tten after writing row groups (apache#37786)

### Rationale for this change

The key value file metadata may include information generated while writing row groups.  Currently, it is not possible to set the key value file metadata after creating a writer.  With the changes in this branch, key value pairs may be added any time before closing the writer.

### What changes are included in this PR?

This branch adds a `writer.AppendKeyValueMetadata(key, value)` method to the parquet `file.Writer` and to the `pqarrow.FileWriter`.

### Are these changes tested?

Tests are added for the new functionality.

### Are there any user-facing changes?

The `KeyValueMetadata` field on the parquet `file.Writer` has been renamed to `initialKeyValueMetadata`.  This is a breaking change.  Although the field was exported, setting it did not result in new key value metadata being written.  Instead, it represented the initial key value metadata if the writer was passed the `WithWriteMetadata` write option.

The `WithWriteMetadata` option can still be used to provide the initial key value metadata values.  In addition, the `AppendKeyValueMetadata` method can be called to add key value pairs after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.  Previously, setting this field value had no effect.

**This PR includes breaking changes to public APIs.** 

The `KeyValueMetadata` field is no longer exported from the parquet `file.Writer` struct.  Use the `WithWriteMetadata` writer option to set key value metadata when creating a writer or use the `AppendKeyValueMetadata` method to add key value metadata after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.
* Closes: apache#35775

Authored-by: Tim Schaub <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…tten after writing row groups (apache#37786)

### Rationale for this change

The key value file metadata may include information generated while writing row groups.  Currently, it is not possible to set the key value file metadata after creating a writer.  With the changes in this branch, key value pairs may be added any time before closing the writer.

### What changes are included in this PR?

This branch adds a `writer.AppendKeyValueMetadata(key, value)` method to the parquet `file.Writer` and to the `pqarrow.FileWriter`.

### Are these changes tested?

Tests are added for the new functionality.

### Are there any user-facing changes?

The `KeyValueMetadata` field on the parquet `file.Writer` has been renamed to `initialKeyValueMetadata`.  This is a breaking change.  Although the field was exported, setting it did not result in new key value metadata being written.  Instead, it represented the initial key value metadata if the writer was passed the `WithWriteMetadata` write option.

The `WithWriteMetadata` option can still be used to provide the initial key value metadata values.  In addition, the `AppendKeyValueMetadata` method can be called to add key value pairs after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.  Previously, setting this field value had no effect.

**This PR includes breaking changes to public APIs.** 

The `KeyValueMetadata` field is no longer exported from the parquet `file.Writer` struct.  Use the `WithWriteMetadata` writer option to set key value metadata when creating a writer or use the `AppendKeyValueMetadata` method to add key value metadata after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.
* Closes: apache#35775

Authored-by: Tim Schaub <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment