Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change map field default names to follow the parquet format spec #6814

Closed
wants to merge 1 commit into from

Conversation

Morgan279
Copy link

Which issue does this PR close?

Closes #6213 .

Rationale for this change

The MapBuilder uses the nonstandardized default names, which results in #6213 . Changing to parquet spec helps reduce confusion and provides users with a more standardized naming guide.

According to the parquet-format spec, the outer-most level should be a group that contains a single field named key_value for Map type:

The outer-most level must be a group annotated with MAP that contains a single field named key_value. The repetition of this level must be either optional or required and determines whether the map is nullable.
The middle level, named key_value, must be a repeated group with a key field for map keys and, optionally, a value field for map values. It must not contain any other values.

Changing the default map field names to match it not only complies with the parquet spec, but also aligns with pyarrow.

What changes are included in this PR?

Default value of the MapFieldNames

Are there any user-facing changes?

No(I think)

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 29, 2024
@tustvold
Copy link
Contributor

tustvold commented Nov 29, 2024

Can you confirm pyarrow does follow this convention, the arrow spec has different guidance

https://github.com/apache/arrow-rs/blob/main/format%2FSchema.fbs#L133

I'm also rather wary of making this change as it will be highly disruptive, and for relatively limited benefit

Edit: in fact the linked issue shows pyarrow coercing when writing to parquet

import pyarrow as pa
import pyarrow.parquet as pq

pylist = [{"map_type":{'1':b"M"}}]
schema = pa.schema(
    [
        pa.field("map_type", pa.map_(pa.large_string(), pa.large_binary())),
    ]
)
table = pa.Table.from_pylist(pylist, schema=schema)

# table.schema
#
# map_type: map<large_string, large_binary>
#   child 0, entries: struct<key: large_string not null, value: large_binary> not null
#       child 0, key: large_string not null
#       child 1, value: large_binary

This boils down to a similar issue to #6733

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Option To Coerce Map Type on Parquet Write
2 participants