Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MapArray Requires Values Array #1642

Closed
tustvold opened this issue May 3, 2022 · 5 comments · Fixed by #6730
Closed

MapArray Requires Values Array #1642

tustvold opened this issue May 3, 2022 · 5 comments · Fixed by #6730
Labels
bug question Further information is requested

Comments

@tustvold
Copy link
Contributor

tustvold commented May 3, 2022

Describe the bug

The parquet specification states that the values for a MapArray are optional - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps.

However, the currently logic in ArrayReaderBuilder requires a values array, along with the current definition of DataType::Map

/// A Map is a logical nested type that is represented as
///
/// `List<entries: Struct<key: K, value: V>>`
///
/// The keys and values are each respectively contiguous.
/// The key and value types are not constrained, but keys should be
/// hashable and unique.
/// Whether the keys are sorted can be set in the `bool` after the `Field`.
///
/// In a field with Map type, the field has a child Struct field, which then
/// has two children: key type and the second the value type. The names of the
/// child fields may be respectively "entries", "key", and "value", but this is
/// not enforced.

To Reproduce

Try to read a MapArray without a values array, you will receive an error

Expected behavior

I'm not actually entirely sure, the arrow specification doesn't seem to describe the semantics of Map Arrays, however, our code seems to assume that the values array is required. I'm creating this to track the fact something isn't right here, but I don't actually know what.

Additional context

Map Arrays were added by @nevi-me as part of #395

@tustvold tustvold added question Further information is requested bug labels May 3, 2022
@HaoYang670
Copy link
Contributor

I'm not actually entirely sure, the arrow specification doesn't seem to describe the semantics of Map Arrays

MapArray seems like a logic type of ListArray whose child is a StructArray.

/// [MapArray] is physically a [crate::array::ListArray] that has a
/// [crate::array::StructArray] with 2 child fields.
pub struct MapArray {
    data: ArrayData,
    values: ArrayRef,
    value_offsets: RawPtrBox<i32>,
}

@frolovdev
Copy link
Contributor

frolovdev commented Jun 17, 2022

@tustvold

So the basic idea is to avoid the obligation of values in the map. According to

The value field encodes the map's value type and repetition. This field can be required, optional, or **omitted.**

Did I get the idea right?

So it should be possible to parse something like this without any errors

message table {
            required group map (MAP) {
                repeated group key_value {
                    REQUIRED BYTE_ARRAY key;
                }
            }
        }

@tustvold
Copy link
Contributor Author

That is my understanding, but some spelunking in the C++ or Java implementations may be warranted to confirm how they choose to handle it

@nevi-me
Copy link
Contributor

nevi-me commented Jun 18, 2022

@tustvold here's Arrow's spec: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L103-L131

/// A Map is a logical nested type that is represented as
///
/// List<entries: Struct<key: K, value: V>>
///
/// In this layout, the keys and values are each respectively contiguous. We do
/// not constrain the key and value types, so the application is responsible
/// for ensuring that the keys are hashable and unique. Whether the keys are sorted
/// may be set in the metadata for this field.
///
/// In a field with Map type, the field has a child Struct field, which then
/// has two children: key type and the second the value type. The names of the
/// child fields may be respectively "entries", "key", and "value", but this is
/// not enforced.
///
/// Map
/// ```text
///   - child[0] entries: Struct
///     - child[0] key: K
///     - child[1] value: V
/// ```
/// Neither the "entries" field nor the "key" field may be nullable.
///
/// The metadata is structured so that Arrow systems without special handling
/// for Map can make Map an alias for List. The "layout" attribute for the Map
/// field must have the same contents as a List.
table Map {
  /// Set to true if the keys within each value are sorted
  keysSorted: bool;
}

Parquet seems to allow both HashMap and HashSet, while I interpret Neither the "entries" field nor the "key" field may be nullable. to mean that Arrows Map requires both keys and values.


@frolovdev I suppose a solution is to check whether a map has both key and value, then fall back to parsing it as a list.
I think

message table {
  required group map (MAP) {
    repeated group key_value {
      REQUIRED BYTE_ARRAY key;
    }
  }
}

would then be read in as list[map]<Binary[key]>

@etseidl
Copy link
Contributor

etseidl commented Nov 8, 2024

Bumping this in light of apache/parquet-format#469. I think @nevi-me is correct. As pointed out by @wgtmac, arrow-cpp treats a MAP with no value field as a list, rather than a map with null values. I think arrow-rs should follow suit. I'm willing to take a crack at this, but it might take some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants