Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] List types with different field names are not equal #30519

Closed
asfimport opened this issue Dec 6, 2021 · 11 comments
Closed

[C++] List types with different field names are not equal #30519

asfimport opened this issue Dec 6, 2021 · 11 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Dec 6, 2021

When comparing map types, the names of the fields are ignored. This was introduced in ARROW-7173.

However for list types, they are not ignored. For example,

In [6]: l1 = pa.list_(pa.field("val", pa.int64()))

In [7]: l2 = pa.list_(pa.int64())

In [8]: l1
Out[8]: ListType(list<val: int64>)

In [9]: l2
Out[9]: ListType(list<item: int64>)

In [10]: l1 == l2
Out[10]: False

Should we make list type comparison ignore field names too?

Reporter: Will Jones / @wjones127
Assignee: Will Jones / @wjones127

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-14999. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
IIRC, the reason for ARROW-7173 is that implementations historically chose different field names, so we had to do that for proper interoperation. It doesn't seem to be necessary for list types.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
For list types we also had the inconsistency with the field names between Parquet vs Arrow (but that's of course not between different implementation of Arrow itself).

@asfimport
Copy link
Collaborator Author

Will Jones / @wjones127:
Here's the issue Joris was talking about: https://issues.apache.org/jira/browse/ARROW-11497

I encountered this issue again here: delta-io/delta-rs#684 (comment)

My recent experiences have led me to believe that (1) we should not care about field names when comparing ListType and (2) we should enable use_compliant_nested_type by default in Parquet writers. I am unable to determine any real use case for a custom field name in a ListType (there's only 1 field, so no need for a name to disambiguate), and preserving them and considering them in equality comparisons is proving to be annoying.

Though if that's not acceptable, I think we should at least provide an an option in the equals method to ignore field names in Lists and Maps.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:

am unable to determine any real use case for a custom field name

We are starting to use custom field names in the extension types for geospatial data (https://github.com/geopandas/geo-arrow-spec/blob/main/extension-types.md#concrete-examples-of-extension-type-metadata, https://github.com/paleolimbot/geoarrow/).
I don't know if we currently already rely on those names, but we had the intention to do so in the future (cc @paleolimbot)

@asfimport
Copy link
Collaborator Author

Will Jones / @wjones127:
Do you expect to be able to roundtrip that from Parquet? It seems like the conclusion of discussion in ARROW-11497 was that we should transition in the long term towards always using "element", but maybe we would still be able to roundtrip by casting back based on the Arrow schema saved in the metadata?

@asfimport
Copy link
Collaborator Author

Dewey Dunnington / @paleolimbot:
Yes, we do use those names to communicate some information in the geoarrow spec (although I don't think we need them for type equality since it doesn't affect the storage interpretation).

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
@paleolimbot @jorisvandenbossche Wouldn't it be better to use custom metadata to convey some information? That would make sure the information is not lost.

@asfimport
Copy link
Collaborator Author

Dewey Dunnington / @paleolimbot:
We do use metadata (extension type + metadata), although the definition of the extension types is newer than the memory layout, which was the document that proposed the list-child-name-conveying-information strategy. In practice the prototype I wrote mostly ignores the list child names, with the exception being the name of a point array which conveys its dimension (e.g., xy, xyz, xym, or xyzm). Conveying the dimension of the point array using the name was done in part to avoid a parser (to parse extension type metadata since it has to be valid utf8) when writing C kernels (since 'name' at the C data array level is just a null-terminated string). At the time we designed this there was no indication that the roundtripping of the child name would be an issue; however, we could certainly work around it.

@asfimport
Copy link
Collaborator Author

Will Jones / @wjones127:
So here are the conclusions I've gathered so far:

  1. Equality of ListTypes and MapTypes have different behavior right now: list types with different field names are unequal, but map types with different field names are equal. We should make this behavior consistent and probably have an option in the .Equals() method to toggle checking these internal field names.
  2. For extension arrays, it's important that we preserve these field names in most operations. That means that even if the default behavior is to ignore field names in equality for List/Map, unit tests for functions should check for field name equality.

I'm leaning right now that the default for checking equality should be to ignore field names for List/Map (obviously not for struct) in cases where we also don't check metadata. For example, TypeEquals() will check metadata and field names, while DataType::Equals() will not.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:

I'm leaning right now that the default for checking equality should be to ignore field names for List/Map (obviously not for struct) in cases where we also don't check metadata.

That sounds reasonable to me.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 14847
#14847

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants