Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-14999: [C++] Don't check field name in ListType Equals() #13851

Conversation

wjones127
Copy link
Member

@wjones127 wjones127 commented Aug 11, 2022

BREAKING CHANGE

Two changes for "internal fields" (fields within ListTypes and MapTypes):

  • ListType and MapType internal field names only matter in cases where metadata also matters (but can be explicitly changed with options).
  • Nullability of MapType internal fields now matters in comparison.

Examples

import pyarrow as pa

lt1 = pa.list_(pa.field("item", pa.int32(), nullable=False))
lt2 = pa.list_(pa.field("item", pa.int32(), nullable=True))
lt3 = pa.list_(pa.field("element", pa.int32(), nullable=False))
lt4 = pa.list_(pa.field("item", pa.int32(), nullable=False, metadata={"hello": "world"}))

# Nullability matters always:
lt1 == lt2 # False, was False
# Field names don't matter:
lt1 == lt3 # True, but was previously False
# ...unless you explicitly ask:
lt1.equals(lt3, check_internal_field_names=True) # False
# Metadata also doesn't matter:
lt1 == lt4 # True, was True
# ...unless you explicitly ask:
lt1.equals(lt4, check_metadata=True) # False


mt1 = pa.map_(pa.utf8(), pa.int32())
mt2 = pa.map_(pa.utf8(), pa.field("value", pa.int32(), nullable=False))
mt3 = pa.map_(pa.utf8(), pa.field("other", pa.int32()))
mt4 = pa.map_(pa.utf8(), pa.field("value", pa.int32(), metadata={"hello": "world"}))

# Nullability always matters:
mt1 == mt2 # False, was previously True
# Field names don't matter
mt1 == mt3 # True, was True
# ... unless you explicitly ask:
mt1.equals(mt3, check_internal_field_names=True) # False
# Metadata also doesn't matter:
mt1 == mt4 # True, was True
# ...unless you explicitly ask:
mt1.equals(mt4, check_metadata=True) # False

@github-actions
Copy link

Comment on lines +2108 to +2155
if (value_field()->nullable()) {
ss << 'n';
} else {
ss << 'N';
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullability of internal field is now part of the fingerprint.

@jorgecarleitao
Copy link
Member

I think this should be raised in the mailing list - this is a non-trivial change to the spec - afaik internal names are an integral part of the field and must be taken into account in equality.

@wjones127
Copy link
Member Author

wjones127 commented Oct 31, 2022

I think this should be raised in the mailing list - this is a non-trivial change to the spec - afaik internal names are an integral part of the field and must be taken into account in equality.

@jorgecarleitao I was thinking about doing that; thanks for the nudge. To be clear, this PR makes checking field names optional, and keeps the check on by default in code paths where strict equality (where we also check field metadata) are already on. So I don't think this breaks the spec, but happy to discuss more on the mailing list.

https://lists.apache.org/thread/p6y48qznd61zxc78g3930h4nddz7oo4z

@wjones127 wjones127 force-pushed the ARROW-14999-no-compare-internal-fields branch from 8c709b6 to ba5ea71 Compare October 31, 2022 19:54
@wjones127 wjones127 marked this pull request as ready for review November 14, 2022 21:58
@wjones127 wjones127 force-pushed the ARROW-14999-no-compare-internal-fields branch from b80a544 to 6f28382 Compare November 14, 2022 21:58
@wjones127 wjones127 force-pushed the ARROW-14999-no-compare-internal-fields branch from 6f28382 to 25568ae Compare November 28, 2022 17:30
@wjones127
Copy link
Member Author

@pitrou @paleolimbot Would you be willing to review?

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few notes! You should take anything I say about C++ with a grain of salt, but I do wonder if the signature Equals(DataType|Field|Schema other, TypeEqualsOptions options) might be more appropriate than accumulating arguments? I can see how in some future somebody might not care about nullability either and that approach would help avoid even more arguments.

@@ -16,6 +16,7 @@
# under the License.

import decimal
from email.policy import strict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems out of place?

Comment on lines +377 to +381
# TODO(ARROW-18204): metadata doesn't matter by default
# other_metadata <- list_of(field("item", int32(), # nolint
# metadata = list(hello="world"))) # nolint
# expect_equal(x, other_metadata) # nolint
# expect_false(x$Equals(other_metadata, check_metadata = TRUE)) # nolint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be more effective as a reprex() in ARROW-18204 rather than commented-out code here.

Comment on lines +415 to +418
# other_metadata <- map_of(int32(), # nolint
# field("value", int32(), metadata = list(hello="world"))) # nolint
# expect_equal(x, other_metadata) # nolint
# expect_false(x$Equals(other_metadata, check_metadata = TRUE)) # nolint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I would put this in the Jira rather than commented out code here.

@pitrou
Copy link
Member

pitrou commented Nov 30, 2022

Do we really want to add a new option for this or can we just reuse check_metadata? New options add cognitive overhead, and I'm not sure there's a legitimate reason to decouple those two settings.

@wjones127
Copy link
Member Author

Do we really want to add a new option for this or can we just reuse check_metadata? New options add cognitive overhead, and I'm not sure there's a legitimate reason to decouple those two settings.

I implemented it this way in #14847. I was skeptical at first, but it came out rather clean. I just added the field names to the metadata fingerprint.

@wjones127
Copy link
Member Author

Closing in favor of #14847

@wjones127 wjones127 closed this Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants