-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement comparisons on nested data types such that distinct/except would work #11117
Conversation
…would work This relies on newer functionality in arrow 52 and allows DataFrame.except() to properly work on schemas with structs and lists Closes apache#10749
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @rtyler -- I think this is a nice improvement. I left some suggestions on how to improve comments / naming but I do think they could go in a follow on PR
It might also make sense to see if there are other kernels which need the same handling (e.g. eq_dyn
for example)
if left.data_type().is_nested() && null_equals_null { | ||
let cmp = make_comparator(left, right, SortOptions::default())?; | ||
let len = left.len().min(right.len()); | ||
let values = (0..len).map(|i| cmp(i, i).is_eq()).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is likely quite slow as it will be doing dynamic dispatch per row.
However, slow is better than not working at first.
Could you please: update the name of the function to reflect it isn't just for null
anymore? Perhaps we could rename it to eq_dyn
or something more generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think other than the potential rename the PR is ready to go -- however I also think we could do the rename as a follow on PR
Note @jayzhan211 added similiar code to handle nested comparisons in eq_datum
in #11091 -- I wonder if we would consolidate those implementations somehow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do the comparison with datum function, I move it to physical-common in #11091
It will be a nice alternative for equal_rows_arr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in #11149
I am quite indifferent to the solution here as long as #10749 is resolved 😄 Happy to have this closed out in favor of a better implementation! |
This PR is great and I think a step forward (the code no longer errors!) I'll make a follow on PR to try and simplify the implementation. |
Thanks again @rtyler and @jayzhan211 |
Filed #11149 with a proposed simpler implementation |
…would work (apache#11117) This relies on newer functionality in arrow 52 and allows DataFrame.except() to properly work on schemas with structs and lists Closes apache#10749
Which issue does this PR close?
Closes #10749
Rationale for this change
This relies on newer functionality in arrow 52 and allows DataFrame.except() to properly work on schemas with structs and lists. I'm not sure if this is the appropriate way to handle this change per se, but I included the regression case from the issue as a test in order to demonstrate the correction of the issue
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?