Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement comparisons on nested data types such that distinct/except would work #11117

Merged
merged 1 commit into from
Jun 27, 2024

Conversation

rtyler
Copy link
Contributor

@rtyler rtyler commented Jun 25, 2024

Which issue does this PR close?

Closes #10749

Rationale for this change

This relies on newer functionality in arrow 52 and allows DataFrame.except() to properly work on schemas with structs and lists. I'm not sure if this is the appropriate way to handle this change per se, but I included the regression case from the issue as a test in order to demonstrate the correction of the issue

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

…would work

This relies on newer functionality in arrow 52 and allows
DataFrame.except() to properly work on schemas with structs and lists

Closes apache#10749
@github-actions github-actions bot added the core Core DataFusion crate label Jun 25, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rtyler -- I think this is a nice improvement. I left some suggestions on how to improve comments / naming but I do think they could go in a follow on PR

It might also make sense to see if there are other kernels which need the same handling (e.g. eq_dyn for example)

if left.data_type().is_nested() && null_equals_null {
let cmp = make_comparator(left, right, SortOptions::default())?;
let len = left.len().min(right.len());
let values = (0..len).map(|i| cmp(i, i).is_eq()).collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is likely quite slow as it will be doing dynamic dispatch per row.

However, slow is better than not working at first.

Could you please: update the name of the function to reflect it isn't just for null anymore? Perhaps we could rename it to eq_dyn or something more generic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think other than the potential rename the PR is ready to go -- however I also think we could do the rename as a follow on PR

Note @jayzhan211 added similiar code to handle nested comparisons in eq_datum in #11091 -- I wonder if we would consolidate those implementations somehow

Copy link
Contributor

@jayzhan211 jayzhan211 Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do the comparison with datum function, I move it to physical-common in #11091
It will be a nice alternative for equal_rows_arr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in #11149

@rtyler
Copy link
Contributor Author

rtyler commented Jun 27, 2024

I am quite indifferent to the solution here as long as #10749 is resolved 😄

Happy to have this closed out in favor of a better implementation!

@alamb
Copy link
Contributor

alamb commented Jun 27, 2024

I am quite indifferent to the solution here as long as #10749 is resolved 😄

Happy to have this closed out in favor of a better implementation!

This PR is great and I think a step forward (the code no longer errors!)

I'll make a follow on PR to try and simplify the implementation.

@alamb alamb merged commit d2ff218 into apache:main Jun 27, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 27, 2024

Thanks again @rtyler and @jayzhan211

@alamb
Copy link
Contributor

alamb commented Jun 27, 2024

Filed #11149 with a proposed simpler implementation

@rtyler rtyler deleted the issue-10749-only branch June 27, 2024 23:38
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…would work (apache#11117)

This relies on newer functionality in arrow 52 and allows
DataFrame.except() to properly work on schemas with structs and lists

Closes apache#10749
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.except() does not work with structs in schema
3 participants