-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: eliminate redundant sorts on monotonic expressions #9813
Conversation
Thanks @suremarc -- I am behind on reviews -- I hope to review this PR over the next day or two. |
@mustafasrepo do you perhaps have time to review this PR? |
Sure |
I suggested an alternative solution in the PR. I welcome any review regarding that PR. |
Converting to draft as we discuss options on this PR. Thanks again @suremarc and @mustafasrepo |
@alamb @mustafasrepo re-opening this for feedback after discussion in polygon-io/arrow-datafusion#1. Thanks to PR #9819, I was able to add that sqllogictest I mentioned in the original post. I also added a negative test case based on the counterexample described here. |
I plan to review this PR carefully tomorrow. Thanks @suremarc |
I am sorry -- I ran out of brainpower today to review this PR. I will try again tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @suremarc -- it took me a bit to fully grok what this PR was doing, but now I think it makes sense to me. I had several suggestions for additional tests that I think are important before merge.
Also, I think it would make sense to always apply code (aka update collapse_lex_ordering
rather than add a new function). What do you think?
However, once the tests are in place I think this PR could also be merged as is.
Thank you again. This is very cool. I apologize for the delay in review
I think this makes sense, in fact that was my initial idea. The only thing stopping me from doing that is I'm not sure if it interferes with how orderings are normalized, and generally I wasn't sure if it just might cause bugs. I can give it a shot and see what happens. Edit: it seemed to work 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @suremarc -- this looks like a very nice improvement to me 👍
/// that are ordered if the next entry is. | ||
/// Used in `collapse_lex_req` | ||
fn collapse_monotonic_lex_req(input: LexRequirement) -> LexRequirement { | ||
input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we could probably save a copy by using into_iter()
here rather than iter()
-- then we could skip the cloned()
at the end too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it, it seems challenging to avoid clones since the iteration looks ahead at the next item. I think it could be done if we rewrote it as a fold or reduce.
--Filter: data.ticker = Utf8("A") AND CAST(data.time AS Date32) = data.date | ||
----TableScan: data projection=[date, ticker, time] | ||
physical_plan | ||
CoalescePartitionsExec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is cool to see it knows it doesn't need to preserve the order at all (doesn't have to use sort preserving merge)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I finally realized what you mean. Since ticker
is constant the order is arbitrary. That is indeed very cool
@alamb While testing out this feature, I've noticed that the sort elimination doesn't work when certain projections are involved. E.g., EXPLAIN
SELECT "time" FROM data
WHERE "ticker" = 'A'
ORDER BY "time" fails to eliminate the sort, but It seems like some substantial work has been done for propagating ordering through projections (#7363), but I guess it didn't work in this case. Curious what your thoughts are here @mustafasrepo I still think this PR can be merged as-is, but I will make a follow-up issue unless anyone has any objections |
I think a follow on PR would be great. Thanks again @suremarc -- this is a great contribution |
(Brainstorming) The reason might be that, datafusion with current implementation infers that with existing orderings, constant expressions, equivalent expressions some of the sort expressions are already satisfied. |
@alamb @mustafasrepo false alarm, I just realized I forgot to include That said, I have been seeing issues trying to integrate this feature into my project, but I'm still unsure if it's an issue on my side or in DataFusion yet. Most likely the former, but if not I will file an issue Edit: it was a problem on my side |
Which issue does this PR close?
Closes #9812.
Rationale for this change
Eliminate redundant sorts
What changes are included in this PR?
Adds a new function,
collapse_monotonic_lex_req
that collapses sorts likef(a), a
intoa
if it is known thatf
preserves ordering, and uses it in thenormalize_sort_requirements
function ofEquivalenceProperties
.Are these changes tested?
Yes, there is a unit test. I also wanted to add a sqllogictest as demonstrated in the write-up in #9812:
but filter equality predicates are not converted into equivalence classes unless it's a constant filter, so the only way to demonstrate this change is with custom equivalence properties.
Are there any user-facing changes?
There are no public API changes, but sorts may be removed in certain cases.