Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: eliminate redundant sorts on monotonic expressions #9813

Merged
merged 13 commits into from
Apr 12, 2024

Conversation

suremarc
Copy link
Contributor

@suremarc suremarc commented Mar 26, 2024

Which issue does this PR close?

Closes #9812.

Rationale for this change

Eliminate redundant sorts

What changes are included in this PR?

Adds a new function, collapse_monotonic_lex_req that collapses sorts like f(a), a into a if it is known that f preserves ordering, and uses it in the normalize_sort_requirements function of EquivalenceProperties.

Are these changes tested?

Yes, there is a unit test. I also wanted to add a sqllogictest as demonstrated in the write-up in #9812:

CREATE UNBOUNDED EXTERNAL TABLE data (
    "date"      DATE, 
    "ticker"    VARCHAR, 
    "timestamp" TIMESTAMP,
) STORED AS CSV
WITH ORDER ("date", "ticker", "timestamp")
LOCATION './a.parquet';

explain SELECT * FROM data 
WHERE ticker = 'A' AND date = CAST(timestamp AS DATE)
ORDER BY "timestamp";

but filter equality predicates are not converted into equivalence classes unless it's a constant filter, so the only way to demonstrate this change is with custom equivalence properties.

Are there any user-facing changes?

There are no public API changes, but sorts may be removed in certain cases.

@github-actions github-actions bot added the physical-expr Physical Expressions label Mar 26, 2024
@suremarc suremarc changed the title initial impl feat: eliminate redundant sorts on monotonic expressions Mar 26, 2024
@suremarc suremarc marked this pull request as ready for review March 26, 2024 16:59
@alamb
Copy link
Contributor

alamb commented Mar 27, 2024

Thanks @suremarc -- I am behind on reviews -- I hope to review this PR over the next day or two.

@alamb alamb requested a review from mustafasrepo March 27, 2024 03:06
@alamb
Copy link
Contributor

alamb commented Mar 27, 2024

@mustafasrepo do you perhaps have time to review this PR?

@mustafasrepo
Copy link
Contributor

@mustafasrepo do you perhaps have time to review this PR?

Sure

@mustafasrepo
Copy link
Contributor

mustafasrepo commented Mar 27, 2024

I suggested an alternative solution in the PR. I welcome any review regarding that PR.

@alamb alamb marked this pull request as draft March 27, 2024 13:13
@alamb
Copy link
Contributor

alamb commented Mar 27, 2024

Converting to draft as we discuss options on this PR. Thanks again @suremarc and @mustafasrepo

@suremarc suremarc marked this pull request as ready for review April 2, 2024 15:40
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 2, 2024
@suremarc
Copy link
Contributor Author

suremarc commented Apr 2, 2024

@alamb @mustafasrepo re-opening this for feedback after discussion in polygon-io/arrow-datafusion#1.

Thanks to PR #9819, I was able to add that sqllogictest I mentioned in the original post. I also added a negative test case based on the counterexample described here.

@alamb
Copy link
Contributor

alamb commented Apr 8, 2024

I plan to review this PR carefully tomorrow. Thanks @suremarc

@alamb
Copy link
Contributor

alamb commented Apr 9, 2024

I am sorry -- I ran out of brainpower today to review this PR. I will try again tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @suremarc -- it took me a bit to fully grok what this PR was doing, but now I think it makes sense to me. I had several suggestions for additional tests that I think are important before merge.

Also, I think it would make sense to always apply code (aka update collapse_lex_ordering rather than add a new function). What do you think?

However, once the tests are in place I think this PR could also be merged as is.

Thank you again. This is very cool. I apologize for the delay in review

cc @mustafasrepo

@suremarc
Copy link
Contributor Author

suremarc commented Apr 10, 2024

Also, I think it would make sense to always apply code (aka update collapse_lex_ordering rather than add a new function). What do you think?

I think this makes sense, in fact that was my initial idea. The only thing stopping me from doing that is I'm not sure if it interferes with how orderings are normalized, and generally I wasn't sure if it just might cause bugs. I can give it a shot and see what happens.

Edit: it seemed to work 🎉

@suremarc suremarc requested a review from alamb April 10, 2024 21:09
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @suremarc -- this looks like a very nice improvement to me 👍

/// that are ordered if the next entry is.
/// Used in `collapse_lex_req`
fn collapse_monotonic_lex_req(input: LexRequirement) -> LexRequirement {
input
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we could probably save a copy by using into_iter() here rather than iter() -- then we could skip the cloned() at the end too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it, it seems challenging to avoid clones since the iteration looks ahead at the next item. I think it could be done if we rewrote it as a fold or reduce.

--Filter: data.ticker = Utf8("A") AND CAST(data.time AS Date32) = data.date
----TableScan: data projection=[date, ticker, time]
physical_plan
CoalescePartitionsExec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is cool to see it knows it doesn't need to preserve the order at all (doesn't have to use sort preserving merge)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I finally realized what you mean. Since ticker is constant the order is arbitrary. That is indeed very cool

@suremarc
Copy link
Contributor Author

@alamb While testing out this feature, I've noticed that the sort elimination doesn't work when certain projections are involved. E.g.,

EXPLAIN
SELECT "time" FROM data
WHERE "ticker" = 'A'
ORDER BY "time"

fails to eliminate the sort, but SELECT * works.

It seems like some substantial work has been done for propagating ordering through projections (#7363), but I guess it didn't work in this case. Curious what your thoughts are here @mustafasrepo

I still think this PR can be merged as-is, but I will make a follow-up issue unless anyone has any objections

@alamb alamb merged commit ffc8444 into apache:main Apr 12, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 12, 2024

I think a follow on PR would be great. Thanks again @suremarc -- this is a great contribution

@mustafasrepo
Copy link
Contributor

While testing out this feature, I've noticed that the sort elimination doesn't work when certain projections are involved.

(Brainstorming) The reason might be that, datafusion with current implementation infers that with existing orderings, constant expressions, equivalent expressions some of the sort expressions are already satisfied.
However, projection cannot propagate information about the pruned out expressions. This may result, equivalence or constantness information to be lost. This might be the root cause of the problem.

@suremarc
Copy link
Contributor Author

suremarc commented Apr 15, 2024

@alamb @mustafasrepo false alarm, I just realized I forgot to include CAST(time AS DATE) = date in the example query above 🤦

That said, I have been seeing issues trying to integrate this feature into my project, but I'm still unsure if it's an issue on my side or in DataFusion yet. Most likely the former, but if not I will file an issue

Edit: it was a problem on my side

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eliminate redundant "prefix" sorts
3 participants