Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix optimize projections bug #8960

Merged
merged 2 commits into from
Jan 25, 2024
Merged

Fix optimize projections bug #8960

merged 2 commits into from
Jan 25, 2024

Conversation

mustafasrepo
Copy link
Contributor

@mustafasrepo mustafasrepo commented Jan 23, 2024

Which issue does this PR close?

Closes #8942.

Rationale for this change

See issue.

What changes are included in this PR?

Note: This PR is an implementation of @gruuya's suggestion in discussion, with his test in the PR.

Currently projections are not inserted to logical plan, if its input schema and its schema are same. However, this equality is not a sufficient condition. See #8942 for an example how can this assumption be a problem. On top of that condition, we need to make sure that projection is trivial (e.g it just emits its input). Then we can deem Projection unnecessary.

However, adding .iter().all(is_trivial) check inserts additional ProjectionExecs to the plan for 2 test cases as observed by @gruuya. Even if these test plans retract, I think we should do this change. Because removing Projection in those plan overfits the name being same after casting (this assumption is not safe).

I think, what we should really do for those queries is rewriting window expression so that they no longer contain CAST expr after once their input have casted result. I think that is the concern of another PR.

@gruuya suggested PR for fixing this issue without retracting these window tests. However, I think that PR solves the problem only for consecutive projections. Hence even if 2 tests regresses we should continue with these changes

Are these changes tested?

Yes

Are there any user-facing changes?

--------------WindowAggr: windowExpr=[[SUM(CAST(annotated_data_infinite2.c AS Int64)annotated_data_infinite2.c AS annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.a, annotated_data_infinite2.b] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING, SUM(CAST(annotated_data_infinite2.c AS Int64)annotated_data_infinite2.c AS annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.a, annotated_data_infinite2.b] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING]]
----------------Projection: CAST(annotated_data_infinite2.c AS Int64) AS CAST(annotated_data_infinite2.c AS Int64)annotated_data_infinite2.c, annotated_data_infinite2.a, annotated_data_infinite2.b, annotated_data_infinite2.c, annotated_data_infinite2.d
------------------TableScan: annotated_data_infinite2 projection=[a, b, c, d]
------Projection: CAST(annotated_data_infinite2.c AS Int64) AS CAST(annotated_data_infinite2.c AS Int64)annotated_data_infinite2.c, annotated_data_infinite2.a, annotated_data_infinite2.b, annotated_data_infinite2.c, annotated_data_infinite2.d, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.a, annotated_data_infinite2.b] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.a, annotated_data_infinite2.b] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.a, annotated_data_infinite2.b, annotated_data_infinite2.d] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.a, annotated_data_infinite2.b, annotated_data_infinite2.d] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 5 PRECEDING AND CURRENT ROW, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.a, annotated_data_infinite2.d] ORDER BY [annotated_data_infinite2.b ASC NULLS LAST, annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.a, annotated_data_infinite2.d] ORDER BY [annotated_data_infinite2.b ASC NULLS LAST, annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 1 FOLLOWING AND 5 FOLLOWING, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.b, annotated_data_infinite2.a] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.b, annotated_data_infinite2.a] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.b, annotated_data_infinite2.a, annotated_data_infinite2.d] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.b, annotated_data_infinite2.a, annotated_data_infinite2.d] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these tests, there is additional Projections with Cast exprs

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mustafasrepo -- this makes sense to me. Maybe @sergiimk / @gruuya have some time to review as well

I think it might be valuable to add the tests case from @sergiimk that verifies the actual values from #8942 (comment) as part of the dataframe tests as well

https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/mod.rs

I can do so as well as a follow on PR if you prefer.

Copy link
Contributor

@gruuya gruuya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

It would be nice to do a follow-up on the extra projections in the window SLTs now. Since the results haven't changed presumably these are redundant, so there should be a heuristic to optimize them away.

Copy link
Contributor

@sergiimk sergiimk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my limited experience with DF internals this looks good 👍

@github-actions github-actions bot added the core Core DataFusion crate label Jan 25, 2024
@mustafasrepo
Copy link
Contributor Author

I think it might be valuable to add the tests case from @sergiimk that verifies the actual values from #8942 (comment) as part of the dataframe tests as well

https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/mod.rs

I can do so as well as a follow on PR if you prefer.

I have added test in the issue as dataframe test

@mustafasrepo mustafasrepo merged commit 4ac7de1 into apache:main Jan 25, 2024
22 checks passed
@alamb
Copy link
Contributor

alamb commented Jan 25, 2024

Thanks everyone

Comment on lines +870 to +871
if &projection_schema(&input, &exprs_used)? == input.schema()
&& exprs_used.iter().all(is_expr_trivial)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few lines below (882) we have the same check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: Logical optimizer causes invalid query result with case expression
5 participants