Fix optimize projections bug #8960

mustafasrepo · 2024-01-23T08:24:13Z

Which issue does this PR close?

Closes #8942.

Rationale for this change

See issue.

What changes are included in this PR?

Note: This PR is an implementation of @gruuya's suggestion in discussion, with his test in the PR.

Currently projections are not inserted to logical plan, if its input schema and its schema are same. However, this equality is not a sufficient condition. See #8942 for an example how can this assumption be a problem. On top of that condition, we need to make sure that projection is trivial (e.g it just emits its input). Then we can deem Projection unnecessary.

However, adding .iter().all(is_trivial) check inserts additional ProjectionExecs to the plan for 2 test cases as observed by @gruuya. Even if these test plans retract, I think we should do this change. Because removing Projection in those plan overfits the name being same after casting (this assumption is not safe).

I think, what we should really do for those queries is rewriting window expression so that they no longer contain CAST expr after once their input have casted result. I think that is the concern of another PR.

@gruuya suggested PR for fixing this issue without retracting these window tests. However, I think that PR solves the problem only for consecutive projections. Hence even if 2 tests regresses we should continue with these changes

Are these changes tested?

Yes

Are there any user-facing changes?

mustafasrepo · 2024-01-23T08:45:13Z

datafusion/sqllogictest/test_files/window.slt

--------------WindowAggr: windowExpr=[[SUM(CAST(annotated_data_infinite2.c AS Int64)annotated_data_infinite2.c AS annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.a, annotated_data_infinite2.b] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING, SUM(CAST(annotated_data_infinite2.c AS Int64)annotated_data_infinite2.c AS annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.a, annotated_data_infinite2.b] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING]]
----------------Projection: CAST(annotated_data_infinite2.c AS Int64) AS CAST(annotated_data_infinite2.c AS Int64)annotated_data_infinite2.c, annotated_data_infinite2.a, annotated_data_infinite2.b, annotated_data_infinite2.c, annotated_data_infinite2.d
------------------TableScan: annotated_data_infinite2 projection=[a, b, c, d]
+------Projection: CAST(annotated_data_infinite2.c AS Int64) AS CAST(annotated_data_infinite2.c AS Int64)annotated_data_infinite2.c, annotated_data_infinite2.a, annotated_data_infinite2.b, annotated_data_infinite2.c, annotated_data_infinite2.d, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.a, annotated_data_infinite2.b] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.a, annotated_data_infinite2.b] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.a, annotated_data_infinite2.b, annotated_data_infinite2.d] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.a, annotated_data_infinite2.b, annotated_data_infinite2.d] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 5 PRECEDING AND CURRENT ROW, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.a, annotated_data_infinite2.d] ORDER BY [annotated_data_infinite2.b ASC NULLS LAST, annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.a, annotated_data_infinite2.d] ORDER BY [annotated_data_infinite2.b ASC NULLS LAST, annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 1 FOLLOWING AND 5 FOLLOWING, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.b, annotated_data_infinite2.a] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.b, annotated_data_infinite2.a] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.b, annotated_data_infinite2.a, annotated_data_infinite2.d] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING, SUM(annotated_data_infinite2.c) PARTITION BY [annotated_data_infinite2.b, annotated_data_infinite2.a, annotated_data_infinite2.d] ORDER BY [annotated_data_infinite2.c ASC NULLS LAST] ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING


In these tests, there is additional Projections with Cast exprs

alamb

Thank you @mustafasrepo -- this makes sense to me. Maybe @sergiimk / @gruuya have some time to review as well

I think it might be valuable to add the tests case from @sergiimk that verifies the actual values from #8942 (comment) as part of the dataframe tests as well

https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/mod.rs

I can do so as well as a follow on PR if you prefer.

gruuya

LGTM!

It would be nice to do a follow-up on the extra projections in the window SLTs now. Since the results haven't changed presumably these are redundant, so there should be a heuristic to optimize them away.

sergiimk

To my limited experience with DF internals this looks good 👍

mustafasrepo · 2024-01-25T08:01:10Z

I think it might be valuable to add the tests case from @sergiimk that verifies the actual values from #8942 (comment) as part of the dataframe tests as well

https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/mod.rs

I can do so as well as a follow on PR if you prefer.

I have added test in the issue as dataframe test

alamb · 2024-01-25T11:27:51Z

Thanks everyone

joroKr21 · 2024-01-30T16:38:18Z

datafusion/optimizer/src/optimize_projections.rs

+        if &projection_schema(&input, &exprs_used)? == input.schema()
+            && exprs_used.iter().all(is_expr_trivial)


A few lines below (882) we have the same check

Fix optimize projections bug

9912a02

github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jan 23, 2024

mustafasrepo mentioned this pull request Jan 23, 2024

Handle nested projection with derived column optimization #8951

Closed

mustafasrepo commented Jan 23, 2024

View reviewed changes

This was referenced Jan 24, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 22, 2024 #8933

Closed

optimize_projections rule breaks some view operation #8978

Closed

Regression: Logical optimizer causes invalid query result with case expression #8942

Closed

alamb approved these changes Jan 24, 2024

View reviewed changes

gruuya approved these changes Jan 24, 2024

View reviewed changes

sergiimk approved these changes Jan 24, 2024

View reviewed changes

Add new dataframe test

9953cd7

github-actions bot added the core Core DataFusion crate label Jan 25, 2024

mustafasrepo merged commit 4ac7de1 into apache:main Jan 25, 2024
22 checks passed

mustafasrepo mentioned this pull request Jan 26, 2024

Cache common referred expression at the window input #9009

Merged

joroKr21 reviewed Jan 30, 2024

View reviewed changes

mustafasrepo mentioned this pull request Jan 31, 2024

[MINOR]: Add check for unnecessary projection #9079

Merged

alamb mentioned this pull request May 7, 2024

Stop copying LogicalPlan and Exprs in OptimizeProjections (2% faster planning) #10405

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix optimize projections bug #8960

Fix optimize projections bug #8960

mustafasrepo commented Jan 23, 2024 •

edited

Loading

mustafasrepo Jan 23, 2024

alamb left a comment •

edited

Loading

gruuya left a comment

sergiimk left a comment

mustafasrepo commented Jan 25, 2024

alamb commented Jan 25, 2024 •

edited

Loading

joroKr21 Jan 30, 2024

		if &projection_schema(&input, &exprs_used)? == input.schema()
		&& exprs_used.iter().all(is_expr_trivial)

Fix optimize projections bug #8960

Fix optimize projections bug #8960

Conversation

mustafasrepo commented Jan 23, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

mustafasrepo Jan 23, 2024

Choose a reason for hiding this comment

alamb left a comment • edited Loading

Choose a reason for hiding this comment

gruuya left a comment

Choose a reason for hiding this comment

sergiimk left a comment

Choose a reason for hiding this comment

mustafasrepo commented Jan 25, 2024

alamb commented Jan 25, 2024 • edited Loading

joroKr21 Jan 30, 2024

Choose a reason for hiding this comment

mustafasrepo commented Jan 23, 2024 •

edited

Loading

alamb left a comment •

edited

Loading

alamb commented Jan 25, 2024 •

edited

Loading