Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Aggregate push down for incremental scan #10538

Merged
merged 2 commits into from
Jun 21, 2024

Conversation

huaxingao
Copy link
Contributor

@huaxingao huaxingao commented Jun 19, 2024

Enable Aggregate push down for incremental scan. Here is the original PR

@huaxingao
Copy link
Contributor Author

In this PR, I have basically extracted the code for building an org.apache.iceberg.Scan from buildBatchScan() and reused it in the pushAggregation method. The difference between the org.apache.iceberg.Scan in pushAggregation and buildBatchScan is that stats are needed in pushAggregation, which is why I added a withStats flag. This PR is exactly the same as the old one, except I pass in expectedSchema so I don't need to call schemaWithMetadataColumns() twice.

@huaxingao
Copy link
Contributor Author

cc @szehon-ho Could you please take a look when you have a moment? Thanks a lot!


assertThat(explain1).contains("LocalTableScan", "min(data)", "max(data)", "count(data)");

Dataset<Row> noPushdownResult =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nit: Could we suffix the Dataset variables with ds so this would be noPushdownDs and the other one would be pushdownDs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I don't think there's too much additional value in verifying the noPushdown case here. I think the test should just enable agg pushdown, and explicitly verify the expected results. But not super opinionated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amogh-jahagirdar for the review! I have changed the variable names and removed the noPushdown cases.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @huaxingao the change looks great to me. I'll wait in case @szehon-ho or others have comments before merging.

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

@szehon-ho szehon-ho merged commit a47937c into apache:main Jun 21, 2024
31 checks passed
@szehon-ho
Copy link
Collaborator

szehon-ho commented Jun 21, 2024

Merged, thanks @huaxingao and @amogh-jahagirdar for review, @RussellSpitzer for original review

@huaxingao
Copy link
Contributor Author

Thanks @amogh-jahagirdar, @szehon-ho, and of course, @RussellSpitzer

@huaxingao huaxingao deleted the agg_pushdown_incremental2 branch June 21, 2024 17:44
@huaxingao
Copy link
Contributor Author

I will have a follow-up PR to port the changes to Spark 3.4. @szehon-ho

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants