Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MINOR]: Unknown input statistics in FilterExec #7544

Closed
wants to merge 3 commits into from
Closed

[MINOR]: Unknown input statistics in FilterExec #7544

wants to merge 3 commits into from

Conversation

berkaysynnada
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

In FilterExec::statisticsmethod, if the input statistics are None, the analysis is not performed. However, with the filter predicate, we can estimate the output column boundaries and propagate it to the next exec safely assuming the column boundaries as infinite.

What changes are included in this PR?

There is a function new_with_unbounded_columns constructing a ColumnStatistics having columns with unbounded min and max values (they are ScalarValue::SomeType(None), and these null instances are interpreted as infinite in the interval library).

After this change, a test needed to be modified due to the change of the build side of the join operation.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Sep 13, 2023
@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Sep 13, 2023
jackwener
jackwener previously approved these changes Sep 13, 2023
Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @berkaysynnada

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In FilterExec::statisticsmethod, if the input statistics are None, the analysis is not performed.

I think we also can modify the statistic derive/propagate in FilterExec.

I'm not sure which is better.

As a whole, LGTM.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a test for code somehow, to prevent breaking it during future refactorings

@berkaysynnada
Copy link
Contributor Author

In FilterExec::statisticsmethod, if the input statistics are None, the analysis is not performed.

I think we also can modify the statistic derive/propagate in FilterExec.

I'm not sure which is better.

As a whole, LGTM.

I'm sorry but I didn't understand exactly what you mean while saying derive/propagate.

@github-actions github-actions bot added the physical-expr Physical Expressions label Sep 14, 2023
@@ -196,7 +196,8 @@ fn shrink_boundaries(
&final_result.upper.value,
&target_boundaries,
&initial_boundaries,
)?;
)
.unwrap_or(1.0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this implementation, columns with infinite bounds are not handled, so it was returning an error, and the newly calculated bounds could not be set. This is a kind of workaround but I will refactor these parts with the context of this issue

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Sep 14, 2023
@jackwener
Copy link
Member

jackwener commented Sep 14, 2023

I'm sorry but I didn't understand exactly what you mean while saying derive/propagate.

I didn't explain it in enough detail, and I'm sorry for that.

Statistic Derive/Propagate is a process to compute a whole plan statistic. In general, we have the init statistic of table, and we will compute parent PlanNode from bottom to up recursively and we will compute fill statistic in all PlanNode.

This PR is handle a condition: In FilterExec::statisticsmethod, if the input statistics are None, the analysis is not performed. So this PR fill TypeInfo (such as ScalarValue::Boolean::Null) into Statisitc.

My point is that we can also correct the "Statistic Derive" to make input statistics already have TypeInfo (such as ScalarValue::Boolean) instead of None. If that's the case, we won't need to use new_with_unbounded_columns to inject TypeInfo directly into ColumnStatistic.

Can you get it accurately now what I mean? @berkaysynnada

@berkaysynnada
Copy link
Contributor Author

I'm sorry but I didn't understand exactly what you mean while saying derive/propagate.

I didn't explain it in enough detail, and I apologize for that.

Statistic Derive/Propagate is a process to compute a whole plan statistic. In general, we have the init statistic of table, and we will compute parent PlanNode from bottom to up recursively and we will compute fill statistic in all PlanNode.

This PR is handle a condition: In FilterExec::statisticsmethod, if the input statistics are None, the analysis is not performed. So this PR fill TypeInfo (such as ScalarValue::Null) into Statisitc.

My point is that we can also correct the "Statistic Derive" so that the input statistics are not None, but already have TypeInfo, rather than injecting TypeInfo directly into FilterExec.

Thanks for the explanation :) There is an issue, I don't know if you have found the chance to review it, but in summary, this statistics method of FilterExec needs a refactor. I will move the changes in this PR there; therefore I close this PR.

Actually what you said "if the input statistics are None, the analysis is not performed." is not what I intended. The analysis is performed with columns having infinite bounds. To do this, filling Statistics with TypeInfo is inevitable.

To reflect what you suggest in practice, I plan to add a Schema-like field to the Statistics struct (to hold TypeInfo) and let each statistics method update this schema for the PlanNode's above it. However, since each statistics method has access to its own and children's schema, it would be like carrying duplicate information, so I gave up.

@berkaysynnada berkaysynnada deleted the feature/unknown-input-stats branch September 18, 2023 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants